IEEE Software Blog: Has empiricism in Software Engineering amounted to anything?

Associate Editor: Bogdan Vasilescu, University of California, Davis. USA (@b_vasilescu)

A 2014 talk by Jim Larus, currently (after a long tenure at Microsoft Research) Dean of the School of Computer and Communication Sciences at EPFL, was circulated recently over email among some colleagues. In the talk, Larus reflects on success and failure stories of software engineering (SE) tools used at Microsoft over the years: interestingly, he lists the empirical work on data mining and failure prediction, from Microsoft Research, as one of the biggest successes, together with Windows error reporting; in contrast, Larus mentions both PREfast (static analysis) and SDV (formal verification) as examples of tools that didn't matter as much.

Needless to say, this sparked quite the controversy among participants in my email thread, about the value (or lack thereof) of empiricism in SE. I found the discussion thoughtful and useful, and I’d like to share it. This blog post is an attempt to summarize two main arguments, drawing mostly verbatim from the email thread participants. Active participants included, at various times, Alex Wolf, David Rosenblum, Audris Mockus, Jim Herbsleb, Vladimir Filkov, and Prem Devanbu.

Is empiricism just blind correlation?

The main criticism is that empirical SE work does not move past mere correlations to find deeper understanding, yet it's understanding that leads to knowledge, not correlation. The analysis of data should enable understanding, explanation, and prediction, not be an end in and of itself. The accessibility of correlation among statistical methods, coupled with the flood of data from open-source development and social programming websites, have created a recipe for endless paper milling. As a result, many empirical SE papers usually don’t ask any critically important research questions, that matter both to researchers and practitioners; instead, they present obvious, expected, and unactionable insights, that neither researchers could exploit to develop new or improved approaches, nor practitioners to produce guaranteed improvements to their development efforts.

In response to this criticism, three counterexamples:

Are automated debugging techniques actually helping programmers?

C Parnin, A Orso. ISSTA ’11 PDF

On the naturalness of software

A Hindle, ET Barr, Z Su, M Gabel, P Devanbu. ICSE ’12, CACM Research Highlights ’16 PDF

Are mutants a valid substitute for real faults in software testing?

R Just, D Jalali, L Inozemtseva, MD Ernst, R Holmes, G Fraser. FSE ’14 PDF

Arguably, the successes at Microsoft and other places, with saving money on quality control using process metrics, have led to a sort of Gold Rush and probably some excesses. However, it is clear that empirical SE has advanced leaps and bounds since the early correlation days. Admittedly, earlier work was trying to understand the usefulness of the data and stats, and resulted in some fluff; perhaps like with a new gadget, people were more interested in the technology than how useful it is in daily lives. However, thanks in part to all that playing around, we now have lots of solid work, and openness of the community to scientific methods.

Moreover, these days, to pass muster at top conferences, one needs to have a good theoretical grounding, not only on past work in SE & PL, but (often) also in sociology, psychology, management science, and behavioral science -- creating and maintaining software is inherently human-centric. Theory helps to predict and explain results; frame future research questions; provide continuity across studies and research groups; accumulate results into something resembling knowledge; and it points toward important open questions. For example, there is a lot of theory in psychology about multitasking and context switching, that really helps to explain the quantitative results relating productivity to patterns of work across projects. Such theory can both push further theory development as well as work toward deeper explanations of the phenomena we study. What exactly happens when humans context switch, why is context switching costly, and why do costs vary in different conditions?

This is also why correlations have largely fallen out of favour. Careful consideration of the relevant phenomena (in the context of existing work) will lead one to use, e.g., more sophisticated mixed-effects / multivariate models and bootstrapped distributions, and to carefully argue why their experimental methods are well-suited to validate the theory. The use of these models is driven precisely by more sophisticated theories (not statistical, but rather SE / behavioral science theories).

Can empiricism be valuable without fully understanding the “why”?

Although the demand for understanding why something works is prima facie not entirely unreasonable, if one insisted on that, then many engineering disciplines would come to a halt: e.g., airfoil design, medicine, natural language processing (no one really understands why n-gram models work better than just about anything else). Knowing that something works or doesn’t can be quite enough to help a great many people. We have many extremely useful things in the world that work by correlation and “hidden” models. For example, theoretically sound approaches such as topic models (e.g., LDA) don’t work well, but highly engineered algorithms (e.g., word2vec) work fantastically. By insisting on why, one would: stop using SVM, deep learning, many drugs; not fly on airplanes; and not use Siri/Echo, because in many cases the why of these things is not well understood. Even such a trivial thing as GPS has a lot of engineering adjustments that have not been derived from first principles but just work.

However, without explicit understanding of why things work, it’s not clear how far one can go with engineering. For example, many ancient civilizations were able to predict the phases of the moon and positions of the stars without understanding the why of gravity and orbital mechanics, simply based on repeated measurement. But would that have gotten us satellite communication, planetary exploration, and the Higgs boson? Similarly, would you feel comfortable in a plane that lands using a machine-learned autoland system?

Closing remarks

In summary, it seems there is a central place for empirical work in SE; not as an end in itself, but rather the beginning of an inquiry leading to understanding and useful insights that can we can use to truly advance our discipline. See also Peggy Storey's Lies, Damned Lies and Software Analytics talk along these lines.

4 comments:

Kate ThompsonMay 23, 2016 at 12:09 AM
Studies lead to data, data leads to knowledge. I don't think redefining empiricism as "blind correlation" is helpful: empiricism better understood as, "based on evidence."
Kate ThompsonMay 23, 2016 at 1:41 PM
To clarify, the proper way to do empiricism is this:

Create a hypothesis, do a study, collect data. Try to understand data, which leads to new hypothesis. Create a new study to test that hypothesis.

Rinse, Recycle, Repeat.
Tao XieMay 24, 2016 at 1:50 PM
The discussion/reflection from different viewpoints in our community (reflected by this IEEE Software Blog post) is useful and healthy for resolving differences and working together to aim for high impact, which should be the goal in the first place. In the machine learning community, even when it is so well known that machine learning is highly useful and impactful in practice, there is still voice on "Machine Learning that Matters" (ICML 2012 paper: http://teamcore.usc.edu/WeeklySeminar/Aug31.pdf) and a blog on "Is Machine Learning Losing Impact?" (http://blog.mikiobraun.de/2012/06/is-machine-learning-losing-impact.html). For those who are interested in digging out more related references/materials, feel free to browse my talk slides on "Towards Mining Software Repositories Research that Matters", presented at Next Generation of Mining Software Repositories '14 (Pre-FSE 2014 Event): http://www.slideshare.net/taoxiease/towards-mining-software-repositories-research-that-matters. For those who are interested in producing practice impact with data-driven software engineering research, take a look at FSE 2014 tutorial slides (or earlier versions) on "Software Analytics - Achievements and Challenges": http://research.microsoft.com/en-US/groups/sa/fse2014tutorial.pdf and high-practice-impact research done at the Software Analytics group at Microsoft Research Asia (http://research.microsoft.com/en-us/groups/sa/). More discussion and sharing on how to make more and higher impact in our research field are especially welcome!
Stefan WagnerMay 25, 2016 at 6:11 AM
I also really don't understand why this post equates empiricism with machine learning. Doesn't make sense.

Sunday, May 22, 2016

Has empiricism in Software Engineering amounted to anything?

Is empiricism just blind correlation?

Can empiricism be valuable without fully understanding the “why”?

Closing remarks

4 comments: