IEEE Software Blog: May 2016

Sunday, May 29, 2016

Insights for practitioners into empirical software architecture research

Associate Editor: Mehdi Mirakhorli (@MehdiMirakhorli)

Software architecture is an applied discipline. Therefore, research in that area should eventually offer practitioners useful and trustworthy results.

What we did: To find out more about the state of empirical software architecture research and its relevance for practitioners, we looked at research papers published between 1999 and 2015 at the most popular architecture conferences:

· CBSE (International ACM SIGSOFT Symposium on Component-based Software Engineering)

· ECSA (European Conference on Software Architecture)

· QoSA (International ACM SIGSOFT Conference on the Quality of Software Architectures)

· WICSA (Working IEEE/IFIP Conference on Software Architecture)

We went through a total of 667 “full” papers out of which 115 (i.e., around 17%) turned out to be “empirical”. In other words, 17% presented research based on systematically collected evidence from observation or experience [1], using methods such as case studies, experiments or surveys (the remaining 83% of papers present solutions or new methods without evaluation, experience reports, tools, etc.). The full study has been published as a research paper at WICSA 2016 [2] and a preprint is available [5]. In this post, we briefly highlight some of the insights.

Insight 1 – Amount of empirical work is increasing: Since around 2010, there is an increasing trend to publish empirical research in software architecture. The figure below shows the relative contribution of empirical papers compared to all papers published in a year.

Empirical methods are used equally to (a) to study phenomena (e.g., to understand how architecture decisions are made in practice) and (b) to evaluate approaches (e.g., to evaluate a new approach for capturing architectural knowledge). Therefore, current empirical research helps us understand software architecture practice (for example, to identify the real problems that research should tackle), but also contributes to increasing the confidence in the validity of newly proposed approaches and methods.

Insight 2 – Involvement of practitioners as researchers/co-authors is limited: On average, every two papers have one industry author but five authors from academia. Around 70% of papers have no industry author at all. Overall, empirical papers have roughly the same number of authors from industry than non-empirical papers. Little active involvement of practitioners (assuming that co-authors are actively involved in the research presented in a paper) might be due to challenges related successful industry-academia collaborations (e.g., champions in industry organizations, buy-in support commitment to contribute to industry needs) [3].

Insight 3 – Involvement of practitioners as study participants increases: Half of the empirical studies involve humans as subjects or study participants (e.g., in as participants in case studies or respondents of surveys). Most of these studies (68%) involve practitioners from industry (with the exception of experiments where students are the dominating participants). This involvement of practitioners is one way to increase the applicability and relevance of research findings for practice.

Insight 4 – Trustworthiness and rigor vary: Most studies (60%) discuss validity threats and acknowledge limitations. On the other hand, we should be careful when interpreting research findings: Many so-called “evaluations” presented in papers are often only illustrations using toy examples. Also, replications to confirm or refute previous findings and to build our confidence in research outcomes are almost non-existent in software architecture research. There might be a misconception in the research community that replications have little scientific value.

Insight 5 – There is a potential mismatch between research topics and trends in industry: Component selection and composition, architecture reasoning and decisions are dominating themes in empirical papers in the period 1999-2015. Michael Keeling in his reflection on emerging trends at SATURN [4] found architecting for DevOps, flexible design, lightweight architecture design methods and a renewed interest in architecture fundamentals as trends at SATURN. We couldn’t find these topics as trends in empirical papers.

Summary: Applying empirical methods puts an increasing demand on planning, conducting and evaluating research and its results, and presenting it in a way that is accessible and meaningful for practitioners. On the other hand, empirical research allows practitioners to get actively involved in collaborations with academics, both to explore emerging phenomena of interest (and to find out about the real problems in software architecture practice) and to evaluate new approaches that address practically relevant problems.

References:

D. Sjoberg, T. Dyba, and M. Jorgensen, "The Future of Empirical Methods in Software Engineering Research," in Future of Software Engineering (FOSE) Minneapolis, MN: IEEE Computer Society, 2007, pp. 358-378.
M. Galster and D. Weyns, "Empirical Research in Software Architecture - How far have we come?," in 13th Working IEEE/IFIP Conference on Software Architecture (WICSA) Venice, Italy: IEEE Computer Society, 2016.
C. Wohlin, A. Aurum, L. Angelis, L. Philips, Y. Dietrich, T. Gorschek, H. Grahn, K. Henningsson, S. Kagstrom, G. Low, P. Rovegard, P. Tomaszewski, C. Van Toorn, and J. Winter, "The Success Factors Powering Industry-Academia Collaboration," IEEE Software, vol. 29, pp. 67-73, 2012.
M. Keeling, "Lightweight and Flexible - Emerging Trends in Software Architecture from the SATURN Conferences," IEEE Software, vol. 32, pp. 7-11, 2015.
https://people.cs.kuleuven.be/danny.weyns/papers/2016WICSA.pdf

You may also like:

F. Shull, "Research 2.0?," in IEEE Software, vol. 29, no. 6, pp. 4-8, Nov.-Dec. 2012.
doi: 10.1109/MS.2012.164.
Colin Potts. 1993. Software-Engineering Research Revisited. IEEE Software 10, 5 (September 1993), 19-28. DOI=http://dx.doi.org/10.1109/52.232392
Tore Dyba, Barbara A. Kitchenham, and Magne Jorgensen. 2005. Evidence-Based Software Engineering for Practitioners. IEEE Software, 22, 1 (January 2005), 58-65. DOI=http://dx.doi.org/10.1109/MS.2005.6

Sunday, May 22, 2016

Has empiricism in Software Engineering amounted to anything?

Associate Editor: Bogdan Vasilescu, University of California, Davis. USA (@b_vasilescu)

A 2014 talk by Jim Larus, currently (after a long tenure at Microsoft Research) Dean of the School of Computer and Communication Sciences at EPFL, was circulated recently over email among some colleagues. In the talk, Larus reflects on success and failure stories of software engineering (SE) tools used at Microsoft over the years: interestingly, he lists the empirical work on data mining and failure prediction, from Microsoft Research, as one of the biggest successes, together with Windows error reporting; in contrast, Larus mentions both PREfast (static analysis) and SDV (formal verification) as examples of tools that didn't matter as much.

Needless to say, this sparked quite the controversy among participants in my email thread, about the value (or lack thereof) of empiricism in SE. I found the discussion thoughtful and useful, and I’d like to share it. This blog post is an attempt to summarize two main arguments, drawing mostly verbatim from the email thread participants. Active participants included, at various times, Alex Wolf, David Rosenblum, Audris Mockus, Jim Herbsleb, Vladimir Filkov, and Prem Devanbu.

Is empiricism just blind correlation?

The main criticism is that empirical SE work does not move past mere correlations to find deeper understanding, yet it's understanding that leads to knowledge, not correlation. The analysis of data should enable understanding, explanation, and prediction, not be an end in and of itself. The accessibility of correlation among statistical methods, coupled with the flood of data from open-source development and social programming websites, have created a recipe for endless paper milling. As a result, many empirical SE papers usually don’t ask any critically important research questions, that matter both to researchers and practitioners; instead, they present obvious, expected, and unactionable insights, that neither researchers could exploit to develop new or improved approaches, nor practitioners to produce guaranteed improvements to their development efforts.

In response to this criticism, three counterexamples:

Are automated debugging techniques actually helping programmers?

C Parnin, A Orso. ISSTA ’11 PDF

On the naturalness of software

A Hindle, ET Barr, Z Su, M Gabel, P Devanbu. ICSE ’12, CACM Research Highlights ’16 PDF

Are mutants a valid substitute for real faults in software testing?

R Just, D Jalali, L Inozemtseva, MD Ernst, R Holmes, G Fraser. FSE ’14 PDF

Arguably, the successes at Microsoft and other places, with saving money on quality control using process metrics, have led to a sort of Gold Rush and probably some excesses. However, it is clear that empirical SE has advanced leaps and bounds since the early correlation days. Admittedly, earlier work was trying to understand the usefulness of the data and stats, and resulted in some fluff; perhaps like with a new gadget, people were more interested in the technology than how useful it is in daily lives. However, thanks in part to all that playing around, we now have lots of solid work, and openness of the community to scientific methods.

Moreover, these days, to pass muster at top conferences, one needs to have a good theoretical grounding, not only on past work in SE & PL, but (often) also in sociology, psychology, management science, and behavioral science -- creating and maintaining software is inherently human-centric. Theory helps to predict and explain results; frame future research questions; provide continuity across studies and research groups; accumulate results into something resembling knowledge; and it points toward important open questions. For example, there is a lot of theory in psychology about multitasking and context switching, that really helps to explain the quantitative results relating productivity to patterns of work across projects. Such theory can both push further theory development as well as work toward deeper explanations of the phenomena we study. What exactly happens when humans context switch, why is context switching costly, and why do costs vary in different conditions?

This is also why correlations have largely fallen out of favour. Careful consideration of the relevant phenomena (in the context of existing work) will lead one to use, e.g., more sophisticated mixed-effects / multivariate models and bootstrapped distributions, and to carefully argue why their experimental methods are well-suited to validate the theory. The use of these models is driven precisely by more sophisticated theories (not statistical, but rather SE / behavioral science theories).

Can empiricism be valuable without fully understanding the “why”?

Although the demand for understanding why something works is prima facie not entirely unreasonable, if one insisted on that, then many engineering disciplines would come to a halt: e.g., airfoil design, medicine, natural language processing (no one really understands why n-gram models work better than just about anything else). Knowing that something works or doesn’t can be quite enough to help a great many people. We have many extremely useful things in the world that work by correlation and “hidden” models. For example, theoretically sound approaches such as topic models (e.g., LDA) don’t work well, but highly engineered algorithms (e.g., word2vec) work fantastically. By insisting on why, one would: stop using SVM, deep learning, many drugs; not fly on airplanes; and not use Siri/Echo, because in many cases the why of these things is not well understood. Even such a trivial thing as GPS has a lot of engineering adjustments that have not been derived from first principles but just work.

However, without explicit understanding of why things work, it’s not clear how far one can go with engineering. For example, many ancient civilizations were able to predict the phases of the moon and positions of the stars without understanding the why of gravity and orbital mechanics, simply based on repeated measurement. But would that have gotten us satellite communication, planetary exploration, and the Higgs boson? Similarly, would you feel comfortable in a plane that lands using a machine-learned autoland system?

Closing remarks

In summary, it seems there is a central place for empirical work in SE; not as an end in itself, but rather the beginning of an inquiry leading to understanding and useful insights that can we can use to truly advance our discipline. See also Peggy Storey's Lies, Damned Lies and Software Analytics talk along these lines.

Sunday, May 15, 2016

Does your Android App Collect More than it Promises to?

by Rocky Slavin, University of Texas at San Antonio, USA (@RockySlavin)

Associate Editor: Sarah Nadi, Technische Universität Darmstadt, Germany (@sarahnadi)

How do we know that the apps on our mobile devices actually access and collect the private information we are told they do? This is an important question particularly about mobile devices due to their various sensors that can produce private information. Typically, end users can read an app’s privacy policy that is provided by the publisher in order get details on what private information is being collected by the app. But even so, it is difficult to verify that the app’s code does indeed adhere to the promises made in the policy. This is an important problem not only for end users who care about their right to privacy, but for developers who have moral and legal obligations to be honest about their code.
In order to aid developers and end users in answering these questions, we have created an approach that connects the natural language used in privacy policies with the code used to access sensitive information on Android devices [4]. This connection, or mapping, allows for a fully-automated violation detection process that can check for consistency between a compiled Android application and its corresponding natural language privacy policy.

Privacy Policies
If you look for an app on the Google Play store, you’ll commonly find a link to a legal document disclosing the private information that is accessed or collected through the app. Perhaps the biggest hindrance in understanding and analyzing these privacy policies is their lack of a canonical format. Privacy policies exist in all lengths and levels of detail, yet under United States law, they must all provide the end user with enough information to be able to make an informed decision on the app’s access to their private information [3].

Sensors and Code
As mentioned above, mobile devices often provide access to various sensors including GPS, Bluetooth, cameras, networking devices, and many others. In order for an app’s code to access data from theses sensors, it must invoke methods from an application program interface (API). For the Android operating system, accessing this API is as simple as invoking the appropriate methods, such as android.location.LocationManager.getLastKnownLocation(), directly in the app’s code. It is these invocations that need to align with the apps’ privacy policies for consistency to be true.

Bridging the Gap
For our approach, we created associations between the API methods used for accessing private data and the natural language used in privacy policies to describe that data.

First, we used the popular crowd-sourcing tool, Amazon Mechanical Turk, to identify commonly-used phrases in privacy policies that describe information that can be produced by Android’s API. The tasks involved users reading through short excerpts from a set of 50 random privacy policies and annotating the phrases used to describe information that was collected. For example, words like “IP address”, “location”, and “device identifier” were some of the most frequently found phrases. The resulting privacy policy lexicon represented the general language used in privacy policies when referencing sensitive data.

Next, we used a similar approach to identify words descriptive of the data produced from all of the publicly-accessible API methods that are sources [2] of private information. Tasks for this portion consisted of individual methods with their descriptions from the API documentation. Users annotated phrases in the description that described the information being produced by the method. This created a natural language representation of the methods’ data to which we could associate phrases from the privacy policy lexicon. The result was a many-to-many mapping of 154 methods to 76 phrases.

Detecting Violations
The resulting mapping between API methods and the language used in privacy policies made violation detection possible. To do so, we use the information flow analysis tool, FlowDroid [1], to detect API invocations that produce sensitive information and then relay it to the network. We considered such invocations as probable instances of data collection. If such a method invocation did not have a corresponding phrase in the app’s privacy policy, it was flagged as a potential privacy policy violation.

Using the above technique, we were able to discover 341 violations from the top 477 Android applications. We believe this implies a lack of a policy verification system for developers and end users alike.

Implications for Developers
Based on our results, we believe that this information and framework can be used to aid developers in ensuring consistency for their own privacy policies. To this end, we are extending our work with an IDE plugin to aid developers in consistency verification as well as a web-based tool for checking compiled apps against their policies. We believe that such tools could be invaluable especially to smaller development teams that may not have the legal resources available to more established development firms. Ultimately, access to such tools could lead to not only a better development experience, but a better product for the end user.

References
[1] S. Arzt, S. Rasthofer, C. Fritz, E. Bodden, A. Bartel, J. Klein, Y. Le Traeon, D. Octeau, and P. McDaniel. Flowdroid: Precise context, flow, field, object-sensitive and lifecycle-aware taint analysis for Android apps. In 35th ACM SIGPLAN Conference on Programming Language Design and Implementation, 2014.
[2] S. Rasthofer, S. Arzt, and E. Bodden. A machine-learning approach for classifying and categorizing Android sources and sinks. In Network and Distributed System Security Symposium, 2014.

[3] J.R. Reidenberg, T. D. Breaux, L. F. Cranor, B. French, A. Grannis, J. T. Graves, F. Liu, A. M. McDonald, T. B. Norton, R. Ramanath, et al. Disagreeable privacy policies: Mismatches between meaning and users’ understanding. Berkeley Tech. LJ 30 (2014): 39.

[4] R. Slavin, X. Wang, M. Hosseini, W. Hester, R. Krishnan, J. Bhatia, T. D. Breaux, and J. Niu. Toward a framework for detecting privacy policy violation in Android application code., In 38th ACM/IEEE International Conference on Software Engineering, 2016, Austin, Texas.

Sunday, May 1, 2016

Why not use open source code examples? A Case Study of Prejudice in a Community of Practice

by Ohad Barzilay, Tel Aviv University (ohadbr@tau.ac.il)
Associate Editor: Christoph Treude (@ctreude)

With so much open source code available online, why do some software developers avoid using it? That was the research question guiding a qualitative grounded-theory study recently published [1]. We analyzed the perceptions of professional software developers as manifested in the LinkedIn online community, and used the theoretical lens of prejudice theory to interpret their answers in a broader context.

We focused on developers’ perception of (re)using code examples - existing code snippets that are used in a new context. Our definition of a code ‘example’ is broad; some of the code examples which appear on the Internet were not written in order to be reused. Code examples may accompany answers on Q&A sites [2], illustrate an idea in an online tutorial, or even be extracted from an open source project [7].

We suggest that developers’ approach with respect to using code examples is dominated by their personality, and affected by concerns such as their community identity, ownership and trust. We find that developers’ perception of such reuse goes beyond the activities and practices, and that some developers associate the use of code examples with negative character. Some of these developers stereotype habitual example users as inferior and unprofessional.

It should be noted that not only human aspects are associated with example usage – there are some other issues involved in this activity such as engineering aspects (e.g. search techniques and tools) and legal issues (e.g. copyright and licensing). These issues are outside of the scope of our discussion; however, we believe that these challenges can be mitigated with proper tools [9], training and organizational support (e.g. leveraging upon social media cues [8], teaching developers which code could they use, and under what circumstances).

Code Writers vs. Copy-and-Paste Monkeys

Some software developers perceive themselves as code writers, and feel strongly about it. Their identity and sometimes even self-esteem are derived from perceiving themselves that way. As suggested by Bewer [5] this may result in the creation of ingroup bias, and can be in turn used as a platform for hate of the outgroup – in this domain, example users. For the code writers (virtual) group, new code is the unit of progress, a sign of productivity (however misleading it may sometimes be). Copying, on the other hand, is perceived as a devalued shortcut – an imitation rather than a creation. In most university courses, the students are not allowed to share their work with fellow students, but are expected to write their own code.

Having ingroup bias often limits boundaries of trust and cooperation [6], which may explain why some developers avoid copy and paste at all cost. They do not trust other programmers enough to take responsibility for and ownership of their code. These programmers find it difficult to understand existing code, they feel that they cannot identify fallacies in someone else's code nor test it thoroughly. They prefer to write the code by themselves and take responsibility for it rather than trust others, and perhaps lose control over their code.

Furthermore, we find that example usage opponents do not conform to organizational goals, and specifically the need for speed. They do not acknowledge the required dexterity and practices for effective example usage, and they aspire to be held in high regard (as opposed to "merely plumbers" [4], suggesting that the essence of the software engineering job boils down to putting the components together and making small non-glorious fixes). After all, some of them might have chosen programming as their profession because of its status.

Implications

In a commercial context, revealing implicit prejudice and disarming it may allow developers to leverage further benefit of code reuse, and may improve the collaboration of individuals, teams and organizations. Moreover, prejudice may interfere with achieving organizational goals, or while conducting an organizational change. Some of these concerns may be mitigated by providing a comprehensive ecosystem of tools, practices, training and organizational support [3]. Having the prejudice lens in mind, one may incorporate methods which were proven effective in addressing prejudice in different context (racism, sexism, nationalism) as part of the software engineering management tools.

Finally, this study may also be considered in the broader context of the changing software engineering landscape. The recent availability of information over the Web, and in our context – availability of source code, is challenging the way software is produced. Some of the main abstractions used in the software domain, namely development and construction, do not adequately describe the emerging practices involving pragmatic and opportunistic reuse. These practices favor composing over constructing and finding over developing. In this context, prejudice can be perceived as a reaction to change and an act resulting from fear of the new and unknown.

References

[1] O. Barzilay and C. Urquhart. Understanding reuse of software examples: A case study of prejudice in a community of practice. Information and Software Technology 56, pages 1613-1628, 2014.
[2] O. Barzilay, C. Treude, and A. Zagalsky. Facilitating crowd sourced software engineering via Stack Overflow. In S. E. Sim and R. E. Gallardo-Valencia, editors, Finding Source Code on the Web for Remix and Reuse, pages 289–308. Springer New York, 2013.
[3] O. Barzilay. Example embedding. In Proceedings of the 10th SIGPLAN symposium on New ideas, new paradigms, and reflections on programming and software, pages 137-144, 2011, ACM.
[4] O. Barzilay, A. Yehudai, and O. Hazzan. Developers attentiveness to example usage. In Human Aspects of Software Engineering, HAoSE ’10, pages 1–8, New York, NY, USA, 2010. ACM.
[5] M. B. Brewer. The psychology of prejudice: Ingroup love and outgroup hate? Journal of social issues, 55(3):429–444, 1999.
[6] S. L. Jarvenpaa and A. Majchrzak. Knowledge collaboration among professionals protecting national security: Role of transactive memories in ego-centered knowledge networks. ORGANIZATION SCIENCE, 19(2):260–276, 2008.
[7] S. E. Sim and R. E. Gallardo-Valencia, editors. Finding Source Code on the Web for Remix and Reuse. Springer, 2013.
[8] C. Treude and M. P. Robillard. Augmenting API documentation with insights from Stack Overflow. Forthcoming ICSE ’16: 38th Int’l. Conf. on Software Engineering, 2016.
[9] A. Zagalsky, O. Barzilay, and A. Yehudai. Example overflow: Using social media for code recommendation. In Proceedings of the Third International Workshop on Recommendation Systems for Software Engineering, pages 38-42, 2012, IEEE Press.

Associate Editors

Jeffrey Carver (Practitioners' digest)
Dario Di Nucci (Testing)
Niko Mäkitalo (Microservices/Software Architecture)
Sofia Ouhbi (Requirements Engineering and Software Sustainability)
Varun Gupta (Global developments)
Jinghui Cheng (Human Aspects)
Muneera Bano (User Centric/Human Aspects)
Ronald Jabangwe (Software Engineering Process Models)
Mehdi Mirakhorli (Design/ Architecture and Requirements)
Brittany Johnson (Issue and SE Radio Summary)
Sarah Nadi (Software release and configuration management)
Stefano Zacchiroli (Open source software systems)
Federica Sarro (Mobile applications and systems)
Sridhar Chimalakonda (Software Quality and Software Reuse)
Danilo Pianini (Pervasive computing)
Karim Ali (Programming Languages)
Mei Nagappan (Practitioner perspectives)
Xabier Larrucea (Practitioner perspectives)