Sunday, September 25, 2016

The Value of Applied Research in Software Engineering



By: David C. Shepherd, ABB Corporate Research, USA (@davidcshepherd)

Associate Editor: Sonia Haiduc, Florida State University, USA (@soniahaiduc


Jean Yang’s thoughtful post “Why It’s Not Academia’s Job to Produce Code that Ships” posits that academics should (and must) be free to explore ideas with no (immediate) practical relevance. Agreed, as these long term, deep, seemingly impractical research projects often end up causing major advances. However she goes further. In her defense of fundamental research she claims, “In reality, most research--and much of the research worth doing--is far from being immediately practical.” This subtle distain for applied research, all too common in academic circles, has had deep implications for our field’s impact and ultimately widens the already cavernous gap between academic research and industrial practice. In this example-driven article I’ll discuss why applied research is a key enabler for impact, identify common vehicles for applied research, and argue for better metrics for scientific evaluation that measure both fundamental and applied contributions. Impacting the practice of software engineering is a huge challenge; it will require the best from our fundamental AND applied researchers.


Applied Research Reduces Risk

Software engineering research has produced thousands of novel solutions in the past decade. In many cases these approaches are rigorously evaluated in the lab.  That is, not by real developers and/or not in the field, as that would require the approach to be implemented in a production quality tool. Thus, companies looking to adopt more effective software engineering techniques are faced with a huge amount of risk. Should they adopt technique A, which had great success in lab evaluations, but has never been tested in the field? The answer to this question is almost always “No”.  Fortunately, applied research can reduce this risk dramatically, leading to more adoption and impact on industrial practices. Below I briefly discuss two examples.

In 2004, Andrian Marcus first published the concept of searching source code using information retrieval techniques. It is tempting to think that this work should have been immediately transferred to modern development environments. However, when my colleagues and others implemented prototypes of the concept and released it to the public it became evident there were many remaining research challenges. We discovered that the query/term mismatch problem is even more pronounced in source code and thus we developed a source code focused autocomplete functionality. We (and others) discovered how difficult it is to split identifiers into words (e.g., ‘OpenFile’ to ‘open’ and ‘file’) due to the many different naming conventions. We found that indexing speed is a major issue, as most searches occur within about 15s of opening a project. By discovering and solving these various applied research challenges this work is now ready to be transferred with a lowered level of risk.

The Microsoft Research team behind Pex, an automated test generation tool, can also attest to the importance of refining and revising research contributions through real world usage. Their work actually began as the Spec explorer project, a model-based testing tool with an embedded model checker, which they quickly discovered was too complex for the average user. They noted, “After observing the difficulties of “[training]” the target tool users, the first author moved on to propose and focus on a more lightweight formal-testing methodology, parameterized unit testing.” Later, when they began releasing Pex as an independent extension for Visual Studio they soon discovered users had “…strong needs of tool support for mocking (independent of whether Pex.. is used), [and so] the Pex team further invested efforts to develop Moles.” Having refined their approach via user feedback and solved practical blockers by creating Moles (i.e., a mocking framework) they have since shipped a simplified version of Pex with Visual Studio called IntelliTest which has received many positive reviews.

Vehicles for Applied Research

While I may have convinced you that applied research is a key step in the innovation pipeline it can be difficult to imagine how applied research can be conducted in today’s academic and economic climate. Below I will detail ways in which researchers have used open source software, startups, and product groups as vehicles for driving applied research. Their approaches can serve as a pattern for applied researchers to follow.

Creating or contributing to open source software can be a great vehicle for applied research. It allows you to gather contributions from others and get your work out to users with a minimum of hassle. Perhaps one of the best examples of using open source as a successful means for performing applied research is Terence Parr and his work on ANTLR, a project which he founded in 1988. In the course of his work on this parser generator he has balanced between acquiring users and contributing to theory. In the past few years he has both published at competitive venues (e.g., OOPSLA and PLDI) and supported major products at Twitter, Google, Oracle, and IBM. Impressive. Unfortunately, it appears that his achievements are not appreciated in academia, as he recently tweeted “I wish tenure/promotion committees knew what software was…and valued it for CS professors.”

Startups can be another great vehicle for applied research, fleshing out the ideas started during a thesis or project into a full-fledged technology. The team at AnswerDash provides a great example. Parmit Chilana’s thesis work focused on providing contextual help to website visitors. The concept appeared promising when first published but was untested in the “real world”. The authors, Parmit along with professors Andrew Ko and Jacob Wobbrock, spent significant effort performing applied research tasks such as deploying and gathering feedback from live websites prior to creating a spinoff. The applied research brought up (and answered) crucial questions, reducing risk and ultimately creating enough value that AnswerDash has received over $5M in funding to date.

Finally, for those working in or even collaborating with a large company product improvements can be a great vehicle for applied research. In this space there is no shortage of examples. From Pex and Moles (which became IntelliTest) to Excel’s Flash Fill to refactorings in NetBeans researchers have found it can be very effective to convince an existing product team on how your improvements can help their product and then leverage their development power. As Sumit Gulwani of Flash Fill fame says, “If we want to do more, and receive more funding, then we need a system in which we give back to society in a shorter time-frame by solving real problems, and putting innovation into tangible products.”

Rewards for Applied Research

“One of the major reasons I dropped out of my PhD was because I didn’t believe academia could properly value software contributions” – Wes McKinney, author of Pandas package for data analysis

Unfortunately, in spite of the importance of applied research for increasing our field’s impact there are few, if any rewards for undertaking this type of work. Tenure committees, conferences, professional societies, and even other researchers do not properly value software contributions. In this section I discuss how we as a community can make changes to continue to reward fundamental researchers while also rewarding applied researchers.

One of the most important issues to address, for both academic and industrial researchers, is that of what metrics are they evaluated on. For this matter I hold up ABB Corporate Research as an example of an enlightened approach. When I applied for ABB’s global “Senior Principal Scientist” position, which is awarded annually and is maintained as 5-10% of the overall researcher population, I had to create an application packet similar to a tenure packet. However, unlike an academic tenure packet I was encouraged to list applied research metrics such as tool downloads, talks at developer conferences, tool usage rates, blog post hits, and ABB internal users. While I did not use this metric at the time, I would also add the amount of collected telemetry data to this list (e.g., collected 1000 hours of activity from 100 users working in the IDE).  At ABB Corporate Research these applied metrics are considered equally, if not more than traditional metrics such as citation count.

The CRA also provides guidelines on how to evaluate researchers for promotion and tenure. In 1999 the CRA recommended that when measuring the impact of a researcher’s work factors like “…the number of downloads of a (software) artifact, number of users, number of hits on a Web page, etc…” may be used which, especially for the time, was progressive. Unfortunately, after listing these potential metrics they spend two paragraphs discrediting them, stating that “…popularity is not equivalent to impact…” and “it is possible to write a valuable, widely used piece of software inducing a large number of downloads and not make any academically significant contribution.” In brief, they stop short of recommending the type of metrics that would reward applied researchers.

Second to how researchers are evaluated are how their outputs are evaluated. In this area there have been small changes in how grants are evaluated, yet the impact of these changes has yet to be seen. The NSF, the US’s primary funding agency for computer science research, started allowing researchers to list “Products” instead of “Publications” in their biological sketch in 2012. This allows researchers to list contributions such as software and data sets in addition to relevant publications. Unfortunately, I have no evidence as to how seriously these non-publication “Products” are considered in evaluating grants (I would love to have comments from those reviewing NSF grants).

While some companies like ABB and institutions like the NSF and to a lesser extent the CRA have begun to consider applied metrics the adoption is not widespread enough to affect sweeping changes. In order to increase the amount of impact our field has on the practice I estimate we would need at least one quarter of our field to be composed of applied researchers. As of today we have less than five percent. The adoption and acceptance of applied metrics may be the single biggest change we could make as a community to increase impact.

Conclusion

Academia’s not-so-subtle distain for applied research does more than damage a few promising careers; it renders our field’s output useless, destined to collect dust on the shelves of Elsevier. We cannot and should not accept this fate. In this article I have outlined why applied research is valuable (it reduces risk), how applied research can be undertaken in the current climate (e.g., open source, startups, and product contributions), and finally have discussed how to measure applied research’s value (e.g., downloads, hits, and usage data). I hope these discussions can serve as a starting point for chairs, deans, and industrial department managers wishing to encourage applied research.

Wondering how to publish your applied research work at a competitive venue? Consider applying to one of the following… I know one of the PC co-chairs and can assure you that he values applied work. Publish at an industry track: SANER Industrial TrackICPC Industry TrackFSE Industry Track, ICSE SEIP


Monday, September 19, 2016

Comprehensively testing software patches with symbolic execution

Associate Editor: Sarah Nadi, University of Alberta (@sarahnadi)


A large fraction of the costs of maintaining software applications is associated with detecting and fixing errors introduced by patches. Patches are prone to introducing failures  [410] and as a result, users are often reluctant to upgrade their software to the most recent version  [3], relying instead on older versions which typically expose a reduced set of features and are frequently susceptible to critical bugs and security vulnerabilities. To properly test a patch, each line of code in the patch, and all the new behaviours introduced by the patch, should be covered by at least one test case. However, the large effort involved in coming up with relevant test cases means that such thorough testing rarely happens in practice.

The Software Reliability Group at Imperial College London has invested a significant amount of effort in the last few years on devising techniques and tools for comprehensively testing software patches. The main focus of our work has been on developing dynamic symbolic execution techniques that automatically detect bugs and augment program test suites. In this article, we discuss two interrelated projects, KATCH  [7] and Shadow  [9], whose objectives and relationship are depicted in Figure 1. In a nutshell, KATCH aims to generate program inputs that cover the lines of code added or modified by a patch, while Shadow further modifies existing test inputs in order to trigger the new behaviour introduced by a patch.






It is rather surprising that lots of code is being added or modified in mature software projects without a single test that exercises it  [8]. In response to such poor culture of testing software patches, we have designed KATCH, a system whose goal is to automatically generate inputs that exercise the lines of code of a patch. KATCH is based on symbolic execution  [2], a program analysis technique that can systematically explore a program’s possible executions. Symbolic execution replaces regular program inputs with symbolic variables that initially represent any possible value. Whenever the program executes a conditional branch instruction that depends on symbolic data, the possibility of following each branch is analysed and execution is forked for each feasible branch.

In its standard instantiation, symbolic execution aims to achieve high coverage of the entire program. Instead, the challenge for KATCH is to focus on the patch code in order to quickly generate inputs that cover it. KATCH accomplishes this goal by starting from existing test cases that partially cover or come close to covering the patch (such test cases often exist, so why not make use of them?), and combining symbolic execution with targeted heuristics that aim to direct the exploration toward uncovered parts of the patch.

We evaluated KATCH on 19 programs from the GNU diffutils, binutils and findutils systems, a set of mature and widely-used programs, installed on virtually all UNIX-based distributions. We included all the patches written over an extended period of time, for a cumulative period of approximately six years across all programs. Such an unbiased selection is essential if one is to show both the strengths and weaknesses of a technique.

The results are presented in Table 1, which shows the number of basic blocks across all patches for each application suite, and the fraction of basic blocks exercised by the existing test suite and the existing test suite plus KATCH. It also shows the number of crash bugs revealed by KATCH. The highlights of our results are: (1) KATCH can significantly increase the patch coverage in these applications in a completely automatic fashion, and (2) there are still significant (engineering and fundamental) limitations: for instance, patch coverage for binutils is still at only 33% of basic blocks, although the additional coverage achieved by KATCH was enough to find 14 previously-unknown crash bugs in these programs.




Now consider a hypothetical scenario where KATCH can generate a test input for every single line in a patch. Would that be sufficient in order to adequately test the patch? Let us illustrate the question using a simple example. Consider the two versions of code below, in which the second version changes only x % 2 to x % 3, and let us assume that this statement can be executed at most once by a deterministic program and that the variable x is only referenced here:


Let us assume that the developers chose inputs x = 6 and x = 7 to test the patch. Do these inputs comprehensively test the patch? The first reaction might be to say yes, since they cover each side of the branch in both the old and the new version. However, the true answer is that these are bad choices, as each of these inputs follows the same side of the branch in both versions, making the program behave identically before and after the patch. Instead, inputs such as x = 8 or x = 9 are good choices, as they trigger different behaviour at the code level: e.g. x = 8 follows the ‘then’ side in the old version, but the ‘else’ side in the new version.


Shadow symbolic execution is a new technique that assists developers with the generation of such inputs. Our tool Shadow starts with an input that exercises the patch, either constructed manually or synthesised by KATCH, and generates new inputs that trigger different behaviour at the code level in the unpatched and patched versions. Like KATCH, Shadow is based on symbolic execution, this time augmented with the ability to perform a “four-way fork”, as shown in Figure 2. Whenever we reach a branch condition that evaluates to semantically-different expressions in the two versions — say, ‘old’ in the old version, and ‘new’ in the new version, instead of forking execution into two paths (as in standard symbolic execution) based on the behaviour of the new version, we fork into up to four ways. On two of these cases, the two versions behave identically (denoted by same in the figure): both versions take either the then (new old) or the else (¬new ∧¬old) branch. On the other two paths, the executions of the two versions diverge (denoted by diff in the figure): either the new version takes the then branch and the old version the else branch (new ∧¬old), or vice versa (¬new old). Of course, the last two cases are of interest to Shadow, and when it encounters them, it generates a new input triggering the divergent behaviour between the two versions.

Shadow is not a fully automatic technique, requiring developers to create a single unified program in which the two versions are merged via ‘change’ annotations. For instance, in our example, the two versions would be unified by creating the if statement if change(x % 2, x % 3), in which the first argument represents the code expression from the old version and the second argument the corresponding expression from the new version. While mapping program elements across versions is a difficult task  [5], we discovered that in practice the process can be made sufficiently precise and could be (partially) automated using predefined patterns.

We evaluated Shadow on the 22 patches from the GNU Coreutils programs included in the CoREBench suite of regression bugs. Similar to the benchmarks used to evaluate KATCH, these are mature, widely-used applications available in most UNIX distributions. We chose the CoREBench patches because they are known to introduce bugs, and furthermore, the bug fixes are known.

After applying Shadow, we were able to generate inputs that trigger code-level divergences for all but one patch. We were further able to generate inputs for which the two versions generate different outputs, as well as inputs that abort or trigger memory errors in the new version. Some sample inputs generated by Shadow are shown in Table 2.

While Shadow was not successful in all cases, the results are promising in terms of its ability to find regression bugs and augment the test suite in a meaningful way. Furthermore, even generated inputs exposing expected divergences are great candidates for addition to the program test suites, and can act as good documentation for program changes.





As most readers would agree, continuously changing a program without having a single input that exercises those changes is unsustainable. On the other hand, as all developers know, comprehensively testing program patches is a difficult, tedious, and time-consuming task. Fortunately, program analysis techniques such as symbolic execution are becoming more and more scalable, and can be effectively extended to generate inputs that exercise program changes (as we did in KATCH) and trigger different behaviour across versions (as we did in Shadow). More work is required to scale these approaches to large software systems and integrate them in the development cycle, but initial results are promising. To stimulate research in this space, we make our benchmarks and our systems available for comparison (see https://srg.doc.ic.ac.uk/projects/katch/ and https://srg.doc.ic.ac.uk/projects/shadow/). More details on our techniques can be found in our publications  [1679].


[1]   C. Cadar and H. Palikareva. Shadow symbolic execution for better testing of evolving software. In ICSE NIER’14, May 2014.
[2]   C. Cadar and K. Sen. Symbolic Execution for Software Testing: Three Decades Later. CACM, 56(2):82–90, 2013.
[3]    O. Crameri, N. Knezevic, D. Kostic, R. Bianchini, and W. Zwaenepoel. Staged deployment in Mirage, an integrated software upgrade testing and distribution system. In SOSP’07, Oct. 2007.
[4]   Z. Gu, E. T. Barr, D. J. Hamilton, and Z. Su. Has the bug really been fixed? In ICSE’10, May 2010.
[5]   M. Kim and D. Notkin. Program element matching for multi-version program analyses. In MSR’06, May 2006.
[6]   P. D. Marinescu and C. Cadar. High-coverage symbolic patch testing. In SPIN’12, July 2012.
[7]   P. D. Marinescu and C. Cadar. KATCH: High-coverage testing of software patches. In ESEC/FSE’13, Aug. 2013.
[8]   P. D. Marinescu, P. Hosek, and C. Cadar. Covrig: A framework for the analysis of code, test, and coverage evolution in real software. In ISSTA’14, July 2014.
[9]   H. Palikareva, T. Kuchta, and C. Cadar. Shadow of a doubt: Testing for divergences between software versions. In ICSE’16, May 2016.
[10]   Z. Yin, D. Yuan, Y. Zhou, S. Pasupathy, and L. Bairavasundaram. How do fixes become bugs? In ESEC/FSE’11, Sept. 2011.











Sunday, September 11, 2016

Should we care about context-specific code smells?

by Maurício Aniche, Delft University of Technology (@mauricioaniche)
Associate Editor: Christoph Treude (@ctreude)

Detecting code smells that the software development team actually care about can be challenging. I remember the last software development team I was part of. We were responsible for developing a medium-sized web application. Code quality was, as in many other real-world systems, ok-ish: some components were very well-written; others were a nightmare.

That team used to get together often and discuss how to improve the quality of the system's source code. I wanted to talk about God Classes, Feature Envys, and Brain Methods, and all the new strategies to detect them. To my surprise, they never wanted to talk about it: "Yes, this is problematic, but take a look at this Controller and count the number of different endpoints it has... That's more important!", or "Yes, I understand that, but take a look at this Repository... It has lots of complicated SQL. We need to refactor it first".

I realized something that we all take for granted, but sometimes forget: context matters. My team wanted to talk about code smells that were specific to their context: a Java-based web application that uses Spring MVC as MVC framework and Hibernate as a persistence framework. We decided to investigate these smells in more detail. Similar research has been conducted by other researchers already, such as in the usage of object-relational mapping frameworks [1], Android apps [2,3], or Cascading Style Sheets (CSS) [4].

By means of different sets of interviews and surveys with many software developers that were experienced in this kind of architecture, we catalogued 6 new smells. However, these smells do not make sense in any kind of system; they only apply to systems which use Spring MVC. They are specific to the context!

As examples, we can cite Brain Repository, which happens when a repository contains too much logic, both in terms of SQL or code, or Promiscuous Controller, which happens when a Controller offers too many actions/endpoints.

We have shown [5] that these smells are indeed bad for their systems. Classes affected by these smells are more change-prone than clean classes. As part of our study, we performed an experiment with 17 developers that didn't know anything about these smells. They perceived smelly classes as problematic classes. Curiously, although with no statistical significance, they evaluated "context smells" as more problematic than "traditional smells".

We explored the problem further and investigated how code metric assessment would behave for specific architectures. We found that, for specific roles in the system, traditional code metric assessment based on thresholds can lead to unexpected results. As an example, in MVC applications, Controllers have significantly higher coupling metrics than other classes. Thus, if we use the same threshold to find problematic classes for Controllers as for other classes, we end up with lots of false positives.

Towards this goal, we propose SATT [11]. The approach

  1. analyzes the distribution of code metric values for each architectural role in a software system,
  2. verifies whether the distribution is significantly higher or lower than other classes, and
  3. provides a specific threshold for that architectural role in that software architecture.

Our approach seems to provide better assessments for architectural roles which are significantly different from other classes.

If you use Spring MVC, you can make use of all of this with our open source Springlint tool: http://www.github.com/mauricioaniche/springlint.

With that being said, we see a couple of tasks that practitioners can do in order to better spot code smells:

  • Understand their system's architecture and what the specific smells are. For example, if you are developing a mobile application, your code might suffer from smells that would not occur in a web service.
  • Share this knowledge with the rest of your team. In another study, we found that developers do not have a common perception about their system's architecture [6].
  • Look for simple detection strategies and implement them. A computer can spot these classes much faster than you!
  • Monitor and safely refactor the smelly classes.

We are not arguing that God Classes or any other traditional smells are not useful. They are, and researchers have shown their negative impact on source code before [7,8,9,10]. But your application may smell differently than these ones, and the smell may not be good.

References

[1] Tse-Hsun Chen, Weiyi Shang, Zhen Ming Jiang, Ahmed E. Hassan, Mohamed Nasser, and Parminder Flora. 2014. Detecting performance anti-patterns for applications developed using object-relational mapping. In Proceedings of the 36th International Conference on Software Engineering (ICSE 2014). ACM, New York, NY, USA, 1001-1012.
[2] Daniël Verloop. 2013. Code Smells in the Mobile Applications Domain. Master thesis, Delft University of Technology.
[3] Geoffrey Hecht, Romain Rouvoy, Naouel Moha, and Laurence Duchien. 2015. Detecting antipatterns in Android apps. In Proceedings of the Second ACM International Conference on Mobile Software Engineering and Systems (MOBILESoft '15). IEEE Press, Piscataway, NJ, USA, 148-149.
[4] Davood Mazinanian and Nikolaos Tsantalis. 2016. An empirical study on the use of CSS preprocessors. In Proceedings of the 23rd International Conference on Software Analysis, Evolution, and Reengineering (SANER '16). IEEE Computer Society, Washington, DC, USA, 168-178.
[5] Maurício Aniche, Gabriele Bavota, Christoph Treude, Arie van Deursen, Marco Gerosa. 2016. A Validated Set of Smells in Model-View-Controller Architecture. In Proceedings of the 32nd International Conference on Software Maintenance and Evolution (ICSME '16), IEEE Computer Society, Washington, DC, USA, to appear.
[6] Maurício Aniche, Christoph Treude, Marco Gerosa. 2016. Developers' Perceptions on Object-Oriented Design and System Architecture. In Proceedings of the 30th Brazilian Symposium on Software Engineering (SBES 2016). To appear.
[7] Fabio Palomba, Gabriele Bavota, Massimiliano Di Penta, Rocco Oliveto, and Andrea De Lucia. 2014. Do They Really Smell Bad? A Study on Developers' Perception of Bad Code Smells. In Proceedings of the 2014 IEEE International Conference on Software Maintenance and Evolution (ICSME '14). IEEE Computer Society, Washington, DC, USA, 101-110.
[8] Foutse Khomh, Massimiliano Di Penta, and Yann-Gael Gueheneuc. 2009. An Exploratory Study of the Impact of Code Smells on Software Change-proneness. In Proceedings of the 16th Working Conference on Reverse Engineering (WCRE '09). IEEE Computer Society, Washington, DC, USA, 75-84.
[9] Marwen Abbes, Foutse Khomh, Yann-Gael Gueheneuc, and Giuliano Antoniol. 2011. An Empirical Study of the Impact of Two Antipatterns, Blob and Spaghetti Code, on Program Comprehension. In Proceedings of the 15th European Conference on Software Maintenance and Reengineering (CSMR '11). IEEE Computer Society, Washington, DC, USA, 181-190.
[10] Foutse Khomh, Massimiliano Di Penta, Yann-Gaël Guéhéneuc, and Giuliano Antoniol. 2012. An exploratory study of the impact of antipatterns on class change- and fault-proneness. Empirical Software Engineering 17, 3 (June 2012), 243-275.
[11] Maurício Aniche, Christoph Treude, Andy Zaidman, Arie van Deursen, and Marco Gerosa. 2016. SATT: Tailoring Code Metric Thresholds for Different Software Architectures. In Proceedings of the 16th International Working Conference on Source Code Analysis and Manipulation (SCAM '16), IEEE Computer Society, Washington, DC, USA, to appear.

Sunday, September 4, 2016

IEEE Software September/October 2016 Issue

Associate Editor: Brittany Johnson (@brittjaydlf)

The September/October issue is dominated by papers on the software development process and the business behind the software that is produced. One important aspect of software development and, more importantly, the ability to sell software (or rather thrive as a software company) is the ability to innovate. Innovation, by definition, is the ability to come up with new ideas that improve the state of the art. Amazon is a prime example of what innovation can do for a company, dubbed Forbes 2015 #8 most innovative company in the world AND #29 among the Fortune 500 in 2015. And I'm willing to bet just about everyone reading this blog post has purchased at least one item from Amazon.com. Despite having no physical stores, according to the National Retail Federation, Amazon.com is the 8th top retailer in 2016!

So what's in Amazon's secret sauce? A couple of papers in this issue of IEEE Software may shed some light on how to increase our ability to innovate and turn innovative ideas into action:

  • "How Software Development Group Leaders Influence Team Members' Innovative Behavior" by Fabio Q.B. Silva, Cleviton V.F. Monteiro, Igor Ebrahim dos Santos, and Luiz Fernando Capretz, and
  • "Innovation-Driven Software Development: Leveraging Small Companies' Product-Development Capabilities" by Ricardo Elto-Brun and Miguel-Angel Sicilia

In their article on "How Software Development Group Leaders Influence Team Members' Innovative Behavior", the authors explore the effect of leadership style on the ability for teams to innovate using empirical evidence from existing literature and two industrial case studies. According to their literature review, the most commonly studied leadership styles are transactional, transformational, and charismatic leadership. The authors decide to focus on the former two, stating these two that can affect innovative behavior in various ways. Surprisingly, and not surprisingly, transactional and transformational leadership can be good and bad for innovation. The surprising part, or the part I myself did not know, was that each type of leadership fosters a complimentary type of innovative behavior -- while transformational leadership inspires exploratory innovation (search for new ways to do things and solve problems), transactional leadership inspires exploitative innovation (refine current methods to gain efficiency). I, like I think most people, think of innovation as more of the former. But really it's a combination of both that breeds true success. Take Amazon...I'm sure they spend time thinking of new ways to do things, however, looking at the consistency in their software and business model over the years, I'm willing to bet they do their fair share of exploitative innovation.
What's not surprising is that innovation requires flexibility and truly effective leadership comes from knowing the ideal leadership style based on the context. Do these type of leaders exist? If so, what do they look like and how much more successful are they really than the combo leaders this article speaks on? Chime in if you think your company has dynamic leadership!

The authors of "Innovation-Driven Software Development: Leveraging Small Companies' Product-Development Capabilities" explore innovation in smaller software development teams by developing an innovation activity model. What was interesting about this work, particularly when read after the above article, is that their work focuses less on how to foster new ideas and more so on the process that takes an innovative idea and makes it a reality -- their innovation activity model is composed of activities, outcomes, tasks, and work products. In fact, the authors stressed the fact that an innovation management model shouldn't focus solely on idea generation. To be truly effective, there should be considerations regarding how to diffuse the innovation to real customers and stakeholders. The authors reference the well-known Diffusion of Innovation model, developed by Everett Rogers, which has been used in a variety of contexts to help foster, or explain the lack of, adoption of innovations. As with the previous article, the authors were thorough -- combining literature reviews, interviews, and focus groups to determine objectives that would inform the construction and refinement of their model. A slight disappointment is that small companies eventually, with success, can become larger companies. Is this growth to be expected? Prepared for? Does the model expand with the company or is the model only useful for small companies. Questions that need answers...future work?

Again, another set of interesting reads in the September/October 2016 issue...don't believe me? You don't have to take my word for it...check it out today!

Sunday, August 28, 2016

Usage, Costs, and Benefits of Continuous Integration in Open-Source Projects

by Michael Hilton, Oregon State University, Canada (@michaelhilton)
Associate Editor: Sarah Nadi, University of Alberta, Canada (@sarahnadi)

Continuous integration (CI) systems automate the compilation, building, and testing of software.  A recent large survey of software development professionals, found that 50% of respondents use CI [1]. Despite the widespread usage of CI, it has received almost no attention from the research community. This gap has left a lot of unanswered questions around CI. For example, how widely is CI used in practice, and what are some costs and benefits associated with CI? To answer these questions, we studied the usage, costs, and benefits of CI in open-source projects. We examined the 34,000 most popular open-source GitHub projects to evaluate which CI service(s) they use. From that list, we collected over 1.5 Million builds from over 600 of the most popular projects and then surveyed 442 of the developers of these projects. In the following, we answer some of your burning questions about CI.

Usage of CI

Is CI actually used?  We found that over 40% of all open-source projects that we examined use CI.  Note that since we cannot identify private CI servers that may be used by some projects, 40% should be considered as the floor of CI usage.

What is the most popular CI for open source?  The most commonly used CI service in our data was Travis CI. Over 90% of all open-source projects using CI use Travis CI.

Are the ``big names’’ in open source using CI?  We sorted the projects by popularity and found that for the 500 most popular projects, 70% use CI.  As the projects become less popular, the percentage that use CI goes down.

Is CI a passing fad?  We asked developers in our survey if they plan on using CI for their next project.  The top two options, 'Definitely' and 'Most Likely' account for 94% of all our survey respondents.  Even among respondents who are not currently using CI, over 50% said they would use CI for their next project.

Costs of CI

Why then do some projects not use CI? The most common reason reported in our survey is that other developers in a project are not familiar with CI.  The second most popular reason was that the project does not have automated tests.

Will I have to be continuously fixing the configuration files? We found that the median project changes their CI configuration files 12 times over the lifetime of the project, so there is not a lot of churn in CI configurations.

Benefits of CI

So what exactly do developers like about CI? When we asked developers why they use CI, the most common answer was “CI helps us worry less about breaking our builds”. This was reported by 87% of our respondents. The second most common answer was that “CI helps catch bugs earlier”.

Can CI help me release more often? We found that projects with CI do in fact release faster then projects that do not use CI (.54 releases per month with CI versus .24 releases per month without CI).  To control for project type, we also looked at only projects that use CI and compared how often they released before and after introducing CI to the project.  Before CI, they released at a rate of .34 releases per month, but after introducing CI, the release rate rose to .54 releases per month.

Can CI help me spot problems before they happen?  Projects that use CI accepted less pull requests then projects that do not use CI.  This could be because CI helps identify subtle issues that a quick human review would not catch.

Will CI help me save time? We found that projects with CI accept pull requests (PRs) on average 1.6 hours faster than projects that do not use CI.  This could be because of the time saved by letting the CI review the PR, as opposed to manually reviewing it.
 

Conclusion

CI can save you time, help you release more often, and help you worry less about breaking your builds.  We hope these results motivate developers to continue to adopt CI, but also to provide a call to action for the research community to continue investigating this important aspect of the development process.

If you are interested in more details about our results, please read our full paper [2].




[1] Version One. 10th annual state of Agile development survey. https://versionone.com/pdf/VersionOne-10th-Annual- State-of-Agile-Report.pdf, 2016. 

[2] Usage, Costs, and Benefits of Continuous Integration in Open-Source Projects. Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering (ASE 2016). To appear.

Sunday, August 21, 2016

From Aristotle to Ringelmann: Using data science to understand the productivity of software development teams

By: Ingo Scholtes, Chair of Systems Design, ETH Zürich, Switzerland (@ingo_S)
Associate Editor: Bogdan Vasilescu, University of California, Davis. USA (@b_vasilescu)

I am sure that the title of this blog post raises a number of questions: What has the Greek philosopher Aristotle to do with the 19th century French agricultural engineer Maximilien Ringelmann? And what, if anything, do the two of them have to do with software?

The answers to both questions are related to Aristotle's famous reference to systems where the "whole is greater than the sum of its parts", i.e. complex systems where interactions between elements give rise to emergent or synergetic effects. Teams of software developers which communicate and coordinate to jointly solve complex development tasks can be seen as one example for such a complex (social) system. But what what are the emergent effects in such teams? How do they affect productivity and how can we quantify them?

This is where Maximilien Ringelmann, a French professor for agricultural engineering, who would later - astonishingly - become one of the founders of social psychology, enters the story. Around 1913, Maximilien Ringelmann became interested in the question how the collective power of draught animals, like horses or oxen pulling carts or plows, changes as increasingly large teams of them are harnessed. He answered the question based on a data-driven study. Precisely, he asked increasingly large teams of his students to jointly pull on a rope, measuring the collective force they were able to exert. He found that the collective force of a team of students was less than the sum of forces exerted by each team member alone, an effect that became later known as the "Ringelmann effect". One possible explanation for the finding are coordination issues that make it increasingly difficult to tightly synchronize actions in an increasingly large team. Moreover, social psychology has generally emphasized motivational effects that are due to shared responsibility for the collective output of a team, a phenomenon known by the rather unflattering term "social loafing".

Changing our focus from draught oxen to developers, let us now consider how all of this is related to software engineering. Naturally, the question how the cost and time of software projects scale with the number of software developers involved in the project is of major interest in software project management and software economics.

In the 1975 book The Mythical Man-Month, Fred Brooks formulated his famous law of software project management, stating that "adding manpower to a late project makes it later". Just like for the Ringelmann effect, different causes for this have been discussed. First, software development naturally comes with inherently indivisible tasks which cannot easily be distributed among a larger number of developers. Illustrating this issue, Brooks' stated that "nine women can't make a baby in one month". Secondly, larger team sizes give rise to an increasing need for coordination and communication that can limit the productivity of team members. And finally, for developers added to a team later, Brooks' discussed the "ramp-up" time that is due to the integration and education of new team members. The result of these combined effects is that the individual productivities of developers in smaller teams cannot simply be multiplied to estimate their productivity in a larger team.

While the (empirical) software engineering community has been rather unanimous about the fact that the productivity of individual developers is likely to decrease as teams grow larger, a number of recent works in management science and data science (referenced and summarized in [1]) have questioned this finding in the context of Open Source communities. The argument is that larger team sizes increase the motivation of individuals in Open Source communities, thus giving rise to an "Aristotelian regime" where indeed the whole team produces more than expected based on the sum of its parts. The striking consequence of this would be that Open Source projects are instances of economies of scale, where the effort of production (in terms of team members involves) sublinearly scales with the scale of the project. In contrast traditional software projects represent diseconomies of scale, i.e. the cost of production superlinearly increases as projects become larger and more complex.

The goal of our study [1] was to contribute to this discussion by means of a large-scale data-driven study. Precisely, we studied a set of 58 major Open Source projects on gitHub with a history of more than ten years and a total of more than half a million commits contributed by more than 30,000 developers. The question that we wanted to answer is simple: How does the productivity of software development teams scale with the team size? In particular, we are interested whether Open Source projects indeed represent exceptions from basic software engineering economics, as argued by recent works.

Regarding methodology, answering this question involves two important challenges:
  1. We need a quantitative measure to assess the productivity of a team
  2. We must be able to calculate the size of a development team at any given point in time
In our work, we addressed these challenges as follows. First, in line with a large body of earlier studies and notwithstanding the fact that it necessarily gives rise to a rather limited notion of productivity, we use a proxy measure for productivity that is based on the amount of source code committed to the project's repository. In fact, there are different ways to define such a measure. While a number of previous studies have simply used the number of commits to proxy the amount of code produced, we find that the distribution of code contributions in these commits are so broadly distributed that we cannot simply use it as a measure for productivity. Doing so would substantially bias our analysis. To avoid this problem, we use the Levenshtein distance between the code versions in consecutive commits, which allows us to quantify the number of characters edited between consecutive versions of the source code.

A second non-trivial challenge is to assess the size of a development team in Open Source communities. Most of the time there is no formal notion of a team, so who should be counted as a team member at a given point in time? Again, we address this problem by means of an extensive statistical analysis. Specifically, we analyze the inter-commit times of developers in the commit log. This allows us to define a reasonably-sized time window based on the prediction whether team members are likely to commit again in the future after a given period of inactivity. This time window can then be used to estimate team sizes in a way that is substantiated by the temporal actitiy distribution in the data set (see [1] for details).

We now have all that we need. To answer our question we only need to plot the average code contribution per team member (measured in terms of the Levenshtein distance) against the size of the development team (calculated as described above). If Ringelmann and Brooks are right, we expect a decreasing trend which indicates that developers in larger development teams tend to produce less. If, on the other hand, studies highlighting synergetic effects in OSS communities are right, we expect an increasing trend which indicates that developers in larger developments teams tend to produce more (because they are more motivated). The results across all of the 58 projects are shown in the following plot.


The clear decreasing trend that can be observed visually shows that Ringelmann and Brooks are seemingly right. We can further use a log-linear regression model to quantify the scaling factors and to assess the robustness of our result. This analysis confirms a strong negative relation, spanning several orders of magnitude in terms of the code contributions. Notably, this negative relation holds both at the aggregate level as well as for each of the studied projects individually.

While our analysis quantitatively confirms the Ringelmann effect in all of the studied projects, we have not yet addressed why it holds. Unfortunately, it is non-trivial to quantify potential motivational factors that have been discussed in social psychology. But what we can do is to study potential effects which are due to increasing coordination efforts. For this, we again use the time-stamped commit log of projects to infer simple proxies for the coordination structures of a project. Precisely, we construct complex networks based on the co-editing of source code regions by multiple developers. Whenever we detect that a developer A changed a line of source code that was previously edited by a developer B, we draw a link from A to B. The meaning of such a link is that we assume that there is the potential need for developer A to coordinate his or her change with developer B.

The result of this procedure are co-editing networks that can be constructed for different time ranges and projects. We can now study how these networks change as teams increase in size. What we find is that, in line with Brooks' argument on the increasing coordination and communication effort, the number of links in the co-editing networks tends to grow in a super-linear fashion as teams grow larger. The result of this is that the coordination overhead for each team member is likely to increase as the team grows, thus providing an explanation for the decreasing code production. Moreover, by fitting a model that allows us to estimate the speed at which co-editing networks grow in different projects, we find that there is a statistically significant relation between the growth dynamics of co-editing links and the scaling factor for the decrease of productivity. This finding indicates that the management of a project and resulting coordination structures can significantly influence the productivity of team members, thus enforcing or mitigating the strength of the Ringelmann effect as the team grows.

So, do developers really become more productive as teams grow larger? Does the whole team really produce more than the sum of its team members? Or do we find evidence for Brooks' law and the Ringelmann effect? Based on our large-scale data analysis of more than 580,000 commits by more than 30,000 developers in 58 Open Source Software projects, we can safely conclude that there is a strong Ringelmann effect in all of the studied Open Source projects. As expected based on basic software engineering wisdom, our findings show that developers in larger teams indeed tend to produce less code than developers in smaller teams. Our analysis of time-evolving co-editing networks constructed from the commit log history further suggests that the increasing coordination overhead imposed by larger teams is a factor that drives the decrease of developer productivity as teams grow larger.

In summary, Open Source projects seem to be no magical exceptions from the basic principles of collaborative software engineering. Our study demonstrates how data science and network analysis techniques can provide actionable insights into software engineering processes and project management. It further shows how the application of computational techniques to large amounts of publicly available data on social organizations allows to study hypotheses relevant to social psychology. As such it highlights interesting relations between empirical software engineering and computational social science which provide a large potential for interesting future works.

[1] Ingo Scholtes, Pavlin Mavrodiev, Frank Schweitzer: From Aristotle to Ringelmann: a large-scale analysis of productivity and coordination in Open Source Software projects, Empirical Software Engineering, Volume 21, Issue 2, pp 642-683, April 2016, available online

Sunday, August 14, 2016

Release management in Open Source projects

By: Martin Michlmayr (@MartinMichlmayr)
Associate editor: Stefano Zacchiroli (@zacchiro)

Open source software is widely used today. While there is not a single development method for open source, many successful open source projects are based on widely distributed development models with many independent contributors working together. Traditionally, distributed software development has often been seen as inefficient due to the high level of communication and coordination required during the software development process. Open source has clearly shown that successful software can be developed in a distributed manner.

The open source community has over time introduced many collaboration systems, such as version control systems and mailing lists, and processes that foster this collaborative development style and improve coordination. In addition to implementing efficient collaboration systems and processes, it has been argued that open source development works because it aims to reduce the level of coordination needed. This is because development is done in parallel streams by independent contributors who work on self-selected tasks. Contributors can work independently and coordination is only required to integrate their work with others.

Relatively little attention has been paid to release management in open source projects in the literature. Release management, which involves the planning and coordination of software releases and the overall management of releases throughout the life cycle, can be studied from many different aspects. I investigated release management as part of my PhD from the point of view of coordination theory. If open source works so well because of various mechanism to reduce the level of coordination required, what implications does this have on release management which is a time in the development process when everyone needs to come together to align their work?

Complexity of releases

As it turns out, my study on quality problems has highlighted that release management can be a very problematic part in the production of open source. Several projects described constant delays with their releases, leading to software which is out-of-date or which has other problems. Fundamentally, release management relies on trust. Contributors have to trust release managers and deadlines imposed by them, otherwise the deadlines are simply ignored. This leads to a self-fulfilling prophecy: because developers don't believe a release will occur on time, they continue to make changes and add new code, leading to delays with the release. It's very hard to break such a vicious circle.

It's important to consider why creating alignment in open source projects can be a challenge. I identified three factors that made coordination difficult:
  1. Complexity: alignment is harder to achieve in large projects and many successful open source projects have hundreds of developers.
  2. Decentralization: many open source projects are distributed, which can create communication challenges.
  3. Voluntary nature: it's important to emphasize that this does not mean that contributors are unpaid. While some open source contributors are unpaid, increasingly open source development is done by developers employed or sponsored by corporations. The Linux kernel is a good example with developers from hundreds of companies, such as Intel and Red Hat. What I mean by voluntary nature is that the project itself has no control over the contributors. The companies (in the case of paid developers) define what work gets done and unpaid contributors generally "scratch their own itch". What this means is that it's difficult for a release manager or a project leader to tell everyone to align their work at the same time.

Time-based releases

While my research has shown problems with release management in many projects, it has also identified a novel approach to release management employed by an increasing number of projects. Instead of doing releases based on new features and functionality, which has historically been the way releases are done, releases are made based on time. Time-based releases are like a release train: there is a clear timetable by which developers can orient themselves to plan their work. If they want to make the next train (release), they know when they have to be ready.

Time-based releases work particularly well if the releases are predictable (for example, every X months) and frequent. If a release is predictable, developers can plan accordingly. If a release is fairly frequent, it doesn't matter if you miss a release because you can simply target the next one. You can think of a train station with a train that is leaving. Should you run? If you know the next train will leave soon and you can trust that the next train will leave on time, there is no need to hurry — you can avoid a dangerous sprint across the platform. Similarly, if the next release is near and predictable, you don't need to disrupt the project by making last minute changes.

Additionally, frequent releases give developers practice making releases. If a project does releases only every few years, it's very hard to make the release process an integral process of the development cycle. When releases are done on a regular basis (say every three or six months), the release process can work like a machine — it becomes part of the process.

Time-based releases are a good mechanism for release management in large open source projects. Speaking in terms of coordination theory, time-based releases decrease the level of coordination required because the predictable timetable allows developers to plan for themselves. The timetable is an important coordination mechanism in its own right.

Looking at various open source projects, time-based release management is implemented in different ways. For example, GNOME and Ubuntu follow a relatively strict frequency of six months. This is frequent, predictable and easy to understand. The Linux kernel employs a 2 week "merge window" in which new features are accepted. This is followed by a number of release candidates with bug fixes (but no new features) until the software is ready for release, typically after 7 or 8 release candidates. Debian follows a model where the freeze date is announced in advance. The time between freeze and release depends on how fast defects get fixed. Debian's freeze dates are more than 2 years apart, which in my opinion leads to challenges because the release process is not performed often enough to become a routine.

Recent changes

There have been many changes recently in the industry and development community that have influenced release management. One change relates to the release frequency, which has been going up in several projects (such as U-Boot, which moved from a release every three months to one every two months in early 2016). This may stem from the need to serve updates to users more frequently because the technology is changing so rapidly. It could also be because the release train is working so well and the cost of doing releases has gone down. Frequent releases lead to a number of questions, though. For example, do users prefer small, incremental updates over larger updates? Furthermore, how can you support old releases or are users expect to upgrade to the latest version immediately (a model employed by various app stores for mobile devices)?

In addition to more frequent releases, I also observe that some projects have stopped making releases altogether. In a world of Continuous Integration (CI) and Continuous Deployment (CD), does it make sense to offer the latest changes to users because you can assure that every change has been sufficiently tested? Is there still a value of performing separate releases when you have CI/CD?

I believe more research is needed to understand release management in our rapidly changing world, but one thing is clear to me: studying how contributors with different interests come together to produce and release software is fascinating!

References