Sunday, October 16, 2016

JDeodorant: Clone Refactoring Support beyond IDEs

By: Nikolaos Tsantalis, Concordia University, Montreal, Canada (@NikosTsantalis)
Associate Editor: Sonia Haiduc, Florida State University, USA (@soniahaiduc)

Why should I bother to refactor duplicated code in my project?
Duplicated code (also known as code clones) can be a potential big trouble for the maintenance of your project. Not all clones are harmful [1] or equally harmful [2]. However, if you find yourself repeatedly applying the same changes (e.g., bug fix) in multiple places in your code, you should definitely consider to refactor [3] this duplication. Merging such clones into a single copy will make future maintenance faster and less error-prone.

Why is tool support necessary for the refactoring of clones?
Software clones tend to diverge after the initial copy-and-paste. This happens because developers have to adjust the copied code to a new context and/or different requirements. Therefore, clones tend to have non-trivial differences as the project evolves (e.g., different methods being called, or completely different logic in parts of the implemented algorithm). Unfortunately, the current state-of-the-art IDEs support the refactoring of clones with trivial differences (i.e., identical clones with variations in whitespace, layout, comments, and variable identifiers).

JDeodorant [4] is an Eclipse plugin that can help developers to inspect the differences between clones in Java projects, and refactor them safely, if possible. More specifically, JDeodorant applies a sophisticated static source code analysis to examine whether the parameterization of the differences appearing in the clones is free of side effects. We refer to this feature as refactorability analysis [5]. If all examined preconditions pass successfully, JDeodorant can proceed with the refactoring of the clones.

Feature 1: Clone import and clone group exploration
  • JDeodorant can import results from 5 popular clone detection tools, namely CCFinder, ConQAT, NiCad, Deckard, and CloneDR. In addition, JDeodorant can analyze any pair of methods selected by the developer from the Eclipse Package Explorer view.
  • While importing the results, JDeodorant automatically checks the syntactic correctness of the clone fragments, and fixes any discrepancies by removing incomplete statements and adding the missing closing brackets from incomplete blocks of code. Additionally, the tool filters out the clones that extend beyond the body of a method (i.e., class-level clones).
  • The imported results are presented to the user in a tree-like view, as shown in Figure 1. The clones are organized into groups based on their similarity (i.e., a clone group contains two or more clone instances).
  • The clone groups are also analyzed to discover subclone relationships between them. Group A is a subclone of group B, if every clone instance in A is a sub-clone (i.e., a partial code fragment) of an instance in B. The subclone information appears as a link in the last column of the clone group table to help the user navigate between clone groups having such a relationship.
  • By clicking on the “Show only clone groups for the files opened in the editor” checkbox, the user can filter the clone groups table to display only the clones being relevant to the context (i.e., appearing in the files) he/she is currently working on.
  • All clones are constantly monitored for modifications. If the developer refactors or updates some code associated with the imported clones, the clone group table is automatically updated by disabling the clones affected by the modification (disabled clones appear with strikethrough text), and by re-computing the offsets of other clones belonging to the same modified Java files (shifted clones appear with text highlighted in green). In this way, the user can continue with the inspection and refactoring of other clone groups without having to import new results from external tools. 

Figure 1: Presentation of the imported clone detection results to the user

Feature 2: Clone visualization and refactorability analysis
The user can right-click on any pair of clones from the same clone group, or any pair of methods from the Eclipse Package Explorer and select “Refactor Duplicated Code…” from the popup menu. The outcome of the clone pair analysis is presented to the user as shown in Figure 2.

Figure 2: Clone pair visualization and refactorability analysis

The analyzed clone fragments appear as two side-by-side trees, where each pair of tree nodes in the same row represents a pair of mapped statements in the first and second clone fragment, respectively. The user can inspect the clones in a synchronized manner, by expanding and collapsing the nodes corresponding to control statements (i.e., loops, conditionals, try blocks). The code is highlighted in 3 different colors to help the developer inspect and understand the differences between the clones.
  • Yellow: Represents differences in expressions between matched statements. These expressions are evaluated to the same type, but have a different syntactic structure or identifier.
  • Red: Represents unmapped statements that do not have a matching statement in the other clone fragment (also known as clone gaps).
  • Green: Represents semantically equivalent statements, i.e., statements of different AST types performing exactly the same functionality. In Figure 2, we can see a for loop in the left clone matched with a while loop in the right clone having the same initializer and updater as separate statements.
By hovering over a pair of statements highlighted in yellow, a tooltip appears providing semantic information about the type of each difference based on the program elements (e.g., variables, method calls, literals, class instantiations) appearing in the difference. Currently, JDeodorant supports over 20 difference types, including some more advanced ones, such as the replacement of a direct field access with the corresponding getter method call, and the replacement of a direct field assignment with the corresponding setter method call. In addition, the tooltip may also include information about precondition violations, if the expressions appearing in the differences cannot be safely parameterized.

Semantically equivalent differences, and renamed variables are not examined against preconditions, since they should not be parameterized. JDeodorant automatically detects the local variables that have been consistently renamed between the clone fragments (as shown in the bottom-right side of Figure 2).

Feature 3: Clone refactoring
Based on the location of the clones, JDeodorant determines automatically the best refactoring strategy:

1.  Extract Method (clones belong to the same Java file)
2.  Extract and Pull Up Method (clones have a common superclass)
a) Introduce Template Method (clones call local methods from the subclasses)
b) Extract Superclass (clones have an external common superclass, or the common superclass has additional subclasses)
3.  Introduce Utility Method (clones access/call only static fields/methods)

As shown in Figure 3, JDeodorant can generate a detailed preview of the refactoring to be applied, where the developer can inspect all the changes that will take place at a fine-grained level. Finally, the user can undo and redo the applied refactorings, since they are recorded in the change history of Eclipse.

Figure 3: Clone refactoring preview

JDeodorant is an open-source project hosted on GitHub.
Videos demonstrating the use and features of JDeodorant can be found on YouTube.

[1] Cory J. Kapser and Michael W. Godfrey, ""Cloning considered harmful" considered harmful: patterns of cloning in software," Empirical Software Engineering, vol. 13, no. 6, pp. 645-692, December 2008.
[2] Foyzur Rahman, Christian Bird, and Premkumar Devanbu, "Clones: what is that smell?," Empirical Software Engineering, vol. 17, no. 4-5, pp. 503-530, August 2012.
[3] Emerson Murphy-Hill, Don Roberts, Peter Sommerlad, and William F. Opdyke, "Refactoring [Guest editors' introduction]," IEEE Software, vol. 32, no. 6, pp. 27-29, November-December 2015.
[4] Davood Mazinanian, Nikolaos Tsantalis, Raphael Stein, and Zackary Valenta, "JDeodorant: Clone Refactoring," 38th International Conference on Software Engineering (ICSE'2016), Formal Tool Demonstration Session, Austin, Texas, USA, May 14-22, pp. 613-616, 2016.
[5] Nikolaos Tsantalis, Davood Mazinanian, and Giri Panamoottil Krishnan, "Assessing the Refactorability of Software Clones," IEEE Transactions on Software Engineering, vol. 41, no. 11, pp. 1055-1090, November 2015.

Sunday, October 9, 2016

What's in Repeated Requirements Research for Practitioners?

By: Nan Niu, University of Cincinnati
Associate Editor: Mehdi Mirakhorli (@MehdiMirakhorli)

For many things said in requirements engineering (RE) research, practitioners may know whether they can apply the research in their current or future projects, and yet they may not know. For the things said and done (i.e., evaluated by the RE researchers), practitioners can gain a more detailed understanding about the research's scope of applicability. For the same research done repeatedly, practitioners are much more informed: knowing better about under which conditions the research can be applied, under which it cannot, what benefits are expected and how much, what limitations there are and how to overcome them, and more importantly, what's in common and what varies when the research is repeated.

In a replication study, we re-tested the work by Easterbrook and his colleagues [1]. They reported that, when approaching a conceptual modeling problem, it was better to build many fragmentary models representing different perspectives than to attempt to construct a single coherent model. Their case study, illustrated by the following figure, was carried out by two teams using different processes to build i* models for the Kids Help Phone (KHP) organization [1]: The global (G) team worked together whereas the viewpoints (V) team worked individually on separate, loosely coupled, yet overlapping models before explicitly merging their viewpoints together.

The results? The V team gained a richer domain understanding than the G team. The take-away for practitioners? Adopting viewpoints in requirements modeling, especially for multi-stakeholder, socio-technical, large-scale, distributed projects. Well, not that fast. The V team's richer domain understanding was gained, according to [1], at the cost of slowness. That is, the viewpoints process was so slow that no merged model was ever produced. That's why only the model slices (shown in the above figure) were presented to the KHP stakeholders. Viewpoints or not? To practitioners, the results in [1] were mixed at best.

How have things changed since [1]? We took theoretical replication's advantage to improve the study design. Among the improvements, we paid specific attention to i* modeling tools that were developed in the past decade. We asked our G and V teams to use OpenOME [2] in constructing their models for the Scholar@UC project [3]. Our results? Not only did our study confirm the deeper domain understanding achieved by the V teams, but the viewpoints modeling was no longer slower. In fact, with OpenOME, the 2 V teams in our study spent less time in generating the final, integrated models than the 2 G teams.

The take-away from our repeated research? Viewpoints-based requirements modeling is a valuable approach to adopt for practitioners in many domains, such as IoT and smart cities, because the process leads to better understandings in terms of hidden assumptions, stakeholder disagreements, and new requirements. With the tech transfer of research tools like OpenOME, the more valuable process also becomes faster and more practical.

  1. S. Easterbrook, E. Yu, J. Aranda, Y. Fan, J. Horkoff, M. Leica, and R. Qadir, "Do Viewpoints Lead to Better Conceptual Models? An Exploratory Case Study," in 13th IEEE International Requirements Engineering Conference (RE) Paris, France: IEEE Computer Society, 2005, pp. 199-208.
Interested in our study? We welcome your feedback and invite you to replicate.
  • N. Niu, A. Koshoffer, L. Newman, C. Khatwani, C. Samarasinghe, and J. Savolainen, "Advancing Repeated Research in Requirements Engineering: A Theoretical Replication of Viewpoints Merging," in 24th IEEE International Requirements Engineering Conference (RE) Beijing, China: IEEE Computer Society, 2016, pp. 186-195. (pre-print)
  • Our replication packet hosted by Scholar@UC:

Sunday, September 25, 2016

The Value of Applied Research in Software Engineering

By: David C. Shepherd, ABB Corporate Research, USA (@davidcshepherd)

Associate Editor: Sonia Haiduc, Florida State University, USA (@soniahaiduc

Jean Yang’s thoughtful post “Why It’s Not Academia’s Job to Produce Code that Ships” posits that academics should (and must) be free to explore ideas with no (immediate) practical relevance. Agreed, as these long term, deep, seemingly impractical research projects often end up causing major advances. However she goes further. In her defense of fundamental research she claims, “In reality, most research--and much of the research worth doing--is far from being immediately practical.” This subtle disdain for applied research, all too common in academic circles, has had deep implications for our field’s impact and ultimately widens the already cavernous gap between academic research and industrial practice. In this example-driven article I’ll discuss why applied research is a key enabler for impact, identify common vehicles for applied research, and argue for better metrics for scientific evaluation that measure both fundamental and applied contributions. Impacting the practice of software engineering is a huge challenge; it will require the best from our fundamental AND applied researchers.

Applied Research Reduces Risk

Software engineering research has produced thousands of novel solutions in the past decade. In many cases these approaches are rigorously evaluated in the lab.  That is, not by real developers and/or not in the field, as that would require the approach to be implemented in a production quality tool. Thus, companies looking to adopt more effective software engineering techniques are faced with a huge amount of risk. Should they adopt technique A, which had great success in lab evaluations, but has never been tested in the field? The answer to this question is almost always “No”.  Fortunately, applied research can reduce this risk dramatically, leading to more adoption and impact on industrial practices. Below I briefly discuss two examples.

In 2004, Andrian Marcus first published the concept of searching source code using information retrieval techniques. It is tempting to think that this work should have been immediately transferred to modern development environments. However, when my colleagues and others implemented prototypes of the concept and released it to the public it became evident there were many remaining research challenges. We discovered that the query/term mismatch problem is even more pronounced in source code and thus we developed a source code focused autocomplete functionality. We (and others) discovered how difficult it is to split identifiers into words (e.g., ‘OpenFile’ to ‘open’ and ‘file’) due to the many different naming conventions. We found that indexing speed is a major issue, as most searches occur within about 15s of opening a project. By discovering and solving these various applied research challenges this work is now ready to be transferred with a lowered level of risk.

The Microsoft Research team behind Pex, an automated test generation tool, can also attest to the importance of refining and revising research contributions through real world usage. Their work actually began as the Spec explorer project, a model-based testing tool with an embedded model checker, which they quickly discovered was too complex for the average user. They noted, “After observing the difficulties of “[training]” the target tool users, the first author moved on to propose and focus on a more lightweight formal-testing methodology, parameterized unit testing.” Later, when they began releasing Pex as an independent extension for Visual Studio they soon discovered users had “…strong needs of tool support for mocking (independent of whether Pex.. is used), [and so] the Pex team further invested efforts to develop Moles.” Having refined their approach via user feedback and solved practical blockers by creating Moles (i.e., a mocking framework) they have since shipped a simplified version of Pex with Visual Studio called IntelliTest which has received many positive reviews.

Vehicles for Applied Research

While I may have convinced you that applied research is a key step in the innovation pipeline it can be difficult to imagine how applied research can be conducted in today’s academic and economic climate. Below I will detail ways in which researchers have used open source software, startups, and product groups as vehicles for driving applied research. Their approaches can serve as a pattern for applied researchers to follow.

Creating or contributing to open source software can be a great vehicle for applied research. It allows you to gather contributions from others and get your work out to users with a minimum of hassle. Perhaps one of the best examples of using open source as a successful means for performing applied research is Terence Parr and his work on ANTLR, a project which he founded in 1988. In the course of his work on this parser generator he has balanced between acquiring users and contributing to theory. In the past few years he has both published at competitive venues (e.g., OOPSLA and PLDI) and supported major products at Twitter, Google, Oracle, and IBM. Impressive. Unfortunately, it appears that his achievements are not appreciated in academia, as he recently tweeted “I wish tenure/promotion committees knew what software was…and valued it for CS professors.”

Startups can be another great vehicle for applied research, fleshing out the ideas started during a thesis or project into a full-fledged technology. The team at AnswerDash provides a great example. Parmit Chilana’s thesis work focused on providing contextual help to website visitors. The concept appeared promising when first published but was untested in the “real world”. The authors, Parmit along with professors Andrew Ko and Jacob Wobbrock, spent significant effort performing applied research tasks such as deploying and gathering feedback from live websites prior to creating a spinoff. The applied research brought up (and answered) crucial questions, reducing risk and ultimately creating enough value that AnswerDash has received over $5M in funding to date.

Finally, for those working in or even collaborating with a large company product improvements can be a great vehicle for applied research. In this space there is no shortage of examples. From Pex and Moles (which became IntelliTest) to Excel’s Flash Fill to refactorings in NetBeans researchers have found it can be very effective to convince an existing product team on how your improvements can help their product and then leverage their development power. As Sumit Gulwani of Flash Fill fame says, “If we want to do more, and receive more funding, then we need a system in which we give back to society in a shorter time-frame by solving real problems, and putting innovation into tangible products.”

Rewards for Applied Research

“One of the major reasons I dropped out of my PhD was because I didn’t believe academia could properly value software contributions” – Wes McKinney, author of Pandas package for data analysis

Unfortunately, in spite of the importance of applied research for increasing our field’s impact there are few, if any rewards for undertaking this type of work. Tenure committees, conferences, professional societies, and even other researchers do not properly value software contributions. In this section I discuss how we as a community can make changes to continue to reward fundamental researchers while also rewarding applied researchers.

One of the most important issues to address, for both academic and industrial researchers, is that of what metrics are they evaluated on. For this matter I hold up ABB Corporate Research as an example of an enlightened approach. When I applied for ABB’s global “Senior Principal Scientist” position, which is awarded annually and is maintained as 5-10% of the overall researcher population, I had to create an application packet similar to a tenure packet. However, unlike an academic tenure packet I was encouraged to list applied research metrics such as tool downloads, talks at developer conferences, tool usage rates, blog post hits, and ABB internal users. While I did not use this metric at the time, I would also add the amount of collected telemetry data to this list (e.g., collected 1000 hours of activity from 100 users working in the IDE).  At ABB Corporate Research these applied metrics are considered equally, if not more than traditional metrics such as citation count.

The CRA also provides guidelines on how to evaluate researchers for promotion and tenure. In 1999 the CRA recommended that when measuring the impact of a researcher’s work factors like “…the number of downloads of a (software) artifact, number of users, number of hits on a Web page, etc…” may be used which, especially for the time, was progressive. Unfortunately, after listing these potential metrics they spend two paragraphs discrediting them, stating that “…popularity is not equivalent to impact…” and “it is possible to write a valuable, widely used piece of software inducing a large number of downloads and not make any academically significant contribution.” In brief, they stop short of recommending the type of metrics that would reward applied researchers.

Second to how researchers are evaluated are how their outputs are evaluated. In this area there have been small changes in how grants are evaluated, yet the impact of these changes has yet to be seen. The NSF, the US’s primary funding agency for computer science research, started allowing researchers to list “Products” instead of “Publications” in their biological sketch in 2012. This allows researchers to list contributions such as software and data sets in addition to relevant publications. Unfortunately, I have no evidence as to how seriously these non-publication “Products” are considered in evaluating grants (I would love to have comments from those reviewing NSF grants).

While some companies like ABB and institutions like the NSF and to a lesser extent the CRA have begun to consider applied metrics the adoption is not widespread enough to affect sweeping changes. In order to increase the amount of impact our field has on the practice I estimate we would need at least one quarter of our field to be composed of applied researchers. As of today we have less than five percent. The adoption and acceptance of applied metrics may be the single biggest change we could make as a community to increase impact.


Academia’s not-so-subtle distain for applied research does more than damage a few promising careers; it renders our field’s output useless, destined to collect dust on the shelves of Elsevier. We cannot and should not accept this fate. In this article I have outlined why applied research is valuable (it reduces risk), how applied research can be undertaken in the current climate (e.g., open source, startups, and product contributions), and finally have discussed how to measure applied research’s value (e.g., downloads, hits, and usage data). I hope these discussions can serve as a starting point for chairs, deans, and industrial department managers wishing to encourage applied research.

Wondering how to publish your applied research work at a competitive venue? Consider applying to one of the following… I know one of the PC co-chairs and can assure you that he values applied work. Publish at an industry track: SANER Industrial TrackICPC Industry TrackFSE Industry Track, ICSE SEIP

Monday, September 19, 2016

Comprehensively testing software patches with symbolic execution

Associate Editor: Sarah Nadi, University of Alberta (@sarahnadi)

A large fraction of the costs of maintaining software applications is associated with detecting and fixing errors introduced by patches. Patches are prone to introducing failures  [410] and as a result, users are often reluctant to upgrade their software to the most recent version  [3], relying instead on older versions which typically expose a reduced set of features and are frequently susceptible to critical bugs and security vulnerabilities. To properly test a patch, each line of code in the patch, and all the new behaviours introduced by the patch, should be covered by at least one test case. However, the large effort involved in coming up with relevant test cases means that such thorough testing rarely happens in practice.

The Software Reliability Group at Imperial College London has invested a significant amount of effort in the last few years on devising techniques and tools for comprehensively testing software patches. The main focus of our work has been on developing dynamic symbolic execution techniques that automatically detect bugs and augment program test suites. In this article, we discuss two interrelated projects, KATCH  [7] and Shadow  [9], whose objectives and relationship are depicted in Figure 1. In a nutshell, KATCH aims to generate program inputs that cover the lines of code added or modified by a patch, while Shadow further modifies existing test inputs in order to trigger the new behaviour introduced by a patch.

It is rather surprising that lots of code is being added or modified in mature software projects without a single test that exercises it  [8]. In response to such poor culture of testing software patches, we have designed KATCH, a system whose goal is to automatically generate inputs that exercise the lines of code of a patch. KATCH is based on symbolic execution  [2], a program analysis technique that can systematically explore a program’s possible executions. Symbolic execution replaces regular program inputs with symbolic variables that initially represent any possible value. Whenever the program executes a conditional branch instruction that depends on symbolic data, the possibility of following each branch is analysed and execution is forked for each feasible branch.

In its standard instantiation, symbolic execution aims to achieve high coverage of the entire program. Instead, the challenge for KATCH is to focus on the patch code in order to quickly generate inputs that cover it. KATCH accomplishes this goal by starting from existing test cases that partially cover or come close to covering the patch (such test cases often exist, so why not make use of them?), and combining symbolic execution with targeted heuristics that aim to direct the exploration toward uncovered parts of the patch.

We evaluated KATCH on 19 programs from the GNU diffutils, binutils and findutils systems, a set of mature and widely-used programs, installed on virtually all UNIX-based distributions. We included all the patches written over an extended period of time, for a cumulative period of approximately six years across all programs. Such an unbiased selection is essential if one is to show both the strengths and weaknesses of a technique.

The results are presented in Table 1, which shows the number of basic blocks across all patches for each application suite, and the fraction of basic blocks exercised by the existing test suite and the existing test suite plus KATCH. It also shows the number of crash bugs revealed by KATCH. The highlights of our results are: (1) KATCH can significantly increase the patch coverage in these applications in a completely automatic fashion, and (2) there are still significant (engineering and fundamental) limitations: for instance, patch coverage for binutils is still at only 33% of basic blocks, although the additional coverage achieved by KATCH was enough to find 14 previously-unknown crash bugs in these programs.

Now consider a hypothetical scenario where KATCH can generate a test input for every single line in a patch. Would that be sufficient in order to adequately test the patch? Let us illustrate the question using a simple example. Consider the two versions of code below, in which the second version changes only x % 2 to x % 3, and let us assume that this statement can be executed at most once by a deterministic program and that the variable x is only referenced here:

Let us assume that the developers chose inputs x = 6 and x = 7 to test the patch. Do these inputs comprehensively test the patch? The first reaction might be to say yes, since they cover each side of the branch in both the old and the new version. However, the true answer is that these are bad choices, as each of these inputs follows the same side of the branch in both versions, making the program behave identically before and after the patch. Instead, inputs such as x = 8 or x = 9 are good choices, as they trigger different behaviour at the code level: e.g. x = 8 follows the ‘then’ side in the old version, but the ‘else’ side in the new version.

Shadow symbolic execution is a new technique that assists developers with the generation of such inputs. Our tool Shadow starts with an input that exercises the patch, either constructed manually or synthesised by KATCH, and generates new inputs that trigger different behaviour at the code level in the unpatched and patched versions. Like KATCH, Shadow is based on symbolic execution, this time augmented with the ability to perform a “four-way fork”, as shown in Figure 2. Whenever we reach a branch condition that evaluates to semantically-different expressions in the two versions — say, ‘old’ in the old version, and ‘new’ in the new version, instead of forking execution into two paths (as in standard symbolic execution) based on the behaviour of the new version, we fork into up to four ways. On two of these cases, the two versions behave identically (denoted by same in the figure): both versions take either the then (new old) or the else (¬new ∧¬old) branch. On the other two paths, the executions of the two versions diverge (denoted by diff in the figure): either the new version takes the then branch and the old version the else branch (new ∧¬old), or vice versa (¬new old). Of course, the last two cases are of interest to Shadow, and when it encounters them, it generates a new input triggering the divergent behaviour between the two versions.

Shadow is not a fully automatic technique, requiring developers to create a single unified program in which the two versions are merged via ‘change’ annotations. For instance, in our example, the two versions would be unified by creating the if statement if change(x % 2, x % 3), in which the first argument represents the code expression from the old version and the second argument the corresponding expression from the new version. While mapping program elements across versions is a difficult task  [5], we discovered that in practice the process can be made sufficiently precise and could be (partially) automated using predefined patterns.

We evaluated Shadow on the 22 patches from the GNU Coreutils programs included in the CoREBench suite of regression bugs. Similar to the benchmarks used to evaluate KATCH, these are mature, widely-used applications available in most UNIX distributions. We chose the CoREBench patches because they are known to introduce bugs, and furthermore, the bug fixes are known.

After applying Shadow, we were able to generate inputs that trigger code-level divergences for all but one patch. We were further able to generate inputs for which the two versions generate different outputs, as well as inputs that abort or trigger memory errors in the new version. Some sample inputs generated by Shadow are shown in Table 2.

While Shadow was not successful in all cases, the results are promising in terms of its ability to find regression bugs and augment the test suite in a meaningful way. Furthermore, even generated inputs exposing expected divergences are great candidates for addition to the program test suites, and can act as good documentation for program changes.

As most readers would agree, continuously changing a program without having a single input that exercises those changes is unsustainable. On the other hand, as all developers know, comprehensively testing program patches is a difficult, tedious, and time-consuming task. Fortunately, program analysis techniques such as symbolic execution are becoming more and more scalable, and can be effectively extended to generate inputs that exercise program changes (as we did in KATCH) and trigger different behaviour across versions (as we did in Shadow). More work is required to scale these approaches to large software systems and integrate them in the development cycle, but initial results are promising. To stimulate research in this space, we make our benchmarks and our systems available for comparison (see and More details on our techniques can be found in our publications  [1679].

[1]   C. Cadar and H. Palikareva. Shadow symbolic execution for better testing of evolving software. In ICSE NIER’14, May 2014.
[2]   C. Cadar and K. Sen. Symbolic Execution for Software Testing: Three Decades Later. CACM, 56(2):82–90, 2013.
[3]    O. Crameri, N. Knezevic, D. Kostic, R. Bianchini, and W. Zwaenepoel. Staged deployment in Mirage, an integrated software upgrade testing and distribution system. In SOSP’07, Oct. 2007.
[4]   Z. Gu, E. T. Barr, D. J. Hamilton, and Z. Su. Has the bug really been fixed? In ICSE’10, May 2010.
[5]   M. Kim and D. Notkin. Program element matching for multi-version program analyses. In MSR’06, May 2006.
[6]   P. D. Marinescu and C. Cadar. High-coverage symbolic patch testing. In SPIN’12, July 2012.
[7]   P. D. Marinescu and C. Cadar. KATCH: High-coverage testing of software patches. In ESEC/FSE’13, Aug. 2013.
[8]   P. D. Marinescu, P. Hosek, and C. Cadar. Covrig: A framework for the analysis of code, test, and coverage evolution in real software. In ISSTA’14, July 2014.
[9]   H. Palikareva, T. Kuchta, and C. Cadar. Shadow of a doubt: Testing for divergences between software versions. In ICSE’16, May 2016.
[10]   Z. Yin, D. Yuan, Y. Zhou, S. Pasupathy, and L. Bairavasundaram. How do fixes become bugs? In ESEC/FSE’11, Sept. 2011.