IEEE Software Blog: Comprehensively testing software patches with symbolic execution

by Cristian Cadar (@c_cadar), Tomasz Kuchta (@Tomasz_Kuchta), Paul Marinescu and Hristina Palikareva (@inanaiain) -- Imperial College London

Associate Editor: Sarah Nadi, University of Alberta (@sarahnadi)

A large fraction of the costs of maintaining software applications is associated with detecting and fixing errors introduced by patches. Patches are prone to introducing failures [4, 10] and as a result, users are often reluctant to upgrade their software to the most recent version [3], relying instead on older versions which typically expose a reduced set of features and are frequently susceptible to critical bugs and security vulnerabilities. To properly test a patch, each line of code in the patch, and all the new behaviours introduced by the patch, should be covered by at least one test case. However, the large effort involved in coming up with relevant test cases means that such thorough testing rarely happens in practice.

The Software Reliability Group at Imperial College London has invested a significant amount of effort in the last few years on devising techniques and tools for comprehensively testing software patches. The main focus of our work has been on developing dynamic symbolic execution techniques that automatically detect bugs and augment program test suites. In this article, we discuss two interrelated projects, KATCH [7] and Shadow [9], whose objectives and relationship are depicted in Figure 1. In a nutshell, KATCH aims to generate program inputs that cover the lines of code added or modified by a patch, while Shadow further modifies existing test inputs in order to trigger the new behaviour introduced by a patch.

It is rather surprising that lots of code is being added or modified in mature software projects without a single test that exercises it [8]. In response to such poor culture of testing software patches, we have designed KATCH, a system whose goal is to automatically generate inputs that exercise the lines of code of a patch. KATCH is based on symbolic execution [2], a program analysis technique that can systematically explore a program’s possible executions. Symbolic execution replaces regular program inputs with symbolic variables that initially represent any possible value. Whenever the program executes a conditional branch instruction that depends on symbolic data, the possibility of following each branch is analysed and execution is forked for each feasible branch.

In its standard instantiation, symbolic execution aims to achieve high coverage of the entire program. Instead, the challenge for KATCH is to focus on the patch code in order to quickly generate inputs that cover it. KATCH accomplishes this goal by starting from existing test cases that partially cover or come close to covering the patch (such test cases often exist, so why not make use of them?), and combining symbolic execution with targeted heuristics that aim to direct the exploration toward uncovered parts of the patch.

We evaluated KATCH on 19 programs from the GNU diffutils, binutils and findutils systems, a set of mature and widely-used programs, installed on virtually all UNIX-based distributions. We included all the patches written over an extended period of time, for a cumulative period of approximately six years across all programs. Such an unbiased selection is essential if one is to show both the strengths and weaknesses of a technique.

The results are presented in Table 1, which shows the number of basic blocks across all patches for each application suite, and the fraction of basic blocks exercised by the existing test suite and the existing test suite plus KATCH. It also shows the number of crash bugs revealed by KATCH. The highlights of our results are: (1) KATCH can significantly increase the patch coverage in these applications in a completely automatic fashion, and (2) there are still significant (engineering and fundamental) limitations: for instance, patch coverage for binutils is still at only 33% of basic blocks, although the additional coverage achieved by KATCH was enough to find 14 previously-unknown crash bugs in these programs.

Now consider a hypothetical scenario where KATCH can generate a test input for every single line in a patch. Would that be sufficient in order to adequately test the patch? Let us illustrate the question using a simple example. Consider the two versions of code below, in which the second version changes only x % 2 to x % 3, and let us assume that this statement can be executed at most once by a deterministic program and that the variable x is only referenced here:

Let us assume that the developers chose inputs x = 6 and x = 7 to test the patch. Do these inputs comprehensively test the patch? The first reaction might be to say yes, since they cover each side of the branch in both the old and the new version. However, the true answer is that these are bad choices, as each of these inputs follows the same side of the branch in both versions, making the program behave identically before and after the patch. Instead, inputs such as x = 8 or x = 9 are good choices, as they trigger different behaviour at the code level: e.g. x = 8 follows the ‘then’ side in the old version, but the ‘else’ side in the new version.

Shadow symbolic execution is a new technique that assists developers with the generation of such inputs. Our tool Shadow starts with an input that exercises the patch, either constructed manually or synthesised by KATCH, and generates new inputs that trigger different behaviour at the code level in the unpatched and patched versions. Like KATCH, Shadow is based on symbolic execution, this time augmented with the ability to perform a “four-way fork”, as shown in Figure 2. Whenever we reach a branch condition that evaluates to semantically-different expressions in the two versions — say, ‘old’ in the old version, and ‘new’ in the new version, instead of forking execution into two paths (as in standard symbolic execution) based on the behaviour of the new version, we fork into up to four ways. On two of these cases, the two versions behave identically (denoted by same in the figure): both versions take either the then (new ∧ old) or the else (¬new ∧¬old) branch. On the other two paths, the executions of the two versions diverge (denoted by diff in the figure): either the new version takes the then branch and the old version the else branch (new ∧¬old), or vice versa (¬new ∧ old). Of course, the last two cases are of interest to Shadow, and when it encounters them, it generates a new input triggering the divergent behaviour between the two versions.

Shadow is not a fully automatic technique, requiring developers to create a single unified program in which the two versions are merged via ‘change’ annotations. For instance, in our example, the two versions would be unified by creating the if statement if change(x % 2, x % 3), in which the first argument represents the code expression from the old version and the second argument the corresponding expression from the new version. While mapping program elements across versions is a difficult task [5], we discovered that in practice the process can be made sufficiently precise and could be (partially) automated using predefined patterns.

We evaluated Shadow on the 22 patches from the GNU Coreutils programs included in the CoREBench suite of regression bugs. Similar to the benchmarks used to evaluate KATCH, these are mature, widely-used applications available in most UNIX distributions. We chose the CoREBench patches because they are known to introduce bugs, and furthermore, the bug fixes are known.

After applying Shadow, we were able to generate inputs that trigger code-level divergences for all but one patch. We were further able to generate inputs for which the two versions generate different outputs, as well as inputs that abort or trigger memory errors in the new version. Some sample inputs generated by Shadow are shown in Table 2.

While Shadow was not successful in all cases, the results are promising in terms of its ability to find regression bugs and augment the test suite in a meaningful way. Furthermore, even generated inputs exposing expected divergences are great candidates for addition to the program test suites, and can act as good documentation for program changes.

As most readers would agree, continuously changing a program without having a single input that exercises those changes is unsustainable. On the other hand, as all developers know, comprehensively testing program patches is a difficult, tedious, and time-consuming task. Fortunately, program analysis techniques such as symbolic execution are becoming more and more scalable, and can be effectively extended to generate inputs that exercise program changes (as we did in KATCH) and trigger different behaviour across versions (as we did in Shadow). More work is required to scale these approaches to large software systems and integrate them in the development cycle, but initial results are promising. To stimulate research in this space, we make our benchmarks and our systems available for comparison (see https://srg.doc.ic.ac.uk/projects/katch/ and https://srg.doc.ic.ac.uk/projects/shadow/). More details on our techniques can be found in our publications [1, 6, 7, 9].

[1] C. Cadar and H. Palikareva. Shadow symbolic execution for better testing of evolving software. In ICSE NIER’14, May 2014.

[2] C. Cadar and K. Sen. Symbolic Execution for Software Testing: Three Decades Later. CACM, 56(2):82–90, 2013.

[3] O. Crameri, N. Knezevic, D. Kostic, R. Bianchini, and W. Zwaenepoel. Staged deployment in Mirage, an integrated software upgrade testing and distribution system. In SOSP’07, Oct. 2007.

[4] Z. Gu, E. T. Barr, D. J. Hamilton, and Z. Su. Has the bug really been fixed? In ICSE’10, May 2010.

[5] M. Kim and D. Notkin. Program element matching for multi-version program analyses. In MSR’06, May 2006.

[6] P. D. Marinescu and C. Cadar. High-coverage symbolic patch testing. In SPIN’12, July 2012.

[7] P. D. Marinescu and C. Cadar. KATCH: High-coverage testing of software patches. In ESEC/FSE’13, Aug. 2013.

[8] P. D. Marinescu, P. Hosek, and C. Cadar. Covrig: A framework for the analysis of code, test, and coverage evolution in real software. In ISSTA’14, July 2014.

[9] H. Palikareva, T. Kuchta, and C. Cadar. Shadow of a doubt: Testing for divergences between software versions. In ICSE’16, May 2016.

[10] Z. Yin, D. Yuan, Y. Zhou, S. Pasupathy, and L. Bairavasundaram. How do fixes become bugs? In ESEC/FSE’11, Sept. 2011.

Monday, September 19, 2016

Comprehensively testing software patches with symbolic execution

1 comment: