Since 2008, my colleagues and I have developed and maintained C-Reduce, a tool for programmatically reducing the size of C and C++ files that trigger compiler bugs. C-Reduce also usually does a credible job reducing test cases in languages other than C and C++; we'll return to that topic in Part 2.
Why Reduce Test Cases?
int a[1];
int b;
void c() {
void *d = b = 0;
for (;; b++)
a[b] = (int) d++;
}
LLVM crashes when compiling this code at the -O2 optimization level. The bug report does not contain the original, unreduced test case, but most likely it was larger.
A reduced test case is preferable because:
it usually gets the compiler to misbehave quickly, reducing the number of function calls, memory allocations, etc. that the compiler developer has to step through and reason about while debugging
the reduced file contains little code not directly related to triggering the bug, making it less likely that compiler developers will be distracted by extraneous features of the test case
reduced test cases for the same bug often look similar to each other, whereas this is not normally true for unreduced files that trigger the same bug
there is often little or no discernible similarity between an unreduced source file and its reduced version, making it easier for compiler bugs triggered by proprietary code to be reported externally
The minimum-sized input triggering any particular compiler bug can be found using a trivial algorithm: exhaustive search of text strings of increasing size. This method is, of course, almost always intractable. In practice, test case reduction proceeds in the other direction: starting with a large, failure-inducing test case, incrementally making it smaller until a local minimum is reached.
A Bit of Background
The history of automated test case reduction does not seem to be well documented, but several examples can be found in software testing papers from the 1990s such as Differential Testing for Software and Massive Stochastic Testing of SQL. Test-case reduction was first studied in its own right in 2000 when Hildebrandt and Zeller introduced Delta Debugging: a general-purpose technique for test case reduction. Their algorithm uses a greedy search where a series of variants (our term, not theirs, for partially-reduced test case candidates) is produced by removing chunks of the test case. As reduction progresses, the chunk size is reduced, until it reaches some minimum-sized unit, such as a single line, token, or character. When no minimum-sized chunk can be removed from the test case without breaking the property that it triggers the bug, the Delta Debugger terminates. Almost all subsequent test-case reduction work, including ours, builds upon this work.
Towards C-Reduce
I became interested in test case reduction when my colleagues and I began to find a lot of bugs in C compilers using random testing. We found so many bugs that reporting them became bottlenecked on reducing the bug-triggering programs. Since I was the one reporting the bugs we found, I was the one who felt the pain of manual test-case reduction, and it quickly got old. I eventually reported around 500 compiler bugs and I could not have done this without first creating C-Reduce.
At the time, the best open-source implementation of Delta Debugging, from UC Berkeley, was line-based and contained a significant innovation over the original algorithm: it could reorganize a file in such a way that all nested curly braces deeper than a configurable level would appear on a single line. Thus, at level zero, entire functions would be placed on the same line, enabling the line-based reducer to remove an entire function at once. At higher nesting levels, functions would be split across lines, enabling finer-grained reduction. This method worked well but the tool ended up being inadequate for my needs: it got stuck at local minima that were often orders of magnitude larger than what could be achieved when reducing by hand.
The limiting factor in the Berkeley Delta tool ("Delta" from now on) was obvious: it was not able to exploit enough of the structure of the file being reduced. For example, it could usually not do much to simplify arithmetic expressions. These sorts of simplifications tend to have a cascading effect: eliminating the last use of a variable allows its definition to be eliminated, etc. The obvious path forward was to write a new tool that solved a reduction problem that Delta could not solve, and then to alternate running this tool and Delta until a global fixpoint was reached. I did this, adding more and more reduction techniques over time. I eventually wrote a line-elimination pass in my new reducer, at which point Delta was subsumed and could be dropped.
We ended up keeping two elements of Delta's design. First, the configurable hierarchical reformatting of a test case based on curly brace nesting. This technique, followed by removing contiguous lines of code, is still one of C-Reduce's most useful first lines of attack on a test case. Second, Delta's mechanism for determining whether a given variant is "interesting." An interesting variant is used as the basis for further reduction steps; an uninteresting variant is a dead end, and is discarded. Delta determined interestingness by invoking a user-supplied program—typically a shell script—whose process exit code determines the interestingness of the current variant. The flexibility afforded by this small element of user extensibility ends up being extremely useful. For example, the interestingness test can discard test cases that trigger certain compiler warnings, it can attempt to disambiguate different crash bugs, etc.
It is more challenging to reduce test cases that cause the compiler to emit incorrect object code than it is to reduce test cases that merely cause the compiler to crash. C-Reduce itself is agnostic about the character of the bug of interest: we push all of the difficulties in reducing miscompilation triggers into the interestingness test, which should try to answer questions such as:
is the variant well-defined by the C or C++ standard?
does the variant avoid depending on behaviors that are unspecified by the C or C++ standard?
does the buggy compiler turn the variant into an executable?
does this executable terminate within a specified time?
does the reference compiler (assumed to not contain the bug of interest) turn the variant into an executable?
does this executable also terminate within a specified time?
does the behavior of the two executables differ in a way that indicates that a miscompilation occurred?
The variant is interesting if the answer to all of these questions is "yes."
The hardest part of reducing programs that trigger miscompilation bugs is ensuring that variants avoid undefined behaviors (such as invalid pointer operations) and do not rely on unspecified behaviors (such as the order of evaluation of function arguments). A test case doing one of these things is ill-formed and can accomplish nothing beyond annoying compiler developers. Empirically, if undefined behavior is not actively avoided during test-case reduction, C-Reduce will almost certainly introduce it. The practical solution is to use suitable static and dynamic analysis tools to rule out ill-formed variants. Since no single tool that detects all undefined and unspecified behaviors in C and C++ programs exists, a hybrid approach involving multiple tools is typically used in practice. This approach is not completely satisfying, but it works well enough that C-Reduce can reliably produce useful reduced test cases for miscompilation bugs in C and C++ compilers.
Writing good interestingness tests for miscompilations takes a bit of practice. First, when there are many criteria that must be satisfied for a variant to be interesting, it is useful to minimize the test’s expected-case runtime by asking the quickest and most-likely-to-fail questions first. Second, it is easy to write buggy tests. More than one user has described C-Reduce as being something like the sorcerer's apprentice: it does an excellent job reducing according to the criteria it is given, but if these criteria contain any kind of loophole, C-Reduce is likely to find it. For example, it is easy to accidentally write a test that claims that the empty file is interesting.
From the start, C-Reduce’s main goal was to produce a very small final reduced test case, even when this would take longer than we liked. This is based on the premise that we should burn cycles instead of human time, and that reporting a compiler bug is not usually on the critical path; we can often afford to wait for a better result. The consequences of this decision can be seen in Tables 1 and 2 of this paper that evaluates several test-case reduction methods: C-Reduce produces the smallest final output, but takes more time to do so.
A Modular, Domain-Independent Reducer Core
Although C-Reduce started out as a pet project solving a specific problem, it evolved into a research project involving a number of my colleagues, whose top-level goal was to produce an effective and usable reducer for C and C++ code as found in the wild. The first research contribution to come out of this effort was a way to achieve a clean mechanism/policy separation in a test case reducer. Previous reduction techniques had all baked specific transformations into the overall search strategy. That approach impedes extensibility, which we found to be crucial. The structure that we ended up with is a small core that invokes a collection of pluggable transformation passes until a global fixpoint is reached.
The API for C-Reduce passes is simple but—like many simple things—required a lot of iterations before it felt finished. It is based on the ideas that transformation passes should be stateless and that every pass should implement a linear sequence of transformations, each of which results in a variant that may or may not be interesting. The interface is as follows:
state new(filename, option)
Return a fresh state object. Each pass uses this state to keep track of where it is in the sequence of transformations that it is capable of performing. These states may contain arbitrary data items; the C-Reduce core treats them as opaque. A typical pass stores some kind of cursor—usually a byte offset, token offset, line number, or position in a tree traversal—in the state object.
The file referred to by filename is logically part of the state object even though it resides in the filesystem instead of memory. Of course it would not be difficult to pass the contents of the file around as a memory object but this approach would be slow when these objects are large: C-Reduce is frequently invoked on multi-megabyte preprocessed C++ files.
The "option" is used to select among different behaviors implemented by a composite pass.
state advance(filename, option, state)
Return a new state object referring to the next transformation opportunity following the one referenced by the state object passed as a parameter.
result transform(filename, option, state)
Modify the file in-place, selecting the transformation instance referred to by the state object. The result takes one of three values:
OK : the transformation succeeded
STOP : no more transformation instances remain for this pass
ERROR : something went wrong; for example, an external tool crashed, a working file or directory could not be created, etc.
(The API contains one additional method, which checks whether a pass's external dependencies are satisfied, that doesn't matter here.)
Our experience has been that every transformation pass that we wanted has been easy to implement behind this API.
The C-Reduce core implements this algorithm:
current = original_test_case
do
size_at_start = size(current)
foreach (p, option) in pass_list
state = p::new(current, option)
do
variant = current // this is a file copy operation
result = p::transform(variant, option, state)
if result == ERROR
report_problem_in_pass(p, option)
if result == OK
if is_interesting(variant)
current = variant // also a file copy
else
state = p::advance(current, option, state)
while result == OK
while size(current) < size_at_start
The termination argument for C-Reduce is:
Since the outermost loop requires the size of the test case to decrease monotonically, it can only execute as many times as the size (in bytes) of the unreduced test case. In practice, it executes many fewer times than this.
The loop over passes terminates because the pass list is immutable after C-Reduce’s initialization phase.
Each iteration of the innermost loop either advances the state object or else (by selecting an interesting variant) removes one transformation opportunity. Either way, the number of transformations remaining in the current pass is decreased by one.
The interestingness test is, at worst, terminated (using OS support for killing a process group) after a configurable timeout.
In practice, the weak link in this argument is the third item, which is vulnerable to bugs in passes. C-Reduce terminates robustly by abandoning passes when they appear to be behaving unreasonably.
The C-Reduce core does not insist that transformations make the test case smaller, and in fact quite a few of its passes potentially increase the size of the test case, with the goal of eliminating sources of coupling within the test case, unblocking progress in other passes.
The sequence of transformation passes is carefully orchestrated such that passes that are likely to give the biggest wins—such as those that remove entire functions—run first; otherwise the tool would end up spending days or weeks doing silly things such as trying to shrink numeric constants in a huge source file. Shrinking numbers is useful, and it should be done, but only after many other reduction mechanisms have run to completion.
C-Reduce's collection of cooperating passes, with heavy phase-ordering constraints, is highly reminiscent of how a modern optimizing compiler works. However, only a small proportion of the transformation passes is intended to be semantics-preserving in the sense that a compiler's optimization passes must be. In this domain, we only want to preserve enough semantics that we can probabilistically avoid breaking whatever property makes a test case interesting.
A consequence of writing a modular reducer is that once we came up with the right API for writing passes, we were free to write a lot of passes. My colleagues and I spent several years doing this and we ended up with:
35 passes, implemented in Perl, that include heuristics such as removing lines, removing various kinds of matched delimiters (and perhaps also the text between them), and shrinking integer values
6 passes that invoke external utilities such as unifdef, a partial evaluator for the C preprocessor language, a lexer for C and C++ that supports various token-level reduction transformations, and pretty-printing utilities that make the reduced test case more pleasant to look at
69 passes, implemented in C++, that use LLVM's Clang front end as a library for source-to-source transformation of C and C++ code; these include function inlining, partial template instantiation, scalar replacement of aggregates, copy propagation, and eliminating levels of a class hierarchy.
The actual number of dynamic passes is larger than the total of these numbers since some passes can be invoked in different modes using the "option" parameter mentioned above.
In this piece, we looked at why we had to create C-Reduce and at the modular structure that was key to making it solve the problems that we wanted to solve. In Part 2, I'll describe how C-Reduce improves reduction times using multiple cores and why C-Reduce usually does a good job reducing test cases in languages other than C and C++; finally, I'll discuss a few open research problems in test case reduction.