Monday, April 9, 2018

Efficiently and Automatically Detecting Flaky Tests with DeFlaker

By: Jonathan Bell, George Mason University (@_jon_bell_), Owolabi Legunsen, UIUC, Michael Hilton (@michaelhilton), CMU, Lamyaa Eloussi, UIUC, Tifany Yung, UIUC, and Darko Marinov, UIUC.
Associate editor: Sarah Nadi, University of Alberta (@sarahnadi), Bogdan Vasilescu, Carnegie Mellon University (@b_vasilescu).
Flaky tests are tests that can non-deterministically pass or fail for the same version of the code under test. Therefore, flaky tests can be incredibly frustrating for developers. Ideally, every new test failure would be due to the latest changes that a developer made, and the developer could subsequently focus on debugging these changes. However, because their outcome depends not only on the code, but also on tricky non-determinism (e.g. dependence on external resources or thread scheduling), flaky tests can be very difficult to debug. Moreover, if a developer doesn’t know that a test failure is due to a flaky test (rather than a regression that they introduced), how does the developer know where to start debugging: their recent changes, or the failing test? If the test failure is not due to their recent changes, then should they debug the test failure immediately, or later?
Flaky tests plague both large and small companies. Google reports that 1 in 7 of their tests have some level of flakiness associated with them. A quick search also turns up many StackOverflow discussions and details about flaky tests at Microsoft, ThoughtWorks, SemaphoreCI and LucidChart.

Traditional Approach for Detecting Flaky Tests

Prior to our work, the most effective way to detect flaky tests was to repeatedly rerun failed tests. If some rerun passes, the test is definitely flaky; but if all reruns fail, the status is unknown (it might or might not be flaky). Rerunning failed tests is directly supported by many testing frameworks, including Android, Jenkins, Maven, Spring, FaceBook's Buck and Google TAP. Rerunning every failed test is extremely costly when organizations see hundreds to millions of test failures per day. Even Google, with its vast compute resources, does not rerun all (failing) tests on every commit but reruns only those suspected to be flaky, and only outside of peak test execution times.

Our New Approach to Detecting Flaky Tests

Our approach, DeFlaker, detects flaky tests without re-running them, and imposes only a very modest performance overhead (often less than 5% in our large-scale evaluation). Recall that a test is flaky if it can both pass and fail when it executes the same code, i.e., code that did not change; moreover, a test failure is new if the test passed on the previous version of code but fails in the current version. Between each test suite execution, DeFlaker tracks information about what code has changed (using information from a version control system, like git), what test outcomes have changed, and which of those tests executed any changed code. If a test passed on a prior run, but now fails, and has not executed any changed code, then DeFlaker warns that it is a flaky test failure.
However, on the surface, tracking coverage for detecting flaky tests may seem costly: industry reports suggest that collecting statement coverage is often avoided due to the overhead imposed by coverage tools, a finding echoed by our own empirical evaluation of popular code coverage tools (JaCoCo, Cobertura and Clover).
Our key insight in DeFlaker is that one need not collect coverage of the entire codebase in order to detect flaky tests. Instead, one can collect only the coverage of the changed code, which we call differential coverage. Differential coverage first queries a version-control system (VCS) to detect code changes since the last version. It then analyzes the structure of each changed file to determine where exactly instrumentation needs to be inserted to safely track the execution of each change, allowing it to track coverage of changes much faster than traditional coverage tools. Finally, when tests run, DeFlaker monitors execution of these changes and outputs each test that failed and is likely to be flaky.

Finding Flaky Tests

How useful is DeFlaker to developers? Can it accurately identify test failures that are due to flakiness? We performed an extensive evaluation of DeFlaker by re-running historical builds for 26 open-source Java projects, executing over 47,000 builds. When a test failed, we attempted to diagnose it both with DeFlaker and by rerunning it. We also deployed DeFlaker live on Travis CI, where we integrated DeFlaker into the builds of 96 projects. In total, our evaluation involved executing over five CPU-years of historical builds. We also studied the relative overhead of using DeFlaker compared to a normal build. A complete description of our extensive evaluation is available in our ICSE 2018 paper.
Our primary goal was to evaluate how many flaky tests DeFlaker would find, compared with the traditional (rerun) approach. For each build, whenever a test failed, we re-ran the test using the rerun facility in Maven’s Surefire test runner. We were interested to find that this approach only resulted in 23% of test failures eventually passing (hence, marked as flaky) even if we allowed for up to five reruns of each failed test. On the other hand, DeFlaker marked 95% of test failures as flaky! Given that we are reviewing only code that was committed to version control repositories, we expected that it would be rare to find true test failures (and that most would be flaky).
We found that the strategy by which a test is rerun matters greatly: make a poor choice, and the test will continue to fail for the same reason as the first failure, causing the developer to assume that the failure was a true failure (and not a flaky test). Maven’s flaky test re-runner reran each failed test in the same process as the initial failed execution --- which we found to often result in the test continuing to fail. Hence, to better find flaky test failures, and to understand how to best use reruns to detect flaky tests, we experimented with the following strategies, rerunning failed tests: (1) Surefire: up to five times in the same JVM in which the test ran (Maven’s rerun technique); then, if it still did not pass; (2) Fork: up to five times, with each execution in a clean, new JVM; then, if it still did not pass; (3) Reboot: up to five times, running a mvn clean between tests and rebooting the virtual machine between runs.
As shown in the figure below, we found nearly 5,000 flaky test failures using the most time-consuming rerun strategy (Reboot). DeFlaker found nearly as many of these same flaky tests (96%) with a very low false alarm rate (1.5%), and at a significantly lower cost. It’s also interesting to note that when a rerun strategy worked, it generally worked after a single rerun (few additional tests were detected from additional reruns). We demonstrated that DeFlaker was fast by calculating the time overhead of applying DeFlaker to the ten most recent builds of each of those 26 projects. We compared DeFlaker to several state-of-the-art tools: the regression test selection tool Ekstazi, and the code coverage tools JaCoCo, Cobertura, and Clover. Overall, we found that DeFlaker was very fast, often imposing an overhead of less than 5% — far faster than the other coverage tools that we looked at.
Overall, based on these results, we find DeFlaker to be a more cost-beneficial approach to run before or together with reruns, which allows us to suggest a potentially optimal way to perform test reruns: For projects that have lots of failing tests, DeFlaker can be run on all the tests in the entire test suite, because DeFlaker immediately detects many flaky tests without needing any rerun. In cases where developers do not want to pay this 5% runtime cost to run DeFlaker (perhaps because they have very few failing tests normally), DeFlaker can be run only when rerunning failed tests in a new JVM; if the tests still fail but do not execute any changed code, then reruns can stop without costly reboots.
While DeFlaker’s approach is generic and can apply to nearly any language or testing framework, we implemented our tool for Java, the Maven build system, and two testing frameworks (JUnit and TestNG). DeFlaker is available under an MIT license on GitHub, with binaries published on Maven Central.
More information on DeFlaker (including installation instructions) is available on the project website and in our upcoming ICSE 2018 paper. We hope that our promising results will motivate others to try DeFlaker in their Maven-based Java projects, and to build DeFlaker-like tools for other languages and build systems.

For further reading on flaky tests:

Adriaan Labuschagne, Laura Inozemtseva and Reid Holmes
A study of test suite executions on TravisCI that investigated the number of flaky test failures.
John Micco
A summary of Flaky tests at Google and (as of 2016) the strategies used to manage them.
Qingzhou Luo, Farah Hariri, Lamyaa Eloussi, and Darko Marinov
A study of the various factors that might cause tests to behave erratically, and what developers do about them.
A description of how the Chromium and WebKit teams triage and manage their flaky test failures.

1 comment:

  1. Interesting.
    The topic of trying to automatically repair flaky tests came up at the recent "Genetic Improvement of Software" Dagstuhl Seminar 18052 and Baudry et al.'s GI 2018 @ ICSE paper