IEEE Software Blog: April 2018

Monday, April 16, 2018

Which design best practices should be taken care of?

by Johannes Bräuer, Reinhold Plösch, Johannes Kepler University Linz, and Matthias Saft, Christian Körner, Corporate Technology Siemens AG
Associate Editor: Christoph Treude (@ctreude)

In the past, software metrics were used to express the compliance of source code with object-oriented design aspects [1], [2]. Nevertheless, it has been found out that metrics are too vague for dealing with the complexity of driving concrete design improvements [3] and the idea of identifying code or design smells in source code has been established [4].

Despite good progress in localising design flaws based on the identification of design smells, these design smells are still too fine-grained to conclude a design assessment. Consequently, we follow the idea of measuring and assessing the compliance of the source code with object-oriented design principles [5]. For doing so, we systematically collected design principles that are applied in practice and then jointly derived more tangible design best practices [6]. These practices have the key advantage of being specific enough (1) to be applied by practitioners and (2) to be identified by an automatic tool. As a result, we developed the static code analysis tool MUSE that currently contains a set of 67 design best practices (design rules) for the programming languages Java, C# and C++ [7].

Design best practices naturally have a different importance. To find out about a proper importance, we decided to conduct a survey to gather data that allow a more differentiated view of the importance of Java-related design best practices (i.e., a subset of 49 instances).

Survey on the Importance of Design Best Practices

The survey was available from 26^th October until 21^st November 2016. 214 software professionals (software engineers, architects, consultants, etc.) completed the survey, resulting in an average of 134 opinions for each design best practice. Based on this data we derive a default importance, as depicted in Table 1. For the sake of clarification, the arrows indicate design best practices that are close to the next higher (↑) or lower (↓) importance level. Furthermore, we calculated a range based on the standard deviation that allows an increase or decrease of the importance within these borders. This data can be used as basis to assess quality and to plan quality improvements.

Table 1. Design best practices ordered by importance

	Default Importance	Importance Range
AvoidDuplicates	very high	very high
AvoidUsingSubtypesInSupertypes	very high	high-very high
AvoidPackageCycles	very high	high-very high
AvoidCommandsInQueryMethods	very high	high-very high
AvoidPublicFields	very high	high-very high
DocumentInterfaces	high ↑	moderate-very high
AvoidLongParameterLists	high ↑	high-very high
UseInterfacesIfPossible	high ↑	moderate-very high
AvoidStronglyCoupledPackages	high	moderate-very high
AvoidNonCohesiveImplementations	high	moderate-very high
AvoidUnusedClasses	high	moderate-very high
DontReturnUninvolvedData	high	moderate-very high
AvoidNonCohesivePackages	high	moderate-very high
DocumentPublicMethods	high	moderate-very high
UseCompositionNotInheritance	high	moderate-very high
DocumentPublicClasses	high	moderate-very high
AvoidPublicStaticFields	high	moderate-very high
AvoidDiamondInheritanceStructuresInterfaces	high	moderate-very high
AvoidLongMethods	high	moderate-very high
AvoidSimilarNamesForDifferentDesignElements	high	moderate-very high
AvoidUnusedAbstractions	high	moderate-very high
CheckUnsuitableFunctionality	high	moderate-very high
AvoidSimilarAbstractions	high	moderate-very high
DocumentPackages	high ↓	low-very high
UseInterfacesAsReturnType	high ↓	low-very high
AvoidUncheckedParametersOfSetters	high ↓	high-very high
AvoidSimilarNamesForSameDesignElements	moderate ↑	moderate-high
CheckObjectInstantiationsByName	moderate ↑	low-high
AvoidRepetitionOfPackageNamesOnAPath	moderate ↑	low-high
ProvideInterfaceForClass	moderate	low-high
AvoidRuntimeTypeIdentification	moderate	low-high
AvoidDirectObjectInstantiations	moderate	low-high
CheckUnusedSupertypes	moderate	low-high
AbstractPackagesShouldNotDependOnOtherPkg	moderate	low-high
DontReturnMutableCollectionsOrArrays	moderate	low-high
AvoidMassiveCommentsInCode	moderate	low-high
AvoidReturningDataFromCommands	moderate	low-high
UseAbstractions	moderate ↓	very low-high
CheckUsageOfNonFullyQualifiedPackageNames	low	low-moderate
AvoidManySetter	low	very low-moderate
AvoidHighNumberOfSubpackages	low	very low-moderate
AvoidConcretePackagesNotUsedFromOtherPkg	low	very low-moderate
AvoidSettersForHeavilyUsedFields	low	very low-moderate
AvoidAbstractClassesWithOneExtension	low	very low-moderate
DontInstantiateImplementationsInClients	low	very low-moderate
AvoidManyGetters	low	very low-moderate
AvoidProtectedFields	low	very low-moderate
CheckDegradedPackageStructure	low	very low-moderate
AvoidManyTinyMethods	low	very low-moderate

Beyond the Result of the Importance Assessment

Based on the survey result, we expanded our research in two directions. Accordingly, we further examined our idea of operationalizing design principles, and we recently proposed a design debt prioritization approach to guide design improvement activities properly.

While the survey findings revealed evidence of the importance of design best practices, the remaining question was still whether the practices, assigned to a specific design principle, cover essential aspects of that principle or just touch on some minor design concerns. To answer this general question and to identify white-spots in operationalizing certain principles, we conducted a focus group research for 10 selected principles with 31 software design experts in six focus groups [8]. The result of this investigation showed that our design best practices are capable to measure and to assess the major aspects of the examined design principles.

In the course of the focus group discussions and in communicating the survey result to practitioners, we identified the need to prioritize design best practice violations not only from the viewpoint of their importance, but also from the viewpoint of a quality state. As a result, we proposed a portfolio-based assessment approach that combines the importance of each design best practice (y-axis in Figure 1) with a quality index (x-axis in Figure 1) derived from a benchmark suite [9], [10]. This combination is presented as portfolio matrix, as depicted in Figure 1 for the measurement result of a particular open-source project; in total, the 49 design best practices for Java are presented. Taking care of all 49 best practices is time expensive and could be overwhelming for the project team. Consequently, the portfolio-based assessment approach groups the design best practices into four so-called investment areas, which recommend concrete improvement strategies.

Figure 1: Investment areas of portfolio matrix

Concluding Remarks

To summarize this blog entry and to answer the heading question, let’s reconsider the opinions of the 214 survey participants. Accordingly, we derived the importance of the 49 design best practices, from which five instances are judged to be of very high importance. In fact, code duplicates (code clones), supertypes using subtypes, package cycles, commands in query methods and public fields are the design concerns considered to be very important. In other words, avoiding the violation of these design rules in practice can enhance and foster the flexibility, reusability and maintainability of a software product.

For more details about the conducted survey, we refer interested readers to the research article titled “A Survey on the Importance of Object-oriented Design Best Practices” [11].

References

[1] S. R. Chidamber and C. F. Kemerer, “A metrics suite for object oriented design,” IEEE Trans. Softw. Eng., vol. 20, no. 6, pp. 476–493, Jun. 1994.
[2] J. Bansiya and C. G. Davis, “A hierarchical model for object-oriented design quality assessment,” IEEE Trans. Softw. Eng., vol. 28, no. 1, pp. 4–17, Jan. 2002.
[3] R. Marinescu, “Measurement and quality in object-oriented design,” in Proceedings of the 21st IEEE International Conference on Software Maintenance (ICSM), Budapest, Hungary, 2005, pp. 701–704.
[4] R. Marinescu, “Detection strategies: metrics-based rules for detecting design flaws,” in Proceedings of the 20th IEEE International Conference on Software Maintenance, Chicago, IL, USA, 2004, pp. 350–359.
[5] J. Bräuer, “Measuring Object-Oriented Design Principles,” in Proceedings of the 30th IEEE/ACM International Conference on Automated Software Engineering (ASE), Lincoln, NE, USA, 2015, pp. 882–885.
[6] R. Plösch, J. Bräuer, C. Körner, and M. Saft, “Measuring, Assessing and Improving Software Quality based on Object-Oriented Design Principles,” Open Comput. Sci., vol. 6, no. 1, 2016.
[7] R. Plösch, J. Bräuer, C. Körner, and M. Saft, “MUSE - Framework for Measuring Object-Oriented Design,” J. Object Technol., vol. 15, no. 4, p. 2:1-29, Aug. 2016.
[8] J. Bräuer, R. Plösch, M. Saft, and C. Körner, “Measuring Object-Oriented Design Principles: The Results of Focus Group-Based Research,” J. Syst. Softw., 2018.
[9] J. Bräuer, M. Saft, R. Plösch, and C. Körner, “Improving Object-oriented Design Quality: A Portfolio- and Measurement-based Approach,” in Proceedings of the 27th International Workshop on Software Measurement and 12th International Conference on Software Process and Product Measurement (IWSM-Mensura), Gothenburg, Sweden, 2017, pp. 244–254.
[10] J. Bräuer, R. Plösch, M. Saft, and C. Körner, “Design Debt Prioritization - A Design Best Practice-Based Approach,” in Proceedings of the 1st International Conference on Technical Debt (TechDebt), 2018.
[11] J. Bräuer, R. Plösch, M. Saft, and C. Körner, “A Survey on the Importance of Object-Oriented Design Best Practices,” in Proceedings of the 43rd Euromicro Conference on Software Engineering and Advanced Applications (SEAA), Vienna, Austria, 2017, pp. 27–34.

Monday, April 9, 2018

Efficiently and Automatically Detecting Flaky Tests with DeFlaker

By: Jonathan Bell, George Mason University (@_jon_bell_), Owolabi Legunsen, UIUC, Michael Hilton (@michaelhilton), CMU, Lamyaa Eloussi, UIUC, Tifany Yung, UIUC, and Darko Marinov, UIUC.

Associate editor: Sarah Nadi, University of Alberta (@sarahnadi), Bogdan Vasilescu, Carnegie Mellon University (@b_vasilescu).

Flaky tests are tests that can non-deterministically pass or fail for the same version of the code under test. Therefore, flaky tests can be incredibly frustrating for developers. Ideally, every new test failure would be due to the latest changes that a developer made, and the developer could subsequently focus on debugging these changes. However, because their outcome depends not only on the code, but also on tricky non-determinism (e.g. dependence on external resources or thread scheduling), flaky tests can be very difficult to debug. Moreover, if a developer doesn’t know that a test failure is due to a flaky test (rather than a regression that they introduced), how does the developer know where to start debugging: their recent changes, or the failing test? If the test failure is not due to their recent changes, then should they debug the test failure immediately, or later?

Flaky tests plague both large and small companies. Google reports that 1 in 7 of their tests have some level of flakiness associated with them. A quick search also turns up many StackOverflow discussions and details about flaky tests at Microsoft, ThoughtWorks, SemaphoreCI and LucidChart.

Traditional Approach for Detecting Flaky Tests

Prior to our work, the most effective way to detect flaky tests was to repeatedly rerun failed tests. If some rerun passes, the test is definitely flaky; but if all reruns fail, the status is unknown (it might or might not be flaky). Rerunning failed tests is directly supported by many testing frameworks, including Android, Jenkins, Maven, Spring, FaceBook's Buck and Google TAP. Rerunning every failed test is extremely costly when organizations see hundreds to millions of test failures per day. Even Google, with its vast compute resources, does not rerun all (failing) tests on every commit but reruns only those suspected to be flaky, and only outside of peak test execution times.

Our New Approach to Detecting Flaky Tests

Our approach, DeFlaker, detects flaky tests without re-running them, and imposes only a very modest performance overhead (often less than 5% in our large-scale evaluation). Recall that a test is flaky if it can both pass and fail when it executes the same code, i.e., code that did not change; moreover, a test failure is new if the test passed on the previous version of code but fails in the current version. Between each test suite execution, DeFlaker tracks information about what code has changed (using information from a version control system, like git), what test outcomes have changed, and which of those tests executed any changed code. If a test passed on a prior run, but now fails, and has not executed any changed code, then DeFlaker warns that it is a flaky test failure.

However, on the surface, tracking coverage for detecting flaky tests may seem costly: industry reports suggest that collecting statement coverage is often avoided due to the overhead imposed by coverage tools, a finding echoed by our own empirical evaluation of popular code coverage tools (JaCoCo, Cobertura and Clover).

Our key insight in DeFlaker is that one need not collect coverage of the entire codebase in order to detect flaky tests. Instead, one can collect only the coverage of the changed code, which we call differential coverage. Differential coverage first queries a version-control system (VCS) to detect code changes since the last version. It then analyzes the structure of each changed file to determine where exactly instrumentation needs to be inserted to safely track the execution of each change, allowing it to track coverage of changes much faster than traditional coverage tools. Finally, when tests run, DeFlaker monitors execution of these changes and outputs each test that failed and is likely to be flaky.

Finding Flaky Tests

How useful is DeFlaker to developers? Can it accurately identify test failures that are due to flakiness? We performed an extensive evaluation of DeFlaker by re-running historical builds for 26 open-source Java projects, executing over 47,000 builds. When a test failed, we attempted to diagnose it both with DeFlaker and by rerunning it. We also deployed DeFlaker live on Travis CI, where we integrated DeFlaker into the builds of 96 projects. In total, our evaluation involved executing over five CPU-years of historical builds. We also studied the relative overhead of using DeFlaker compared to a normal build. A complete description of our extensive evaluation is available in our ICSE 2018 paper.

Our primary goal was to evaluate how many flaky tests DeFlaker would find, compared with the traditional (rerun) approach. For each build, whenever a test failed, we re-ran the test using the rerun facility in Maven’s Surefire test runner. We were interested to find that this approach only resulted in 23% of test failures eventually passing (hence, marked as flaky) even if we allowed for up to five reruns of each failed test. On the other hand, DeFlaker marked 95% of test failures as flaky! Given that we are reviewing only code that was committed to version control repositories, we expected that it would be rare to find true test failures (and that most would be flaky).

We found that the strategy by which a test is rerun matters greatly: make a poor choice, and the test will continue to fail for the same reason as the first failure, causing the developer to assume that the failure was a true failure (and not a flaky test). Maven’s flaky test re-runner reran each failed test in the same process as the initial failed execution --- which we found to often result in the test continuing to fail. Hence, to better find flaky test failures, and to understand how to best use reruns to detect flaky tests, we experimented with the following strategies, rerunning failed tests: (1) Surefire: up to five times in the same JVM in which the test ran (Maven’s rerun technique); then, if it still did not pass; (2) Fork: up to five times, with each execution in a clean, new JVM; then, if it still did not pass; (3) Reboot: up to five times, running a mvn clean between tests and rebooting the virtual machine between runs.

As shown in the figure below, we found nearly 5,000 flaky test failures using the most time-consuming rerun strategy (Reboot). DeFlaker found nearly as many of these same flaky tests (96%) with a very low false alarm rate (1.5%), and at a significantly lower cost. It’s also interesting to note that when a rerun strategy worked, it generally worked after a single rerun (few additional tests were detected from additional reruns). We demonstrated that DeFlaker was fast by calculating the time overhead of applying DeFlaker to the ten most recent builds of each of those 26 projects. We compared DeFlaker to several state-of-the-art tools: the regression test selection tool Ekstazi, and the code coverage tools JaCoCo, Cobertura, and Clover. Overall, we found that DeFlaker was very fast, often imposing an overhead of less than 5% — far faster than the other coverage tools that we looked at.

Overall, based on these results, we find DeFlaker to be a more cost-beneficial approach to run before or together with reruns, which allows us to suggest a potentially optimal way to perform test reruns: For projects that have lots of failing tests, DeFlaker can be run on all the tests in the entire test suite, because DeFlaker immediately detects many flaky tests without needing any rerun. In cases where developers do not want to pay this 5% runtime cost to run DeFlaker (perhaps because they have very few failing tests normally), DeFlaker can be run only when rerunning failed tests in a new JVM; if the tests still fail but do not execute any changed code, then reruns can stop without costly reboots.

While DeFlaker’s approach is generic and can apply to nearly any language or testing framework, we implemented our tool for Java, the Maven build system, and two testing frameworks (JUnit and TestNG). DeFlaker is available under an MIT license on GitHub, with binaries published on Maven Central.

More information on DeFlaker (including installation instructions) is available on the project website and in our upcoming ICSE 2018 paper. We hope that our promising results will motivate others to try DeFlaker in their Maven-based Java projects, and to build DeFlaker-like tools for other languages and build systems.

For further reading on flaky tests:

Measuring the cost of regression testing in practice: a study of Java projects using continuous integration (FSE 2017)

Adriaan Labuschagne, Laura Inozemtseva and Reid Holmes

A study of test suite executions on TravisCI that investigated the number of flaky test failures.

Flaky Tests at Google and How We Mitigate Them (Google Testing Blog, 2016)

John Micco

A summary of Flaky tests at Google and (as of 2016) the strategies used to manage them.

An Empirical Analysis of Flaky Tests (FSE 2014)

Qingzhou Luo, Farah Hariri, Lamyaa Eloussi, and Darko Marinov

A study of the various factors that might cause tests to behave erratically, and what developers do about them.

Chromium Project’s Flaky Test Dashboard

A description of how the Chromium and WebKit teams triage and manage their flaky test failures.

Associate Editors

Jeffrey Carver (Practitioners' digest)
Dario Di Nucci (Testing)
Niko Mäkitalo (Microservices/Software Architecture)
Sofia Ouhbi (Requirements Engineering and Software Sustainability)
Varun Gupta (Global developments)
Jinghui Cheng (Human Aspects)
Muneera Bano (User Centric/Human Aspects)
Ronald Jabangwe (Software Engineering Process Models)
Mehdi Mirakhorli (Design/ Architecture and Requirements)
Brittany Johnson (Issue and SE Radio Summary)
Sarah Nadi (Software release and configuration management)
Stefano Zacchiroli (Open source software systems)
Federica Sarro (Mobile applications and systems)
Sridhar Chimalakonda (Software Quality and Software Reuse)
Danilo Pianini (Pervasive computing)
Karim Ali (Programming Languages)
Mei Nagappan (Practitioner perspectives)
Xabier Larrucea (Practitioner perspectives)