Monday, July 17, 2017

Performance testing in Java-based open source projects

by Cor-Paul Bezemer (@corpaul), Queen's University, Canada
Associate Editor: Zhen Ming (Jack) Jiang, York University, Canada

From a functional perspective, the quality of open source software (OSS) is on par with comparable closed-source software [1]. However, in terms of nonfunctional attributes, such as reliability, scalability, or performance, the quality is less well-understood. For example, Heger et al. [2] stated that performance bugs in OSS go undiscovered for a longer time than functional bugs, and fixing them takes longer.

As many OSS libraries (such as apache/log4j) are used almost ubiquitously across a large span of other OSS or industrial applications, a performance bug in such a library can lead to widespread slowdowns. Hence, it is of utmost importance that the performance of OSS is well-tested.

We studied 111 Java-based open source projects from GitHub to explore to what extent and how OSS developers conduct performance tests. First, we searched for projects that included at least one of the keywords 'bench' or 'perf' in the 'src/test' directory. Second, we manually identified the performance and functional tests inside that project. Third, we identified performance-sensitive projects, which mentioned in the description of their GitHub repository that they are the 'fastest', 'most efficient', etc. For a more thorough description of our data collection process, see our ICPE 2017 paper [3]. In the remainder of this blog post, the most significant findings of our study are highlighted.

Finding # 1 - Performance tests are maintained by a single developer or a small group of developers. 
In 50% of the projects, all performance test developers are one or two core developers of the project. In addition, only 44% of the test developers worked on the performance tests as well.

Finding # 2 - Compared to the functional tests, performance tests are small in most projects. 
The median SLOC (source lines of code) in performance tests in the studied projects was 246, while the median SLOC of functional tests was 3980. Interestingly, performance-sensitive projects do not seem to have more or larger performance tests than non-performance-sensitive projects.

Finding # 3 - There is no standard for the organization of performance tests. 
In 52% of the projects, the performance tests are scattered throughout the functional test suite. In 9% of the projects, code comments are used to communicate how a performance test should be executed. For example, the file from the nbronson/snaptree project contains the following comment:
* This is not a regression test, but a micro-benchmark.
* I have run this as follows:
* repeat 5 for f in -client -server;
* do mergeBench dolphin . jr -dsa\
*       -da f;
* done
public class RangeCheckMicroBenchmark {

In four projects, we even observed that code comments were used to communicate the results of a previous performance test run.

Finding # 4 - Most projects have performance smoke tests. 
We identified the following five types of performance tests in the studied projects:
  1. Performance smoke tests: These tests (50% of the projects) typically measure the end-to-end execution time of important functionality of the project.
  2. Microbenchmarks: 32% of the projects use microbenchmarks, which can be considered performance unit tests. Stefan et al. [4] studied microbenchmarks in depth in their ICPE 2017 paper.
  3. One-shot performance tests: These tests (15% of the projects) were meant to be executed once, e.g., to test the fix for a performance bug.
  4. Performance assertions: 5% of the projects try to integrate performance tests in the unit-testing framework, which results in performance assertions. For example, the file from the anthonyu/Kept-Collections project asserts that one bytecode serialization method is at least four times as fast as the alternative.
  5. Implicit performance tests: 5% of the projects do not have performance tests, but simply yield a performance metric (e.g., the execution time of the unit test suite). 
The different types of tests show that there is a need for performance tests at different levels, ranging from low-level microbenchmarks to higher-level smoke tests.

Finding # 5 - Dedicated performance test frameworks are rarely used. 
Only 16% of the studied projects used a dedicated performance test framework, such as JMH or Google Caliper. Most projects use a unit test framework to conduct their performance tests. One possible explanation is that developers are trying hard to integrate their performance tests into the continuous integration processes. 

The main takeaway of our study

Our observations imply that developers are currently missing a “killer app” for performance testing, which would likely standardize how performance tests are conducted, in the same way as JUnit standardized unit testing for Java. An ubiquitous performance testing tool will need to support performance tests on different levels of abstraction (smoke tests versus detailed microbenchmarking), provide strong integration into existing build and CI tools, and support both, extensive testing with rigorous methods as well as quick-and-dirty tests that pair reasonable expressiveness with being fast to write and maintain even by developers who are not experts in software performance engineering.


[1] M. Aberdour. Achieving quality in open-source software. IEEE Software. 2007.
[2] C. Heger, J. Happe, and R. Farahbod. Automated Root Cause Isolation of Performance Regressions During Software Development. In Proceedings of the 4th ACM/SPEC International Conference on Performance Engineering (ICPE). 2013.
[3] P. Leitner and C.-P. Bezemer. An exploratory study of the state of practice of performance testing in Java-based open source projects. In Proceedings of the 8th ACM/SPEC on International Conference on Performance Engineering (ICPE). 2017. 
[4] P. Stefan, V. Horky, L. Bulej, and P. Tuma. Unit testing performance in Java projects: Are we there yet? In Proceedings of the 8th ACM/SPEC on International Conference on Performance Engineering (ICPE). 2017.

If you like this article, you might also enjoy reading:

[1] Jose Manuel Redondo, Francisco Ortin. A Comprehensive Evaluation of Common Python Implementations. IEEE Software. 2015.
[2] Yepang Liu, Chang Xu, Shing-Chi Cheung. Diagnosing Energy Efficiency and Performance for Mobile Internetware Applications. IEEE Software. 2015.
[3] Francisco Ortin, Patricia Conde, Daniel Fernández Lanvin, Raúl Izquierdo. The Runtime Performance of invokedynamic: An Evaluation with a Java Library. IEEE Software. 2014.

    No comments:

    Post a Comment