Monday, November 12, 2018

Correctness Attraction: A Study of Stability of Software Behavior Under Runtime Perturbation

By: Benjamin Danglot, Philippe Preux, Benoit Baudry, Martin Monperrus (@diversifyprojec, @stamp_project@martinmonperrus)
Associate Editor: Federica Sarro (@f_sarro)

In the introductory class of statics, a branch of mechanics, one learns that there are two kinds of equilibrium: stable and unstable. Consider Figure 1, where a ball lies respectively in a basin (left) and on the top of a hill (right). The first ball is in a stable equilibrium, one can push it left or right, it will come back to the same equilibrium point. On the contrary, the second one is in an unstable equilibrium, a perturbation as small as an air blow directly results in the ball falling away.

Figure 1: The concept of stable and unstable equilibrium in physics motivate us to introduce the concept of “correctness attraction”.

In one of his famous lectures [2], Dijkstra has stated a fundamental hypothesis about the nature of software, “the smallest possible perturbations – i.e. changes of a single bit – can have the most drastic consequences.”. Viewed under the perspective of statics, it means that Dijkstra considers software as a system that can only be in an unstable equilibrium, or to put it more precisely, that the correctness of a program output is unstable with respect to perturbations.

In our recent work [1], we have performed an original experiment to per- turb program executions. Our protocol consists in perturbing the execution of programs according to a perturbation model and observing whether this has an impact on the correctness of the output. We observe two different outcomes: the perturbation breaks the computation and results in an incor- rect output (unstable under perturbation), or the correctness of the output is stable despite the perturbation.

Let’s consider the PONE perturbation model which consists in increment (+1) integer values at runtime. At some point during the execution, we increment an integer value. An equivalently small perturbation model is MONE, which decrements integers. In Dijkstra’s view, PONE and MONE are the smallest possible perturbations one can do on integer data.

We perform the experiment on 10 Java programs which size range from 42 to 568 lines of code. They all have the property that we can perfectly assess the correctness of the output thanks to a “perfect oracle”. These programs and their oracles are described in Table 1. 

Table 1: Dataset of 10 subject programs used in our experiments.

In total, we perturb 2917701 separate executions of the ten programs. Among those 2917701 perturbed executions, 1977199 (67.76%) yield a correct output. This result surprised us, because it goes against the com- mon intuition that programs are brittle. Our literature review revealed that this empirical phenomenon has no name, yet. 

Correctness attraction

When a perturbation does not break output correctness, we observe “stable correctness”. This is conceptually related to the concepts of “stable equilibrium” and “attraction basin” in physics. Due to this conceptual proximity, we name the observed phenomenon: “correctness attraction“. As shown in Figure 1, the intuition behind correctness attraction can be graphical. The correctness attraction basin at the left hand-side of Figure 1 refers to the input points for which a software system eventually reaches the same fixed and correct point according to a perturbation model.
If you want to read more about this fascinating phenomenon, we discuss the reasons behind correctness attraction in our paper [1].
This original work opens interesting directions in the area of software reliability. If we engineer techniques to automatically improve correctness attraction, we would obtain zones in the code that can accommodate more perturbations of the runtime state, and those zones could be then deemed “bug absorbing zones”.


[1] B. Danglot, P. Preux, B. Baudry, and M. Monperrus. Correctness Attrac- tion: A Study of Stability of Software Behavior Under Runtime Perturba- tion. Empirical Software Engineering, 2017.
[2] E. W. Dijkstra. On the cruelty of really teaching computing science. Dec. 1988.

Wednesday, October 31, 2018

Lessons Learned from Bad API Design

By Sven Amann (@svamann), Sarah Nadi (@sarahnadi), and Titus Winters (@TitusWinters)
Associate Editor: Sridhar Chimalakonda (@ChimalakondaSri)

Software libraries and their Application Programming Interfaces (APIs) are essential for proper code reuse that leads to faster software development and less software bugs. However, a poorly designed API can end up causing more harm than good: clients of the API may incorrectly use certain functionality leading to runtime failures, performance degradations, or security issues.
Over the last couple of decades, several program analysis techniques have been proposed to catch bugs arising from incorrect API usage [1]. Some of these techniques rely on mining patterns of API usages, whether from code or logs, and finding violations of these patterns, while others rely on checking that client code respects explicit annotations in the API code. Some languages come with annotations that allow the API designer to indicate certain restrictions on expected values or behavior (e.g., @Notnull or @Nullable in Java). Such annotations allow static checks of client code to warn the developer, e.g., when a null reference is potentially passed in a place expecting a non-null reference (e.g., IntelliJ has support for this However, unfortunately, such annotations are still not consistently used by API designers. Additionally, the space of possible violations and things that can go wrong is unbounded. Thus, there may be desired checks or API design problems that have no corresponding annotations. That is: even meta-design via introducing annotations cannot solve all design problems. At some level, some design problems can only be identified through experience.

Moreover, the above checks and similar analyses are all posteriori measures that typically catch only a subset of the problems. In this blog post, we take the stand that we should try building APIs that are hard to misuse in the first place. We provide a set of lessons learned from common examples of problematic API design decisions. These lessons are based on our own experiences, whether as part of industrial work or part of our research, as well as examples from various sources. This list is by no means comprehensive, but a start for collecting common pitfalls of API design. The idea for this blog post was born out of our discussions during the 2nd International Workshop on API Usage and Evolution (WAPI ‘18) in May 2018 in Sweden.

Lessons from Bad API Design

Lesson 1: Avoid the Name/Action Mismatch

Description: Most design guides (rightly) point out that types should be named as nouns (string), mutating/non-const methods on those types should be named like verbs/actions (reset), and const accessors should be named like adjectives (is_blue()) or actions (find, count)). Sometimes, we don’t get that right.

C++ Example: In the C++ standard library, all containers (vector, set, list, etc) have an .empty() method on them. In English, empty is both an adjective ("The glass is empty") and a verb ("Please empty the trash.") As such, it is entirely possible to encounter code that invokes empty() and does nothing with it.
std::vector<int> v = GetInts();
v.empty();  // Does nothing!
v.clear();  // Does what the author of the previous line probably meant
Java Example: In the Java Class Library (JCL), BigDecimal.setScale(...) does not actually set the scale on the receiver, but instead returns a new object with a different scale.
Another example is the peek method on Java’s stream API. The name suggests that by calling this method, the code peeks at the first element in the stream. However, calling the method actually just returns a copy of the stream and nothing happens. The peeking only actually occurs when the stream is processed, e.g., by calling collect() or forEach().

Solution: Be very sure about your choices in naming. Make sure to consider all interpretations of your chosen API names. Be extra cautious about possible interpretations from non-native speakers. In the event that your chosen names can be misinterpreted, but changing the name isn’t an option, it may be useful to try to mitigate the problem. In C++, this has been done by adding the [[nodiscard]] attribute to the function. With this attribute, a compilation warning will be issued if the return value of the function is not used in some fashion. It’s likely that any pure function (sqrt(double)) and any side-effect free const method (empty()) should be tagged [[nodiscard]] to help identify bugs in client code.

Lesson 2: Avoid Poor/Unclear Type Invariants

Description: In many respects, type design in an OO language is about identifying the logical state of a type (how you describe the type, or the mental model of a programmer using your type), the physical state of a type (the data members that implement that logical state), and a set of invariants on the logical and physical state.

For instance, a type like std::string is logically “a contiguous buffer of characters of length size(), with at least one ‘\0’ character at position size().” Both these conditions can be considered the invariants of string; in other words, these are conditions that must hold at all times. Internally, a string can be represented in various ways by various standard library implementations, but a programmer does not need to know anything about the representation (the physical state) - the type provides an abstraction.

In order to do this, we rely on mechanisms such as data hiding - making data members private so that users of the type cannot arbitrarily change the internals. This is important: if the various pointers and lengths in a std::string were directly accessible and modifiable, it is only a matter of time before someone accesses something inappropriately and breaks our invariants - perhaps we are no longer null-terminated, or perhaps the allocation for our character buffer has been deleted. However, sometimes the invariants of a type are unclear or “poor”, in the sense that objects of the type may reach logical states in which the invariants do not hold. Especially for value types, having unclear invariants or a vague model of what a specific type actually is leads to types that are unnecessarily hard to use.

C++ Example: Many designers argue against “two step initialization.” That is, types that can be constructed separately from becoming usable.
Foo f;  // Constructed, but not useful yet. Must call Init()
if (!f.Init(10)) throw Error(“Failed to initialize!”);
f.Run(); // Do whatever we’re supposed to do.
If we consider this from the perspective of invariants, this becomes clearer. In a two-step initialization design, every function needs to be labelled to describe whether it’s allowable to call that function before calling Init(). In the example above, Run() cannot be called unless the object is initialized using Init. As the class gets bigger, merely documenting which methods are safe to call when becomes more complex - and the odds of users doing it wrong go up.
It’s much better when your type has simpler invariants: if it is constructed, it’s valid.
Foo f(10);  // May throw if initialization failed
f.Run();  // Do whatever we’re supposed to do
Unfortunately, two-step initialization designs are still common, and the resulting types are consistently error prone. For instance, the design of std::fstream shows this design smell: the majority of the API for std::fstream cannot be used after construction, it requires a secondary call to fstream::open().

C++ Example 2: In modern C++, there is also an inverse to two-step initialization. In C++11, the language added support for “move semantics” - an optimization of copying for an object where the source object is known to be unnecessary and can be consumed in the process of moving into the destination object.
std::string s1 = GetString();
std::string s2 = std::move(s1);  // Contents of s1 has been moved into s2
s1.clear();  // s1 still exists, but is in an undefined state.
//You can only safely call methods that have no preconditions/make no assumption about its contents.
In order to avoid adding a “moved-from” state to every type, the language says that a moved-from object is left in a “valid but unspecified” state - the invariants of the type are still upheld, but you cannot reason about the state. This allows the object to be used for operations with no preconditions (you can still call clear() or size() on a moved-from string, but you cannot assume whether or not a call to operator[] is valid).

Moved-from objects and invariants work just like two-step initialization - sometimes people design types where the moved-from state no longer obeys the rest of the type invariants. For instance, a smart pointer that expresses sole ownership (a la std::unique_ptr) and whose content is declared to never be null. When such a type is moved-from (the owned pointer is moved away) the result either violates the sole-ownership invariant or the not-null invariant. Such a NotNullUniquePtr is a common design with broken invariants.

Java Example: Many of the Java cryptography APIs also rely on a two-step initialization. The javax.crypto.Cipher is one such example:
Cipher encryptCipher = Cipher.getInstance("AES/CBC/PKCS5Padding");
encryptCipher.init(Cipher.ENCRYPT_MODE, aesKey);
byte[] cipherText = encryptCipher.doFinal(plaintTextByteArray);
Similar to the C++ example above, calling doFinal() without calling init() will not work. Simply calling getInstance() does not guarantee that the invariants of the type hold. Instead, one has to call init() to initialize the cipher and get an actual cipher object that respects the description of the class: “This class provides the functionality of a cryptographic cipher for encryption and decryption”.

Solution: Be clear what your type is, and document its invariants. Make sure that those invariants are as simple as possible and that they actually hold. Don’t add invalid states - neither at construction nor destruction (nor moved-from).

Lesson 3: Avoid Bad Default Values

Description: To give the developers flexibility in which values they want to specify, and to simplify usage, many APIs come with default values for certain parameters or required fields. In Java, this is typically achieved through providing various overloaded methods or the Builder design pattern. While default values make it easier to create the necessary objects, developers may not be aware of what these defaults are. If the defaults chosen in the API’s design are problematic for any reason, then the default behavior of the API becomes problematic.

Java Example: In the Java Cryptography Extension (JCE) library, which contains Java’s default cryptography support, the default value for the encryption mode when using the popular AES cipher in the javax.crypto.Cipher constructor is the Electronic Codebook Mode (ECB), which is insecure [2]. Thus, client developers who rely on the API’s default values end up using insecure encryption in their applications. Code reviewers might easily miss the problem, because the insecure setting is not visible in the client code. API developers cannot easily fix this problem, because changing the default value would break existing client applications that already encrypted values using the (insecure) default value.

Solutions: Choose reasonable default values for your API. If any default value may become unreasonable (or even critical), consider making it an explicit required parameter. This would at least allow the bad default value to be spotted during code review. Clearly document these default values.

Lesson 4: Avoid Stringly-typed Parameters

Description: A parameter is stringly-typed, if its type is string, but it expects a more specific value that could be more appropriately typed. As a result, client developers cannot make use of many assistance tools, such as static type checkers or completion engines, when programming against the API. This makes the API harder to learn and more easy to misuse.

Java Example: In the Java Cryptography Extension (JCE) library, which contains Java’s default cryptography support, all parameters to instantiate a javax.crypto.Cipher are encoded in a single string parameter that expects a value of the form “//”, where the last or both substrings beginning with “/” may be omitted. It is difficult for client developers to know which values may be passed to this parameter. Moreover, when encountering a valid instantiation, such as:
Cipher cipher = Cipher.getInstance(“AES”);
a client developer cannot see that the parameter value specifies only one of three actual parameters.

Lessons Learned: Use specific types instead of encoding values as strings. Declare separate parameters for separate values, instead of joining multiple values in one string.

Lesson 5: Avoid Incorrect Placement of Functionality

Description: Placing a method on a class where it is unexpected may lead to clients using it in unintended ways.

Java Example: Java’s primitive-type wrapper java.lang.Boolean has a method getBoolean(String). This method returns true if there is a system property with the given name and the value “true”, and false otherwise. Developers might easily confuse this method with parse(String), which converts a string (“true”/”false”) to the corresponding boolean value.
Solution: Place methods on an API where it’s functionality is likely to be expected, e.g., the System class for querying a system property.


Poorly designed APIs can be confusing to client developers, and may lead to potential misuse of the API. In the above, we gave some examples of cases we have personally seen in the past. This is by no means a comprehensive list, and we hope that more developers contribute such examples, which can help in creating a set of design principles for API creators. References

[1] M. P. Robillard, E. Bodden, D. Kawrykow, M. Mezini and T. Ratchford, "Automated API Property Inference Techniques," in IEEE Transactions on Software Engineering, vol. 39, no. 5, pp. 613-637, May 2013. doi: 10.1109/TSE.2012.63
[2] Block Cipher Mode of Operation. Wikipedia.

Monday, August 13, 2018

IEEE Software Blog March/April Issue, Blog, and Radio Summary

The March/April Issue of IEEE Software, as usual, is chock full of interesting articles on challenges and advances in software engineering. The topics in this issue range from the always popular topics of DevOps and security to the related but separate topic of release engineering.

One special addition to this issue is a special thanks to all those that participate in the reviewing efforts in 2017. Of course the reviewers help make IEEE Software the magazine that it is, so thank you from us all!

As with each issue, this issue includes a special focus topic: Release Engineering. The following articles are included in the March/April issue of Software on release engineering:

The articles "The Challenges and Practices of Release Engineering" and "Release Engineering 3.0" set the tone for the focus articles in this issue. The two articles provide some background on release engineering and discuss the state of the art in the field. Each of the other articles take a deeper dive into more specific aspects of release engineering. For example, in the article "Over-the-Air Updates for Robotic Swarms", the authors present a toolset for sending code updates over-the-air to robot swarms. 

One topic that always seems to find its way into each issue of IEEE Software is agile software development. The following articles appeared in this issue on agile development:

In "Practitioners' Agile-Methodology Use and Job Perceptions", the authors report on a survey conducted to better understand practitioner perceptions of agile methodology use. Similarly, "Making Sense of Agile Methods" provides insights into agile methodologies based on personal experiences of the author. 

Wanna know more? Make sure you check out the March/April issue of IEEE Software today!

IEEE Software Blog

The blog was a little light for March and April (as many of us had deadlines we were meeting). But of course we always try to make sure there's some kind of knowledge sharing going on!

For those who are also a little behind on things and want a quick way to catch up on the previous issue (January/February), there's a summary posted on the blog.

In the post "Which design best practices should be taken care of?", the authors report on the results from a survey sent out to learn more about the importance of design best practices. The post also reports on some of what was found to be the more important design concerns, such as code clones and package cycles. For those interested, there's also a reference to the authors full research article on this work.

The other April post, titled "Efficiently and Automatically Detecting Flaky Tests with DeFlaker", the authors present a new approach, called DeFlaker, that can be used to detect flaky tests (without having to re-run them!). The post also includes some details on the evaluation of DeFlaker and other relevant resources (such as a link to the project's GitHub page and full publication) for those interested in learning more about this tool.

SE Radio

The SE Radio broadcasts this issue were also a little light, but of course no less interesting! This issue is a technical one, with all discussions focused on various technologies. Nicole Hubbard joined SE Radio host Edaena Salinas to talk about migrating VM infrastructures to Kubernetes -- for those of you (like me) who have no clue what that is, they talk about that too!
Nate Taggart spoke with Kishore Bhatia about going serverless and what exactly that means.
And lastly but certainly not least in this issue, Péter Budai sat down with Kim Carter to talk about End to End Encryption (E2EE) and when it cane (and should) be used.

Also, for those looking for some extracurriculars to fill their free time, SE Radio is looking for a new volunteer host! For more information, see the SE Radio website:

Monday, July 16, 2018

It’s Time For Secure Languages

By: Cristina Cifuentes, Oracle Labs (@criscifuentes)
Associate Editor: Karim Ali, University of Alberta (@karimhamdanali)
Back in 1995, only 25 Common Vulnerabilities and Exposures (CVE) were reported.But by 2017, that number had blown out to 10,000+, as reported in the National Vulnerability Database (NVD). As the CVE list is a dictionary of publicly disclosed vulnerabilities and exposures, it should give an idea of the current situation.
However, it does not include all data, because some vulnerabilities are never disclosed. For example, they may be found internally and fixed, or they may be in cloud software that is actively upgraded through continuous integration and continuous delivery (CI/CD). Node.js has its own Node Security Platform Advisories system.
A five-year analysis (2013 to 2017) of the labelled data in NVD reveals that three of the top four most common vulnerabilities are issues that can be taken into account in programming language design:
  • 5,899 buffer errors
  • 5,851 injection errors1
  • 3,106 information leak errors
Combined, these three types of vulnerabilities represent 53% of all labelled exploited vulnerabilities listed in the NVD for that period, and they affect today’s mainstream languages.
We have known about exploitation of buffer errors since the Morris worm exploited a buffer error in the Unix finger server over 25. SQL injections2 and XSS exploits3 have been documented in the literature since 1998 and 2000, respectively. A 20164 study revealed that the average cost of a data breach is $4 million for an average of 10,000 records stolen through a vulnerability exploit. Not including the financial loss, imagine all the innovation that could happen if more than 50% of vulnerabilities did not exist!
Not only is the percentage of issues high, data on mainstream programming languages shows that none of them provide solutions to these three areas at the same time. A few languages provide solutions to buffer errors and/or injection errors, and no language provides a solution to information leaks. In other words, there are no mainstream secure languages that prevent these issues.

Why is this happening?

Let’s be clear about one thing – developers do not write incorrect code because they want to. It happens inadvertently, because our software programming languages do not provide the right abstractions to support developers in writing error-prone code.
Abstractions in programming languages introduce different levels of cognitive load. The easier it is for an abstraction to be understood, the more accepted that abstraction becomes. For example, managed memory is an abstraction that frees the developer from having to manually keep track of memory (both allocation and deallocation). As such, this abstraction is widely used in a variety of today’s programming languages, such as Java, JavaScript, and Python.
At the same time, performance of the developed code is also of interest. If an abstraction introduces a high performance overhead, it makes it hard for that abstraction to be used in practice in some contexts. For example, managed memory is not often used in embedded systems or systems programming due to its performance overhead.
At the root of our problem is the fact that many of our mainstream languages provide unsafe abstractions to developers; namely, manual management of pointers, manual string concatenation and sanitization, and manual tracking of sensitive data. These three abstractions are easy to use, they provide low performance overhead, but they are not easy to write correct code for. Hence, if used correctly, they have a high cognitive load on developers.
We need to provide safe abstractions to developers; ideally, abstractions that have low cognitive load and low performance overhead. I will briefly review three different abstractions for each of the vulnerabilities of interest, to show abstractions that could provide a solution in these areas.

1. Avoiding buffer errors through ownership and borrowing

Rust is a systems programming language that runs fast, prevents memory corruption, and guarantees memory and thread safety. Not only does it prevent buffer errors, it prevents various other types of memory corruptions, such as null pointers, and use after free. This feature is provided by the introduction of ownership and borrowing into the type system of the language:
  • Ownership is an abstraction used in C++ whereby a resource can have only one owner. Ownership with resource acquisition is initialiszation (RAII) ensures that whenever an object goes out of scope, its destructor is called and its owned resource is freed. Ownership of a resource is transferred (i.e., moved) through assignments or passing arguments by value. When a resource is moved, the previous owner can no longer access it, therefore preventing dangling pointers.
  • Borrowing is an abstraction that allows a reference to a resource to be made available in a secure way – either through a shared borrow (&T), where the shared reference cannot be mutated, or a mutable borrow (&mut T), where the shared reference cannot be aliased, but not both at the same time. Borrowing allows for data to be used elsewhere in the program without giving up ownership. It prevents use after free and data races.
Ownership and borrowing provide abstractions suitable for memory safety, and prevent buffer errors from happening in the code. Anecdotal evidence seems to suggest that the learning curve to pick up these abstractions takes some time, pointing to a high cognitive load.

2. Avoiding injection errors through taint tracking

Perl is a rapid prototyping language with over 29 years of development. It runs on over 100 platforms, from portables to mainframes. In 1989, Perl 3 introduced the concept of taint mode, to track external input values (which are considered tainted), and to perform runtime taint checks to prevent direct or indirect use of the tainted value in any command that invokes a sub-shell, or any command that modifies files/directories/processes, except for arguments to print and syswrite, symbolic methods and symbolic subreferences, or hash keys. Default tainted values include all command-line arguments, environment variables, locale information, results of some system calls (readdir(), readlink()), etc.
Ruby is a dynamic programming language with a focus on simplicity and productivity. It supports multiple programming paradigms, including functional, object-oriented, imperative, and reflective. Ruby extends Perl’s taint mode to provide more flexibility. Four safe levels are available, of which the first two are as per Perl:
  1. No safety.
  2. Disallows use of tainted data by potentially dangerous operations. This level is the default on Unix systems when running Ruby scripts as setuid.
  3. Prohibits loading of program files from globally writable locations.
  4. All newly created objects are considered tainted.
In Ruby, each object has a Trusted flag. There are methods to make an object tainted, check whether the object is tainted, or untaint the object (only for levels 0–2). At runtime, Ruby tracks direct data flows through levels 1–3; it does not track indirect/implicit data flows.
The taint tracking abstraction provides a way to prevent some types of injection errors with low cognitive load on developers. Trade-offs in performance overhead need to be made in order to determine how much data can be tracked and what target locations should be tracked, and whether direct and indirect uses can be tracked.

3. Avoiding information leaks through faceted values

Jeeves is an experimental academic language for automatically enforcing information flow policies. It is implemented as an embedded DSL in Python. Jeeves makes use of the faceted values abstraction, which is a data type used for sensitive values that stores within it the secret (high-confidentiality) and non-secret (low-confidentiality) values, guarded by a policy, e.g., <s | ns> (p). A developer specifies policies outside the code, and the language runtime enforces the policy by guaranteeing that a secret value may flow to a viewer only if the policies allow the viewer to view secret data.
Many applications today make use of a database. To make the language practical, faceted values need to be introduced into the database when dealing with database-backed applications. A faceted record is one that guards a secret and non-secret pair of values. Jacqueline, a web framework developed to support faceted values in databases, automatically reads and writes meta-data in the database to manager relevant faceted records. The developer can use standard SQL databases through the Jacqueline object relational mapping.
The faceted values abstraction provides a way to prevent information leaks, with low cognitive load on developers, but at the expense of performance overhead. This ongoing work is yet to determine the lower bound on performance overhead, in order to provide direct and indirect tracking of the data flows for leak of sensitive data purposes.

The future

The previous abstractions illustrate examples of ways to deal with specific types of errors through a programming language abstraction that may be implemented in the language’s type system and/or tracked in its runtime system. These abstractions provide great first steps at looking into the trade-offs of cognitive load and performance in our programming language abstractions, and to create practical solutions accessible to developers at large.
As a community, we need to step back and think of new abstractions to be developed that avoid high cognitive load on developers – and how to overcome any performance implications. Research in this area would allow for many new secure languages to be developed, languages that prevent, by construction, the existence of buffer errors, injection errors and information leaks in our software; i.e., over 50% of today’s exploited vulnerabilities. We need to improve our compiler technology, to develop new abstractions, and to cross the boundaries between different languages used in today’s cloud applications. The right secure abstraction for a web-based application that is database-backed may be different to the right secure abstraction needed for a microservices application.
With over 18.5 million software developers worldwide, security is not just for expert developers. It’s time to design the future of programming language security – it’s time for secure languages.

1. [Injection errors include Cross-Site Scripting (XSS), SQL injection, Code injection, and OS command injection.]
2. [Phrack Magazine, 8(54), article 8, 1998.]
3. [CERT “Malicious HTML Tags”, 2000.]
4. [2016 Ponemon Cost of Data Breach Study.]

Friday, May 4, 2018

Why should start-ups care about technical debt?

By: Eriks KlotinsBlekinge Institute of Technolgy
Associate Editor: Mehdi Mirakhorli (@MehdiMirakhorli)

We asked 84 start-ups to estimate levels of technical debt (TD) in their products and reflect on their software engineering practices. Technical debt is a metaphor to describe suboptimal solutions arising from a tradeoff between time-to-market, resources, and quality. When not addressed, compound effects of suboptimal solutions hinder further product development and reduce overall quality. 
Start-ups are known for their speed of developing innovative products and entering new markets. Technical debt can slow a start-up down and hinder its potential of quickly iterating ideas, or launching modifications for new markets. On the upper side, start-ups can leverage on technical debt to quickly get a product out to customers without significant upfront investment in product development.

Our data from 84 companies shows a clear association between excessive levels of technical debt and state of start-ups. We differentiate between active start-ups working on their products, and closed and paused start-ups. The results show that too much technical debt can impair product quality to the extent that further investments in the product to remove technical debt become unfeasible. Thus, excessive technical debt can kill the product and the company. Also, sustaining high levels of technical debt harms teams' morale as a lot of time is spent on patching the product. We are not advocating for the removal of all technical debt. Instead, we advocate for more understanding and awareness of technical debt.

1- Learn how to spot technical debt

Technical debt can affect different product artifacts, such as source code, documentation, architecture, documentation, and infrastructure.

Code debt or code smells correspond to poorly written code, such as unnecessary code duplication and complexity, long methods/functions, bad style reducing readability. Code debt is the easiest to spot.

Documentation debt refers to shortcomings in distributing knowledge on how to evolve, operate, and maintain the product. For example, poorly documented requirements, outdated architecture drawings, and lack of instructions what maintenance actions are required falls into this category. Documentation can be white-board drawings, notes, information in on-line tools, and formal documents. In start-ups lack of documentation is often compensated by implicit knowledge about the product. However, when team grows, key people may leave or the product may be transferred to another team, e.g., in case of an acquisition, perceived level of documentation debt peaks. The new engineers need to spend a significant effort to learn the product.

Architecture debt concerns structure of the software with effects on its maintainability and adaptability. Start-ups often use open-source frameworks and components to construct their products. Typically, popular frameworks come with their own best practices, and following these practices assures compatibility, easy to upgrade, and makes onboarding of new engineers faster.

Testing debt refers to lack of test automation leading to the need to manually test an entire product before every release. With lack of automation, an effort of regression testing grows with every new feature, supported platform and configuration. Regression testing could not be a problem while a product is small, however, testing debt could become a concern later as start-up matures and needs to support an increasing number of features across multiple platforms.

Environmental debt concerns hardware, other supporting applications, and processes relevant for development, operation and maintenance of the software product. For example, outdated server software leading to security vulnerabilities. Lack of data backup routines, problems in versioning, shortcomings in defect management, and other inadequacies may affect team's ability to create a quality product.

2- Main causes of technical debt

We found that level of engineering skills and the size of the whole start-up team are the primary causes of excessive technical debt. Inexperienced developers are more likely to unknowingly introduce technical debt. Our analysis shows that lack of skills contributes to communication issues and shortcomings in distributing relevant information to the team.

Larger teams, of 9 or more people, are more likely to experience skills shortages, face communication issues, introduce code smells, and experience coordination challenges. Small teams of 2 – 3, engineers can easily communicate with each other to coordinate their activities. However, with every new team member, coordinating with everyone becomes more difficult and suboptimal solutions, especially code smells, find their way into the product.

3- Strategies to address technical debt

1.    As an engineer, be aware of good practices. Knowing the good practices can help to spot bad practices. Knowing the difference can help to better argue for or against certain solutions and be aware of potential negative side effects.

2.    On a team level, run retrospectives and learn how to remove friction from collaboration. With increasing team size, new practices supporting collaboration may be needed. For example, simple practices like daily standups, pair-programming, and a task board can make a significant difference in distributing knowledge and improving teamwork. Note that difficulties in communication and coordination are associated with a size of the whole team, not only the engineering part. Thus, everyone in a start-up team must participate in coordination and communication activities.

3.    On an organizational level, anticipate when to leverage on technical debt to speed up certain goals, and when to slow down and refactor. Our results show that most issues are experienced when a start-up attempts to on-board a large number of users and launch customizations for new markets.

Read more in the original paper: 

E. Klotins, M. Unterkalmsteiner, T. Gorschek et al., “Exploration of Technical Debt in Start-ups,” in International Conference of Software Engineering, 2018.

You may also like: 

  1. B. Stopford, K. Wallace and J. Allspaw, "Technical Debt: Challenges and Perspectives," in IEEE Software, vol. 34, no. 4, pp. 79-81, 2017.
  2. E. Wolff and S. Johann, "Technical Debt," in IEEE Software, vol. 32, no. 4, pp. 94-c3, July-Aug. 2015.
  3. C. Giardino, N. Paternoster, M. Unterkalmsteiner, T. Gorschek and P. Abrahamsson, "Software Development in Startup Companies: The Greenfield Startup Model," in IEEE Transactions on Software Engineering, vol. 42, no. 6, pp. 585-604, June 1 2016.

Monday, April 16, 2018

Which design best practices should be taken care of?

by Johannes Bräuer, Reinhold Plösch, Johannes Kepler University Linz, and Matthias Saft, Christian Körner, Corporate Technology Siemens AG
Associate Editor: Christoph Treude (@ctreude)

In the past, software metrics were used to express the compliance of source code with object-oriented design aspects [1], [2]. Nevertheless, it has been found out that metrics are too vague for dealing with the complexity of driving concrete design improvements [3] and the idea of identifying code or design smells in source code has been established [4].

Despite good progress in localising design flaws based on the identification of design smells, these design smells are still too fine-grained to conclude a design assessment. Consequently, we follow the idea of measuring and assessing the compliance of the source code with object-oriented design principles [5]. For doing so, we systematically collected design principles that are applied in practice and then jointly derived more tangible design best practices [6]. These practices have the key advantage of being specific enough (1) to be applied by practitioners and (2) to be identified by an automatic tool. As a result, we developed the static code analysis tool MUSE that currently contains a set of 67 design best practices (design rules) for the programming languages Java, C# and C++ [7].

Design best practices naturally have a different importance. To find out about a proper importance, we decided to conduct a survey to gather data that allow a more differentiated view of the importance of Java-related design best practices (i.e., a subset of 49 instances).

Survey on the Importance of Design Best Practices

The survey was available from 26th October until 21st November 2016. 214 software professionals (software engineers, architects, consultants, etc.) completed the survey, resulting in an average of 134 opinions for each design best practice. Based on this data we derive a default importance, as depicted in Table 1. For the sake of clarification, the arrows indicate design best practices that are close to the next higher (↑) or lower (↓) importance level. Furthermore, we calculated a range based on the standard deviation that allows an increase or decrease of the importance within these borders. This data can be used as basis to assess quality and to plan quality improvements.
Table 1. Design best practices ordered by importance
Default ImportanceImportance Range
very high
very high
very high
high-very high
very high
high-very high
very high
high-very high
very high
high-very high
high ↑
moderate-very high
high ↑
high-very high
high ↑
moderate-very high
moderate-very high
moderate-very high
moderate-very high
moderate-very high
moderate-very high
moderate-very high
moderate-very high
moderate-very high
moderate-very high
moderate-very high
moderate-very high
moderate-very high
moderate-very high
moderate-very high
moderate-very high
high ↓
low-very high
high ↓
low-very high
high ↓
high-very high
moderate ↑
moderate ↑
moderate ↑
moderate ↓
very low-high
very low-moderate
very low-moderate
very low-moderate
very low-moderate
very low-moderate
very low-moderate
very low-moderate
very low-moderate
very low-moderate
very low-moderate

Beyond the Result of the Importance Assessment

Based on the survey result, we expanded our research in two directions. Accordingly, we further examined our idea of operationalizing design principles, and we recently proposed a design debt prioritization approach to guide design improvement activities properly.

While the survey findings revealed evidence of the importance of design best practices, the remaining question was still whether the practices, assigned to a specific design principle, cover essential aspects of that principle or just touch on some minor design concerns. To answer this general question and to identify white-spots in operationalizing certain principles, we conducted a focus group research for 10 selected principles with 31 software design experts in six focus groups [8]. The result of this investigation showed that our design best practices are capable to measure and to assess the major aspects of the examined design principles.

In the course of the focus group discussions and in communicating the survey result to practitioners, we identified the need to prioritize design best practice violations not only from the viewpoint of their importance, but also from the viewpoint of a quality state. As a result, we proposed a portfolio-based assessment approach that combines the importance of each design best practice (y-axis in Figure 1) with a quality index (x-axis in Figure 1) derived from a benchmark suite [9], [10]. This combination is presented as portfolio matrix, as depicted in Figure 1 for the measurement result of a particular open-source project; in total, the 49 design best practices for Java are presented. Taking care of all 49 best practices is time expensive and could be overwhelming for the project team. Consequently, the portfolio-based assessment approach groups the design best practices into four so-called investment areas, which recommend concrete improvement strategies.
Figure 1: Investment areas of portfolio matrix
Concluding Remarks

To summarize this blog entry and to answer the heading question, let’s reconsider the opinions of the 214 survey participants. Accordingly, we derived the importance of the 49 design best practices, from which five instances are judged to be of very high importance. In fact, code duplicates (code clones), supertypes using subtypes, package cycles, commands in query methods and public fields are the design concerns considered to be very important. In other words, avoiding the violation of these design rules in practice can enhance and foster the flexibility, reusability and maintainability of a software product.

For more details about the conducted survey, we refer interested readers to the research article titled “A Survey on the Importance of Object-oriented Design Best Practices” [11].


[1] S. R. Chidamber and C. F. Kemerer, “A metrics suite for object oriented design,” IEEE Trans. Softw. Eng., vol. 20, no. 6, pp. 476–493, Jun. 1994.
[2] J. Bansiya and C. G. Davis, “A hierarchical model for object-oriented design quality assessment,” IEEE Trans. Softw. Eng., vol. 28, no. 1, pp. 4–17, Jan. 2002.
[3] R. Marinescu, “Measurement and quality in object-oriented design,” in Proceedings of the 21st IEEE International Conference on Software Maintenance (ICSM), Budapest, Hungary, 2005, pp. 701–704.
[4] R. Marinescu, “Detection strategies: metrics-based rules for detecting design flaws,” in Proceedings of the 20th IEEE International Conference on Software Maintenance, Chicago, IL, USA, 2004, pp. 350–359.
[5] J. Bräuer, “Measuring Object-Oriented Design Principles,” in Proceedings of the 30th IEEE/ACM International Conference on Automated Software Engineering (ASE), Lincoln, NE, USA, 2015, pp. 882–885.
[6] R. Plösch, J. Bräuer, C. Körner, and M. Saft, “Measuring, Assessing and Improving Software Quality based on Object-Oriented Design Principles,” Open Comput. Sci., vol. 6, no. 1, 2016.
[7] R. Plösch, J. Bräuer, C. Körner, and M. Saft, “MUSE - Framework for Measuring Object-Oriented Design,” J. Object Technol., vol. 15, no. 4, p. 2:1-29, Aug. 2016.
[8] J. Bräuer, R. Plösch, M. Saft, and C. Körner, “Measuring Object-Oriented Design Principles: The Results of Focus Group-Based Research,” J. Syst. Softw., 2018.
[9] J. Bräuer, M. Saft, R. Plösch, and C. Körner, “Improving Object-oriented Design Quality: A Portfolio- and Measurement-based Approach,” in Proceedings of the 27th International Workshop on Software Measurement and 12th International Conference on Software Process and Product Measurement (IWSM-Mensura), Gothenburg, Sweden, 2017, pp. 244–254.
[10] J. Bräuer, R. Plösch, M. Saft, and C. Körner, “Design Debt Prioritization - A Design Best Practice-Based Approach,” in Proceedings of the 1st International Conference on Technical Debt (TechDebt), 2018.
[11] J. Bräuer, R. Plösch, M. Saft, and C. Körner, “A Survey on the Importance of Object-Oriented Design Best Practices,” in Proceedings of the 43rd Euromicro Conference on Software Engineering and Advanced Applications (SEAA), Vienna, Austria, 2017, pp. 27–34.

Monday, April 9, 2018

Efficiently and Automatically Detecting Flaky Tests with DeFlaker

By: Jonathan Bell, George Mason University (@_jon_bell_), Owolabi Legunsen, UIUC, Michael Hilton (@michaelhilton), CMU, Lamyaa Eloussi, UIUC, Tifany Yung, UIUC, and Darko Marinov, UIUC.
Associate editor: Sarah Nadi, University of Alberta (@sarahnadi), Bogdan Vasilescu, Carnegie Mellon University (@b_vasilescu).
Flaky tests are tests that can non-deterministically pass or fail for the same version of the code under test. Therefore, flaky tests can be incredibly frustrating for developers. Ideally, every new test failure would be due to the latest changes that a developer made, and the developer could subsequently focus on debugging these changes. However, because their outcome depends not only on the code, but also on tricky non-determinism (e.g. dependence on external resources or thread scheduling), flaky tests can be very difficult to debug. Moreover, if a developer doesn’t know that a test failure is due to a flaky test (rather than a regression that they introduced), how does the developer know where to start debugging: their recent changes, or the failing test? If the test failure is not due to their recent changes, then should they debug the test failure immediately, or later?
Flaky tests plague both large and small companies. Google reports that 1 in 7 of their tests have some level of flakiness associated with them. A quick search also turns up many StackOverflow discussions and details about flaky tests at Microsoft, ThoughtWorks, SemaphoreCI and LucidChart.

Traditional Approach for Detecting Flaky Tests

Prior to our work, the most effective way to detect flaky tests was to repeatedly rerun failed tests. If some rerun passes, the test is definitely flaky; but if all reruns fail, the status is unknown (it might or might not be flaky). Rerunning failed tests is directly supported by many testing frameworks, including Android, Jenkins, Maven, Spring, FaceBook's Buck and Google TAP. Rerunning every failed test is extremely costly when organizations see hundreds to millions of test failures per day. Even Google, with its vast compute resources, does not rerun all (failing) tests on every commit but reruns only those suspected to be flaky, and only outside of peak test execution times.

Our New Approach to Detecting Flaky Tests

Our approach, DeFlaker, detects flaky tests without re-running them, and imposes only a very modest performance overhead (often less than 5% in our large-scale evaluation). Recall that a test is flaky if it can both pass and fail when it executes the same code, i.e., code that did not change; moreover, a test failure is new if the test passed on the previous version of code but fails in the current version. Between each test suite execution, DeFlaker tracks information about what code has changed (using information from a version control system, like git), what test outcomes have changed, and which of those tests executed any changed code. If a test passed on a prior run, but now fails, and has not executed any changed code, then DeFlaker warns that it is a flaky test failure.
However, on the surface, tracking coverage for detecting flaky tests may seem costly: industry reports suggest that collecting statement coverage is often avoided due to the overhead imposed by coverage tools, a finding echoed by our own empirical evaluation of popular code coverage tools (JaCoCo, Cobertura and Clover).
Our key insight in DeFlaker is that one need not collect coverage of the entire codebase in order to detect flaky tests. Instead, one can collect only the coverage of the changed code, which we call differential coverage. Differential coverage first queries a version-control system (VCS) to detect code changes since the last version. It then analyzes the structure of each changed file to determine where exactly instrumentation needs to be inserted to safely track the execution of each change, allowing it to track coverage of changes much faster than traditional coverage tools. Finally, when tests run, DeFlaker monitors execution of these changes and outputs each test that failed and is likely to be flaky.

Finding Flaky Tests

How useful is DeFlaker to developers? Can it accurately identify test failures that are due to flakiness? We performed an extensive evaluation of DeFlaker by re-running historical builds for 26 open-source Java projects, executing over 47,000 builds. When a test failed, we attempted to diagnose it both with DeFlaker and by rerunning it. We also deployed DeFlaker live on Travis CI, where we integrated DeFlaker into the builds of 96 projects. In total, our evaluation involved executing over five CPU-years of historical builds. We also studied the relative overhead of using DeFlaker compared to a normal build. A complete description of our extensive evaluation is available in our ICSE 2018 paper.
Our primary goal was to evaluate how many flaky tests DeFlaker would find, compared with the traditional (rerun) approach. For each build, whenever a test failed, we re-ran the test using the rerun facility in Maven’s Surefire test runner. We were interested to find that this approach only resulted in 23% of test failures eventually passing (hence, marked as flaky) even if we allowed for up to five reruns of each failed test. On the other hand, DeFlaker marked 95% of test failures as flaky! Given that we are reviewing only code that was committed to version control repositories, we expected that it would be rare to find true test failures (and that most would be flaky).
We found that the strategy by which a test is rerun matters greatly: make a poor choice, and the test will continue to fail for the same reason as the first failure, causing the developer to assume that the failure was a true failure (and not a flaky test). Maven’s flaky test re-runner reran each failed test in the same process as the initial failed execution --- which we found to often result in the test continuing to fail. Hence, to better find flaky test failures, and to understand how to best use reruns to detect flaky tests, we experimented with the following strategies, rerunning failed tests: (1) Surefire: up to five times in the same JVM in which the test ran (Maven’s rerun technique); then, if it still did not pass; (2) Fork: up to five times, with each execution in a clean, new JVM; then, if it still did not pass; (3) Reboot: up to five times, running a mvn clean between tests and rebooting the virtual machine between runs.
As shown in the figure below, we found nearly 5,000 flaky test failures using the most time-consuming rerun strategy (Reboot). DeFlaker found nearly as many of these same flaky tests (96%) with a very low false alarm rate (1.5%), and at a significantly lower cost. It’s also interesting to note that when a rerun strategy worked, it generally worked after a single rerun (few additional tests were detected from additional reruns). We demonstrated that DeFlaker was fast by calculating the time overhead of applying DeFlaker to the ten most recent builds of each of those 26 projects. We compared DeFlaker to several state-of-the-art tools: the regression test selection tool Ekstazi, and the code coverage tools JaCoCo, Cobertura, and Clover. Overall, we found that DeFlaker was very fast, often imposing an overhead of less than 5% — far faster than the other coverage tools that we looked at.
Overall, based on these results, we find DeFlaker to be a more cost-beneficial approach to run before or together with reruns, which allows us to suggest a potentially optimal way to perform test reruns: For projects that have lots of failing tests, DeFlaker can be run on all the tests in the entire test suite, because DeFlaker immediately detects many flaky tests without needing any rerun. In cases where developers do not want to pay this 5% runtime cost to run DeFlaker (perhaps because they have very few failing tests normally), DeFlaker can be run only when rerunning failed tests in a new JVM; if the tests still fail but do not execute any changed code, then reruns can stop without costly reboots.
While DeFlaker’s approach is generic and can apply to nearly any language or testing framework, we implemented our tool for Java, the Maven build system, and two testing frameworks (JUnit and TestNG). DeFlaker is available under an MIT license on GitHub, with binaries published on Maven Central.
More information on DeFlaker (including installation instructions) is available on the project website and in our upcoming ICSE 2018 paper. We hope that our promising results will motivate others to try DeFlaker in their Maven-based Java projects, and to build DeFlaker-like tools for other languages and build systems.

For further reading on flaky tests:

Adriaan Labuschagne, Laura Inozemtseva and Reid Holmes
A study of test suite executions on TravisCI that investigated the number of flaky test failures.
John Micco
A summary of Flaky tests at Google and (as of 2016) the strategies used to manage them.
Qingzhou Luo, Farah Hariri, Lamyaa Eloussi, and Darko Marinov
A study of the various factors that might cause tests to behave erratically, and what developers do about them.
A description of how the Chromium and WebKit teams triage and manage their flaky test failures.