Monday, August 26, 2019

Where to deploy my API? From Server-Centric to Client-Centric Architectural Styles.

Authors: Javier Berrocal (@jberolm), Jose Garcia-Alonso (@jmgaralo), Juan Manuel Murillo (@juanmamuro)
 Associate Editor:Niko Mäkitalo (@nikkis)

The increase in the capabilities of near-to-the-edge and end devices has allowed developers to take these devices as a possible destination for the deployment architecture of any API. They can better satisfy the essential requirements of some APIs, such as responsiveness, location-awareness, etc., but they also present some restrictions regarding resource consumption (battery, data traffic, and so on). Currently, developers have to analyze every architectural style carefully and available device to meet the system requirements better. Different tools and proposals are available to facilitate both the decision-making process and the application of the selected architecture.

Increase in the capabilities of near-to-the-end devices

During the last few years, the predominant architectural style for the deployment of APIs has been the Server-Centric or Cloud-Centric style. In these styles, the most demanding features, like storing or computing of the managed data, are offloaded to cloud environments. These styles provide improved scalability, fault tolerance, and greater control of the operational costs. They allowed developers to implement applications with an excellent response time that can be visualized and used from end devices, that until a few years ago had very limited capabilities.
In the last few years, end devices and near-to-the-edge devices capabilities have increased tremendously. Their storage and computational capabilities have increased in order to be able to execute more complex tasks. This allowed researchers to develop new paradigms, such as Fog, Mist or Edge Computing, in which the whole or part of an application or API is loaded in end devices or near-to-the-edge devices, reducing the network load and the dependency of the cloud environment and improving response time. For instance, by using data closer to, or even inside, the targeted device.
Therefore, developers not only have to make decisions about how to develop a specific API but also what architectural style to apply and where each microservice should be deployed to meet the system requirements better. Initially, one may think that the behaviors implemented with one style, cannot be achieved with others. Our experience, however, has shown us that similar behaviors can be obtained with any of them.

Trade-off among capabilities and limitations

Therefore, just at the moment in which the developer has to select what architectural and deployment style to use, s/he has to do a complex trade-off among the system requirements, and the capabilities and limitations of the different devices in the environment. Deploying an API following a server-centric style could increase the system scalability or the fault tolerance, but could also negatively affect the operational cost, location-awareness, and responsiveness.  Deploying an API in near-to-the edge devices could positively impact these requirements, but they also present some significant limitations. For instance, a constraining factor of these devices is the resource consumption (battery, data traffic, etc.). Many of these devices, such as mobile phones, smart speakers, etc. are battery-powered. Likewise, some of them have to interact using the mobile network, either because they are mobile or because of their specific situation, which entails the consumption of the data plan.  In the deployment of any API, the correct management of these resources is crucial for user satisfaction. It is well known that resource consumption, is a factor determining the application success.
Therefore, all these dimensions have to be taken into account to select the architectural style that best fit the system requirements, doing a trade-off among the capabilities and limitations of every connected device and the system requirements. 

Tools helping in the decision-making process

During the last few years, researchers have been working on defining methods for measuring whether the requirements are met, from those directly related to the system behavior to those that impact on other non-functional dimensions such as the resource consumption. 
Some works both in the academia and in the industry domains are focused on analyzing the final developed and deployed an application for identifying if those requirements are met. For instance, Giovani et al. presented some years ago, a monitoring framework able to adjust the resources assigned to the deployed application or to face transient failures replicating and restarting components to provide resiliency. Leitner et al. proposed a tool to re-evaluate the cost of an application in the background continually. Alternatively, for end Android devices, Batterystatscould be used to collects battery data about the consumption of a specific application.
Another proposal, however, is focused on evaluating during the early stages of the development process how these requirements can be met, helping in the decision-making process of which architectural design are the most suitable. For instance, some works propose a different heuristic to identify where the microservices should be deployed to meet some requirements such as responsiveness better. Likewise, the authors of this post have been working on defining a conceptual framework to identify in early development stages what architectural style is the most appropriate to achieve a reduction in terms of resource consumption.

Generating microservices for different architectures

Once identified the most appropriate style for achieving the system requirements, the API has to be developed and deployed.

Currently, the specification and development of microservices are supported by a large number of tools that facilitate the work of the developer. Specifications such as OpenAPI, are widely used to detail an API, generate the documentation, perform tests and, even, generate the API skeleton. This type of tools can even generate the source code for different technologies such as Node.JS, Kotlin, JAX-RS,etc., but they are focused on deploying the API in a server or a cloud environment. 

Fewer commercial tools are supporting other architectural styles, such as client-centric or others architectures based on fog or mist computing. However, at the research level, more and more proposals are presented detailing how to use these specifications to generate the skeleton and to deploy microservices in other elements of the network. For instance, Noura et al. propose WoTDL2API tool, that automatically generates a running RESTful API based on the OpenAPI specification and its code generation toolchain to control different IoT devices. The generated source code can be deployed in near-to-the-edge devices, such as gateways, routers, and so on. 

Likewise, the authors of this post have recently proposed a set of tools to generate and deploy a running RESTful API based on the OpenAPI standard on end devices such as smartphones, IoT devices, etc., providing support to client-centric architectures. 

Summarizing

APIs and microservices until now have been designed and developed to be deployed on a cloud environment because of its computing capabilities, fault tolerance, responsiveness, etc. Nevertheless, the ever-increasing capabilities of other elements in the networks have fostered the deployment of these APIs in these devices to better meet some requirements such as scalability, interoperability, real-time responsiveness, security, or location-awareness.

Nevertheless, for developers to apply the best architectural style, first, they need tools helping in the decision making process by providing information for each style on the degree of compliance of the requirements; and, secondly, tools supporting and reducing the effort required to develop and deploy the microservices for each architectural style are needed. If these tools are not provided developer, tend to use only known architectures and those with greater support.

Monday, August 19, 2019

Studying human values in software engineering

Authors: Emily Winter and Maria Angela Ferrario
Associate EditorJinghui Cheng (@JinghuiCheng)

The Values in Computing (ViC) project: http://www.valuesincomputing.org

Why do human values matter for software engineering? Recent years have seen high profile software scandals and malpractices, in which individual privacy and social democracy have been undermined, as in the case of Cambridge Analytica’s use of Facebook data [1] and even human lives lost, as in the case of Boeing 737 [2]. As another recent IEEE Software post puts it, we are heading into an age of considerable ‘values debt’ [3], as the negative societal consequences, both intended and unintended, of our software systems mount up.

There is a pressing need then to understand how human values operate, to develop methods and tools to study them in a software engineering context, and to build on this understanding to consider how SE research might contribute to a more socially responsible software industry.

How do values operate?

We use values research from social psychology as our framework. In particularly, we draw on Schwartz’s values model, based on extensive empirical research spanning the last three decades. Schwartz’s work has identified, through survey research, a range of values that are recognised across cultures and that operate relationally. Schwartz’s values model operates across two key oppositional axes: self-enhancement vs. self-transcendence; and openness vs. conservation [4].

We also use Maio’s work, which considers values as mental constructs that can be studied at three different levels: the system level (the relationships outlined by Schwartz); the personal level (the different interpretations of values held by individuals); and the instantiation level (how values are expressed through behaviours) [5]. At the system level, for example, a software engineer who is highly concerned about their personal career development (achievement) is – according to Schwartz’s model – less likely to be concerned about the environmental sustainability (universalism) of the systems they are building. At the personal level, software engineers may have different interpretations of high quality code (achievement) - e.g. ‘code that does the job’ vs. ‘elegant code’. At the instantiation level, a concern with privacy may manifest in a development decision not to track user queries.

Understanding values in software engineering

In order to study human values in a software engineering context, we required methods that were relatable and relevant to the software engineering community. We used Q-methodology as our starting point [6]. Q-methodology is a well-established method designed to systematically study subjectivity. It involves participants taking part in a card ranking exercise; they are interviewed about their decisions and multiple participants’ ‘sorts’ can be statistically analysed. The structured nature of the sorting helped with the systematic articulation and analysis of the qualitative data elicited from participant’s narratives; it was also important that the card statements were specific to the software engineering community. We used the newly revised ACM Code of Ethics [7] as a basis, choosing principles that corresponded with Schwartz’s values types and filling in any gaps by creating statements in accordance with the missing values. It was important, in order to gain a full understanding of the software engineering context, to consider a wide range of values, not just those considered ethical, in order to understand fully values trade-offs within complex industry contexts. Power and profit were as important for our study as honesty and the social good.

The role of the researcher in promoting a socially responsible software industry

One of our key findings is that people interpret and act out values in very different ways. Two of our study participants, for example, who both placed the statement ‘it is important for me that the public good is the central concern of all professional computing work’ in their ‘top 3’, showed almost opposite understandings of this value. For Laura, for example, the public good was about optimising the user experience: she explained they would ‘analyse the data once the user hits our website; we would then optimise off that behaviour’. By contrast, Stuart didn’t want to overly ‘structure’ the experience of users. He explained that an e-commerce site could ask users ‘do you want us to try and automate your offers? Yes or no’. He viewed an overly structured web experience as being oppositional to users’ freedom of choice.

By simply introducing the Q-Sort to software engineers, we have already encouraged articulation of these differences of interpretation, things that are often taken for granted and rarely explained. Maio and Olson, for example, argue that values often act as truisms, ‘widely shared, rarely questioned, and, therefore, relatively bereft of cognitive support’ [8]. Carrying out this kind of research may be the first step in encouraging a more values-aware and values-reflective technology industry – in which the taken-for granted may begin to be reflected upon and articulated. Avenues for future work include identifying opportunities for light-weight interventions that enable values reflection as an integral part of the agile process, for example.

As well as encouraging discussion of values within industry, we (SE researchers and academics) need to foster reflective, critical skills within our students. For example, we used the Q-Sort as a teaching tool in our Software Studio, a 20-week long module for second year Software Engineering undergraduate students. Within this module, students work in small teams to ideate, design and develop a software application. We introduced the Q-Sort to teams early on in the process as a way of encouraging values articulation and prioritisation that would underpin the entire software engineering decision making process. As well as generating discussion, reflection and critical thinking, this led to concrete future design decisions. One team, for example, went on to adopt a ‘most-vulnerable-first’ approach to system design and development for their train journey planning app, prioritizing search needs for people with disabilities, people with young children, and the elderly. In contrast to standalone ethics courses, the Software Studio embedded values and ethical considerations into the module; they were integrated with technical skills, not an optional add-on.

This is one example of teaching practice that supports the Values in Computing mission: that the next generation of computing professionals will be equipped with the technical tools, foundational knowledge, and critical skills necessary to distinguish responsible software engineering decisions from those that are potentially harmful to self and society.

References
  1. The Guardian, Cambridge Analytica Files. Retrieved on 12 August 2019 from https://www.theguardian.com/news/series/cambridge-analytica-files
  2. Helmore, E. (2019) ‘Profit over safety? Boeing under fire over 737 Max crashes as families demand answers’, The Guardian. Retrieved on 12 August 2019 from https://www.theguardian.com/business/2019/jun/17/boeing-737-max-ethiopian-airlines-crash
  3. Hussain, W. (2019) ‘Values debt is eating software’, IEEE Software blog. Retrieved on 12 August 2019 from http://blog.ieeesoftware.org/2019/07/values-debt-is-eating-software.html
  4. Schwartz, S. H. et al. (2012) ‘Refining the theory of basic individual values’. Journal of personality and social psychology 103(4): 663-688
  5. Maio, G. R. (2010) ‘Mental representations of social values’. in Advances in Experimental Social Psychology (Vol 42). Academic Press, pp. 1–43.
  6. Winter, E. et al. (2019) ‘Advancing the study of human values in software engineering’. In Proceedings of the 12thInternational Workshop on Cooperative and Human Aspects of Software Engineering (CHASE '19). IEEE Press, pp. 19-26. 
  7. ACM (2019) Code of Ethics and Professional Conduct. Retrieved on 12 August 2019 from https://www.acm.org/code-of-ethics
  8. Maio, G. R. and J. M. Olson (1998) ‘Values as truisms: Evidence and implications’. Journal of Personality and Social Psychology, 74(2): 294-311.

Wednesday, August 7, 2019

Design and Evolution of C-Reduce (Part 2)

Associate Editor: Karim Ali (@karimhamdanali)

Part 1 of this series introduced C-Reduce and showed how it combines a domain-independent core with a large collection of domain-specific passes in order to create a highly effective test-case reducer for C and C++ code. This part tells the rest of the story and concludes.

Parallel Test-Case Reduction


C-Reduce's second research contribution is to perform test-case reduction in parallel using multiple cores. The parallelization strategy, based on the observation that most variants are

uninteresting, is to speculate along the uninteresting branch of the search tree. Whenever C-Reduce discovers an interesting variant, all outstanding interestingness tests are killed and a new line of speculation is launched. This is the same approach that was subsequently rediscovered by the creator of halfempty.
C-Reduce has a policy choice between taking the first interesting variant returned by any CPU, which provides a bit of extra speedup but makes parallel reduction non-deterministic, or only taking an interesting variant when all interestingness tests earlier in the search tree have reported that their variants are uninteresting. We tested both alternatives and found that the speedup due to non-deterministic choice of variants was minor. Therefore, C-Reduce currently employs the second option, which always follows the same path through the search tree that non-parallel C-Reduce would take. The observed speedup due to parallel C-Reduce is variable, and is highest when the interestingness test is relatively slow. Speedups of 2–3x vs. sequential reduction are common in practice.

Applicability and Moving Towards Domain-Independence

Something that surprised us is how broadly applicable C-Reduce is to test cases in languages other than C and C++. Our users have found it effective in reducing Java, Rust, Julia, Haskell, Racket, Python, SMT-LIB, and a number of other languages. When used in this mode, the highly C/C++-specific C-Reduce passes provide no benefit, but since they fail quickly they don't do much harm. (C-Reduce also has a --not-c command line option that avoids running these passes in the first place.) One might ask why C-Reduce is able to produce good results for languages other than C and C++; the answer appears to be based on the common structural elements found across programming languages in the Algol and Lisp families. In contrast, in our limited experience, C-Reduce does a very poor job reducing test cases in binary formats such as PDF and JPEG. Other reducers, such as the afl-tmin tool that ships with afl-fuzz, work well for binary test cases.

A consequence of the modular structure of C-Reduce is that while its transformation passes are aimed at reducing C and C++ code, the C-Reduce core is completely domain-independent. C-Reduce has been forked to create a highly effective reducer for OpenCL programs. We believe it would be relatively straightforward to do something similar for almost any other programming language simply by tailoring the passes that C-Reduce uses.

Limitations


When used for its intended purpose, when does C-Reduce work badly? We have seen two main classes of failure. First, C-Reduce can be annoyingly slow. This typically happens when the passes early in the phase ordering, which are intended to remove a lot of code quickly, fail to do this. Second, highly templated C++ sometimes forces C-Reduce to terminate with an unacceptably large (say, >1 KB) final result. Of course this is better than nothing, and subsequent manual reduction is usually not too difficult, but it is frustrating to have written 69 different Clang-based passes and to still find effectively irreducible elements in test cases. The only solution—as far as we know—is to strengthen our existing transformation passes and to create more such passes.

A small minority of compiler bugs appears to be impossible to trigger using small test cases. These bugs are exceptions to the small scope hypothesis. They typically stem from resource-full bugs in the compiler. For example, a bug in register spilling code requires the test case to use enough registers that spilling is triggered; a bug in logic for emitting long-offset jumps requires the test case to contain enough code that a long offset is required; etc. These test cases are just difficult to work with, and it is not clear to us that there's anything we can do to make it easier to debug the issues that they trigger.

C-Reduce Design Principles


In summary, C-Reduce was designed and implemented according to the following principles:

  1. Be aggressive: make the final reduced test case as small as possible.
  2. Make the reducer fast, for example using parallelism, careful phase ordering of passes, and avoiding unnecessary I/O traffic, when this can be done without compromising the quality of the final output.
  3. Make it as easy as possible to implement new passes, so that many domain-specific passes can be created.

  4. Keep the C-Reduce core domain-independent.
  5. Focus only on producing potentially-interesting variants, delegating all other criteria to the user-supplied interestingness test.

Directions for Future Test-Case Reduction Research

Although perhaps a few dozen papers have been written about test-case reduction since Hildebrandt and Zeller's initial paper 19 years ago, I believe that this area is under-studied relative to its practical importance. I'll wrap up this article with a collection of research questions suggested by our experience in over a decade of work on creating a highly aggressive reducer for C and C++.


What is the role of domain-specificity in test case reduction? Researchers who cite C-Reduce appear to enjoy pointing out that it is highly domain-specific (nobody seems to notice that the C-Reduce core is domain-independent, and that the pass schedule is easy to modify). The implication is that domain-specific hacks are undesirable and, of course, an argument against such hacks would be forceful if backed up by a test-case reducer that produced smaller final C and C++ code than C-Reduce does, without using domain knowledge. So far, such an argument has not been made.


Is domain knowledge necessary, or can a domain-independent test-case reducer beat C-Reduce at its own game? The most impressive such effort that we are aware of is David MacIver's structureshrink, which uses relatively expensive search techniques to infer structural elements of test cases that can be used to create variants. Anecdotally, we have seen structureshrink produce reduced versions of C++ files that are smaller than we would have guessed was possible without using domain knowledge. Even so, some useful transformations such as function inlining and partial template instantiation seem likely to remain out-of-reach of domain-independent reduction techniques.


What is the role of non-greedy search in test-case reduction? In many cases, the order in which C-Reduce runs its passes has little or no effect on the final, reduced test case. In other words, the search is often diamond-shaped, terminating at the same point regardless of the path taken through the search space. On the other hand, this is not always the case, and when the search is not diamond-shaped, a greedy algorithm like C-Reduce's risks getting stopped at a local minimum that is worse than some other, reachable minimum. The research question is how to get the benefit of non-greedy search algorithms without making test-case reduction too much slower.


What other parallelization methods are there? C-Reduce's parallelization strategy is simple, gives a modest speedup in practice, and always returns the same result as sequential reduction. There must be other parallel test-case reduction strategies that hit other useful points in the design space. This is, of course, related to the previous research question. That is, if certain branches in the search tree can be identified as being worth exploring in both directions, this could be done in parallel.


What is the role of canonicalization in test-case reduction? A perfectly canonicalizing reducer would reduce every program triggering a given bug to the same final test case. This is a very difficult goal, but there are many relatively simple strategies that can be employed to increase the degree of canonicalization, such as putting arithmetic expressions into a canonical form, assigning canonical names to identifiers, etc. C-Reduce has a number of transformations that are aimed at canonicalization rather than reduction. For example, the reduced test case at the top of Part 1 of this piece has four variables a, b, c, and d, which first appear in that order. I believe that more work in this direction would be useful.

Can we avoid bug hijacking? Test reduction sometimes goes awry when the bug targeted by the reduction is "hijacked" by a different bug. In other words, the reduced test case triggers a different bug than the one triggered by the original. During a compiler fuzzing campaign this may not matter since one fuzzer-generated bug is as good as another, but hijacking can be a problem when the original bug is, for example, blocking compilation of an application of interest. Hijacking is particularly common when the interestingness test looks for a non-specific behavior such as a null pointer dereference. C-Reduce pushes the problem of avoiding hijacking onto the user who can, for example, add code to the interestingness test looking for specific elements in a stack trace. The research question here is whether there are better, more automated ways to prevent bug hijacking.

Obtaining C-Reduce

Binary C-Reduce packages are available as part of many software distributions including Ubuntu, Fedora, FreeBSD, OpenBSD, MacPorts, and Homebrew. Source code can be found at:
https://github.com/csmith-project/creduce
Acknowledgments: C-Reduce was initially my own project, but by lines of code the largest contributor by a factor of two is my former student Yang Chen, now a software engineer at Microsoft. Yang wrote effectively all of the Clang-based source-to-source transformation code, more than 50,000 lines in total. Eric Eide, a research professor in computer science at the University of Utah, is the other major C-Reduce contributor. Our colleagues Pascal Cuoq, Chucky Ellison, and Xuejun Yang also contributed to the project, and we have gratefully received patches from a number of external contributors. Someone created a fun visualization of the part of C-Reduce's history that happened on Github:


Finally, I’d like to thank Eric Eide and Yang Chen for reviewing and suggesting improvements to this piece.

Tuesday, July 23, 2019

Values Debt is Eating Software

By: Waqar Hussain (@infinity4upk )
Associate Editor: Muneera Bano (@DrMuneeraBano)


The growing diffusion of software into human societies and its profound impact on individuals calls for a reflective and responsible attitude from its creators. Software engineers often do well when it comes to delivering software functionality thus creating economic or business value. However, an aspect that hasn’t received sufficient attention is human values – which we all would claim to care about. On the positive side, the shift from value-neutral towards value-based software engineering is a sign of growing maturity for software engineering as a discipline [1].

Values such as fairness, sustainability and pleasure are ideals and aspirations that are important to humans. Values transcend situations, guide human decision making and influence their evaluation of worldly things [2]. When values are not designed for and delivered in software, there are ramifications. I call value deficiencies or omissions in software, Values Debt (VD). Unlike other forms of software debt often remedied by spending additional cost or effort, high VD renders software unsalvageable. It poses an existential threat to the ‘owner’ organization - something that happened to Volkswagen and Facebook [3].

VD is present in all kinds of software. For example, the mundane software like flight reservation systems has an inherent VD as they are built on economic drivers of supply and demand rather than on the values of empathy and compassion. Delta airline’s reservation system manifested its VS when it started price gouging people trying to escape from the devastation of Hurricane Irma [4].

More alarmingly, VD in also present in Machine Learning (ML) and Artificial Intelligence (AI) based systems that can easily scale and impact masses of people. These ‘intelligent’ ML/AI based systems are often trained on biased data hence spew out biased results such as longer sentencing for criminals of a particular skin type, denial of loans and jobs based on certain ethnicity or gender and so on. With the prevalence of VD in all software, the point of emphasis is, while software may be eating the world, VD is dining on software!

Contrary to the common (practitioners’) belief, not all ills of software can be put to rest by adding ‘more’ code. For example, the tech giant Facebook had to undergo social audit rather than technical overhauls of its platform to fix privacy, discrimination and other human rights issues.

But why is it so hard to embed even a small subset of human values in software? Firstly, it is due to the subjective nature of values and their imprecise definitions. Software engineers, especially those with little or no background in social sciences, can hardly relate values to their ‘day to day’ work. Secondly, the interconnected of values. Implementing one needs to address the willful implementation of its interconnected set.

Poor implementation of one e.g. privacy might impact other connected values such as perceived autonomy and freedom. Thirdly, prioritizing stakeholder values (individual vs. corporate vs. societal) and aligning them in software delivery is extremely challenging.

But how do software engineers deal with values in practice? Our research shows that there is a general lack of awareness in the creators of software about human values and how they relate to their work [4]. Some avoid dealing with the murky and philosophical area of values and divorce themselves from taking up any responsibility. Others get deterred by the cost and effort required to achieve the desired level of maturity in implementing values and are content with covering enough regulatory requirements just to be compliant. Yet others consider the delivery of software itself takes care of human values - and the attitude is like ‘after all we are solving people’s problems and keeping ourselves from doing any harm ’.

What can be done to address this? Besides providing awareness of the importance of human values in SE and providing practitioners with the right tools and techniques, more accountability is needed. While codes of professional conduct for software practitioners and standards to support the development of ethical AI and ML-like IEEE P7000 suite are useful. Appealing merely to the good conscience of people alone, however, will not achieve the desired results. Regulations like General Data Protection Regulation (GDPR) and others are helpful [5]. Regulations deter technological advancements from deliberately breaching societal values and hold organizations accountable (and even individuals) when software negatively impacts people or the environment.

Where to now? Values Debt metaphor is an ice breaker to discuss, understand and thereby avoid the implications of human-values-deficient software. Imagine how does the world start to look like if we could engineer in values from the get-go? Do we move to avoid human rights deficiencies in software? Can we solve failures like Delta Air ticketing system labelled as one with “no built-in ethical values”; can we ‘educate’ and restrain intelligent algorithms in-time from sending pictures of self-harm to children and avoid future suicides? [6] I believe we can if we reduce software Values Debt.

References

[1] Boehm, B. Value-based software engineering, ACM SIGSOFT Software Engineering Notes, 28(2), 4., 2003
[2] Schwartz, S.H., et al., Refining the theory of basic individual values. Journal of personality and social psychology, 2012. 103(4): p. 663.
[3 ] Smith D. Mark Zuckerberg vows to fight election meddling in marathon Senate grilling. The Guardian. 2018 Apr;11.
[4] Justin Sablich. 2017. ’Price Gouging’ and Hurricane Irma: What Happened and What to Do. https://www.nytimes.com/2017/09/17/travel/price-gouginghurricane-irma-airlines.html.
[5] Voigt, P. and A. Von dem Bussche, The EU General Data Protection Regulation (GDPR). A Practical Guide, 1st Ed., Cham: Springer International Publishing, 2017.
[6] Crawford, A. Instagram 'helped kill my daughter'. BBC News. Retrieved on 24 July 2019 from https://www.bbc.com/news/av/uk-46966009/instagram-helped-kill-my-daughter, 2019

Tuesday, July 9, 2019

Design and Evolution of C-Reduce (Part 1)

Associate Editor: Karim Ali (@karimhamdanali)

Since 2008, my colleagues and I have developed and maintained C-Reduce, a tool for programmatically reducing the size of C and C++ files that trigger compiler bugs. C-Reduce also usually does a credible job reducing test cases in languages other than C and C++; we'll return to that topic in Part 2.

Why Reduce Test Cases?

Here's a typical C-Reduce output, as found in the LLVM bug database:

int a[1];
int b;
void c() {
  void *d = b = 0;
  for (;; b++)
    a[b] = (int) d++;
}

LLVM crashes when compiling this code at the -O2 optimization level. The bug report does not contain the original, unreduced test case, but most likely it was larger.


A reduced test case is preferable because:
  • it usually gets the compiler to misbehave quickly, reducing the number of function calls, memory allocations, etc. that the compiler developer has to step through and reason about while debugging
  • the reduced file contains little code not directly related to triggering the bug, making it less likely that compiler developers will be distracted by extraneous features of the test case
  • reduced test cases for the same bug often look similar to each other, whereas this is not normally true for unreduced files that trigger the same bug
  • there is often little or no discernible similarity between an unreduced source file and its reduced version, making it easier for compiler bugs triggered by proprietary code to be reported externally

The minimum-sized input triggering any particular compiler bug can be found using a trivial algorithm: exhaustive search of text strings of increasing size. This method is, of course, almost always intractable. In practice, test case reduction proceeds in the other direction: starting with a large, failure-inducing test case, incrementally making it smaller until a local minimum is reached.

A Bit of Background

The history of automated test case reduction does not seem to be well documented, but several examples can be found in software testing papers from the 1990s such as Differential Testing for Software and Massive Stochastic Testing of SQL. Test-case reduction was first studied in its own right in 2000 when Hildebrandt and Zeller introduced Delta Debugging: a general-purpose technique for test case reduction. Their algorithm uses a greedy search where a series of variants (our term, not theirs, for partially-reduced test case candidates) is produced by removing chunks of the test case. As reduction progresses, the chunk size is reduced, until it reaches some minimum-sized unit, such as a single line, token, or character. When no minimum-sized chunk can be removed from the test case without breaking the property that it triggers the bug, the Delta Debugger terminates. Almost all subsequent test-case reduction work, including ours, builds upon this work.

Towards C-Reduce

I became interested in test case reduction when my colleagues and I began to find a lot of bugs in C compilers using random testing. We found so many bugs that reporting them became bottlenecked on reducing the bug-triggering programs. Since I was the one reporting the bugs we found, I was the one who felt the pain of manual test-case reduction, and it quickly got old. I eventually reported around 500 compiler bugs and I could not have done this without first creating C-Reduce.

At the time, the best open-source implementation of Delta Debugging, from UC Berkeley, was line-based and contained a significant innovation over the original algorithm: it could reorganize a file in such a way that all nested curly braces deeper than a configurable level would appear on a single line. Thus, at level zero, entire functions would be placed on the same line, enabling the line-based reducer to remove an entire function at once. At higher nesting levels, functions would be split across lines, enabling finer-grained reduction. This method worked well but the tool ended up being inadequate for my needs: it got stuck at local minima that were often orders of magnitude larger than what could be achieved when reducing by hand.

The limiting factor in the Berkeley Delta tool ("Delta" from now on) was obvious: it was not able to exploit enough of the structure of the file being reduced. For example, it could usually not do much to simplify arithmetic expressions. These sorts of simplifications tend to have a cascading effect: eliminating the last use of a variable allows its definition to be eliminated, etc. The obvious path forward was to write a new tool that solved a reduction problem that Delta could not solve, and then to alternate running this tool and Delta until a global fixpoint was reached. I did this, adding more and more reduction techniques over time. I eventually wrote a line-elimination pass in my new reducer, at which point Delta was subsumed and could be dropped.

We ended up keeping two elements of Delta's design. First, the configurable hierarchical reformatting of a test case based on curly brace nesting. This technique, followed by removing contiguous lines of code, is still one of C-Reduce's most useful first lines of attack on a test case. Second, Delta's mechanism for determining whether a given variant is "interesting." An interesting variant is used as the basis for further reduction steps; an uninteresting variant is a dead end, and is discarded. Delta determined interestingness by invoking a user-supplied program—typically a shell script—whose process exit code determines the interestingness of the current variant. The flexibility afforded by this small element of user extensibility ends up being extremely useful. For example, the interestingness test can discard test cases that trigger certain compiler warnings, it can attempt to disambiguate different crash bugs, etc.

It is more challenging to reduce test cases that cause the compiler to emit incorrect object code than it is to reduce test cases that merely cause the compiler to crash. C-Reduce itself is agnostic about the character of the bug of interest: we push all of the difficulties in reducing miscompilation triggers into the interestingness test, which should try to answer questions such as:
  • is the variant well-defined by the C or C++ standard?
  • does the variant avoid depending on behaviors that are unspecified by the C or C++ standard?
  • does the buggy compiler turn the variant into an executable?
  • does this executable terminate within a specified time?
  • does the reference compiler (assumed to not contain the bug of interest) turn the variant into an executable?
  • does this executable also terminate within a specified time?
  • does the behavior of the two executables differ in a way that indicates that a miscompilation occurred?

The variant is interesting if the answer to all of these questions is "yes."

The hardest part of reducing programs that trigger miscompilation bugs is ensuring that variants avoid undefined behaviors (such as invalid pointer operations) and do not rely on unspecified behaviors (such as the order of evaluation of function arguments). A test case doing one of these things is ill-formed and can accomplish nothing beyond annoying compiler developers. Empirically, if undefined behavior is not actively avoided during test-case reduction, C-Reduce will almost certainly introduce it. The practical solution is to use suitable static and dynamic analysis tools to rule out ill-formed variants. Since no single tool that detects all undefined and unspecified behaviors in C and C++ programs exists, a hybrid approach involving multiple tools is typically used in practice. This approach is not completely satisfying, but it works well enough that C-Reduce can reliably produce useful reduced test cases for miscompilation bugs in C and C++ compilers.

Writing good interestingness tests for miscompilations takes a bit of practice. First, when there are many criteria that must be satisfied for a variant to be interesting, it is useful to minimize the test’s expected-case runtime by asking the quickest and most-likely-to-fail questions first. Second, it is easy to write buggy tests. More than one user has described C-Reduce as being something like the sorcerer's apprentice: it does an excellent job reducing according to the criteria it is given, but if these criteria contain any kind of loophole, C-Reduce is likely to find it. For example, it is easy to accidentally write a test that claims that the empty file is interesting.

From the start, C-Reduce’s main goal was to produce a very small final reduced test case, even when this would take longer than we liked. This is based on the premise that we should burn cycles instead of human time, and that reporting a compiler bug is not usually on the critical path; we can often afford to wait for a better result. The consequences of this decision can be seen in Tables 1 and 2 of this paper that evaluates several test-case reduction methods: C-Reduce produces the smallest final output, but takes more time to do so.

A Modular, Domain-Independent Reducer Core

Although C-Reduce started out as a pet project solving a specific problem, it evolved into a research project involving a number of my colleagues, whose top-level goal was to produce an effective and usable reducer for C and C++ code as found in the wild. The first research contribution to come out of this effort was a way to achieve a clean mechanism/policy separation in a test case reducer. Previous reduction techniques had all baked specific transformations into the overall search strategy. That approach impedes extensibility, which we found to be crucial. The structure that we ended up with is a small core that invokes a collection of pluggable transformation passes until a global fixpoint is reached.

The API for C-Reduce passes is simple but—like many simple things—required a lot of iterations before it felt finished. It is based on the ideas that transformation passes should be stateless and that every pass should implement a linear sequence of transformations, each of which results in a variant that may or may not be interesting. The interface is as follows:


state new(filename, option)

Return a fresh state object. Each pass uses this state to keep track of where it is in the sequence of transformations that it is capable of performing. These states may contain arbitrary data items; the C-Reduce core treats them as opaque. A typical pass stores some kind of cursor—usually a byte offset, token offset, line number, or position in a tree traversal—in the state object.

The file referred to by filename is logically part of the state object even though it resides in the filesystem instead of memory. Of course it would not be difficult to pass the contents of the file around as a memory object but this approach would be slow when these objects are large: C-Reduce is frequently invoked on multi-megabyte preprocessed C++ files.

The "option" is used to select among different behaviors implemented by a composite pass.

state advance(filename, option, state)

Return a new state object referring to the next transformation opportunity following the one referenced by the state object passed as a parameter.
result transform(filename, option, state)
Modify the file in-place, selecting the transformation instance referred to by the state object. The result takes one of three values:
  • OK : the transformation succeeded
  • STOP : no more transformation instances remain for this pass
  • ERROR : something went wrong; for example, an external tool crashed, a working file or directory could not be created, etc.

(The API contains one additional method, which checks whether a pass's external dependencies are satisfied, that doesn't matter here.)

Our experience has been that every transformation pass that we wanted has been easy to implement behind this API.

The C-Reduce core implements this algorithm:

current = original_test_case
do
  size_at_start = size(current)
  foreach (p, option) in pass_list
    state = p::new(current, option)
    do
      variant = current         // this is a file copy operation
      result = p::transform(variant, option, state)
      if result == ERROR
        report_problem_in_pass(p, option)
      if result == OK
        if is_interesting(variant)
          current = variant     // also a file copy
        else
          state = p::advance(current, option, state)
    while result == OK
while size(current) < size_at_start

The termination argument for C-Reduce is:
  1. Since the outermost loop requires the size of the test case to decrease monotonically, it can only execute as many times as the size (in bytes) of the unreduced test case. In practice, it executes many fewer times than this.
  2. The loop over passes terminates because the pass list is immutable after C-Reduce’s initialization phase.
  3. Each iteration of the innermost loop either advances the state object or else (by selecting an interesting variant) removes one transformation opportunity. Either way, the number of transformations remaining in the current pass is decreased by one.
  4. The interestingness test is, at worst, terminated (using OS support for killing a process group) after a configurable timeout.

In practice, the weak link in this argument is the third item, which is vulnerable to bugs in passes. C-Reduce terminates robustly by abandoning passes when they appear to be behaving unreasonably.

The C-Reduce core does not insist that transformations make the test case smaller, and in fact quite a few of its passes potentially increase the size of the test case, with the goal of eliminating sources of coupling within the test case, unblocking progress in other passes.

The sequence of transformation passes is carefully orchestrated such that passes that are likely to give the biggest wins—such as those that remove entire functions—run first; otherwise the tool would end up spending days or weeks doing silly things such as trying to shrink numeric constants in a huge source file. Shrinking numbers is useful, and it should be done, but only after many other reduction mechanisms have run to completion.

C-Reduce's collection of cooperating passes, with heavy phase-ordering constraints, is highly reminiscent of how a modern optimizing compiler works. However, only a small proportion of the transformation passes is intended to be semantics-preserving in the sense that a compiler's optimization passes must be. In this domain, we only want to preserve enough semantics that we can probabilistically avoid breaking whatever property makes a test case interesting.

A consequence of writing a modular reducer is that once we came up with the right API for writing passes, we were free to write a lot of passes. My colleagues and I spent several years doing this and we ended up with:
  • 35 passes, implemented in Perl, that include heuristics such as removing lines, removing various kinds of matched delimiters (and perhaps also the text between them), and shrinking integer values
  • 6 passes that invoke external utilities such as unifdef, a partial evaluator for the C preprocessor language, a lexer for C and C++ that supports various token-level reduction transformations, and pretty-printing utilities that make the reduced test case more pleasant to look at
  • 69 passes, implemented in C++, that use LLVM's Clang front end as a library for source-to-source transformation of C and C++ code; these include function inlining, partial template instantiation, scalar replacement of aggregates, copy propagation, and eliminating levels of a class hierarchy.

The actual number of dynamic passes is larger than the total of these numbers since some passes can be invoked in different modes using the "option" parameter mentioned above.

In this piece, we looked at why we had to create C-Reduce and at the modular structure that was key to making it solve the problems that we wanted to solve. In Part 2, I'll describe how C-Reduce improves reduction times using multiple cores and why C-Reduce usually does a good job reducing test cases in languages other than C and C++; finally, I'll discuss a few open research problems in test case reduction.