Sunday, August 21, 2016

From Aristotle to Ringelmann: Using data science to understand the productivity of software development teams

By: Ingo Scholtes, Chair of Systems Design, ETH Z├╝rich, Switzerland (@ingo_S)
Associate Editor: Bogdan Vasilescu, University of California, Davis. USA (@b_vasilescu)

I am sure that the title of this blog post raises a number of questions: What has the Greek philosopher Aristotle to do with the 19th century French agricultural engineer Maximilien Ringelmann? And what, if anything, do the two of them have to do with software?

The answers to both questions are related to Aristotle's famous reference to systems where the "whole is greater than the sum of its parts", i.e. complex systems where interactions between elements give rise to emergent or synergetic effects. Teams of software developers which communicate and coordinate to jointly solve complex development tasks can be seen as one example for such a complex (social) system. But what what are the emergent effects in such teams? How do they affect productivity and how can we quantify them?

This is where Maximilien Ringelmann, a French professor for agricultural engineering, who would later - astonishingly - become one of the founders of social psychology, enters the story. Around 1913, Maximilien Ringelmann became interested in the question how the collective power of draught animals, like horses or oxen pulling carts or plows, changes as increasingly large teams of them are harnessed. He answered the question based on a data-driven study. Precisely, he asked increasingly large teams of his students to jointly pull on a rope, measuring the collective force they were able to exert. He found that the collective force of a team of students was less than the sum of forces exerted by each team member alone, an effect that became later known as the "Ringelmann effect". One possible explanation for the finding are coordination issues that make it increasingly difficult to tightly synchronize actions in an increasingly large team. Moreover, social psychology has generally emphasized motivational effects that are due to shared responsibility for the collective output of a team, a phenomenon known by the rather unflattering term "social loafing".

Changing our focus from draught oxen to developers, let us now consider how all of this is related to software engineering. Naturally, the question how the cost and time of software projects scale with the number of software developers involved in the project is of major interest in software project management and software economics.

In the 1975 book The Mythical Man-Month, Fred Brooks formulated his famous law of software project management, stating that "adding manpower to a late project makes it later". Just like for the Ringelmann effect, different causes for this have been discussed. First, software development naturally comes with inherently indivisible tasks which cannot easily be distributed among a larger number of developers. Illustrating this issue, Brooks' stated that "nine women can't make a baby in one month". Secondly, larger team sizes give rise to an increasing need for coordination and communication that can limit the productivity of team members. And finally, for developers added to a team later, Brooks' discussed the "ramp-up" time that is due to the integration and education of new team members. The result of these combined effects is that the individual productivities of developers in smaller teams cannot simply be multiplied to estimate their productivity in a larger team.

While the (empirical) software engineering community has been rather unanimous about the fact that the productivity of individual developers is likely to decrease as teams grow larger, a number of recent works in management science and data science (referenced and summarized in [1]) have questioned this finding in the context of Open Source communities. The argument is that larger team sizes increase the motivation of individuals in Open Source communities, thus giving rise to an "Aristotelian regime" where indeed the whole team produces more than expected based on the sum of its parts. The striking consequence of this would be that Open Source projects are instances of economies of scale, where the effort of production (in terms of team members involves) sublinearly scales with the scale of the project. In contrast traditional software projects represent diseconomies of scale, i.e. the cost of production superlinearly increases as projects become larger and more complex.

The goal of our study [1] was to contribute to this discussion by means of a large-scale data-driven study. Precisely, we studied a set of 58 major Open Source projects on gitHub with a history of more than ten years and a total of more than half a million commits contributed by more than 30,000 developers. The question that we wanted to answer is simple: How does the productivity of software development teams scale with the team size? In particular, we are interested whether Open Source projects indeed represent exceptions from basic software engineering economics, as argued by recent works.

Regarding methodology, answering this question involves two important challenges:
  1. We need a quantitative measure to assess the productivity of a team
  2. We must be able to calculate the size of a development team at any given point in time
In our work, we addressed these challenges as follows. First, in line with a large body of earlier studies and notwithstanding the fact that it necessarily gives rise to a rather limited notion of productivity, we use a proxy measure for productivity that is based on the amount of source code committed to the project's repository. In fact, there are different ways to define such a measure. While a number of previous studies have simply used the number of commits to proxy the amount of code produced, we find that the distribution of code contributions in these commits are so broadly distributed that we cannot simply use it as a measure for productivity. Doing so would substantially bias our analysis. To avoid this problem, we use the Levenshtein distance between the code versions in consecutive commits, which allows us to quantify the number of characters edited between consecutive versions of the source code.

A second non-trivial challenge is to assess the size of a development team in Open Source communities. Most of the time there is no formal notion of a team, so who should be counted as a team member at a given point in time? Again, we address this problem by means of an extensive statistical analysis. Specifically, we analyze the inter-commit times of developers in the commit log. This allows us to define a reasonably-sized time window based on the prediction whether team members are likely to commit again in the future after a given period of inactivity. This time window can then be used to estimate team sizes in a way that is substantiated by the temporal actitiy distribution in the data set (see [1] for details).

We now have all that we need. To answer our question we only need to plot the average code contribution per team member (measured in terms of the Levenshtein distance) against the size of the development team (calculated as described above). If Ringelmann and Brooks are right, we expect a decreasing trend which indicates that developers in larger development teams tend to produce less. If, on the other hand, studies highlighting synergetic effects in OSS communities are right, we expect an increasing trend which indicates that developers in larger developments teams tend to produce more (because they are more motivated). The results across all of the 58 projects are shown in the following plot.

The clear decreasing trend that can be observed visually shows that Ringelmann and Brooks are seemingly right. We can further use a log-linear regression model to quantify the scaling factors and to assess the robustness of our result. This analysis confirms a strong negative relation, spanning several orders of magnitude in terms of the code contributions. Notably, this negative relation holds both at the aggregate level as well as for each of the studied projects individually.

While our analysis quantitatively confirms the Ringelmann effect in all of the studied projects, we have not yet addressed why it holds. Unfortunately, it is non-trivial to quantify potential motivational factors that have been discussed in social psychology. But what we can do is to study potential effects which are due to increasing coordination efforts. For this, we again use the time-stamped commit log of projects to infer simple proxies for the coordination structures of a project. Precisely, we construct complex networks based on the co-editing of source code regions by multiple developers. Whenever we detect that a developer A changed a line of source code that was previously edited by a developer B, we draw a link from A to B. The meaning of such a link is that we assume that there is the potential need for developer A to coordinate his or her change with developer B.

The result of this procedure are co-editing networks that can be constructed for different time ranges and projects. We can now study how these networks change as teams increase in size. What we find is that, in line with Brooks' argument on the increasing coordination and communication effort, the number of links in the co-editing networks tends to grow in a super-linear fashion as teams grow larger. The result of this is that the coordination overhead for each team member is likely to increase as the team grows, thus providing an explanation for the decreasing code production. Moreover, by fitting a model that allows us to estimate the speed at which co-editing networks grow in different projects, we find that there is a statistically significant relation between the growth dynamics of co-editing links and the scaling factor for the decrease of productivity. This finding indicates that the management of a project and resulting coordination structures can significantly influence the productivity of team members, thus enforcing or mitigating the strength of the Ringelmann effect as the team grows.

So, do developers really become more productive as teams grow larger? Does the whole team really produce more than the sum of its team members? Or do we find evidence for Brooks' law and the Ringelmann effect? Based on our large-scale data analysis of more than 580,000 commits by more than 30,000 developers in 58 Open Source Software projects, we can safely conclude that there is a strong Ringelmann effect in all of the studied Open Source projects. As expected based on basic software engineering wisdom, our findings show that developers in larger teams indeed tend to produce less code than developers in smaller teams. Our analysis of time-evolving co-editing networks constructed from the commit log history further suggests that the increasing coordination overhead imposed by larger teams is a factor that drives the decrease of developer productivity as teams grow larger.

In summary, Open Source projects seem to be no magical exceptions from the basic principles of collaborative software engineering. Our study demonstrates how data science and network analysis techniques can provide actionable insights into software engineering processes and project management. It further shows how the application of computational techniques to large amounts of publicly available data on social organizations allows to study hypotheses relevant to social psychology. As such it highlights interesting relations between empirical software engineering and computational social science which provide a large potential for interesting future works.

[1] Ingo Scholtes, Pavlin Mavrodiev, Frank Schweitzer: From Aristotle to Ringelmann: a large-scale analysis of productivity and coordination in Open Source Software projects, Empirical Software Engineering, Volume 21, Issue 2, pp 642-683, April 2016, available online

Sunday, August 14, 2016

Release management in Open Source projects

By: Martin Michlmayr (@MartinMichlmayr)
Associate editor: Stefano Zacchiroli (@zacchiro)

Open source software is widely used today. While there is not a single development method for open source, many successful open source projects are based on widely distributed development models with many independent contributors working together. Traditionally, distributed software development has often been seen as inefficient due to the high level of communication and coordination required during the software development process. Open source has clearly shown that successful software can be developed in a distributed manner.

The open source community has over time introduced many collaboration systems, such as version control systems and mailing lists, and processes that foster this collaborative development style and improve coordination. In addition to implementing efficient collaboration systems and processes, it has been argued that open source development works because it aims to reduce the level of coordination needed. This is because development is done in parallel streams by independent contributors who work on self-selected tasks. Contributors can work independently and coordination is only required to integrate their work with others.

Relatively little attention has been paid to release management in open source projects in the literature. Release management, which involves the planning and coordination of software releases and the overall management of releases throughout the life cycle, can be studied from many different aspects. I investigated release management as part of my PhD from the point of view of coordination theory. If open source works so well because of various mechanism to reduce the level of coordination required, what implications does this have on release management which is a time in the development process when everyone needs to come together to align their work?

Complexity of releases

As it turns out, my study on quality problems has highlighted that release management can be a very problematic part in the production of open source. Several projects described constant delays with their releases, leading to software which is out-of-date or which has other problems. Fundamentally, release management relies on trust. Contributors have to trust release managers and deadlines imposed by them, otherwise the deadlines are simply ignored. This leads to a self-fulfilling prophecy: because developers don't believe a release will occur on time, they continue to make changes and add new code, leading to delays with the release. It's very hard to break such a vicious circle.

It's important to consider why creating alignment in open source projects can be a challenge. I identified three factors that made coordination difficult:
  1. Complexity: alignment is harder to achieve in large projects and many successful open source projects have hundreds of developers.
  2. Decentralization: many open source projects are distributed, which can create communication challenges.
  3. Voluntary nature: it's important to emphasize that this does not mean that contributors are unpaid. While some open source contributors are unpaid, increasingly open source development is done by developers employed or sponsored by corporations. The Linux kernel is a good example with developers from hundreds of companies, such as Intel and Red Hat. What I mean by voluntary nature is that the project itself has no control over the contributors. The companies (in the case of paid developers) define what work gets done and unpaid contributors generally "scratch their own itch". What this means is that it's difficult for a release manager or a project leader to tell everyone to align their work at the same time.

Time-based releases

While my research has shown problems with release management in many projects, it has also identified a novel approach to release management employed by an increasing number of projects. Instead of doing releases based on new features and functionality, which has historically been the way releases are done, releases are made based on time. Time-based releases are like a release train: there is a clear timetable by which developers can orient themselves to plan their work. If they want to make the next train (release), they know when they have to be ready.

Time-based releases work particularly well if the releases are predictable (for example, every X months) and frequent. If a release is predictable, developers can plan accordingly. If a release is fairly frequent, it doesn't matter if you miss a release because you can simply target the next one. You can think of a train station with a train that is leaving. Should you run? If you know the next train will leave soon and you can trust that the next train will leave on time, there is no need to hurry — you can avoid a dangerous sprint across the platform. Similarly, if the next release is near and predictable, you don't need to disrupt the project by making last minute changes.

Additionally, frequent releases give developers practice making releases. If a project does releases only every few years, it's very hard to make the release process an integral process of the development cycle. When releases are done on a regular basis (say every three or six months), the release process can work like a machine — it becomes part of the process.

Time-based releases are a good mechanism for release management in large open source projects. Speaking in terms of coordination theory, time-based releases decrease the level of coordination required because the predictable timetable allows developers to plan for themselves. The timetable is an important coordination mechanism in its own right.

Looking at various open source projects, time-based release management is implemented in different ways. For example, GNOME and Ubuntu follow a relatively strict frequency of six months. This is frequent, predictable and easy to understand. The Linux kernel employs a 2 week "merge window" in which new features are accepted. This is followed by a number of release candidates with bug fixes (but no new features) until the software is ready for release, typically after 7 or 8 release candidates. Debian follows a model where the freeze date is announced in advance. The time between freeze and release depends on how fast defects get fixed. Debian's freeze dates are more than 2 years apart, which in my opinion leads to challenges because the release process is not performed often enough to become a routine.

Recent changes

There have been many changes recently in the industry and development community that have influenced release management. One change relates to the release frequency, which has been going up in several projects (such as U-Boot, which moved from a release every three months to one every two months in early 2016). This may stem from the need to serve updates to users more frequently because the technology is changing so rapidly. It could also be because the release train is working so well and the cost of doing releases has gone down. Frequent releases lead to a number of questions, though. For example, do users prefer small, incremental updates over larger updates? Furthermore, how can you support old releases or are users expect to upgrade to the latest version immediately (a model employed by various app stores for mobile devices)?

In addition to more frequent releases, I also observe that some projects have stopped making releases altogether. In a world of Continuous Integration (CI) and Continuous Deployment (CD), does it make sense to offer the latest changes to users because you can assure that every change has been sufficiently tested? Is there still a value of performing separate releases when you have CI/CD?

I believe more research is needed to understand release management in our rapidly changing world, but one thing is clear to me: studying how contributors with different interests come together to produce and release software is fascinating!


Sunday, August 7, 2016

Architecture-Based Self-Protecting Software Systems

Eric Yuan and Sam Malek,
Software Engineering and Analysis Lab,
University of California, Irvine

Associate Editor: Mehdi Mirakhorli (@MehdiMirakhorli)

Security remains one of the principal concerns for modern software systems. In spite of the significant progress over the past few decades, the challenges posed by security are more prevalent than ever before. As the awareness grows of the limitations of traditional, static security models, current research shifts to dynamic and adaptive approaches, where security threats are detected and mitigated at runtime, namely, self-protection.  Self-protection has been identified by Kephart and Chess [1] as one of the essential traits of self-management for autonomic computing systems.  From a “reactive” perspective, the system automatically defends against malicious attacks or cascading failures, while from a “proactive” perspective, the system anticipates security problems in the future and takes steps to mitigate them.  My systematic survey of this research area [2] shows that although existing research has made significant progress towards autonomic and adaptive security, gaps and challenges remain. Most prominently, self-protection research to-date has primarily focused on specific line(s) of defense (e.g., network, host, or middleware) within a software system.  Such approaches tend to focus on a specific type or category of threats, implement a single strategy or technique, and/or protect a particular component or layer of the system.  In contrast, little research has provided a holistic understanding of overall security posture and concerted defense strategies and tactics.  

In this research project, we are making a case for an architecture-based self-protection (ABSP) approach to address the aforementioned challenges. In ABSP, detection and mitigation of security threats are informed by an architectural representation of the software that is kept in sync with the running system. An architectural focus enables the approach to assess the overall security posture of the system and to achieve defense in depth, as opposed to point solutions that operate at the perimeters. By representing the internal dependencies among the system's constituents, ABSP provides a better platform to address challenging threats such as insider attacks. The architectural representation also allows the system to reason about the impact of a security breach on the system, which would inform the recovery process. 

To prove the feasibility of the ABSP approach, we have designed and implemented an architecture-based, use case-driven framework, dubbed ARchitectural-level Mining Of Undesired behavioR (ARMOUR), that involves mining software component interactions from system execution history and applying the mined architecture model to autonomously identify and mitigate potential malicious behavior.  

The first step towards ABSP is the timely and accurate detection of security compromises and software vulnerabilities at runtime, which is a daunting task in its own right. To that end, the ARMOUR framework starts with monitoring component-based interaction events at runtime, and using machine learning methods to capture a set of probabilistic association rules or patterns that serve as a normal system behavior model. The framework then applies the model with an adaptive detection algorithm to efficiently identify potential malicious events. From the machine learning perspective, we identified and tailored two closely related algorithms, Association Rules Mining and Generalized Sequential Pattern Mining, as the core data mining methods for the ARMOUR framework. My evaluation of both techniques against a real Emergency Deployment System (EDS) has demonstrated very promising results [3,4,5].  In addition to threat detection, the ARMOUR framework also calls for the autonomic assessment of the impact of potential threats on the target system and mitigation of such threats at runtime. In a recent work [yuan_architecture-based_2013], we have shown how this approach can be achieved through (a) modeling the system using machine-understandable representations, (b) incorporating security objectives as part of the system's architectural properties that can be monitored and reasoned with, and (c) making use of autonomous computing principles and techniques to dynamically adapt the system at runtime in response to security threats, without necessarily modifying any of the individual components. Specifically, we illustrated several architecture-level self-protection patterns that provide reusable detection and mitigation strategies against well-known web application security threats.

The high-level architecture of the framework is depicted in the diagram below:

My work outlined in this project makes a convincing case for the hitherto overlooked role of software architecture in software security, especially software self-protection. The ABSP approach complements existing security mechanisms and provides additional defense-in-depth for software systems against ever-increasing security threats. By implementing self-protection as orthogonal architecture concerns, separate from application logic (as shown in the diagram), this approach also allows self-protection mechanisms to evolve independently, to quickly adapt to emerging threats. 

  1. Kephart, J., and Chess, D. The vision of autonomic computing. Computer 36, 1 (Jan. 2003), 41–50.
  2. Yuan, E., Esfahani, N., and Malek, S. A Systematic Survey of Self-Protecting Software Systems. ACM Trans. Auton. Adapt. Syst. (TAAS) 8, 4 (Jan. 2014), 17:1–17:41.
  3. Esfahani, N., Yuan, E., Canavera, K. R., and Malek, S. Inferring software component interaction dependencies for adaptation support. ACM Trans. Auton. Adapt. Syst. (TAAS) 10, 4 (2016), 26.
  4. Yuan, E., Esfahani, N., and Malek, S. Automated Mining of Software Component Interactions for Self-adaptation. In Proceedings of the 9th International Symposium on Software Engineering for Adaptive and Self-Managing Systems (New York, NY, USA, 2014), SEAMS 2014, ACM, pp. 27–36.
  5. Yuan, E., and Malek, S. Mining software component interactions to detect security threats at the architectural level. In Proceedings of the 13th Working IEEE/IEIP Conference on Software Architecture (Venice, Italy, Apr. 2016), WICSA 2016.
  6. Yuan, E., Malek, S., Schmerl, B., Garlan, D., and Gennari, J. Architecture-based Self-protecting Software Systems. In Proceedings of the 9th International ACM Sigsoft Conference on Quality of Software Architectures (New York, NY, USA, 2013), QoSA '13, ACM, pp. 33–42.

Sunday, July 31, 2016

Point-Counterpoint on the length of articles in IEEE Software

By: Diomidis Spinellis (@coolsweng), Christof Ebert (@christofebert)

Point — Diomidis Spinellis

In order to increase the number of articles and theme issues we publish, we will look at limiting the published article length to 3000 words.  This will allow us to publish more articles and two themes in one issue. It should also reduce the length of our queue and therefore the time it takes for us to publish articles. We are huge believers in using reader input and data to drive our decisions. Therefore, we’re presenting here a point-counterpoint regarding this decision and eagerly wait for your comments. Furthermore, as an experiment, our Theme Issue Associate Editor, Henry Muccini, will adjust the Call for Papers for two theme issues to include this limit, and we will see how this will affect submissions, acceptance rates, and, article downloads.

The 3000-word limit  may appear to be to offer too limited space in order to fit all we're asking for.  The inspiration behind the idea comes from journals we admire. Consider the reports in Science, one of the most prestigious journals in the world. These  are limited to about 2500 words including references, notes, and captions or about three printed pages.  Materials and Methods are typically included in online supplementary materials, which also often include information needed to support the paper's conclusions. (See <> for more details).  I have asked Computer Society staff whether we could also provide ancillary material online and I found this is indeed possible. We call these supplements Web Extras and we frequently do include them. The staff do need to review them and if needed then edit them. As an example here is an abstract and a pointer to online material from an article appearing in our July/August 2015 issue titled “Team Performance in Software Development: Research Results versus Agile Principles”.

“Abstract: This article reviews scientific studies of factors influencing co-located development teams’ performance and proposes five factors that strongly affect performance. In the process, it compares these propositions with the Agile Manifesto’s development principles. The Web extra at details the sources and research methods the authors employed.”

Counterpoint — Christof Ebert

Indeed, all that can be said, can be said in brevity. Yet it is challenging for the ambition of our articles and expectations of readers. "Science" is a bit special, but a good ambition as a role model. It covers a wide variety of fields, that it can only survive as an abstracts journal. Not yet our readers' perception of "Software". On the other hand magazines such as HBR, McKinsey and BCG have articles which are rather long - and still read.
We as readers - and leading practitioners - look for material which is down to earth and thus goes beyond hype and marketing, such as foundations, context, background, industry case studies, etc. Providing all that in 3000 words is not easy. Your example is an empirical study, which certainly has to point to more backup data. But does this hold for all type of content in IEEE Software? Stimulating further clicking for more information is appealing, but rarely we actually do it for time constraints. Many of my industry colleagues these days read articles as PDF "on the fly" in a true sense, not clicking further information.
Your suggested pilot is better than fiddling around with assumptions. Following our expectations to authors, we should avoid judging without data.So the pilot needs to measure:
- if and how people access and read these subsequent materials, compared to the current setting
- how much of the additional materials are accessed and with what click-through rates
- how they cite these new article format,
- how reader and subscriber numbers evolve, etc
If the pilot provides solid measurements, we can derive conclusions not only on preferred length but also how additional materials are actually used.

Diomidis Responds

I agree that positioning ourselves in a way that best serves our readers is important.  Do professionals really need to read at detailed methods and statistical analysis?  These are mainly required to help reviewers and researchers evaluate the validity of the findings.  As Christof correctly points out, the pilot will help us see whether publishing shorter articles is a good move, and the excellent metrics he suggests are what we will need to look at. Let’s keep in mind that we may also receive backlash from authors.  I happen to believe that shorter articles are more crisp and easier to read, but I understand that others may feel differently.

Christof Responds

Indeed not all readers need every detail inline. IEEE Software is extremely successful in its current format with balancing string content with good editing. Every now and then it is necessary to test new schemes. Those who don’t change will disappear. Having said that, we should definitely go with the pilot, and collect these pilot  measurements over 6-12 months. With some evaluation, we can take decisions - beyond only looking to size. This will not be a short-term exercise.

Sunday, July 24, 2016

Minor contributors correlate to bugginess. But not when they're code reviewers.

By: Patanamon (Pick) Thongtanunam, Nara Institute of Science and Technology, Japan (@pamon)
Associate Editor: Bogdan Vasilescu, University of California, Davis. USA (@b_vasilescu)

Weak code ownership correlates to poor software quality. Code ownership is a common practice in large, distributed software development teams. It is used to establish a chain of responsibility (who to blame if there is a problem) and simplify management (to whom a task or bug-fix should be assigned). A simple intuition for estimating code ownership is that the developer who has written majority code to a module should be an owner of that module. Moreover, prior research found that a module with weak code ownership (that is written by many minor authors) is more likely to have bugs in the future [1].

Nowadays, development practices are more than just writing code. A tool-based code review process has tightly integrates with the software development cycle. Recent research has found that in addition to a defect-hunting exercise, reviewers also help an author to improve the code changes [2,3]. Then, these code writing and reviewing activities are orthogonal: teams can have a developer who reviews a lot but writes little, and vice versa.

Does code review activity change what we know about ownership and software quality? This led us to investigate the importance of code review activities for code ownership and software quality [4]. Through an empirical study of Qt and OpenStack systems, we (1) investigated the code authoring and reviewing activities of developers, (2) refined code ownership using code reviewing activities, and (3) studied the relationship between our refined ownership and software quality.

Code reviewers are the majority of contributors in a module

We found that the developers who did not previously write any code changes but only reviewed code changes are in the largest proportion of developers who contributed to a module (67%-86% of contributors in a module at the median). Moreover, 18%-50% of these review-only developers are documented core developers of the Qt and OpenStack projects. These findings suggest that if a code ownership estimation considers only code authoring activities, it is missing many developers who also provided reviewing contributions to a module.

Figure 1: Refined code ownership

Many minor authors are actually major reviewers

We observe the amount of code authoring and reviewing contributions that developers made to a module and classify two levels of expertise i.e., major and minor levels in each dimension (Fig 1).

Traditional code ownership (TCO) is solely derived from code authoring activities while review-specific ownership (RSO) is solely derived from code reviewing activities. The interesting part in Figure 1 is the minor authors since these developers were classified as low-expertise developers.

However, we found that 13%-58% of minor authors are major reviewers who actually reviewed many code changes to a module. This finding suggests that many major developers who actually make large contributions to modules by reviewing code changes were misclassified as low-expertise developers according to their low code authoring activities. 

Reviewing expertise reverses the relationship between authoring expertise and software quality

We further investigated whether reviewing expertise has an impact on software quality or not. Hence, we compared the rates of developers with each level of expertise in between defective and clean modules. We found that the rates of developers with the minor author and minor reviewer expertise in defective modules are higher than those in clean modules (The left bean plot in Figure 2). On the other hand, the rates of developers with the minor author but major reviewer expertise in defective modules are less than those in clean modules (The right bean plot in Figure 2). When we control for several confounding factors using statistical models, the rates of developers with the minor author and minor reviewer expertise still share a strong increasing relationship with defect-proneness. These results indicate that the reviewing expertise share a relationship with software quality and it can reverse the direction of the association between the minor authorship and defect-proneness. 

Figure 2: The relationship between minor authors and defect-proneness in Qt version 5.0.0

Practical suggestions

Our findings lead us to believe that code reviewing activity captures an important aspect of code ownership. Therefore, future estimations of code ownership should take code review activity into consideration in order to accurately model the contributions that developers have made to evolve software systems. Such code ownership estimations also can be used to chart quality improvement plans. For example, teams should apply additional scrutiny to module contributions from developers who have neither authored nor reviewed many code changes to that module in the past, while a module with many developers who have not authored many code changes should not be considered risky if those developers have reviewed many of the code changes to that module.


[1] C. Bird, N. Nagappan, B. Murphy, H. Gall, and P. Devanbu, “Don’t Touch My Code! Examining the Effects of Ownership on Software Quality,” in Proceedings of the 8th joint meeting of the European Software Engineering Conference and the International Symposium on the Foundations of Software Engineering (ESEC/FSE), 2011, pp. 4–14.
[2] A. Bacchelli and C. Bird, “Expectations, Outcomes, and Challenges Of Modern Code Review,” in Proceedings of the 35th International Conference on Software Engineering (ICSE), 2013, pp. 712–721.
[3] P. C. Rigby and C. Bird, “Convergent Contemporary Software Peer Review Practices,” in Proceedings of the 9th joint meeting of the European Software Engineering Conference and the International Symposium on the Foundations of Software Engineering (ESEC/FSE), 2013, pp. 202–212.
[4] P. Thongtanunam, S. Mcintosh, A. E. Hassan, and H. Iida, “Revisiting Code Ownership and its Relationship with Software Quality in the Scope of Modern Code Review,” in Proceedings of the 38th International Conference on Software Engineering (ICSE), 2016, pp. 1039–1050.

Sunday, July 17, 2016

Your Local Coffee Shop Performs Resource Scaling

Marios-Eleftherios Fokaefs, York University, Toronto, Canada
Associate Editor: Zhen Ming (Jack) Jiang, York University, Toronto, Canada 

Ever since I moved to Canada, about 8 years ago, I became an avid Starbucks customer, primarily, because it was one of the few places, where I could find a decent iced coffee. As a Greek, I was bound by destiny and tradition to keep drinking iced coffee (asking for a cold beverage in -35℃ in Alberta, it was funny to look at baristas rendered speechless). When I moved to York University for my postdoc and found the closest Starbucks to initiate my everyday routine, Starbucks happened to launch the “Mobile Order & Pay” feature. At the same time, with the new group at CERAS lab, I got better acquainted with Cloud Computing and with such concepts as Self-Adaptive Systems and Cloud Elasticity. Given these two facts, one morning I was waiting in a rather long line at Starbucks, when I noticed one of the employees coming out with a cart full of empty cups and taking orders from the people in the line. I also noticed that baristas are highly efficient and multi-tasking when crafting the beverages, but customers are rather slow in comparison when ordering or paying. “Here is resource scaling and process adaptation in practice!”, I thought. The employees noticed the delay in ordering and they decided to speed up the process, parallelized it with paying and took advantage of the much faster crafting process.

Stepping back a few years, back when I started my graduate studies at the University of Alberta, my then supervisor, Dr. Eleni Stroulia, in order to introduce us to web services and processes recommended us to read Gregor Hohpe’s paper “Your Coffee Shop Doesn’t Use Two-Phase Commit” in IEEE Software Design [1]. In that paper, the author makes a simile of how Starbucks executes orders (the choice of coffee shop is completely coincidental, I swear!) with software processes, fault tolerance and rollback. In this post, I will make a similar attempt again using Starbucks as my example to explain how resource scaling in cloud works. The post is split in four parts where I lay out the details on resources and processes (part 1) as employed by the Starbucks system and the equivalent web software system on cloud, monitoring and analysis of performance metrics (part 2), planning and execution of scaling and adaptive actions (part 3) and the economics of scaling (part 4) in both systems.

1. Processes, Resources and Topologies

Starbucks is primarily a service, which means that in the center of its processes are people and human tasks. The people that participate in the service are the customers, who issue the orders and pay, and the employees, who can be distinguished in tellers and baristas (i.e., the ones who prepare the beverages or other orders). From a system’s perspective, the customers provide the input to the service and the baristas along with special equipment and raw material (coffee, milk etc.) are the resources with which the requests are executed and served.

Figure 1. The Starbucks system and the flow of orders.

Figure 1 shows the overall flow of the Starbucks system. As the clients start their interaction with the system, they enter a queue. The first interface of the system is the cash register. At this point, a client may issue an order. The order can be anything or any combination between hot or cold beverages, food and dessert items or packaged goods. Different orders may require different processes, different equipment and obviously different preparation time. This is an advantage for the baristas since they can parallelize several orders and speed up the whole order process. For example, some hot beverages need either or both the espresso machine and the milk steamer, while cold beverages may need the blender. For many of these drinks the set of required equipment may be completely independent which allows the baristas to execute them simultaneously. The same assumption holds for drinks and food items, as the latter may need to be heated. Some orders may be so simple that can be executed on the spot by the cashier, including brewed coffee, tea or some packaged items. In such orders, the wait time is almost insignificant. The orders are received and executed in a “first-in-first-out” basis. However, due to the variation in execution time, a customer may receive an order later than another, even though it may have been placed earlier than the latter. An interesting characteristic of the Starbucks system concerning its human resources is that there is relatively little training involved and as a result every employer can assume any role at any time with little or no impact to the process. This additional flexibility allows the system to reassign its resources to address its needs as they appear, e.g. assign more baristas or assign more tellers to take orders, given the equipment restrictions.
Once the order has been placed, payment must be received. As with the orders, payment also comes in many forms, which can also affect the time in which the payment will be processed. For example, cash or Starbucks cards are quickly processed, while credit cards and debit cards take more time, even without the odd failure. In general, once the order has been placed, its execution starts immediately and the payment has been usually processed before the drink has been prepared. This means that the client may need to wait more before he or she receives the end product of the order, depending also on the backlog of the orders that has been accumulated by the time the order was placed.

On the Cloud…

Using the Starbucks system as basis, I will explain in similar terms how software services operate using cloud resources. One basic difference is that such systems do not rely as heavily on humans, as most operations are automatic (or at the very least interactive) and executed by software. Nevertheless, the input can come either from other software systems or from humans (as the Starbucks clients). The difference is that software clients have much higher capacity than humans in issuing requests as they have little think time; once a response is received it can be quickly processed and a new request may be issued immediately. Other than that, the systems are basically similar; we have requests coming in (like orders), the requests may be of different nature, thus requiring different resources and taking variable time to be processed, and the clients remain in a queue, while waiting for their requests to finish.

Figure 2. The topology of a web software system on the cloud and the flow of requests.

Figure 2 shows the equivalent system of a web application deployed on a cloud, along with the flow of request processing. The system we are considering is a simple three-tier architecture, where the clients issue requests through an interface, the requests are dispersed by a load balancer to copies of the application in a number of application servers, where they are processed, and, if there is a need, a database is accessed to fetch or store data. The load balancer serves as the queue of the Starbucks system. In design time, we may set the load balancer to distribute requests in a generic manner (e.g. round robin, or by busyness) or be more sophisticated and distribute the requests to specific server clusters according to their individual demands for resources (e.g. CPU, memory, disk etc.). The latter case, which is closer to the Starbucks scenario with multiple types of orders, has interesting and important economic extensions, which we will discuss in Part 4. Exactly like the Starbucks system, requests affect each other as they take up resources, which may lead to delays and longer queues.

Unlike the Starbucks system, it is not as easy or seamless to repurpose resources and assign them a different task on runtime; an application server cannot become a database server with the snap of a finger. However, to compensate for this challenge, cloud environments offer additional flexibility in commissioning and decommissioning resources. Since we deal with virtual and not physical resources on the cloud layer, we can easily boot up a new server and have it working in matter of minutes or even seconds. When we no longer need it, we can stop it without affecting the functionality or the performance of the overall system. This strategy is not equally possible in the Starbucks system, because we cannot hire or fire people on the spot for a very short period, neither can we call in an employee in an instant during rush hour. We will further discuss this special ability of cloud computing in Part 3. 

2. Monitoring and analysis

When considering the quality of service, we need to pinpoint those metrics, which need to be monitored with respect to the system’s health, in order to identify any potential performance problems. Performance is crucial for interactive systems as this will be perceived as quality by the end clients. In the Starbucks system, quality is determined by customers based on the several wait times they have to endure, as shown in Figure 3 by the red-yellow stars. Customers will have to wait to order, the original queue, to pay, based on payment processing times, and finally to receive their order, after all the preparations and crafting have finished.

Figure 3. The Starbucks system along with the monitored metrics.

Interestingly enough, order and pay wait times do not depend entirely on the system’s response capacity, given the primarily interactive nature of the system. While order wait time is partly waiting in line, a big chunk of it is waiting for the customer to decide what they want to order and then actually order it. For those familiar with the Starbucks menu, you can imagine that this is not always a trivial task. After the order has been placed, the cashier has to make out all the details for a particular order in an interactive manner (“Is 2% milk fine?”, “Would you like that sweetened?”). Remarkably, perceived quality is not so much determined by the actual order wait time, but more based on the wait time in the queue. This is because in the queue the customer is inactive, while during the order there is interaction, which is understandable and acceptable. Pay wait time depends on the particular payment type. Usually, cash payment may take longer than card payments, which also take long based on the responsiveness of the credit card service, not considering any potential failures and retries. Payment with the Starbucks rewards and cash cards is most often the fastest one.

The processing and wait times concerning the back-end processes of the system, i.e., the preparation of the orders, are usually regulated and within constraints imposed by the resources. More often than not, baristas are quite efficient in crafting beverages, even more than the customers ordering. The performance of the equipment is standardized. Therefore, the overall performance of the process can be improved only by adding more resources, baristas or equipment. However, adding too many resources can actually create problems, as it was suggested in Fred Brooks’ book “The Mythical Man-Month”.

On the Cloud…

To a large degree, the quality of any system is perceived based on its responsiveness and effectiveness in a timely fashion. The same holds for most software systems on the cloud. As shown in Figure 4 (also as red-yellow stars), the software system’s final response time depends on the performance of the individual resources participating in a request, including computation, storage and network resources. Similarly to the Starbucks system, the client is also responsible for producing some wait time while preparing to issue the request. However, this time is not perceived by the system, since the response time is measured from the moment that the request is received, and there is no or insignificant waiting in the queue, as requests are usually processed in parallel by multithreaded applications. Nevertheless, this time, known as think time, is important when considered along with the response time; if the response time is lower (or even significantly lower) than the think time, it is not perceived as strongly by the customer. In the opposite case, if the think time is low (especially when we talk about software clients), even a slight increase in the response time will be noticed by the client. This property is helpful, when considering and setting performance goals.

Figure 4. A software system on the cloud along with the monitored metrics.

Concerning the performance of the back-end resources, we have to take into account the demand of the various requests in CPU, memory, disk, network and so on. These demands can be roughly estimated for each class of requests during the implementation of the application in the profiling process. Given the performance specification of the cloud resources and the demands of requests, we can also estimate the overall need of the application for resources and predict its performance under certain workloads. With this knowledge, we can also address any fluctuations on the workload by dynamically allocating or deallocating resources. Unlike the Starbucks system, resources in the cloud can change in number and in size in a more flexible and volatile manner, since we are talking about virtual resources, and there are less significant constraints with respect to their number since they do not affect each other to a large degree. The only constraints are imposed by the underlying hardware, which may or may not lie within the application owner’s control, and eventually is a matter of cost with respect to how many resources will be allocated.

3. Scaling Planning and Execution

Having a good understanding about the Starbucks system and its resources, and having set up the monitors for the performance metrics, we can now better understand and identify the motivation behind some of the changes that Starbucks performed to their process. Figure 5 shows the changes introduced in the Starbucks system, marked as numbered circles. 

Figure 5. The adapted Starbucks system with the changes in numbered circles.

The carefully placed monitors (aka employees looking at very long lines and disgruntled patrons muttering under their breath about the same problem), revealed that the bottleneck, or at least one of the bottlenecks, of the Starbucks system is the order queue. Therefore, the first adaptive action that the employees took was the “cup cart” (Figure 5, change 1), i.e. a small wheel cart with empty cups of all sizes and types. An employee would walk along the line with the cart and he or she would mark down a cup with each customer’s order(s) and then pass them down to the baristas. The wheel cart does not accept orders for food, packaged items and brewed coffee or tea as these can be served straight at the cashier. With this change, Starbucks effectively separated the order queue from the pay queue and managed to parallelize the two processes. Having placed the order, the customers feel more relaxed as they know that their order is already being processed as they wait in line to pay. In practice, by the time the customers pay, their order may be ready for pick up, which increases their perception of quality. 

The second adaptive change took advantage of novel technology, where Starbucks introduced the “Mobile Order and Pay” service through the Starbucks mobile application (Figure 5, change 2). The concept is that a customer can place an order to a specific Starbucks store, pay through their Starbucks account and then go to pick up their order from the store. In this way, they can completely skip the queue and pick up their order right when they step in the store. The order and pay queues become completely invalid and the only thing that remains is the preparation time. This also becomes manageable, as the application returns an approximate time by which the order will be ready, taking into account average preparation times and customer arrival patterns.

After both these changes, the only wait time that remains is the preparation time. As it has already been mentioned, this time can be reduced if necessary by commissioning more resources, human or equipment. However, several restrictions apply to these scaling actions. For example, we cannot dynamically increase the size of our equipment for a few ours and then release whatever we no longer need. The acquisition of equipment is planned based on average customer arrival and as a result there are moments when equipment is underutilized or others when it is not enough and wait time is temporarily increased for customers. In addition, we cannot increase the number of baristas to more than 2 or 3 at a given time depending on the size of the store. What Starbucks would do with respect to employees is identify specific times during the day or during the year (e.g. Christmas or other holidays) with increased traffic and assign more baristas or cashiers. This strategy has become a norm to almost all services and has been transferred to software services as well.

On the Cloud…

Thanks to the flexibility of cloud computing, there is a number of adaptive changes we can make to address potential performance issues of the deployed system. Figure 6 shows some of these changes noted in numbered circles. The first change that comes in mind when considering a high response time for a software system on the cloud is to add more resources, mainly virtual machines (Figure 6, change 1).  Unlike the Starbucks system, space restrictions are less prominent in cloud systems. Although these restrictions can be imposed by hardware (a physical server cannot accommodate an infinite number of virtual machines), it doesn’t always come to that. And when it does, the software can be moved to a public cloud, which usually has much larger capacity than private infrastructure and space is not an issue. On the other hand, the problem then becomes one of cost; reserving more and more virtual machines translates into more money that needs to be paid to the public cloud provider. This is a consequence, which we will discuss in the next part concerning the economics of scaling. Cost is actually the motive for the second change (Figure 6, change 2); removing a virtual machine from a cluster when it is not fully utilized. If the workload can be shared by the remaining resources, the spare virtual machine can be removed to save costs.

Figure 6. The adapted software system with the changes as numbered circles.

Another possible change concerns the redistribution of requests to specialized clusters (Figure 6, change 3). We assume that the load balancer of the software system already distributes the incoming requests to specialized clusters according to their demands for specific resources. However, this is done in a very straightforward manner and the balancer sends the CPU intensive requests to the CPU cluster, the memory intensive requests to the memory cluster and so on. If a particular cluster is saturated, the first thought would be to add resources, as in change 1. Alternatively, since the virtual machines in the clusters actually possess all resources (CPU, memory, disk) in different configurations, we can avoid adding unnecessary resources and actually redirect requests from the saturated cluster to one that is not as utilized. However, one needs to be careful and not send too many requests to other clusters to avoid saturating their otherwise limited resources. Therefore, this action requires changes to the balancer software to become more sophisticated with management responsibilities as well.

Finally, if the bottleneck is in the database requests, we can scale the data layer in a similar manner (Figure 6, change 4). Thanks to recent advancements with Big Data and NoSQL technologies in databases, it is possible to partition data and distribute it in multiple stores making both reading and writing to the database faster.

4. On the Economics of Scaling

Out of the potential bottlenecks we identified in Part 2, we have seen that Starbucks have paid particular attention and applied actions to address the order and pay queues. Performance improvement aside, looking at the economic aspect of scaling, this focus makes sense for another reason. Waiting in long lines to order may prompt the customers to abandon the endeavour and leave the store altogether. This automatically means loss of revenue for Starbucks, but it may also imply a steep fall to the customers’ long-term perceived quality, which may prevent them from visiting this particular store in the future or Starbucks altogether. On the other hand, once the customers have placed their order and wait to pay, they are less likely to leave the queue and even less after they have paid. Formally, the probability that a client will leave the system prematurely decreases as he or she progresses further within the system. Long pickup times may affect the long-term quality, but customers will rarely abandon something they have already paid for. By eliminating the order and/or the pay queue, either with the wheel cart or with the mobile app, Starbucks minimizes the risk of losing clients, who have already entered the system, and increases its total expected revenue.
The adaptation costs of the changes described above are also of particular interest. The wheel cart solution is virtually inexpensive, since the cart and the cups already exist, as is the human resource (cashier or barista) at the moment of the change. The redistribution of human resources may affect the rest of the system, but since the order queue has been identified as the current bottleneck, reassigning one employ to this extra task will be of more benefit than cost. The mobile app has obvious additional costs (infrastructure, developers, maintenance and so on), but it is a global solution, which can be applied to all (or potentially all) stores. Furthermore, it eliminates two potential bottlenecks, order and pay queues, and it is a parallel and alternative process, which does not affect the other orders to a large degree. Finally, adding and removing resources dynamically and just-in-time has obvious costs, but more importantly carries high economic risk; reserving equipment for a period of high traffic that would actually be shorter than expected may result in higher costs. Overall, the cost to benefit ratio of adding physical resources may be high enough, so that the system would prefer a few short periods with higher delays than adding resources indiscriminately.

On the Cloud…

Unlike the Starbucks system, adding and removing cloud resources is a convenient change and inexpensive considering the low cost for hardware and the fact that cloud computing is an economy of scale. The last statement means that since a physical host can host a large number of virtual machines, the more VMs are commissioned by clients the more the cost spreads across these machines. However, the concept of economies of scale also applies on the software in a negative manner; the fewer requests a VM serves the more expensive they are. Therefore, it is desirable that our clusters operate close to full capacity so that their costs spread out to more requests. This is especially prominent in small systems, where the clusters are small and an extra VM will increase the average cost per request too much. As a result, when a small number of requests will trigger a scaling action to preserve performance, we may be reluctant to add new resources and prefer to wait until more requests arrive with the risk of increasing the system’s response time. However, a considerably increased response time may result in dropped requests, similar to Starbucks customers leaving the queue. Both these phenomena can result in a significant decrease to the long-term perceived quality of the system. In fact, unlike services like Starbucks, software services usually have written agreements, known as Service Level Agreements (SLAs), with their clients where they guarantee a maximum response time and a minimum availability rate (i.e., percentage of served requests out of total received). Violation of these agreements may even result to financial penalties.

Concerning the concept of heterogeneous clusters using VMs optimized for specific resources, as we described our software system, the motivation comes from Amazon’s pricing policies for virtual resources. Within virtual machines of the same type (general purpose, CPU optimized, memory optimized and so on), when we want to double the resources of the VM so does its cost. However, if we want to unilaterally increase only one resource per VM, we can commission a specialized VM for a lower cost than a general purpose VM, which would unnecessarily increase the other resources as well. Although heterogeneous clusters are an optimal solution from cost and performance perspective, they require additional logic on the load balancer to make sure that the requests are distributed according to their needs.

5. Conclusions 

The purpose of this post was to show that resource scaling and dynamic adaptation is a reality not only in software and computer systems, but in everyday services and processes as simple as ordering a cup of coffee. Crucial components of the adaptation process are monitoring, where we study the performance of our system and identify potential problems, correct planning and execution of the adaptive actions given the available resources, and eventually the economic considerations of the whole process. 

The inclusion of novel technologies, including cloud and mobile computing, does not diminish the role of humans in the process. On the contrary, scaling and smart solutions can lead not only to better services for customers, whether these are software or human services, but more importantly they can lead to economic benefits for the companies, the employees and the clients. Cost savings can allow companies to redistribute the budget towards further improving the service, or the quality of work for their employees (increased salaries, better training etc.). In addition, lower costs can lead the market through competition to lower the service prices to the benefit of the clients.  These facts show that during dynamic adaptation, a service should be perceived both as a system and as a product with economic considerations.


Gregor Hohpe. Your Coffee Shop Doesn't Use Two-Phase Commit. IEEE Software. 22(2): 64-66 (2005) 

If you like this article, you might also enjoy reading:
  • Panos Louridas. Up in the Air: Moving Your Applications to the Cloud. IEEE Software. 27(4): 6-11 (2010).
  • Leah Riungu-Kalliosaari, Ossi Taipale, Kari Smolander. Testing in the Cloud: Exploring the Practice. IEEE Software. 29(2): 46-51 (2012).
  • Diomidis Spinellis. Developing in the cloud. IEEE Software. 31(2): 41-43 (2014).