Sunday, November 29, 2015

The impact of FOSS on software engineering research

by Jesus M. Gonzalez-Barahona
Associate Editor: Stefano Zacchiroli

Back in year 2000, during Christmas, I had some time to play. Some days earlier, by chance, I had read "Estimating Linux's Size", a study on the number of lines of code of all the software in Red Hat 6.2, one of the most popular Linux-based distribution of the age. I had noticed that the tool used in the study, sloccount, was FOSS (free, open source software), so I could use it in my playground. For some days, my laptop was busy downloading all of Debian GNU/Linux 2.2 (Potato) source code, running sloccount on it, and producing all kinds of numbers, tables, and charts. At that time Debian was one of the largest, if not the largest, coordinated collections of software ever compiled. That fun lead to a series of paper, starting with "Counting potatoes: the size of Debian 2.2", and to a pivot in my research career that put me in the track of mining software repositories.
This personal story is an illustration of two emerging trends that proved to be game-changers: the public availability of information about software development, including the source code, and the availability of FOSS tools to retrieve and analyze that information. Both were enabled by FOSS, but they have worked in very different, yet complementary, ways.

FOSS as a matter of study

Everybody knows nowadays that FOSS projects release publicly the source code they produce. This has tremendously lowered the barriers to study software products, when that study requires access to their source code. No longer it is compulsory to have special agreements with software companies to research in this area. For the first time in the software history, it is possible and relatively easy to run comparative studies on large and very large samples of products.
Of course, the impact of these new opportunities has been gradual. As FOSS has become more and more relevant, more "interesting", industry-grade products have been available for studying. With time, researchers have developed new techniques and tools, and even a kind of a new mindset, for analyzing this new, yet very rich, corpus of cases.
Among FOSS projects, there is a large subset using an open development model. They keep public repositories for source code management, issue tracking, mailing lists and forums, etc. For them not only the source code, but also a lot of very detailed information about the development process is available. From the researcher point of view, this means a heaven of data to explore.
Already in the late 1990s some researchers started to take advantage of this new situation, but it was in the early 2000s when this new approach started to take shape, with works such as "Evolution in open source software: a case study""Results From Software Engineering Research Into Open Source Development Projects Using Public Data","A Case Study of Open Source Software Development: The Apache Server", and "Open source software projects as virtual organizations".
Soon, specialized venues that dealt mostly with the study of data from software development repositories emerged. Among them, the International Working Conference on Mining Software Repositories, held annually since 2004, was one of the first and is today probably the most well known. During the last decade, this research approach has permeated all major conferences and journals in software engineering.

FOSS tools and public datasets as enablers of research

The availability of FOSS tools has been an enabler in many research fields, and software engineering has been no exception. The availability of efficient and mature FOSS databases, analysis tools, etc., allows any researcher, even with very modest budgets, to carry on ambitious studies with very large datasets. In addition, more and more tools specifically developed to analyze software or software repositories are available as FOSS. Many researchers are taking as customary to publish as FOSS the tools they use to produce the results shown in the data, letting others modify or adapt them to extend the research results. Some of them are useful for practitioners too.
For example, several FOSS tool sets exist to retrieve information from software repositories and organize it in databases, ready for analysis, such as MetricsGrimoire. They allow researchers to focus on the analysis of the data, and not on the retrieval, a task which may be very effort-consuming, specially when a large number of projects are involved. Practitioners are using them as well, to track KPIs (Key Performance Indicators) of their projects.
One step further, in some cases the datasets resulting from large data retrieval efforts are made public, letting researchers directly jump to analyze them. This is for example the case of the Boa Infrastructure, which allows to execute queries against hundreds of thousands of software projects very efficiently, or FLOSSMole, a repository of FOSS project data coming from several software development forges.

The impact on software engineering research

These two trends are changing large areas in the field of software engineering research, by allowing researchers to produce results in ways closer to the scientific method than before. With relatively little effort, they can start from solid data, which is becoming more and more accessible, and rely on mature, adaptable tools, to produce results that take into account a large diversity of cases. Some concrete impacts are:
  • Availability of datasets. Researchers can work on the same dataset with different techniques, which eases the comparison of results, and allows to determine to which extent a study advances the state of the art. They also allow researchers to focus on analysis, and not on the assembly of the dataset itself, which is complex, error-prone, and effort-consuming.
  • Reproducibility of the studies. When the tools used by researchers are FOSS, and the datasets are public, reproducing previous results becomes possible and even easy. This is already improving the chances of validating results, which is fundamental to advance on solid ground.
  • Incremental research. Now it is much easier to build on the shoulders of giants, by reusing FOSS tools and datasets produced by other research teams. Researchers no longer have to start from the ground up, they can incrementally improve precious results.
In short, FOSS is making empirical software research much easier. Some clear examples are the advances in the areas of detection of code clones, the impact of different release models, or the limits to software evolution. The next years will let us know to which extent this is going to translate into results useful to improve our knowledge of how software is developed, and to improve software development itself.

References

"Estimating Linux's Size", David A. Wheeler, published online, November 6, 2000.
"Counting potatoes: the size of Debian 2.2", González-Barahona, Jesus M.; Pérez-Ortuño, Miguel; de las Heras Quirós, Pedro; Centeno-González, José; Matellán-Olivera, Vicent. Upgrade Magazine, 2(6), 60-66 (2001).
"Evolution in open source software: a case study", Godfrey, M.W.; Qiang Tu. Proceedings od the International Conference on Software Maintenance, 11-14 Oct. 2000, San Jose, CA, USA, 131-142.
"Results From Software Engineering Research Into Open Source Development Projects Using Public Data", Koch, Stefan; Georg Schneider, published online, 2000.
"A Case Study of Open Source Software Development: The Apache Server", Mockus, Audris; Fielding Roy; Herbsleb James.Proceedings of the International Conference on Software Engineering (ICSE 2000), 4-11 Jun. 2000, Limerick, Ireland.
"Open source software projects as virtual organizations: Competency rallying for software development", Crowston, K.; Scozzi, B. Proceedings of IEE Software, Feb. 2002, 149(1), 3-17.

Wednesday, November 25, 2015

Understanding Runtime Value

- A Cost/Benefit Approach to Performance Analysis 


by David Maplesden, The University of Auckland, Auckland, New Zealand (@dmap_nz)
Associate Editor: Zhen Ming (Jack) Jiang, York University, Toronto, Canada


Many large-scale modern applications suffer from performance problems [1] and engineers go to great lengths searching for optimisation opportunities. Most performance engineering approaches focus on understanding an application's cost (i.e., its use of runtime resources). However, understanding cost alone does not necessarily help find optimisation opportunities. One piece of code may take longer than another simply because it is performing more necessary work. For example, it would be no surprise that a routine that sorted a list of elements took longer than another routine that returned the number of elements in the list. The fact that the costs of the two routines are different does not help us understand which may represent an optimisation opportunity. However, if we had two different routines which output the same results (e.g., two different sorting algorithms), then determining which is the more efficient solution becomes a simple cost comparison.
The key is to understand the value provided by the code. It is then possible to find the superfluous activity that characterises poor performance.
Traditionally it has been left to the engineer to determine the value provided by a piece of code through experience, intuition or guesswork. However, the challenge of intuitively divining runtime value is difficult in large-scale applications. These applications have thousands of methods interacting to produce millions of code paths. Establishing the value provided by each method via manual inspection is not practical with such scale and complexity.
To tackle this challenge we are developing an approach to empirically measure runtime value. We can combine this measure with traditional runtime cost information to quantify the efficiency of each method in an application. This allows us to find the most inefficient methods in an application and analyse them for optimisation opportunities.
Our approach to quantifying value is to measure the amount of data created by a method that becomes visible to the rest of the application, i.e., the data that escapes the context of the method. Our rationale is that the value a method is providing can only be imparted by the visible results it creates. Intermediate calculations used to create the data but then discarded do not contribute to this final value. Intuitively two method calls that produce identical results (given the same arguments) are providing the same amount of value, regardless of their internal implementations.
Specifically we track the number of object field updates that escape their enclosing method. An object field update is any assignment to an object field or array element (e.g., foo.value = 1 or bar[0] = 2). A field update escapes a method if the object it is applied to escapes the method i.e. is a global (static), a method parameter or returned.
For example consider the time formatting Java code below:
  public static String formatElapsedTime(long timeInMillis) {
    long seconds = timeInMillis / 1000;
    long minutes = seconds / 60;
    long hours = minutes / 60;

    final StringBuilder sb = new StringBuilder();
    formatTimePart(sb, hours, "hours");
    formatTimePart(sb, minutes, "minutes");
    formatTimePart(sb, seconds, "seconds");
    return sb.toString();
  }  

  public static void formatTimePart(StringBuilder sb, long l, String description) {
    if (l > 0) {
      if (sb.length() > 0) {
        sb.append(' ');
      }
      sb.append(l);
      sb.append(' ');
      sb.append(description);
    }
  }
The formatTimePart method updates the StringBuilder parameter (via calls to sb.append()) and so it has parameter escaping field updates. The formatElapsedTime method has no parameter escaping updates but it does return a newString value (constructed via sb.toString()) and so has returned field updates. Note that the StringBuilder object sb does not escape formatElapsedTime and so the updates applied to it are actually captured by the method, it is only the subsequently constructed String which escapes. We have found captured writes such as these to be a strong indicator of inefficient method implementations.
We have evaluated our approach [2] using the DaCapo benchmark suite, demonstrating our analysis allows us to quantify the efficiency of the code in each benchmark and find real optimisation opportunities, providing improvements of up to 36% in our case studies. For example, we found over 10% of the runtime activity in the h2 benchmark was incurred by code paths such as JdbcConnection.checkClosed() that were checking assertions and not contributing directly to the benchmark result. Many of these checks were repeated unnecessarily and we were able to refactor and remove them.
Our proposed approach allows the discovery of new optimisation opportunities that are not readily apparent from the original profile data. The results of our experiments and the performance improvements we made in our case studies demonstrate that efficiency analysis is an effective technique that can be used to complement existing performance engineering approaches.

References

  1. G. Xu, N. Mitchell, M. Arnold, A. Rountev, and G. Sevitsky. Software Bloat Analysis: Finding, Removing, and Preventing Performance Problems in Modern Large-Scale Object-Oriented Applications. Proceedings of the FSE/SDP Workshop on the Future of Software Engineering Research, pages 421-425, 2010.
  2. D. Maplesden, E. Tempero, J. Hosking, and J. Grundy. A Cost/Benefit Approach to Performance Analysis. Proceedings of the 7th ACM/SPEC International Conference on Performance Engineering (ICPE), to appear. 2016.


Wednesday, November 18, 2015

Supporting newcomers to Open Source Software

by Igor Steinmacher (@igorsteinmacher)
Associate Editor: Christoph Treude (
@ctreude)

Community-based Open Source Software (OSS) projects leverage contributions from geographically distributed volunteers and require a continuous influx of newcomers for their survival, long-term success, and continuity. Therefore, it is essential to motivate, engage, and retain new developers in a project in order to promote a sustainable number of developers. Furthermore, recent studies report that newcomers are needed to replace older members leaving the project and are a potential source of new ideas and work procedures that the project needs [1].

However, new developers face various barriers when attempting to contribute [2]. In general, newcomers are expected to learn the social and technical aspects of the project on their own. Moreover, since delivering a task to an OSS project is usually a long, multi-step process, some newcomers may lose motivation and even give up contributing if there are too many barriers to overcome during this process. These barriers affect not only those interested in remaining project members, but also those who wish to submit a single contribution (e.g., a bug correction or a new feature). And, as Karl Fogel says in his book Producing Open Source Software: "If the project does not cause a good first impression, newcomers rarely give it a second chance."

To better support newcomers, it is necessary to identify and understand the barriers that prevent newcomers from contributing. With a better understanding of these barriers, it is possible to put effort towards building or improving tools and processes, ultimately leading to more contributions to the project. Therefore, we conducted a study to identify the barriers faced by newcomers [2][3]. We collected data from different sources: systematic literature review; open question responses gathered from OSS project contributors; students contributing to OSS projects; and interviews with experienced members and newcomers in OSS projects. Based on the analysis of this data, we organized 58 barriers into a model with 7 categories. Figure 1 depicts the categories and subcategories of this barriers model.

Figure 1. Barriers Model: Categories and Subcategories
Figure 1. Barriers Model: Categories and Subcategories

Based on the barriers model, we built FLOSScoach, a portal to support the first steps of newcomers to OSS projects. The portal has been structured to reflect the categories identified in the barriers model. Each category was mapped onto a portal section which contains information and strategies aimed at supporting newcomers in overcoming the identified barriers. To populate the portal, we collected already-existing strategies and information from interviews with experienced members and from manual inspection of the project web pages.

In the portal, newcomers find information on the skills needed to contribute to a project, a step-by-step contribution flow, the location of features (such as source code repository, issue tracker and mailing list), a list of newcomer friendly tasks (if provided by the project) and tips on how to interact with the community. Preliminary studies have shown that FLOSScoach helps newcomers, guiding them in their first steps and making them more confident in their ability to contribute to a project [4]. When we compared students' performance with and without FLOSScoach, we found a significant drop in terms of self-efficacy among students in the control group (not using FLOSScoach) while the self-efficacy of students using the tool remained at a high level. In addition, by analyzing diaries written during the contribution process, we found evidence that FLOSScoach made newcomers feel oriented and more comfortable with the process, while those who did not have access to FLOSScoach repeatedly reported uncertainty and doubt on how to proceed.

Figure 2. FLOSScoach screen showing the contribution flow (How to Start section)
Figure 2. FLOSScoach screen showing the contribution flow (How to Start section)

Identifying and organizing the barriers and the development of FLOSScoach are the first steps towards supporting newcomers to OSS projects. A smooth first contribution may increase the total number of successful contributions made by a single contributor and, hopefully, the number of long-term contributors. According to the results of our studies conducted so far, the points that deserve more attention are facilitating local workspace setup and providing ways to find the correct set of artifacts to work on once a task is selected.

You can access newcomer support for a number of OSS projects on the FLOSScoach site at http://www.flosscoach.com.

References


  1. R. E. Kraut, M. Burke, J. Riedl, and P. Resnick, “The Challenges of Dealing with Newcomers”, in Building Successful Online Communities: Evidence-Based Social Design, R. E. Kraut and P. Resnick, Eds. MIT Press, 2012, pp. 179–230.
  2. I. Steinmacher, A. P. Chaves, T. Conte, and M. A. Gerosa, “Preliminary empirical identification of barriers faced by newcomers to Open Source Software projects”, in Proceedings of the 28th Brazilian Symposium on Software Engineering, 2014, pp. 1–10.
  3. I. Steinmacher, T. Conte, M. A. Gerosa, and D. Redmiles, “Social Barriers Faced by Newcomers Placing Their First Contribution in Open Source Software Projects”, in Proceedings of the 18th ACM Conference on Computer Supported Cooperative Work & Social Computing, 2015, pp. 1379–1392.
  4. I. Steinmacher, I. S. Wiese, T. Conte, and M. A. Gerosa, “Increasing the self-efficacy of newcomers to Open Source Software projects”, in Proceedings of the 29th Brazilian Symposium on Software Engineering, 2015, pp. 160–169.

If you like this article, you might also enjoy reading:

Monday, November 9, 2015

Kickstarting The IEEE Software Blog

It gives me great pleasure to welcome all of you to the IEEE Software Blog. The goal of the blog is to present recent advances in the different research areas of software engineering via sharp, to-the-point, easily accessible blog posts. Furthermore, we will strive to not use our typical academic jargon, but distill the important takeaway messages from the research projects we are blogging about.

Since most academic journals are not open access, it becomes nontrivial for practitioners to get their hands on the latest research, so this blog will discuss some of the great content in IEEE Software. Readers will also be able to discuss each post in the comments section. At the end of the day, we want practitioners to be able to easily access and apply the latest research advancements. Additionally, we will blog on well-informed opinions, new and disruptive ideas, book reviews, and future directions. We will also disseminate the posts via our social media accounts. 


To this end, I have assembled a diverse, international team of young and upcoming researchers in different software engineering areas: 
  • Programming languages and paradigms — Rishabh Singh, MSR, Redmond, USA
  • Mobile applications and systems — Federica Sarro, UCL, London, UK
  • Software engineering processes, models and methods — Abram Hindle, Univ. of Alberta, Edmonton, Canada 
  • Software maintenance — Sonia Haiduc, FSU, Tallahassee, USA
  • Design/Architecture and Requirements — Mehdi Mirakhorli, RIT, Rochester, USA
  • Software testing and quality assurance — William Halfond, USC, Los Angeles, USA
  • Cloud, Distributed and Enterprise Software Systems — Jack Jiang, York University, Toronto, Canada
  • Open source software systems — Stefano Zacchiroli, Université Paris Diderot, Paris, France
  • Mining software repositories — Alberto Bacchelli, TU Delft, Delft, The Netherlands
  • Human Factors — Christoph Treude, IME/USP, São Paulo, Brazil
  • Software release and configuration management — Sarah Nadi, TU Darmstadt, Darmstadt, Germany
  • Social software development — Bogdan Vasilescu, UC Davis, Davis, USA
  • New initiatives — Mei Nagappan, RIT, Rochester, USA 

This team of blog editors will not only blog themselves but will also reach out to researchers and practitioners to solicit articles in their corresponding areas of expertise. The current plan is to have at least 6 blog posts every month, each in a different area of software engineering research. 
So, if you have a new research finding or an opinion about some existing idea, please contact the appropriate blog editor above. Also, if you have feedback on how we can improve the blog please drop me a note. This blog cannot succeed without your participation!

Mei Nagappan
Editor-in-Chief of the IEEE Software Blog