by Jesus M. Gonzalez-Barahona
Associate Editor: Stefano Zacchiroli
Back in year 2000, during Christmas, I had some time to play. Some days earlier, by chance, I had read "Estimating Linux's Size", a study on the number of lines of code of all the software in Red Hat 6.2, one of the most popular Linux-based distribution of the age. I had noticed that the tool used in the study, sloccount, was FOSS (free, open source software), so I could use it in my playground. For some days, my laptop was busy downloading all of Debian GNU/Linux 2.2 (Potato) source code, running sloccount on it, and producing all kinds of numbers, tables, and charts. At that time Debian was one of the largest, if not the largest, coordinated collections of software ever compiled. That fun lead to a series of paper, starting with "Counting potatoes: the size of Debian 2.2", and to a pivot in my research career that put me in the track of mining software repositories.
This personal story is an illustration of two emerging trends that proved to be game-changers: the public availability of information about software development, including the source code, and the availability of FOSS tools to retrieve and analyze that information. Both were enabled by FOSS, but they have worked in very different, yet complementary, ways.
Of course, the impact of these new opportunities has been gradual. As FOSS has become more and more relevant, more "interesting", industry-grade products have been available for studying. With time, researchers have developed new techniques and tools, and even a kind of a new mindset, for analyzing this new, yet very rich, corpus of cases.
Among FOSS projects, there is a large subset using an open development model. They keep public repositories for source code management, issue tracking, mailing lists and forums, etc. For them not only the source code, but also a lot of very detailed information about the development process is available. From the researcher point of view, this means a heaven of data to explore.
Already in the late 1990s some researchers started to take advantage of this new situation, but it was in the early 2000s when this new approach started to take shape, with works such as "Evolution in open source software: a case study", "Results From Software Engineering Research Into Open Source Development Projects Using Public Data","A Case Study of Open Source Software Development: The Apache Server", and "Open source software projects as virtual organizations".
Soon, specialized venues that dealt mostly with the study of data from software development repositories emerged. Among them, the International Working Conference on Mining Software Repositories, held annually since 2004, was one of the first and is today probably the most well known. During the last decade, this research approach has permeated all major conferences and journals in software engineering.
For example, several FOSS tool sets exist to retrieve information from software repositories and organize it in databases, ready for analysis, such as MetricsGrimoire. They allow researchers to focus on the analysis of the data, and not on the retrieval, a task which may be very effort-consuming, specially when a large number of projects are involved. Practitioners are using them as well, to track KPIs (Key Performance Indicators) of their projects.
One step further, in some cases the datasets resulting from large data retrieval efforts are made public, letting researchers directly jump to analyze them. This is for example the case of the Boa Infrastructure, which allows to execute queries against hundreds of thousands of software projects very efficiently, or FLOSSMole, a repository of FOSS project data coming from several software development forges.
"Counting potatoes: the size of Debian 2.2", González-Barahona, Jesus M.; Pérez-Ortuño, Miguel; de las Heras Quirós, Pedro; Centeno-González, José; Matellán-Olivera, Vicent. Upgrade Magazine, 2(6), 60-66 (2001).
"Evolution in open source software: a case study", Godfrey, M.W.; Qiang Tu. Proceedings od the International Conference on Software Maintenance, 11-14 Oct. 2000, San Jose, CA, USA, 131-142.
"Results From Software Engineering Research Into Open Source Development Projects Using Public Data", Koch, Stefan; Georg Schneider, published online, 2000.
"A Case Study of Open Source Software Development: The Apache Server", Mockus, Audris; Fielding Roy; Herbsleb James.Proceedings of the International Conference on Software Engineering (ICSE 2000), 4-11 Jun. 2000, Limerick, Ireland.
"Open source software projects as virtual organizations: Competency rallying for software development", Crowston, K.; Scozzi, B. Proceedings of IEE Software, Feb. 2002, 149(1), 3-17.
Associate Editor: Stefano Zacchiroli
Back in year 2000, during Christmas, I had some time to play. Some days earlier, by chance, I had read "Estimating Linux's Size", a study on the number of lines of code of all the software in Red Hat 6.2, one of the most popular Linux-based distribution of the age. I had noticed that the tool used in the study, sloccount, was FOSS (free, open source software), so I could use it in my playground. For some days, my laptop was busy downloading all of Debian GNU/Linux 2.2 (Potato) source code, running sloccount on it, and producing all kinds of numbers, tables, and charts. At that time Debian was one of the largest, if not the largest, coordinated collections of software ever compiled. That fun lead to a series of paper, starting with "Counting potatoes: the size of Debian 2.2", and to a pivot in my research career that put me in the track of mining software repositories.
This personal story is an illustration of two emerging trends that proved to be game-changers: the public availability of information about software development, including the source code, and the availability of FOSS tools to retrieve and analyze that information. Both were enabled by FOSS, but they have worked in very different, yet complementary, ways.
FOSS as a matter of study
Everybody knows nowadays that FOSS projects release publicly the source code they produce. This has tremendously lowered the barriers to study software products, when that study requires access to their source code. No longer it is compulsory to have special agreements with software companies to research in this area. For the first time in the software history, it is possible and relatively easy to run comparative studies on large and very large samples of products.Of course, the impact of these new opportunities has been gradual. As FOSS has become more and more relevant, more "interesting", industry-grade products have been available for studying. With time, researchers have developed new techniques and tools, and even a kind of a new mindset, for analyzing this new, yet very rich, corpus of cases.
Among FOSS projects, there is a large subset using an open development model. They keep public repositories for source code management, issue tracking, mailing lists and forums, etc. For them not only the source code, but also a lot of very detailed information about the development process is available. From the researcher point of view, this means a heaven of data to explore.
Already in the late 1990s some researchers started to take advantage of this new situation, but it was in the early 2000s when this new approach started to take shape, with works such as "Evolution in open source software: a case study", "Results From Software Engineering Research Into Open Source Development Projects Using Public Data","A Case Study of Open Source Software Development: The Apache Server", and "Open source software projects as virtual organizations".
Soon, specialized venues that dealt mostly with the study of data from software development repositories emerged. Among them, the International Working Conference on Mining Software Repositories, held annually since 2004, was one of the first and is today probably the most well known. During the last decade, this research approach has permeated all major conferences and journals in software engineering.
FOSS tools and public datasets as enablers of research
The availability of FOSS tools has been an enabler in many research fields, and software engineering has been no exception. The availability of efficient and mature FOSS databases, analysis tools, etc., allows any researcher, even with very modest budgets, to carry on ambitious studies with very large datasets. In addition, more and more tools specifically developed to analyze software or software repositories are available as FOSS. Many researchers are taking as customary to publish as FOSS the tools they use to produce the results shown in the data, letting others modify or adapt them to extend the research results. Some of them are useful for practitioners too.For example, several FOSS tool sets exist to retrieve information from software repositories and organize it in databases, ready for analysis, such as MetricsGrimoire. They allow researchers to focus on the analysis of the data, and not on the retrieval, a task which may be very effort-consuming, specially when a large number of projects are involved. Practitioners are using them as well, to track KPIs (Key Performance Indicators) of their projects.
One step further, in some cases the datasets resulting from large data retrieval efforts are made public, letting researchers directly jump to analyze them. This is for example the case of the Boa Infrastructure, which allows to execute queries against hundreds of thousands of software projects very efficiently, or FLOSSMole, a repository of FOSS project data coming from several software development forges.
The impact on software engineering research
These two trends are changing large areas in the field of software engineering research, by allowing researchers to produce results in ways closer to the scientific method than before. With relatively little effort, they can start from solid data, which is becoming more and more accessible, and rely on mature, adaptable tools, to produce results that take into account a large diversity of cases. Some concrete impacts are:- Availability of datasets. Researchers can work on the same dataset with different techniques, which eases the comparison of results, and allows to determine to which extent a study advances the state of the art. They also allow researchers to focus on analysis, and not on the assembly of the dataset itself, which is complex, error-prone, and effort-consuming.
- Reproducibility of the studies. When the tools used by researchers are FOSS, and the datasets are public, reproducing previous results becomes possible and even easy. This is already improving the chances of validating results, which is fundamental to advance on solid ground.
- Incremental research. Now it is much easier to build on the shoulders of giants, by reusing FOSS tools and datasets produced by other research teams. Researchers no longer have to start from the ground up, they can incrementally improve precious results.
References
"Estimating Linux's Size", David A. Wheeler, published online, November 6, 2000."Counting potatoes: the size of Debian 2.2", González-Barahona, Jesus M.; Pérez-Ortuño, Miguel; de las Heras Quirós, Pedro; Centeno-González, José; Matellán-Olivera, Vicent. Upgrade Magazine, 2(6), 60-66 (2001).
"Evolution in open source software: a case study", Godfrey, M.W.; Qiang Tu. Proceedings od the International Conference on Software Maintenance, 11-14 Oct. 2000, San Jose, CA, USA, 131-142.
"Results From Software Engineering Research Into Open Source Development Projects Using Public Data", Koch, Stefan; Georg Schneider, published online, 2000.
"A Case Study of Open Source Software Development: The Apache Server", Mockus, Audris; Fielding Roy; Herbsleb James.Proceedings of the International Conference on Software Engineering (ICSE 2000), 4-11 Jun. 2000, Limerick, Ireland.
"Open source software projects as virtual organizations: Competency rallying for software development", Crowston, K.; Scozzi, B. Proceedings of IEE Software, Feb. 2002, 149(1), 3-17.