Monday, February 13, 2017

Network Science Offers New Perspective on Developer Roles in Open Source Software

By: Mitchell JoblinSiemens AG. Erlangen, Germany (@mitchelljoblin)
Sven ApelUniversity of Passau, Germany (@SvenApel) 
Associate Editor: Bogdan Vasilescu, Carnegie Mellon University. USA (@b_vasilescu)


Software development is a distributed, collaborative effort, wherein different contributors play different roles, e.g., depending on their ability, experience, history, and position within a project. Research at the University of Passau, Siemens Corporate Technology, and the Technical University of Applied Sciences Regensburg provides novel quantitative insights into the properties and relevance of developer roles and their change over time. The insights could provide guidance on creating optimal governance and coordination structures for distributed software projects.

In open source software in particular, two main groups stand out: a small fraction of developers (the core group) responsible for performing the vast majority of work, assisted by a larger group of less active contributors (the peripheral group), involved in smaller or shorter activities. The two groups should be in balance. A large group of peripheral contributors is critical for ensuring software quality (recall Linus’s law "given enough eyeballs, all bugs are shallow"), but this imposes increased coordination effort on the core developers.

Our software analytics framework Codeface has substantially lowered the technical barrier to performing sophisticated in-depth analyses on big software repository data. As part of a recent empirical study, we provide developer role classification based on data automatically extracted from software repositories (e.g., version-control systems, mailing lists). Our techniques substantially extend rudimentary indicators based on counting lines of code or number of commits. These simple measures—although they are common in research and practice—carry significant uncertainty regarding their validity when it comes to capturing nuances of developer roles, and they are extremely limited with respect to inter-developer relationships. For example, the bare number of lines of code a developer has contributed tells us very little about how regularly they make contributions, how they are organizationally embedded within the project, or the degree of influence they may have on other developers.

Developer Networks

A developer network is a relational abstraction of the social and technical activities performed by developers. An example developer network for the open-source project QEMU is shown in Figure 1. Such networks are accurate in reflecting developer perception, and they reveal important functional substructure, or communities, with related tasks and goals [1, 3]. It has been shown that they evolve over time according to a number of fundamental changes in the network’s structural properties [2].

In a recent study [4], we explored the use of developer networks for classifying developer roles. In a nutshell, we found that they are superior in obtaining developer classification information compared to basic counts of individual developer activities, which we explain next.

Figure 1. The QEMU developer network emphasizing the community structure. Each node represents a developer and edges indicate interrelated development activity. The uniquely colored boxes enclosing developers represent communities of developers that work on strongly interrelated tasks. The pie chart for each node represents the amount of development activity associate with each community. The expected level of influence for each developer is reflected by the node size.

Empirical Study

As a first step, we performed a large-scale empirical study to obtain a ground truth on classifications for roughly 1000 developers. We used the data to validate (statistically and empirically) that using information from developer networks leads to more accurate results and greater practical insights regarding the differences between core and peripheral developers than simpler approaches (e.g., based on counting commits). As a key result, we found that core developers are positioned in the organizational structure distinctly from peripheral developers. Furthermore, the nature of the core developer group in the organizational structure is distinct from peripheral developers, in particular, in two different ways: (1) hierarchy and (2) stability.

Figure 2: Clustering coefficient versus node degree for developers of QEMU. Each point on the scatter plot represents one developer. The linear dependence is indicative of an organizational hierarchy with core developers at the top and peripheral developers underneath.


To mathematically detect if hierarchy exists in a developer network, we need to inspect the dependence between the so-called clustering coefficient and the degree of nodes in the network [5]. The hierarchical relationship for QEMU is shown in Figure 2; there is an obvious linear dependence between the log node degree and log clustering coefficient. Core developers that exhibit a high degree (i.e., who coordinate with many other developers) are seen to exclusively have a very low clustering coefficient (i.e., have neighbors that are only loosely interconnected) and are indicative of developers in a leadership role. In comparison, peripheral developers are seen as low degree nodes having consistently higher clustering coefficients. A developer’s position in the developer network is an organizational manifestation of their particular role. This and similar results may hold great potential for practice, for instance, when the agreement between prescribed and actual social organization needs to be ascertained and optimized, or when the organizational structure of projects is devised along the lines of established, successful projects.


A stable developer is one that maintains consistent participation in the project over a substantial period of time. The result of examining developer stability over one year of development for QEMU are shown in Figure 3. In this figure, the transition probabilities between developer states are shown in the form of a Markov chain. The primary observation is that developers in a core role are substantially less likely to transition to the absent state (i.e., leave the project) or isolated state (i.e., have no neighbors in the developer network by working exclusively on isolated tasks), which is in contrast to developers in a peripheral role. Based on this result, we can conclude that the core developers represent a more stable group than peripheral developers.

Figure 3: The developer-group stability for QEMU shown in the form of a Markov Chain. Edges are labeled with the probability that a developer in one state transitions to the next state in the following development cycle. A few less important edges have been omitted for visual clarity.

Final Remarks

Developer networks present a unique opportunity to capture insights about developer roles that are not visible in simpler, yet ubiquitously used representations. Inferring and modeling the inter-developer relationships leads to greater accuracy in capturing the real world, and it delivers deeper insights into the organizational structure of a software project. Our data and analysis suggest that peripheral developers are extremely reliant on core developers, especially because peripheral developers tend not to associate with other peripheral developers. If a core developer becomes overwhelmed by the need to oversee the activities of too many peripheral developers, a likely consequence is that effective coordination will not occur. If this is a pervasive phenomenon in a project's organizational structure, it may lead to severe degradation of the software architecture and have a negative impact on the overall source code quality. Our results—which can be immediately used by practitioners thanks to the associated freely available software framework Codeface—provides novel means for avoiding these traps.


[1] C. Bird, D. Pattison, R. D’Souza, V. Filkov, and P. Devanbu. Latent social structure in open source projects. In Proc. International Symposium on Foundations of Software Engineering, pages 24–35. ACM, 2008.
[2] M. Joblin, S. Apel, and W. Mauerer. Evolutionary trends of developer coordination: A network approach. Empirical Software Engineering, 2017. To appear.
[3] M. Joblin, W. Mauerer, S. Apel, J. Siegmund, and D. Riehle. From developer networks to verified communities: A fine-grained approach. In Proc. International Conference on Software Engineering, pages 563– 573. IEEE, 2015.
[4] M. Joblin, S. Apel, C. Hunsen, and W. Mauerer. Classifying Developers into Core and Peripheral: An Empirical Study on Count and Network Metrics. In Proc. International Conference on Software Engineering, 2017. To appear.
[5] E. Ravasz and A.-L. Barabasi. Hierarchical organization in complex networks. Physical Review E, 67(2), 2003.

No comments:

Post a Comment