Monday, February 13, 2017

Network Science Offers New Perspective on Developer Roles in Open Source Software

By: Mitchell JoblinSiemens AG. Erlangen, Germany (@mitchelljoblin)
Sven ApelUniversity of Passau, Germany (@SvenApel) 
Associate Editor: Bogdan Vasilescu, Carnegie Mellon University. USA (@b_vasilescu)


Software development is a distributed, collaborative effort, wherein different contributors play different roles, e.g., depending on their ability, experience, history, and position within a project. Research at the University of Passau, Siemens Corporate Technology, and the Technical University of Applied Sciences Regensburg provides novel quantitative insights into the properties and relevance of developer roles and their change over time. The insights could provide guidance on creating optimal governance and coordination structures for distributed software projects.

In open source software in particular, two main groups stand out: a small fraction of developers (the core group) responsible for performing the vast majority of work, assisted by a larger group of less active contributors (the peripheral group), involved in smaller or shorter activities. The two groups should be in balance. A large group of peripheral contributors is critical for ensuring software quality (recall Linus’s law "given enough eyeballs, all bugs are shallow"), but this imposes increased coordination effort on the core developers.

Our software analytics framework Codeface has substantially lowered the technical barrier to performing sophisticated in-depth analyses on big software repository data. As part of a recent empirical study, we provide developer role classification based on data automatically extracted from software repositories (e.g., version-control systems, mailing lists). Our techniques substantially extend rudimentary indicators based on counting lines of code or number of commits. These simple measures—although they are common in research and practice—carry significant uncertainty regarding their validity when it comes to capturing nuances of developer roles, and they are extremely limited with respect to inter-developer relationships. For example, the bare number of lines of code a developer has contributed tells us very little about how regularly they make contributions, how they are organizationally embedded within the project, or the degree of influence they may have on other developers.

Developer Networks

A developer network is a relational abstraction of the social and technical activities performed by developers. An example developer network for the open-source project QEMU is shown in Figure 1. Such networks are accurate in reflecting developer perception, and they reveal important functional substructure, or communities, with related tasks and goals [1, 3]. It has been shown that they evolve over time according to a number of fundamental changes in the network’s structural properties [2].

In a recent study [4], we explored the use of developer networks for classifying developer roles. In a nutshell, we found that they are superior in obtaining developer classification information compared to basic counts of individual developer activities, which we explain next.

Figure 1. The QEMU developer network emphasizing the community structure. Each node represents a developer and edges indicate interrelated development activity. The uniquely colored boxes enclosing developers represent communities of developers that work on strongly interrelated tasks. The pie chart for each node represents the amount of development activity associate with each community. The expected level of influence for each developer is reflected by the node size.

Empirical Study

As a first step, we performed a large-scale empirical study to obtain a ground truth on classifications for roughly 1000 developers. We used the data to validate (statistically and empirically) that using information from developer networks leads to more accurate results and greater practical insights regarding the differences between core and peripheral developers than simpler approaches (e.g., based on counting commits). As a key result, we found that core developers are positioned in the organizational structure distinctly from peripheral developers. Furthermore, the nature of the core developer group in the organizational structure is distinct from peripheral developers, in particular, in two different ways: (1) hierarchy and (2) stability.

Figure 2: Clustering coefficient versus node degree for developers of QEMU. Each point on the scatter plot represents one developer. The linear dependence is indicative of an organizational hierarchy with core developers at the top and peripheral developers underneath.


To mathematically detect if hierarchy exists in a developer network, we need to inspect the dependence between the so-called clustering coefficient and the degree of nodes in the network [5]. The hierarchical relationship for QEMU is shown in Figure 2; there is an obvious linear dependence between the log node degree and log clustering coefficient. Core developers that exhibit a high degree (i.e., who coordinate with many other developers) are seen to exclusively have a very low clustering coefficient (i.e., have neighbors that are only loosely interconnected) and are indicative of developers in a leadership role. In comparison, peripheral developers are seen as low degree nodes having consistently higher clustering coefficients. A developer’s position in the developer network is an organizational manifestation of their particular role. This and similar results may hold great potential for practice, for instance, when the agreement between prescribed and actual social organization needs to be ascertained and optimized, or when the organizational structure of projects is devised along the lines of established, successful projects.


A stable developer is one that maintains consistent participation in the project over a substantial period of time. The result of examining developer stability over one year of development for QEMU are shown in Figure 3. In this figure, the transition probabilities between developer states are shown in the form of a Markov chain. The primary observation is that developers in a core role are substantially less likely to transition to the absent state (i.e., leave the project) or isolated state (i.e., have no neighbors in the developer network by working exclusively on isolated tasks), which is in contrast to developers in a peripheral role. Based on this result, we can conclude that the core developers represent a more stable group than peripheral developers.

Figure 3: The developer-group stability for QEMU shown in the form of a Markov Chain. Edges are labeled with the probability that a developer in one state transitions to the next state in the following development cycle. A few less important edges have been omitted for visual clarity.

Final Remarks

Developer networks present a unique opportunity to capture insights about developer roles that are not visible in simpler, yet ubiquitously used representations. Inferring and modeling the inter-developer relationships leads to greater accuracy in capturing the real world, and it delivers deeper insights into the organizational structure of a software project. Our data and analysis suggest that peripheral developers are extremely reliant on core developers, especially because peripheral developers tend not to associate with other peripheral developers. If a core developer becomes overwhelmed by the need to oversee the activities of too many peripheral developers, a likely consequence is that effective coordination will not occur. If this is a pervasive phenomenon in a project's organizational structure, it may lead to severe degradation of the software architecture and have a negative impact on the overall source code quality. Our results—which can be immediately used by practitioners thanks to the associated freely available software framework Codeface—provides novel means for avoiding these traps.


[1] C. Bird, D. Pattison, R. D’Souza, V. Filkov, and P. Devanbu. Latent social structure in open source projects. In Proc. International Symposium on Foundations of Software Engineering, pages 24–35. ACM, 2008.
[2] M. Joblin, S. Apel, and W. Mauerer. Evolutionary trends of developer coordination: A network approach. Empirical Software Engineering, 2017. To appear.
[3] M. Joblin, W. Mauerer, S. Apel, J. Siegmund, and D. Riehle. From developer networks to verified communities: A fine-grained approach. In Proc. International Conference on Software Engineering, pages 563– 573. IEEE, 2015.
[4] M. Joblin, S. Apel, C. Hunsen, and W. Mauerer. Classifying Developers into Core and Peripheral: An Empirical Study on Count and Network Metrics. In Proc. International Conference on Software Engineering, 2017. To appear.
[5] E. Ravasz and A.-L. Barabasi. Hierarchical organization in complex networks. Physical Review E, 67(2), 2003.

Sunday, February 5, 2017

Empowering Users to Build IoT Software with a Puzzle-like Environment

by Yijun Yu, Pierre A. Akiki, Arosha K. Bandara
Associate Editor: Christoph Treude (@ctreude)

Jigsaw puzzle pieces serve as building blocks that allow children as well as adults to put together impressive pictures. Software engineers follow a similar approach to build complex programs. Since David Parnas' modularization conceptualization (Parnas, 1972), many kinds of software modules in the form of "building blocks" have been proposed and adopted. These include functions, objects, remote procedures, web services, cloud services, and microservices, to name a few. However, from a developer’s perspective, it is hard to achieve the exact requirements of end-users. Therefore, it would be useful to empower end-users with the capability to adapt a software product to their individual needs (Lapouchnian et al., 2006).

Internet of Things (IoT) devices and services can be configured to work together in many different ways. This configuration can be performed by end-users and is considered an end-user development challenge. Millions of school kids nowadays are taught programming concepts using end-user development environments such as Scratch, which has also been adapted into the Sense environment used to teach entry-level computing at The Open University (Kortuem et al., 2013). Can't we use a similar environment to teach end-users to develop software for IoT?

Akiki's recent research on Visual Simple Transformations (ViSiT) solves this problem by empowering end-users to wire IoT devices and services (Akiki et al., 2017). For example, end-users can use puzzle pieces to implement a transformation (Figure 1) that allows a Microsoft Xbox controller to communicate with a Lego Mindstorms robot. This paradigm is familiar to anyone who is used to programming with an environment like Scratch. Each puzzle piece is a visual block that connects to its neighbouring pieces in a similar way. Intuitively, puzzle pieces bind to each other through predefined sockets. The parameters of these building blocks provide the required concretization that matches with the configuration parameters of IoT devices and services.

Figure 1. An example transformation that connects an Xbox controller to a Lego Mindstorms robot
ViSiT’s underlying service-oriented code implements the transformations as executable workflows and composes them into a holistic application for IoT. Although this work primarily targets end-users, software developers can modify the executable workflows using Cedar Studio. This tool was originally developed as part of Akiki's work on adaptive user interfaces (Akiki et al., 2016), (Akiki et al., 2014).

By empowering end-users and lowering the barrier for software development, the creation of real-life IoT applications such as the robot shown in Figure 2 comes within the reach of a wider audience.

Figure 2. A shooting robot can be controlled using an Xbox controller as a result of applying visual simple transformations
Testing ViSiT with a number of end-users showed that it lowers the acceptance barrier and facilitates connecting IoT devices and services. A video demonstrating ViSiT and its supporting development environment can be viewed at:

With the evolution of such development paradigms and their supporting tools, end-users will have an ever-growing role to play in software development. As the adoption of end-user development environments grows, someday the data learnt from these environments could be used to train artificial intelligence to automatically compose software systems.


Monday, January 30, 2017

The Unified ASAT Visualizer (UAV), a tool for comparing multiple ASATs on your Java projects

Authors: Moritz Beller (@Inventitech), Andy Zaidman (@azaidman), Tim Buckers, Clinton Cao, Michiel Doesburg, Boning Gong, Sunwei Wang
Editors: Nikolaos Tsantalis (@NikosTsantalis), Mei Nagappan (@meinagappan)

TL;DR: You want to compare whether using multiple Automated Static Analysis Tools (ASATs) makes sense for your projects, say whether the additional warnings of using Checkstyle on top of FindBugs are worth the increased maintenance costs. We have implemented a prototypical tool to visualize this in an intuitive manner, called UAV. Go check it out on an example project. Here follows the full story.

UAV in action

When we look at the software engineering landscape today, we see that state-of-the-art projects already make use of Automated Static Analysis Tools (ASATs) in some form or another: Either as traditional tools such as JSLint, PVS-Studio, FindBugs, Google's Error-Prone, as web services such as Coverity Scan, or as compiler-embedded analyses like the ones you get from clang. In fact, ASATs have become so numerous for any given programming language that it is difficult for us as project managers to decide which ones my project the most benefits. While deciding on one tool might still be simple, deciding on an effective combination of tools is even more difficult, even though we know that combining multiple ASATs would unleash their potential [1]. The complementary warnings that different ASATs can find often combine nicely think for example of FindBug's bug-finding capabilities and the code readability-centered CheckStyle.

However, many developers understandably refrain from running more than one ASAT mainly because of two reasons: 
1. It is difficult to compare and understand the strengths of multiple ASATs on your own project. Currently, in most cases, you would have to go through the (possibly) lengthy list of findings which each tool emits. This is tedious work since the warnings are not standardized, so it is difficult to tell if the two tools indeed find different warnings or not. 
2. Sifting through ASAT warnings in general is hard work. You don't want to make your life harder by including more tools than necessary. We know that many ASAT warnings are not important to developers, so finding (or configuring) an additional ASAT that reports only interesting warnings for your project is crucial. However, this is again made difficult because there is no common and automatically applicable classification between the warnings that different ASATs emit. As an end result of these complications, many projects still only employ one ASAT with practically no further customization [1].

With UAV, the Unified ASAT Visualizer, we created an ASAT-comparison tool with an intuitive visualization that enables developers, researchers, and tool creators to compare the complementary strengths and overlaps of different Java ASATs. UAV’s enriched treemap and source code views provide its users with a seamless exploration of the warning distribution from a high-level overview down to the source code. We have evaluated our UAV prototype in a user study with ten second-year Computer Science (CS) students, a visualization expert and tested it on large Java repositories with several thousands of PMD, FindBugs, and Checkstyle warnings.

TS;WM (Too Short; Want more): We have a tool paper that goes into many of UAV's implementation details [2].

[1] Moritz Beller, Radjino Bholanath, Shane McIntosh, Andy Zaidman: Analyzing the State of Static Analysis: A Large-Scale Evaluation in Open Source Software. In 23rd IEEE International Conference on Software Analysis, Evolution, and Reengineering (SANER), Osaka (Japan), 2016.

[2] Tim Buckers, Clinton Cao, Michiel Doesburg, Boning Gong, Sunwei Wang, Moritz Beller and Andy Zaidman: UAV: Warnings from Multiple Automated Static Tools at a Glance. In 24th IEEE International Conference on Software Analysis, Evolution, and Reengineering (SANER), Klagenfurt (Austria), 2017.

Monday, January 16, 2017

IEEE November/December Issue, Blog, SE Radio Summary

November/December Issue

IEEE Software Magazine

The November/December issue of IEEE Software offers a variety of relevant and interesting topics in the software world. From hot topics like crowdsourcing and agile to thought-invoking discussions on how research translates to practice, this issue spans a wide range of topics. Tying together all the articles in this issue is an article on telling the story of computing and the role computer plays in the art of story telling. We as software engineers are artists; specializing in the art of technology and "using our software and our hardware as our brush and our canvas".

Featured in this issue are two articles on the changes in the software world that affect that developer and end user:
  • "A Paradigm Shift for the CAPTCHA Race: Adding Uncertainty to the Process" by Shinil Kwon and Sungdeok Cha, where the authors propose ways to improve CAPTCHA (Completely Automated Public Turing Test to Tell Computers and Humans Apart) challenges for increased human ability and decreased bot ability to solve these challenges; and 
  • "Examining the Rating System Used in Mobile-App Stores" by Israel J. Mojica Ruiz, Meiyappan Nagappan, Bram Adams, Thorsten Berger, Steffen Dienst, and Ahmed E. Hassan,  in which the authors explore how accurately user ratings in app stores maps to actual user satisfaction levels with mobile apps.

A large portion of the papers in this issue discuss the artistry of the software architect and how the role of software architect has been changing, and will continue to change, with the changes in technology:

On one side, as technology changes, the importance of the role of the software architect increases. In "The Changing Role of the Software Architect," Editor in Chief Diomidis Spinellis discusses this phenomena in some detail. As software evolves to play a more ubiquitous role in our lives and store more critical and personal information, the design of our software and systems becomes even more vital to the potential for quality, secure transactions. For example, software architecture plays a direct role in the ability for attackers to find and manipulate attack surfaces, or the places where enemies can target their attacks on a given system. This is such an important topic that research has been devoted to approximating and minimizing attack surfaces [1, 2, 3]. Although approximating an attack surface isn't necessarily an architecture problem, minimizing them is. Having the power to determine the design of a system, especially a critical system, is one that should not be taken lightly.

But as Benjamin Parker warned Spider-man, "with great power comes great responsibility". If software architects are becoming more important to the software development and maintenance process, it naturally follows that responsibilities can, and probably should, change. But how? Articles in this issues make some suggestions. For example, Rainer Weinreich and Iris Groher propose one change to the responsibilities of the software architect in their article "The Architect's Role in Practice. From Decision Maker to Knowledge Manager?". The authors interviewed practitioners to learn about how the role of the architect has transformed. Architects are typically tasked primarily, if not solely, with making decisions regarding the design of the target system. However, they discovered that there are additional responsibilities that come with being a software architect, such as advisor and knowledge manager. All the practitioners the authors interviewed agreed that when it comes to knowledge management it is particularly important to document project-specific decisions. With the changes to the software architect role, there is a growing need for tools and guidelines to support their daily activities. Are we up for the challenge??

IEEE Software Blog

In the past couple months, the IEEE Software Blog covered some interesting and practically relevant topics. New to the blog are postmortems, modeled after Postmortems in, where we give companies an opportunity to discuss what is working and what challenges remain for software developers. December features the company Deducely. Along the same lines, there are blog posts regarding various aspects of the software development process, including using creativity in requirements engineering and how to identify and avoid code smellsAlso featured in the November/December blog entires is a blog on the panel titled "The State of Software Engineering Research", which was held last year at FSE 2016.

SE Radio

Featured for this issue of IEEE Software on SE Radio are topics ranging from soft skills, such as salary negotiation, to hard skills, like site reliability engineering and software estimation. Invited guests include Steve McConnell, Sam Aaron, Josh Doody, Björn Rabenstein, Gil Tene, and Peter Hilton. Also, SE Radio welcomed two new members to the SE Radio team: Marcus Blankenship and Felienne Hermans

[1] Theisen, C., Herzig, K., Morrison, P., Murphy, B., & Williams, L. (2015, May). Approximating attack surfaces with stack traces. In Proceedings of the 37th International Conference on Software Engineering-Volume 2 (pp. 199-208). IEEE Press.
[2]Bartel, A., Klein, J., Le Traon, Y., & Monperrus, M. (2012, September). Automatically securing permission-based software by reducing the attack surface: An application to android. In Proceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering (pp. 274-277). ACM. 
[3] Manadhata, P. K., & Wing, J. M. (2011). An attack surface metric. IEEE Transactions on Software Engineering37(3), 371-386.

Sunday, January 8, 2017

BarraCUDA in the Cloud

by W.B. Langdon, UCL and Bob Davidson, Microsoft
Associate Editor: Federica Sarro (@f_sarro), University College London, UK 

What is this, flying fishes? Well no. BarraCUDA is the name of a Bioinformatics program and the cloud in question is Microsoft’s Azure, which is in the process of being upgraded with copious nVidia K80 tesla GPUs which support CUDA in instances of virtual machines. BarraCUDA has been around for a few years [1]. It is a port of BWA [2] which takes advantage of the massive parallelism available on graphics hardware (GPUs) to greatly speed up approximate matching of millions of short DNA strings against a reference genome. For example, the human reference genome [3]. Approximate matching is necessary, because of noise but primarily because the medical purpose of many DNA scans is to reveal differences between them and “normal” (i.e. reference) DNA. A typical difference is to substitute one character for another, but tools like BarraCUDA also find matches where a character is inserted and where one is deleted. Although there are many sources of DNA data, BarraCUDA and similar programs are targeted at strings generated by “Next Generation Sequencing” (NGS) machines. These are amazing devices. A top end NGS machine is now capable of generating more than a billion DNA strings, sequences of A, C, G or T letters. Part of the trade-off for this speed is the strings are short (typically a hundred letters long) and noisy. The first step is to find where the short fragments of DNA came from by aligning the strings against a reference genome. To account for the various sources of noise, NGS is usually run with three fold redundancy and sometimes a particularly important part of a person’s genome may be scanned ten or more times. Given multiple alignments to the same part of the reference genome, it becomes possible to look for consistent variations.

BWA, BarraCUDA and Bowtie are members of a family of Bioinformatics tools which have proved successful because they are able to compress the human reference genome into less than 4 gigabytes of RAM, making it possible to run an important part of the DNA analysis tool chain on widely available computers. Indeed in the case of BarraCUDA, GPUs with 4GB are also widely available. Recently BarraCUDA was optimised using genetic improvement[4,5] (see blog posting February 3, 2016).  This updating prompted the question was it possible to use BarraCUDA with epigenetics data.

To grossly oversimplify, whilst (to a good approximation) all the cells in your body contain the same DNA, what makes your cells different from each other is how that DNA is used. It is thought that to a large extent how DNA is enabled and disabled is controlled by epigenetic makers on that DNA itself. These epigenetic markers differ between cells. Indeed the markers change not only between cells but also with the person’s age and factors outside the cell. Since this is not fully understood, the study of epigenetics, particularly how it relates to disease is a very active topic. Much of the Next Generation Sequencing technology can be reused by epigenetics. However when matching epigenetic sequences against a reference, the reference is twice the size of the DNA reference. Fortunately this need has coincided with the launch of GPUs with larger memory (e.g. the Tesla K40 has 12GB). Which in turn has coincided with the introduction of Azure cloud nodes with multiple K40s or K80s. Recently we have been benchmarking [6] BarraCUDA on epigenetics data supplied by Cambridge Epigenetics on Azure nodes.

Data from Nvidia

At 30 Nov 2016 there were 1519 GPU articles in the USA National Library of Medicine
(PubMed). 1221 (80%) since the end of 2009.

[1]  P. Klus et al., BarraCUDA. BMC Res Nts, 5(27), 2012.
[2]  Heng Li, R. Durbin, Fast and accurate short read alignment with Burrows-Wheeler transform.  Bioinformatics 2009, 25(14):1754-1760.
[3]  Initial sequencing and analysis of the human genome. Nature 409, 6822, (15 Feb 2001), 860–921.
[4]  W.B. Langdon. Genetically improved software. In Amir H. Gandomi et al., editors, Handbook of Genetic Programming Applications, chapter 8, pages 181–220. Springer, 2015.
[5]  W.B. Langdon, Brian Yee Hong Lam, M. Modat, J. Petke, M. Harman. Genetic Improvement of GPU Software. Genetic Programming and Evolvable Machines. Online first.
[6] W.B. Langdon, A. Vilella, Brian Yee Hong Lam, J. Petke and M. Harman, Benchmarking Genetically Improved BarraCUDA on Epigenetic Methylation NGS datasets and nVidia GPUs. In Genetic Improvement 2016, (GECCO 2016) workshop, pages 1131-1132, 20-24 July, Denver.

You might also enjoy reading
Genetic Improvement, IEEE Software blog, February 3, 2016
GPGPUs for bioinformatics, Oxford Protein Informatics Group, April 17, 2013
Genome Sequencing in a Nutshell, Deborah Siegel and Denny Lee, May 24, 2016

Sunday, December 18, 2016

Postmortem: Deducely

by Aswin Vayiravan, Deducely (@Deducely
Editor: Mei Nagappan (@MeiNagappan), University of Waterloo, Canada

Editor's Note: In a new series of posts (modeled after the Postmortems in, we are looking into what worked great and what challenges remain for software developers. We hope to curate many such postmortems in the future! Do email me directly if you have ideas on improving this series of posts.  In the first post of this series, I reached out to a small startup company called Deducely in India.

About Us
Deducely is an AI-powered sales lead generation platform. Usually, software companies have a lead generation team that does a lot of manual work like researching, filtering and qualifying prospects before the sales pitch happens. We have a tool that would remove the burden of repeated manual work in these teams. We use Tensor Flow to learn and track patterns and categorize leads. Also, we use NLTK to learn and extract the required information from unstructured data. We are a small bootstrapped startup headquartered in California but working out of a nondescript village in South India called Thiruparankundram.

What worked great for us?

  1. The SDLC model: Back in university we had meticulously studied the Pros and Cons of various Software Development Lifecycle models like the Waterfall model, Spiral model, and Agile model. etc. However, when I started my career in Freshdesk (A startup back then), it was surprising to find that I couldn’t fit the software development happening there into any one of these theoretical models! It was a hybrid of everything!  Similarly, in our startup Deducely we do not strictly follow any specific model, but the closest model that we could relate to would spiral model. We plan what has to be built, brainstorm the features with our customers and we finally start building one small module at a time, test it, release it and iterate.

  1. Development Platform: We are Linux lovers and to get Linux into our development devices, we use Vagrant. Although this technology might be a tad bit old, we are huge fans of it! It helps us in isolating various development environments. Even if we ruin the configuration of a particular vagrant environment, we can always spin up a new one from a previous snapshot. This gives us the freedom to go and SUDO without caring much about the consequences! We do not use any IDE. Atom is our text editor of choice.

  1. Maintaining the code: Git has become the de-facto version control system for code these days. Its decentralized approach takes some time to master but once you start making Git work for you the benefits it offers is unparalleled. Whilst many bigger companies use Github enterprise version or Atlassian Bitbucket, we went ahead with the Gitlab - an open source, self-hosted version control system with a plethora of features like revision control, continuous integration, a container registry, and an issue tracker.

  1. Tracking the tasks: It all happens with a very simple ToDo board on Trello. We are just two people (Myself and my co-founder Arun Kumar) working full time with a two more remote part-timers. We prioritize tasks and assign it to one person. Before we actually write and integrate different functions, we have a small chat about the function prototype and once the function is coded we run ad-hoc unit tests on the functions. Apart from this, the bugs in our code are tracked using GitLab’s inbuilt bug tracker.

Screen Shot 2016-12-10 at 12.11.29 AM.png
A screenshot of our task board

Screen Shot 2016-12-10 at 12.23.52 AM.png
A screenshot of the issues in one of our repositories

  1. Storing the data: Initially, there was a lot of arguments for using a conventional database like MySQL or PostgreSQL, but we settled with MongoDB because of its loosely typed Schema. All the queries in MongoDB are simple JS function calls and this is a huge advantage for a full stack JS company like ours. Also, say if we had to edit the schema of a MySQL table containing a billion records it would have been a plain disaster, but with MongoDB, we have better control over the schema and data types. Also backing up, and restoring the DB is fairly painless. Plus replication, fault tolerance and disaster recovery are made simple through MongoDBs replica sets. Though the mongoshell is the best way to connect to MongoDB, we prefer robomongo.

Our biggest challenges

  1. Callback Hell: We are Javascript aficionados, especially NodeJS. It gives us the ability to do multiple tasks in parallel instead of waiting for IO. We are a two person company and the NPM registry for NodeJS has a lot of trusted open source third party module that we generously incorporate while development. Also, the community support for NodeJS is very mature. If we face a problem, it can be solved in a matter of a few Google searches. However, writing code free of callback hell is a huge challenge not only with node but with any JS flavor. Callbacks, inside callbacks, makes the code unmaintainable and cluttered. We get around callback hell with a library called Bluebird. It makes code, more maintainable and certain code flows that are easily achievable in other languages aren’t easily possible in JS. Such code flows can be achieved with bluebird.

  1. Sequential execution: We had to scrape data out millions of web pages Javascript gave us a lot of power as it interacted natively with the DOM. However, it had one nasty side effect. In our case, we had to sequentially make millions of HTTP requests. In our initial days, I wrote a snippet to read the list of websites from the database and make those HTTP requests. Since NodeJS is non-blocking controlling the code flow was a huge challenge, and millions of HTTP requests went out but we were never able to get the responses back as these million requests went in one shot! At once! controlling this behaviour of NodeJS was the biggest challenge. JS isn’t very kind to sequential tasks and we work around this restriction via messaging queues - specifically RabbitMQ. It has a powerful API and an easy to use GUI to monitor the status of the queue.

Screen Shot 2016-12-10 at 12.57.51 AM.png

The RabbitMQ dashboard

Data Box
Developer Deducely
Platform Linux
Number of Developers 2 Full time and 2 part time
Length of development 1 year
Lines of Code 10K - 100K

Sunday, December 11, 2016

FSE 2016 Panel: The State of Software Engineering Research

by Matthieu Foucault, Carlene Lebeuf (@CarlyLebeuf), and Margaret-Anne Storey (@margaretstorey)- University of Victoria
Cross Posted from Margaret-Anne Storey's Blog.

The 2016 International Symposium on the Foundations of Software Engineering hosted a panel of prominent software engineering researchers moderated by Margaret-Anne Storey. The slides presented during the panel can be found here
Our panelists:

Tao Xie
University of Illinois at Urbana-Champaign
Tao is an ACM distinguished researcher. His Research focuses on automated software testing, mobile security, and software analytics.

Laurie Williams
North Carolina State University
Laurie is a founder of the Extreme Programming / Agile Conference. Her research focuses on software security, testing, and agile programming.

Peri Tarr
IBM Research
Peri is a principal research staff member at IBM TJ Watson Lab and a technical lead for Cognitive Tools and Methods at IBM. Her research focuses on software composition and aspect oriented software development.

Prem Devanbu
University of California at Davis
Prem started his career as an industrial software developer, then worked at Bell Labs and AT&T Research before beginning to teach at University of California at Davis. His research focuses on empirical software engineering, naturalness of software, and social analytics.

Lionel Briand
University of Luxembourg
Lionel currently leads the Software Verification and Validation Lab at the University of Luxembourg. He strongly advocates that research is practical to industry.
Our panelists were asked to reflect on three questions related to research in software engineering:
  • Do you believe our community as a whole is achieving the right balance of science, engineering, and design in our combined research efforts?
  • What new or existing areas of research do you think our community should pay more attention to?
  • Do you have novel suggestions for how we could improve our research methods to increase the impact of software engineering research in the near and distant future?
Each panelist was asked to briefly present their thoughts on these questions. Then we opened the floor to questions, and the rest of the panel was dedicated to a discussion between panelists and members of the audience. Our summary of the panel discussion focuses on the panelist responses to the three questions posed as well as the themes that emerged from their responses.

Balancing Science and Engineering

A common theme that quickly emerged was the importance of the role of industry in research. To kickstart the group discussion, panelists were asked to reflect on a statement made by Jan Bosch of Chalmers University at a research conference a few weeks earlier:
“Research does not start in universities anymore, it starts in industry.”

A quick show of hands at the conference demonstrated that the majority of people in attendance seemed to agree with Jan Bosch’s claim. Williams also agreed with this statement and expanded it further by stating that “research starts in industry, because that’s the context”. Briand felt that because “we are in a discipline where most of the phenomenon we are studying cannot be reproduced in a lab environment”, as software engineers, “our lab is the industry”.
All panelists agreed that collaborating with practitioners (not only industry, but also open-source communities, governments, etc.) is essential to solve real problems. Williams drew connections between research in software engineering and biology:
“If we try to come up with problems that we think are interesting, that would be similar to a biologist never going outside. We have to go out there and see the problems that they have and then help with it.”
However, even if practitioners are aware of these problems, they may not be able to solve them. Briand mentioned that a lack of expertise and a lack of freedom to look at novel solutions might be to blame. Devanbu observed that researchers have this advantage:
“As a researcher, you can have a broader perspective that spans over several languages, and not only try to generalize observations, but also find effects that are only observable at an ecosystem level. It’s not only a question of freedom, but also of perspective that industrials don’t have because they are not considering different projects at the same time.”
Xie suggested we engage in practitioners in research that is currently outside of their scope:
“If we show [practitioners] things outside of their scope (that in the longer term may be important), they may be more open, and may engage in collaborations with academic researchers, […] along with providing data, problems, or discussions.”
Members of the audience, namely Daniel Jackson (Massachusetts Institute of Technology) and Tom Ball (Microsoft Research), emphasized that a balance is needed, and that looking at basic science should not be left out. Notable examples, such as UNIX, Simula, ALGOL 60, and distributed systems, were not the product of massive empirical studies, but of academic researchers sitting in a room and brainstorming.
The discussion above illustrates the importance of making a conscious effort to reach out to practitioners. However, this is not an easy task and it requires real commitment and patience from researchers, as Peri Tarr mentioned:
“One of the problems that we face all the time as industrial researchers is gaining the trust of the people whose problem we’re going to help them solve […]. It can take months or years to get on the same page with the people who have a problem, to establish that yes, you’re looking for a way to solve their problem that will actually work for them, within, as Lionel points out, their real-world constraints.”
The audience (at FSE and listening to the broadcast) questioned (via Twitter) about our role as researchers and how we collaborate with industry, for example:

More discussion on these questions is needed! We invite you to participate in the blog discussion below.

Paying Attention to Other Areas of Research

In their opening statements, all panelists mentioned other areas of research that our community should look at.
Devanbu mentioned DevOps and IoT as other areas that the SE community tends to neglect. Xie mentioned SE research results that had a broad impact outside of the SE community, such as symbolic execution, delta debugging, or Representational State Transfer (REST). Xie further suggested that we consider more of the societal impact of SE research, advocating for a “bigger social responsibility” for researchers. He referred to the previous day’s keynote from Margaret Burnett about gender inclusiveness of software, and cited David Notkin’s 2013 quote:
“Anybody who thinks that we are just here because we are smart forgets that we’re also privileged, and we have to extend that further. So we have got to educate and help every generation”
Williams addressed the problem of cybersecurity as one of the main challenges for our community, stating that “we haven’t yet provided software engineers the means to write secure code without impacting their own workflow.” She said that software engineering researchers need to “situate [their] work in this world where there is someone working against [them], whether it’s an attacker or someone doing something they aren’t supposed to.” The second research area Williams highlighted was “agile software development on steroids”; the world of continuous integration, continuous deployment, devOps, continuous experimentation, testing in production, etc. We need to explore ways of adopting these practices as well as understand their benefits and the risks they introduce.
Tarr insisted on focusing our research efforts at the intersection of Software Engineering and other “high impact, societally important, value creation areas”, such as health care, environment, cognitive sciences, security and privacy, and education. She said, “In every one of these areas, these people are trying to get new generations of software done, but they don’t know how to do it […]. We desperately need software engineers at the intersection of these areas.” She noted that the traditional areas of software engineering research are now being driven by practitioners and that, as researchers, we are privileged to have the opportunity to take bigger risks that lead to bigger rewards. We “shouldn’t be working in areas where we aren’t afraid to fail”.
Briand considers that, although all topics covered by our community are relevant to practitioners, our “research is largely disconnected from practical engineering needs and priorities” and we “fail to recognize the variations across domains and contexts”. The needs and constraints of people developing software across these varying domains are completely different and “there is no such things as a universal solution to any software engineering problem”. In the domain of software engineering, our working assumptions and contextual factors make a huge difference. Because there is a disconnect from particular needs and priorities, there is a gap in the research literature – “the gap between what I needed and what I could find was significant” – that is too large to deal with.
What are your thoughts on the panelists’ suggestions for future software engineering research directions? Do you agree or disagree? Or do you suggest other areas we should pay attention to, e.g., are there other disciplines we should apply our results to, as the tweet below suggests? Let us know in the discussion below!

Widening Our Vision of What is Research

While discussing whether we need to pay more attention to different areas of research, the panelists were asked to comment on Jane Cleland-Huang’s (University of Notre Dame) tweet regarding fostering more diverse areas of research:

Cleland-Huang commented on her tweet by adding:
“It is easy for us as a community to lock into the same area. For example a lot of people do research that benefits from open-source systems, but other areas [are left out], such as immersive studies in industry, or areas where I do research in, such as requirements and traceability, where datasets are not so available. If we want to make a difference in those areas, what do we, as a community, need to do in terms of the review process and encouraging that kind of research?”
As discussed in the previous sections, if we want to do impactful research, we need to reach out to practitioners and look at the intersection of software engineering and other fields. However, this cannot happen because if, as Xie mentions, we keep a “narrow-minded definition of what is a research contribution”, a large number of papers that may have a high impact on industry will not make it in our venues. Too often our community rejects contributions that look at real-world problems because it’s not research, it’s engineering. A notable example of this is William’s comment regarding her research on agile software development: “My research and the research my lab did was initially rejected by the community because they considered that practitioners shouldn’t do that (using agile methods). But they were doing it, so we have to accept what practitioners are doing on a widespread basis.”
Xie mentioned that “we don’t have enough expertise or experience in the program committees to really judge whether there is a real problem or not.” Tarr furthered this with, “As a community, one of the most important things that we can do is to […] start establishing norms and bars for people who are conducting high risk, important research in important places and are going out into the world to get this information.”
In response to a question posed by Storey regarding how we know when our methods have crossed the line from research to pure engineering, Williams gave the advice that when we are shifting more to the engineering side of things, we should take a step back and reframe the problem in a more scientific way. For example, we can ask “what are the independent variables?” that will allow us to switch to a scientific way of thinking about our research.
This discussion is related to one tweet we received ahead of the panel:

It would seem that our community may need to accept contributions that differ according to their engineering, scientific and design content, but if that is so, do we need to establish different criteria when assessing papers? Jonathan Bell suggested in a tweet that we consider not just evaluation approaches, but also our datasets and tools:

Finally, some discussion that occurred on Twitter suggests we rethink how our community considers negative results and that we look to how other research areas embrace not just positive results:

In summary, we wish to thank the conference organizers for suggesting this panel, and we thank the panelists and the FSE community for participating in this discussion! And we hope to continue the discussion in the comments below!