Sunday, February 26, 2017

Sustainable Software Design

by Martin Robillard, McGill University (@mp_robillard)
Associate Editor: Christoph Treude (@ctreude)

cross-posted from Martin Robillard's blog

There's a lot of interest in understanding software design as an activity. But what happens to the outcome of this activity? At one extreme, what happens on the whiteboard stays on the whiteboard (design is never explicitly captured). At the other extreme, design information is meticulously captured and archived, and then invalidated by the first decent refactoring. So without explicit effort, design knowledge just disappears. The consequence is that without design information developers will have to perform ignorant surgery. The temptation for ignorant surgery has to be related to the cost of maintaining accessible design information. When a particular design is expensive to describe and maintain, its description and rationale is at risk of being lost, no matter how awesome the underlying design ideas are.

So why don't we explicitly consider how expensive a particular design decision will be to capture and maintain consistent with the code, before adopting it?

There exist quality models for software design. They include attributes like reusability, flexibility, understandability. But, as far as I can tell we don't yet have an attribute that captures how cost-effective it is to describe a set of design decisions over time. That attribute is what I would call sustainability.

There's no techno-fix to make design more sustainable. It's a complex, multi-faceted problem. In this paper I review the areas of software development research and practice that relate to design sustainability and explain why they are not silver bullets. These include:
  • Modularity can contribute to design sustainability when design decisions map to code organization concerns. Unfortunately many design decisions don't have much to do with modular decomposition. There's also the issue that many concerns cannot easily be modularized.
  • The relation between documentation and sustainable design is paradoxical. On one hand, good documentation can help sustain a design over time. On the other, documentation is expensive, which is a direct factor of unsustainability. The idea of fully self-documenting systems is appealing, but impractical. Parnas and Clements compared it to the Philosopher's Stone.
  • Programming language constructs are another tool to manage sustainability. For example assert statements are a cheap and relatively user-friendly way to capture simple designs rules and assumptions. There's also research to develop language support for stating and verifying more complex design-level properties, such as immutability. The related challenge for design sustainability is that that language-supported specifiable properties form a closed set of low-level concerns, whereas the set of possible design decisions is open and ranges over different abstraction levels.
  • Design patterns offer a natural map between parts of a system and a set of design rules and even their rationale. The major limitation here is that by definition, patterns are solutions to common problems, whereas there are many idiosyncratic design problems in software projects.
So, how to we move towards sustainable design? Maybe we can draw a lesson from gardening. At this point in our history we have the technology to grow anything anywhere. But the best way for a garden to stay alive with minimum effort is to select plants that are a good match for a specific environment. Likewise with design decisions. Different projects and systems have characteristics that are the equivalent of different types of soil, luminosity conditions, humidity. So we have to figure out how to select and nurture the design decisions that will thrive in these conditions.

References

Monday, February 20, 2017

Anticipating Cross-Layer Attacks in Software Systems

By: Eunsuk Kang (eunsuk.kang@berkeley.edu), Aleksandar Milicevic (almili@microsoft.com), and Daniel Jackson (dnj@mit.edu)

Associate Editor: Mehdi Mirakhorli (@MehdiMirakhorli)

Abstraction is one of the most fundamental techniques in software design, but can be a double-edged sword, especially in systems where security is a major concern. Most systems are too complex to reason about all at once, and so developers tend to focus on one aspect of a system at a time, and ignore irrelevant details. However, an attacker need not respect the same abstraction boundaries, and may exploit details across multiple layers of a system.

One well-known example of this type of security risk can be found in OAuth, a popular authorization protocol. Many web-based implementations of OAuth have been shown to be vulnerable to attacks [4], despite the fact that the protocol itself was subjected to rigorous analyses (including formal verification [2][5]). In hindsight, this is not too surprising, since the protocol is a set of abstract rules that omits details about an underlying platform---deliberately so, since it makes the protocol reusable across multiple platforms! At the same time, it is exactly some of these details (e.g., various browser features and interaction with malicious agents on the web) that allowed an attacker to compromise the security guarantees that the protocol is designed to provide.

In general, security is a cross-cutting concern that cannot be easily contained within a single abstraction of a system. A security guarantee established at one layer may no longer hold once the system is elaborated with details during an implementation phase. Currently, it remains the developer’s burden to ensure that the high-level guarantee is preserved in the implementation---a challenging task even for those with security expertise, due to the complexity of modern platforms such as web browsers.

How do we reason about potential attacks across multiple abstractions of a system? Can we anticipate and address such attacks proactively, before building an implementation? How much detail about the underlying platform do we need to include to perform this analysis?

Poirot is a security analysis tool designed to help developers proactively detect what we call cross-layer vulnerabilities [3]. The tool takes three types of inputs: (1) a pair of models that describe a high-level design and a low-level platform, (2) a desired security property (typically expressed over the high-level model), and (3) a representation mapping, which describes how entities from the high-level model are to be represented in terms of their low-level counterparts. Given these inputs, Poirot exhaustively analyzes potential interactions between the two layers and produces scenarios that describe how an attacker may exploit some of these interactions to undermine the security property. The analysis can be carried out incrementally: Starting with an abstract model that represents an initial design of the system, the designer can elaborate a part of the model with a choice of representation, transforming the model into a more detailed one.

Figure 1: Partial Mapping from abstract Add to HTTP request.














A key aspect of Poirot is the representation mapping, which allows the developer to specify decisions about how an abstract design is to be implemented using concrete primitives. For example, when designing an online shopping cart, one may define an operation named Add, which corresponds to the action of adding a new item to a customer’s shopping cart. At the abstract design level, this operation contains two arguments, as shown in Figure 1: the identifier of the item to be added (i), and a token that represents a customer’s credential (t). In order to deploy the shopping application onto the HTTP protocol, our developer must eventually decide how the two parameters from Add are to be mapped to its counterparts in a concrete HTTP request.

In an early design stage, however, the developer may possess only partial knowledge about the system, and some of these decisions may be unknown. Poirot allows the representation mapping to be only partially specified, allowing the developer to express her uncertainty about design decisions and systematically explore different candidate mappings. For instance, Figure 1 depicts a partial mapping specification that lists only the origin and path of the Add URL, leaving unspecified how the item ID and token will be transmitted as part of a request; this, naturally, yields a space of possible mappings, each leading to a different implementation of the shopping cart (and each with its own security vulnerabilities).

Another important part of Poirot is the library of domain models that together describe a platform---collection of generic components, data structures, and libraries that are used to implement an application. In our example, this library would include generic models of various components of the Web, such as a web server, the HTTP protocol, a browser, and its various features (cookie handling, page rendering, scripts, etc.). Once constructed by domain experts, these models should be reusable for analysis of multiple systems in the same domain. Reusability is crucial for Poirot: If each developer had to write these models for every system to be analyzed, it would simply take too much effort! Thanks to our flexible composition mechanism [3], this library is also easily extensible, in that fresh knowledge about a feature or newly discovered vulnerability can be encoded as a separate model and inserted into the library for later use.

Poirot is most effective when applied early in a development process. As a case study, we had an opportunity to work with a startup called HandMe.In, which was building an online system for tracking personal items. In collaboration with the lead developer, we applied Poirot to discover a number of potential security vulnerabilities in the system, resulting in significant changes to the design. In another study, we used Poirot to model and analyze the security of IFTTT, an application that allows an end user to automate web services, and discovered a previously unknown attack that exploited an interaction between the IFTTT protocol and a browser vulnerability called login CSRF. Many of the attacks generated by Poirot exploited system details at multiple levels of abstraction, and would not have been found if the analysis were confined to a single layer.

There are examples of cross-layer vulnerabilities in a number of other domains besides web security. For instance, a program in a high-level language (e.g., Java) may inadvertently expose private data when translated into a low-level representation (bytecode) [1]. We are currently investigating whether Poirot can be applied to these types of domains as well. In addition, we are building an extension that will allow Poirot to not only produce potential vulnerabilities, but also generate a representation mapping that preserves a desired security property across multiple layers, by leveraging techniques from program synthesis.

Read more about Poirot in our FSE paper [3], or try out the tool at https://eskang.github.io/poirot/


References

[1] M. Abadi. Protection in programming language translations. In International Colloquium on Automata, Languages and Programming (ICALP), 1998.
[2] S. Chari, C. S. Jutla, and A. Roy. Universally Composable Security Analysis of OAuth v2.0. IACR Cryptology ePrint Archive, 2011:526, 2011.
[3] E. Kang, A. Milicevic, and D. Jackson. Multi-representational security analysis. In International Symposium on the Foundations of Software Engineering (FSE), 2016.
[4] S. Sun and K. Beznosov. The devil is in the (implementation) details: an empirical analysis of OAuth SSO  systems. In ACM Conference on Computer and Communications Security (CCS), 2012.
[5] X. Xu, L. Niu, and B. Meng. Automatic verification of security properties of OAuth 2.0 protocol with cryptoverif in computational model. Information Technology Journal, 12(12):2273, 2013.

Monday, February 13, 2017

Network Science Offers New Perspective on Developer Roles in Open Source Software

By: Mitchell JoblinSiemens AG. Erlangen, Germany (@mitchelljoblin)
Sven ApelUniversity of Passau, Germany (@SvenApel) 
Associate Editor: Bogdan Vasilescu, Carnegie Mellon University. USA (@b_vasilescu)


Introduction

Software development is a distributed, collaborative effort, wherein different contributors play different roles, e.g., depending on their ability, experience, history, and position within a project. Research at the University of Passau, Siemens Corporate Technology, and the Technical University of Applied Sciences Regensburg provides novel quantitative insights into the properties and relevance of developer roles and their change over time. The insights could provide guidance on creating optimal governance and coordination structures for distributed software projects.

In open source software in particular, two main groups stand out: a small fraction of developers (the core group) responsible for performing the vast majority of work, assisted by a larger group of less active contributors (the peripheral group), involved in smaller or shorter activities. The two groups should be in balance. A large group of peripheral contributors is critical for ensuring software quality (recall Linus’s law "given enough eyeballs, all bugs are shallow"), but this imposes increased coordination effort on the core developers.

Our software analytics framework Codeface has substantially lowered the technical barrier to performing sophisticated in-depth analyses on big software repository data. As part of a recent empirical study, we provide developer role classification based on data automatically extracted from software repositories (e.g., version-control systems, mailing lists). Our techniques substantially extend rudimentary indicators based on counting lines of code or number of commits. These simple measures—although they are common in research and practice—carry significant uncertainty regarding their validity when it comes to capturing nuances of developer roles, and they are extremely limited with respect to inter-developer relationships. For example, the bare number of lines of code a developer has contributed tells us very little about how regularly they make contributions, how they are organizationally embedded within the project, or the degree of influence they may have on other developers.

Developer Networks

A developer network is a relational abstraction of the social and technical activities performed by developers. An example developer network for the open-source project QEMU is shown in Figure 1. Such networks are accurate in reflecting developer perception, and they reveal important functional substructure, or communities, with related tasks and goals [1, 3]. It has been shown that they evolve over time according to a number of fundamental changes in the network’s structural properties [2].

In a recent study [4], we explored the use of developer networks for classifying developer roles. In a nutshell, we found that they are superior in obtaining developer classification information compared to basic counts of individual developer activities, which we explain next.

Figure 1. The QEMU developer network emphasizing the community structure. Each node represents a developer and edges indicate interrelated development activity. The uniquely colored boxes enclosing developers represent communities of developers that work on strongly interrelated tasks. The pie chart for each node represents the amount of development activity associate with each community. The expected level of influence for each developer is reflected by the node size.

Empirical Study

As a first step, we performed a large-scale empirical study to obtain a ground truth on classifications for roughly 1000 developers. We used the data to validate (statistically and empirically) that using information from developer networks leads to more accurate results and greater practical insights regarding the differences between core and peripheral developers than simpler approaches (e.g., based on counting commits). As a key result, we found that core developers are positioned in the organizational structure distinctly from peripheral developers. Furthermore, the nature of the core developer group in the organizational structure is distinct from peripheral developers, in particular, in two different ways: (1) hierarchy and (2) stability.

Figure 2: Clustering coefficient versus node degree for developers of QEMU. Each point on the scatter plot represents one developer. The linear dependence is indicative of an organizational hierarchy with core developers at the top and peripheral developers underneath.

Hierarchy

To mathematically detect if hierarchy exists in a developer network, we need to inspect the dependence between the so-called clustering coefficient and the degree of nodes in the network [5]. The hierarchical relationship for QEMU is shown in Figure 2; there is an obvious linear dependence between the log node degree and log clustering coefficient. Core developers that exhibit a high degree (i.e., who coordinate with many other developers) are seen to exclusively have a very low clustering coefficient (i.e., have neighbors that are only loosely interconnected) and are indicative of developers in a leadership role. In comparison, peripheral developers are seen as low degree nodes having consistently higher clustering coefficients. A developer’s position in the developer network is an organizational manifestation of their particular role. This and similar results may hold great potential for practice, for instance, when the agreement between prescribed and actual social organization needs to be ascertained and optimized, or when the organizational structure of projects is devised along the lines of established, successful projects.

Stability

A stable developer is one that maintains consistent participation in the project over a substantial period of time. The result of examining developer stability over one year of development for QEMU are shown in Figure 3. In this figure, the transition probabilities between developer states are shown in the form of a Markov chain. The primary observation is that developers in a core role are substantially less likely to transition to the absent state (i.e., leave the project) or isolated state (i.e., have no neighbors in the developer network by working exclusively on isolated tasks), which is in contrast to developers in a peripheral role. Based on this result, we can conclude that the core developers represent a more stable group than peripheral developers.

Figure 3: The developer-group stability for QEMU shown in the form of a Markov Chain. Edges are labeled with the probability that a developer in one state transitions to the next state in the following development cycle. A few less important edges have been omitted for visual clarity.

Final Remarks

Developer networks present a unique opportunity to capture insights about developer roles that are not visible in simpler, yet ubiquitously used representations. Inferring and modeling the inter-developer relationships leads to greater accuracy in capturing the real world, and it delivers deeper insights into the organizational structure of a software project. Our data and analysis suggest that peripheral developers are extremely reliant on core developers, especially because peripheral developers tend not to associate with other peripheral developers. If a core developer becomes overwhelmed by the need to oversee the activities of too many peripheral developers, a likely consequence is that effective coordination will not occur. If this is a pervasive phenomenon in a project's organizational structure, it may lead to severe degradation of the software architecture and have a negative impact on the overall source code quality. Our results—which can be immediately used by practitioners thanks to the associated freely available software framework Codeface—provides novel means for avoiding these traps.

References

[1] C. Bird, D. Pattison, R. D’Souza, V. Filkov, and P. Devanbu. Latent social structure in open source projects. In Proc. International Symposium on Foundations of Software Engineering, pages 24–35. ACM, 2008.
[2] M. Joblin, S. Apel, and W. Mauerer. Evolutionary trends of developer coordination: A network approach. Empirical Software Engineering, 2017. To appear.
[3] M. Joblin, W. Mauerer, S. Apel, J. Siegmund, and D. Riehle. From developer networks to verified communities: A fine-grained approach. In Proc. International Conference on Software Engineering, pages 563– 573. IEEE, 2015.
[4] M. Joblin, S. Apel, C. Hunsen, and W. Mauerer. Classifying Developers into Core and Peripheral: An Empirical Study on Count and Network Metrics. In Proc. International Conference on Software Engineering, 2017. To appear.
[5] E. Ravasz and A.-L. Barabasi. Hierarchical organization in complex networks. Physical Review E, 67(2), 2003.

Sunday, February 5, 2017

Empowering Users to Build IoT Software with a Puzzle-like Environment

by Yijun Yu, Pierre A. Akiki, Arosha K. Bandara
Associate Editor: Christoph Treude (@ctreude)


Jigsaw puzzle pieces serve as building blocks that allow children as well as adults to put together impressive pictures. Software engineers follow a similar approach to build complex programs. Since David Parnas' modularization conceptualization (Parnas, 1972), many kinds of software modules in the form of "building blocks" have been proposed and adopted. These include functions, objects, remote procedures, web services, cloud services, and microservices, to name a few. However, from a developer’s perspective, it is hard to achieve the exact requirements of end-users. Therefore, it would be useful to empower end-users with the capability to adapt a software product to their individual needs (Lapouchnian et al., 2006).

Internet of Things (IoT) devices and services can be configured to work together in many different ways. This configuration can be performed by end-users and is considered an end-user development challenge. Millions of school kids nowadays are taught programming concepts using end-user development environments such as Scratch, which has also been adapted into the Sense environment used to teach entry-level computing at The Open University (Kortuem et al., 2013). Can't we use a similar environment to teach end-users to develop software for IoT?

Akiki's recent research on Visual Simple Transformations (ViSiT) solves this problem by empowering end-users to wire IoT devices and services (Akiki et al., 2017). For example, end-users can use puzzle pieces to implement a transformation (Figure 1) that allows a Microsoft Xbox controller to communicate with a Lego Mindstorms robot. This paradigm is familiar to anyone who is used to programming with an environment like Scratch. Each puzzle piece is a visual block that connects to its neighbouring pieces in a similar way. Intuitively, puzzle pieces bind to each other through predefined sockets. The parameters of these building blocks provide the required concretization that matches with the configuration parameters of IoT devices and services.

Figure 1. An example transformation that connects an Xbox controller to a Lego Mindstorms robot
ViSiT’s underlying service-oriented code implements the transformations as executable workflows and composes them into a holistic application for IoT. Although this work primarily targets end-users, software developers can modify the executable workflows using Cedar Studio. This tool was originally developed as part of Akiki's work on adaptive user interfaces (Akiki et al., 2016), (Akiki et al., 2014).

By empowering end-users and lowering the barrier for software development, the creation of real-life IoT applications such as the robot shown in Figure 2 comes within the reach of a wider audience.

Figure 2. A shooting robot can be controlled using an Xbox controller as a result of applying visual simple transformations
Testing ViSiT with a number of end-users showed that it lowers the acceptance barrier and facilitates connecting IoT devices and services. A video demonstrating ViSiT and its supporting development environment can be viewed at: http://bit.ly/ViSiT

With the evolution of such development paradigms and their supporting tools, end-users will have an ever-growing role to play in software development. As the adoption of end-user development environments grows, someday the data learnt from these environments could be used to train artificial intelligence to automatically compose software systems.

References