IEEE Software Blog: January 2016

Sunday, January 31, 2016

Using Biometrics to Assess a Developer's Cognitive and Emotional States

by Sebastian Müller (@smueller9, smueller@ifi.uzh.ch) and Thomas Fritz (fritz@ifi.uzh.ch)
Associate Editor: Christoph Treude (@ctreude)

What difficulties and emotions do software developers experience during their work? When is a developer "in the flow" and making a lot of progress? So far, little is known about these kinds of questions or the cognitive and emotional states of developers during work. Researchers have predominantly focused on the tangible output of developers' work, such as the source code, to identify bug-prone parts. Yet, a better understanding of what developers experience in their work would allow us to provide better and instantaneous support to developers, for instance by avoiding costly interruptions when a developer is in the flow or by intervening when a developer perceives particularly high difficulty levels and might be close to implementing a bug into the code.

Recent advances in biometric (aka psychophysiological) sensor technology in combination with research in psychology show promise that biometric data can fill this gap and provide answers to some of these kinds of questions. For instance, measures of a person's skin conductivity (aka EDA)—how much a person is sweating—can tell us something about the person's stress level and cognitive effort, but also perceived emotions.

In our research, we investigate the use of biometric measurements in the context of software engineering. In particular, we conduct studies ranging from short controlled lab studies to several week long field studies in which we examine the link between a developer's biometric measures and cognitive and emotional aspects of a developer while working on a change task, such as the perceived difficulty of source code elements or tasks, and the emotions and progress the developer experiences. Our vision is to use biometric measurements to gain a better understanding of what developers experience in their work and to use this improved understanding to better support developers.

To study the potential of biometrics to predict developers' perceived progress and emotions (valence), for example, we conducted a lab study [1] with 17 participants from two companies and one university. For this study, we collected biometric data, such as EDA, or the heart rate of participating developers while they were working on two different change tasks. At the same time, we asked developers in periodic intervals to self-report their perceived progress and emotions, specifically valence and arousal (on two scales from -200 to +200). The following figure highlights some of the collected data, with the biometric EDA and heart rate data on top and the self-reported valence and progress on bottom.

The figure also illustrates a visible difference in the biometric data, in particularly in the EDA signal, between the first episode that was rated as medium progress and higher valence compared to the last episode with the developer reporting being stuck and a lower valence. To analyze the data, we performed several steps of data cleaning, noise filtering and feature extraction, before we could feed these features into a machine learning classifier. The results of our analysis show that biometric data can be used well to distinguish between positive and negative emotions, as well as between phases of low and high progress, in most cases.

Overall, the results of our three studies ([1], [2], [3]) show that we are able to predict aspects of developers' cognitive and emotional states in real-time using biometrics. The results also show that we are able to make accurate predictions in the field in the common work environment of professional software developers and over multiple days. Not surprisingly, biometric data is very sensitive to the individual and certain biometric measures work better for some than for others. By using an approach that takes into account these individual differences more explicitly, we should be able to further improve our results in the future.

So what can we do with these insights? Similar to other metrics, such as code metrics or interaction metrics, biometrics can be leveraged to provide better support to developers, even in real-time. The results of our latest study [3] show that we can use biometrics to predict where quality concerns are or will occur while developers are still working on the change task, and that the biometrics even outperform more traditional metrics. These insights can now be used to enhance existing tool support, such as the automatic detection of quality concerns, to highlight and rank places in the code that should be reviewed and/or refactored, or even to indicate to the developer when he/she should reconsider a change to prevent him/her from ever submitting code with errors to the repository. Similarly, we can use biometrics to provide better interruption management support for developers or to provide recommendations when a developer perceives difficulty, ranging from talking to a coworker to taking a break.

In summary, the results of our studies are promising and provide initial evidence that biometrics can be used to assess cognitive and emotional states of a developer and thus help answer some of the open questions. As one example, our results show that we might be able to support developers when they are experiencing difficulties and help them to fix quality concerns early on. With the rapid and continuous advances in sensor technologies and the increasing pervasiveness of smart wearable devices that already collect biometric data, we might soon be able to leverage this kind of data for each professional software developer and fine-tune our classifiers to further increase their accuracy. At the same time, this opens up many ethical and privacy concerns that will have to be addressed and investigated in further studies.

References

Sebastian Müller and Thomas Fritz. Stuck and Frustrated or In Flow and Happy: Sensing Developers' Emotions and Progress. In Proceedings of the 37th International Conference on Software Engineering (ICSE 2015), Florence, Italy
Thomas Fritz, Andrew Begel, Sebastian Müller, Serap Yigitt-Elliott, and Manuela Züger. Using psycho-physiological measures to assess task difficulty in software development. In Proceedings of the 36th International Conference on Software Engineering (ICSE 2014), Hyderabad, India
Sebastian Müller and Thomas Fritz. Using Bio(Metrics) To Predict Code Quality Online. In Proceedings of the 38th International Conference of Software Engineering (ICSE 2016), Austin, USA

If you like this article, you might also enjoy reading:

Graziotin, D., Wang, X., and Abrahamsson, P. Software Developers, Moods, Emotions, and Performance. In IEEE Software, vol.31, no.4. 2014
Ikutani, Y. and Uwano, H. Brain activity measurement during program comprehension with NIRS. In Proceedings of the International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing. 2014.
Nakagawa, T., Kamei, Y., Uwano, H., Monden, A., Matsumoto, K., and German, D. M. Quantifying programmers' mental workload during program comprehension based on cerebral blood flow measurement: A controlled experiment. In Companion Proceedings of International Conference on Software Engineering. 2014.
Parnin, C. Subvocalization - toward hearing the inner thoughts of developers. In Proceedings of the International Conference on Program Comprehension. 2011.
Siegmund, J., Kästner, C., Apel, S., Parnin, C., Bethmann, A., Leich, T., Saake, G., and Brechmann, A. Understanding understanding source code with functional magnetic resonance imaging. In Proceedings of the 36th International Conference on Software Engineering. 2014.

Wednesday, January 27, 2016

Fast and correct incremental building with pluto

by Sebastian Erdweg, TU Darmstadt, Germany (@seba0_ )
Associate Editor: Sarah Nadi, TU Darmstadt Germany (@sarahnadi)

Build systems are used in all but the smallest software projects to invoke the necessary build tools on the right files in the right order. A build system must satisfy two central properties. First, a build system must guarantee correct builds, which means that generated artifacts consistently reflect the latest source artifacts. Second, a build system must be efficient, which means that it incrementally rechecks and rebuilds as few build steps as possible.

Illustrating example

Consider the following build scenario for a Java tool that uses ANTLR for parsing and preprocessing Java source code.

Checkout source code from tool's remote repository.
Extract Java version targeted by the tool from a configuration file.
Build ANTLR parser for that Java version.
1. Download ANTLR binary.
2. Download ANTLR grammar for correct Java version.
3. Run ANTLR binary to compile the ANTLR grammar into Java code.
4. Compile Java code.
Compile tool's source code against parser.
Package tool together with parser into JAR file.

There are many source artifacts involved in this build, such as the code in the remote repository, the code in the local working copy, the configuration file, the ANTLR binary, or the ANTLR grammar. Correctness requires that the build system rebuilds the generated artifacts whenever any one of these source artifacts changes. Efficiency requires that the build system reexecutes as few build steps as possible. For example, if the code in the remote repository changes but the Java version in the configuration file remains unchanged, the build system should only execute steps 1, 2, 4, and 5. Step 2 is needed to detect that the Java version indeed did not change. Note that the configuration file may require non-trivial parsing to determine the targeted Java version.

Dynamic dependencies needed!

Almost all existing build systems fail on this example and either provide inefficient or incorrect building. The reason is that step 3 should only be executed if the Java version in the configuration file changes. Therefore, it is only possible to determine whether to execute step 3 after executing step 1 and 2. For this reason, the build script must install a dynamic dependency on step 3. In contrast to static dependencies that are hard-coded in the build script, a dynamic dependency is installed and checked while the build script runs. Existing build systems only support static dependencies, which they use to construct a schedule for the execution of build steps. Since the build schedule cannot change during build execution, existing build systems either inefficiently execute step 3 even when the Java version does not change or incorrectly only execute step 3 on the first run, reusing the parser even when the Java version changes. Dynamic dependencies allow us to decide whether to rebuild the parser in step 3 after the execution of step 2 finished.
Dynamic dependencies are necessary for the example scenario above. But, even when dynamic dependencies are not necessary, it is easier to derive precise dynamic dependencies than it is to safely approximate static dependencies. For example, when compiling the tool's source code in step 4 above, the code repository may contain source files that are not used by the tool's implementation. With static dependencies, we have to explicitly select required source files in the build script or apply a static dependency analyzer to generate this information before the build execution starts. With dynamic dependencies, we can simply start the Java compiler and inspect its verbose output to determine the set of required source files and to install dynamic dependencies on them. In practice, dynamic dependencies make precise dependency analysis a lot easier and yield less overapproximation (less superfluous dependencies) and less underapproximation (less missing dependencies).

Dynamic dependencies in the pluto build system

We have developed a new build system called pluto that supports dynamic build dependencies. In pluto, each build step consists of a build function that can declare dependencies on files and other build steps during its execution through a simple API. When a build step finishes, pluto stores the dependency list of the build step to detect changes of required artifacts in subsequent runs. However, in contrast to other build systems, pluto combines change detection and the execution of build steps into a single traversal of the dependency graph. This traversal only triggers the reexecution of those build steps that (i) are still required and (ii) depend on a file that was changed since the previous build. We have formally verified the correctness and optimality of pluto's incremental rebuilding algorithm.
We have developed pluto as a Java API where users can define build steps by implementing a simple Java interface. For illustration, the script of build step 3 of our example is available online for inspection. Using a full-fledged programming language like Java for build scripts has various positive design implications. First and maybe most importantly, Java makes it possible to debug a build script using any Java IDE. Second, build-script authors can use all features of Java and are not bound to a low-level language such as Bash. Third, our API makes it easy to parameterize build steps over run-time Java values. Parameterized build steps can often be reused across build scripts. Fourth, the type system of Java provides a statically typed interface for using build steps and ensures that build steps are invoked on well-typed arguments only.
In summary, pluto provides the following features:

Dynamic dependencies on files and other build steps.
A Java API for debuggable and reusable build steps.
Dependency injection into build steps to avoid hard-coded dependencies.
Semantic file stamps for avoiding rebuilds after irrelevant file changes.
Automatic metadependencies that ensure a rebuild after the build script changes.
Detection of build cycles and customizable cycle handling.

pluto is fast

We have used pluto to develop build scripts for a few projects so far. In particular, we have migrated the existing Ant build script of the Spoofax IDE to pluto. Using pluto, we achieved speedups of 4x-50x for rebuilding after a typical program change. The speedup depends on how many downstream build steps a file change invalidates. The fine-grained dependency tracking of pluto also revealed hidden dependencies in the original build script that led to incorrect rebuilds. Pluto automatically detects such hidden dependencies and guarantees correct rebuilds in all cases.
We have used pluto to develop numerous reusable build steps that perform fine-grained dependency tracking out of the box. The provided build steps range from Java compilation and JUnit testing to dependency resolution via git and Maven. Our experience shows that it is easy to develop new build steps and to make them reusable.

Expect more to come

We are working hard to make pluto more powerful and easier to use. We expect to provide the first binary release in the near future. In the meantime, you can find the source code of pluto and a numerous reusable build steps on github. Feature requests and contributions from the community are more than welcome.

Resources on pluto

Pluto website http://pluto-build.github.io/
Pluto source code https://github.com/pluto-build/
Academic article on pluto: A Sound and Optimal Incremental Build System with Dynamic Dependencies by Sebastian Erdweg, Moritz Lichter, and Manuel Weiel. Published at the Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA).

Sunday, January 24, 2016

Diversity in software teams: buzzword or necessity?

Associate Editor: Bogdan Vasilescu, University of California, Davis. USA (@b_vasilescu)

"I've had some bad experiences, particularly based on my gender. [...] a lot of time I'll join an online team and sometimes the other team members are very rude. In the past I have used a fake GitHub handle (my normal GitHub handle is my first name, which is a distinctly female name) so that people would assume I was male. I wouldn't lie if asked directly about my gender, but I know a lot of people assume by default others on the internet are male, and so I use a unisex handle [...] so people assume I am male without asking. I'm not happy with this solution, but sometimes I really want to participate [...] and I don't have the energy to defend myself for being female." [open source developer, USA]

This is not a singular occurrence. Despite its core meritocratic values, the open source "hacker" culture is known to be male-dominated and unfriendly to women [1, p.194]. Some authors go as far as to say that sexist behavior in open source is "as constant as it is extreme" [2]. In open source, women represent only around 10% of the all contributors [3], considerably more underrepresented than in the big tech companies; Google, for example, reports 18% women in technical jobs; Microsoft is close, with 16.6%.

Yet, open source software are well recognised for their high quality, and are becoming increasingly popular. According to a recent survey, 78% of companies run their operations on open source software; 66% create software for their customers on top of open source software. Surely, open source software teams know what they’re doing. By staying homogeneous, teams limit their members’ potential differences in values, norms, and communication styles; it also better shields them from stereotyping, cliquishness, and conflict. It becomes much less likely for team members to disagree if they all think (and look) alike, with the majority class being the Western young white male.

"Clones" by Nick Royer (CC BY-SA)

However, teams who inadvertently act like this may be missing out. Increased team diversity results in more varied backgrounds and ideas, which provide a team with access to broader information and enhanced creativity, adaptability, and problem solving skills. This fact has been documented countless times in "traditional" teams, that interact face-to-face. Does it also hold online, in the meritocratic open source, where "code sees no color or gender" [4]? After all, on the Internet nobody knows you're a dog.

We set out to investigate how diversity affects team performance on GitHub, the nexus of open source development, using a mixture of qualitative and quantitative research methods. First, through a survey with more than 800 respondents [4], we found that individual demographic attributes are surprisingly salient: while developers are, above all, aware of one another’s programming skills, they are also well aware of each other’s gender, real name, and country of residence. Their opinions on diversity are, however, split. Most respondents (62.5%) seem to view diversity as always positive. They acknowledge that diversity in terms of demographics and technical background is often associated with new ideas and approaches to solve problems, access to different networks, lively discussions around issues and pull requests, and ultimately better code. Still, some respondents (30%) report on occasional negative effects due to diversity, such as the developer quoted at the beginning of this blog post, referring to a gender-related incident.

Second, we extracted data from a large sample of 23,000 active projects on GitHub, spanning six years of activity and including a multitude of variables. We focused on two facets of diversity: gender (having a more balanced male-female team; we used people’s names to infer their gender) and tenure (having a mixture of seniors and juniors; we estimated a person’s programming experience from across all their GitHub contributions). We then used regression analysis to model effects associated with diversity on the outputs produced by teams per unit time (we counted the number of commits to a project by team members during a quarter), as a measure of the teams' effectiveness [5]. After controling for team size, project age, social activity, and other confounds, our models show that both gender and tenure diversity are positive and significant predictors of productivity, together explaining a small but significant fraction of the data variability.

Together, the two analyses paint a complete picture: diversity is more than a buzzword, it’s a necessity! More diverse teams perform better. On a larger, economic and societal scale, these findings also suggest that added investments in educational and professional training efforts and outreach for female programmers will likely result in added overall value.

[1] Turkle, S. The Second Self: Computers and the Human Spirit. MIT Press, 2005
[2] Nafus, D. ‘Patches don’t have gender’: What is not open in open source software. New Media & Society 14, 4 (2012), 669–683.
[3] FLOSS 2013 survey, http://floss2013.libresoft.es/results.en.html
[4] Vasilescu, B., Filkov, V., and Serebrenik, A. Perceptions of Diversity on GitHub: A User Survey. In 8th International Workshop on Cooperative and Human Aspects of Software Engineering, CHASE, IEEE (2015), 50–56.
[5] Vasilescu, B., Posnett, D., Ray, B., Brand, M.G.J. van den, Serebrenik, A., Devanbu, P., and Filkov, V. Gender and tenure diversity in GitHub teams. In ACM SIGCHI Conference on Human Factors in Computing Systems, CHI, ACM (2015), 3789–3798.

If you liked this post, you might also enjoy reading:

Begel, A.; Bosch, J.; Storey, M.-A., "Social Networking Meets Software Development: Perspectives from GitHub, MSDN, Stack Exchange, and TopCoder," in Software, IEEE , vol.30, no.1, pp.52-66, Jan.-Feb. 2013
Dabbish, L.; Stuart, C.; Tsay, J.; Herbsleb, J., "Leveraging Transparency," in Software, IEEE , vol.30, no.1, pp.37-43, Jan.-Feb. 2013

Wednesday, January 20, 2016

Verifying software process adherence

The author of this blog post is a lecturer, teaching different subjects related to software engineering. While I mostly focus on studying and teaching software evolution and related topics, I’ve been also supervising capstone projects of our third-year undergraduate students. Back in 2011 we have been struggling with a peculiar problem: the students were supposed to develop software according to the European Space Agency guidelines [1] , and while the interviews with the students suggested that some groups were adhering to the guidelines better than the others, we had no idea how frequently these guidelines have been violated. We needed a way to get insight into the process the students followed when creating software. Moreover, as opposed to the interviews, we needed objective information about the process the students followed rather than they believe they have followed. This is why we have decided to mine the students' software repositories, including their version control system, issue tracker and mail archive.

To understand the software development process we (1) combined information from different repositories into one log, and (2) applied to the log process mining, collection of log analysis techniques originally developed for business process analysis. The first results turned out to be promising: we have discovered, e.g., that one of the six student groups have reused the prototype implementation in direct violation of the European Space Agency guidelines, and that due to similar violations for three groups out of six the teacher should have considered intervening and correcting the process[2]. The discrepancy between the prescribed model and the ongoing software development process turned out not to be limited to student projects: e.g., the bug resolution process of the GCC, the GNU Compiler Collection, as reflected in the Bugzilla, turned out to be more complex[3] than the bug resolution process prescribed by the Bugzilla Guide[4].

The next step consisted in applying our approach to commercial software development. Upon request of a large multinational company we have analyzed the way their engineers are resolving bugs. We have observed that in contrast with the observations for the GCC Bugzilla the process followed by the company engineers mostly agrees with the prescriptions of the bug tracking software.

Moreover, we have observed that the steps related to the organization of the Defect Management Process contribute significantly to the duration of the whole defect resolution process: for instance, the decision who has to resolve the bug is taken by a special board outside of the development team and the median time spent associated with this decision is 21 days. For the sake of comparison the median time of the resolution by the developers constitutes 36.4 hours. We have also seen that if rework is needed, i.e., the bug has not been solved correctly from the first attempt, the reworking activities are time consuming: median time associated with the first rework investigation is 9.9 days. Based on these observations a number of improvement actions have been implemented by the company. In the coming years we plan to assess how those improvement actions affected the company bug resolution process.

By conducting a series of increasingly complex case studies ranging from student projects to large commercial software systems we have seen that by applying process mining techniques to software repository data one can obtain actionable insights into the ongoing software development process.

[1] The latest Software Standard of ESA is ECSS-E-ST-40C, available at
http://wwwis.win.tue.nl/2R690/doc/ECSS-E-ST-40C(6March2009).pdf It is the software part of a set of systems engineering standards for Space Engineering by ESA-ESTEC, Requirements & Standards Division, Noordwijk, The Netherlands (2009).

[2] Wouter Poncin, Alexander Serebrenik, Mark van den Brand: Mining student capstone projects with FRASR and ProM. OOPSLA Companion 2011: 87-96.
[3] Wouter Poncin, Alexander Serebrenik, Mark van den Brand: Process Mining Software Repositories. CSMR 2011: 5-14

[4] The bug resolution process is described in Figure 6.1 (The Bugzilla Guide -2.18.6 Release---Chapter 6. Using Bugzilla---6.4. Life Cycle of a Bughttps://www.bugzilla.org/docs/2.18/html/lifecycle.html) Since then the bug lifecycle has been (a) simplified and (b) made customizable: https://bugzilla.readthedocs.org/en/latest/using/editing.html#life-cycle-of-abug

About the author: Alexander Serebrenik is an associate professor of software
evolution at Eindhoven University of Technology. He has obtained his Ph.D. in
Computer Science from Katholieke Universiteit Leuven, Belgium (2003) and
M.Sc. in Computer Science from the Hebrew University, Jerusalem, Israel. Dr.
Serebrenik’s areas of expertise include software evolution, maintainability and
reverse engineering, program analysis and transformation, process modeling
and verification. Dr. Serebrenik recently served as the General Chair of the IEEE
International Conference on Software Maintenance 2013, the Program Chair of
the IEEE International Conference on Software Analysis, Evolution, and
Reengineering and as a Program Committee Member of a number of software
engineering conferences. He is currently acting as the chairman of the ICSME
Steering Committee and as a member of the steering committee of SCAM; he also
acts as the online-presence manager of IEEE Software. Serebrenik can be
reached via a.serebrenik@tue.nl or @aserebrenik; he also tweets as
@ieeesoftware.

Sunday, January 10, 2016

Architectures for Massively Multi-User Environments

by Cristina Videira Lopes (@cristalopes)
Associate Editor: Mehdi Mirakhorli (@MehdiMirakhorli)

Multi-user online games are among the most successful interactive, world-scale distributed systems built in the past decade. They come in many flavors; some support small groups of players (e.g. Call of Duty series, with less than 20 players in multiplayer mode), others support massive crowds (e.g. Eve Online, with upwards of 2,000 players engaged in quasi-real-time in some of the epic battles), and many are somewhere in between. The vast majority of these games are commercial; partly because of that, the technical aspects of these systems are vague, as the opportunity for player cheats increases when the player base is aware of technical details. Reverse engineering is strictly prohibited. Yet, these systems are fascinating!

From an engineering perspective, they embody almost everything that is interesting in computing systems these days: graphics, networking, content distribution, state synchronization, quasi-real-time responsiveness, physics simulation, operations simulation, AI, end-user programming, security, privacy, server farms, and even, in some cases, dedicated console hardware. From an application perspective, although most of the applications so far have been in the entertainment industry (games), the technology is general enough to be applicable to many other domains, from telemedicine to education to urban planning to teleconferencing. The Internet of Things (IoT), especially when visualizations are involved, is proving to be another incarnation of these systems. Finally, from a philosophical perspective, simulating reality, or environments that go beyond reality, gives valuable insights into the nature of the world and how we perceive it: many of the technical challenges faced in these environments touch on foundational topics such as the nature of effects and observations.

Over the past 8 years, we have been involved in the design and development of one of these 3D multi-user environments, OpenSimulator. OpenSimulator came into existence in 2007 thanks to one of those rare situations in which a commercial game was partially open sourced: Linden Lab’s Second Life, more specifically, the Second Life client. The publication of the client code allowed, and even encouraged, the development of an alternative server-side that was completely independent from Linden Lab’s server side. OpenSimulator includes a clean-room reverse engineering effort that honors the Linden Lab protocol and is capable of providing the same user experience; but it also greatly expands what can be done in those environments, to the point of supporting completely different clients. By now, OpenSimulator is a multi-user server framework that can be customized in a large number of ways. Thousands of people are using it, primarily in the entertainment and education domains. One of the most visible events of this community is the OpenSimulator conference, an annual conference started in 2013, held entirely in an OpenSimulator-based virtual environment hosted at UC Irvine. The conference brings together hundreds of members of the community in order to share experiences and future plans.

This post is not so much about OpenSimulator, as it is about the lessons learned with it in the domain of massive 3D multi-user environments. For that, we are going to cover the following parts: (1) basics of 3D environments; (2) basics of networked multi-user environments; (3) dimensions of scalability; (4) some systematic experiments we did; (4) common approaches to managing the load; and (5) common and not-so-common approaches to partitioning the load.

The Basics

Let’s start with the absolute simplest environments, the ones that aren’t even networked. A non-networked graphical 3D environment is a program with essentially two main loops: the physics loop and the rendering loop. These two loops aren’t necessarily, or even desirably, at the same rate; they serve different purposes.

The physics loop iterates through the objects in the scene and performs physical simulation on them. Physical simulation can be as simple as detecting collisions and simulating gravity or as complex as performing fluid dynamics. Physics loops run at a rate that achieves whatever physical simulation is desired. For example, if it’s as simple as detecting collisions in slow-moving objects, the physics loop can be as slow as 10 frames per second, but if the objects move much faster, then the rate must be much higher, or important events will be missed.
The rendering loop takes the objects in the scene, a perspective (i.e. a camera) on them, and renders the 3D scene on the [2D] screen. The refresh rate is largely dependent on the human visual system, so it’s typically above 24 frames per second. For immersive displays such as the Oculus Rift, it is desirable that it is above 60 fps.

Additionally to the two main loops, these environments take user input such as the keyboard and the mouse, and react to it by either changing the perspective (camera) and/or moving the user’s representation in the environment (aka avatar).

Also, some of these environments have assorted non-physical simulations, such as artificial intelligence Non-Player Characters (NPCs) and active objects that execute some script. These non-physical simulations can be designed and integrated in a number of ways, but they are often integrated using the basic model of iterations, and given some compute cycles in every frame. Again, these loops may be independent from all the other loops.

(In some games there is only one single loop — a single thread — that allocates compute cycles to the various components of the game. This design makes performance be fairly predictable and independent of the number of cores in the users’ machines.)

Networked Multi-User Environments

Multi-user, networked games add complexity to this model, because the state of the world needs to be shared among the many participants who may be distributed all over the world, and that are both observers of, and actors on, the shared environment. Distribution brings with it a third important loop: state updates over the network to other users. The rate at which this loop executes depends on the characteristics of the application: for applications where precision of effects is very important, this rate tends to be high, sometimes as high as 100 fps; for applications that can tolerate a fair amount of imprecision, the rate can be lower — as low as 1 fps. State updates often result in bursty network traffic: when the user is not doing anything, no updates are sent, but when the user interacts with the environment, a rapid stream of updates is sent.

The requirements of massive multi-user environments are the following:

The world may need to exist and persist independently of users being connected to it.
Autonomous agents, such as Non-Player Characters (NPCs), need to run somewhere and their effects need to be observably consistent for all users.
Physical and non-physical simulations need to be consistent for all users. For example, when two agents arrive simultaneously at a desired position, one of them must have gotten there first, according to some absolute arbiter.
The actions that each user does need to be broadcast to all other users as fast and as accurately as possible.
The local effects of the user’s actions cannot wait for global consistency checks, or the perceived responsiveness of the environment will be severely affected.
The movement of other agents and non-local NPCs need to be smoothed out in the local rendering, in spite of network jitter and possible low-rate stream of updates for them.
When it comes to making important decisions, the software that runs at any user’s computer cannot be trusted.

Networked multi-user environments traditionally fall into two architectural camps: peer to peer and client-server. The first networked games were peer to peer within a local area network; they gave rise to LAN parties, whose popularity peeked in the 1990s. In the early 2000s, peer to peer gaming was all the rage in Academic circles. However, around that same time, commercial games quickly steered towards the client-server architecture, for a variety of reasons — security/cheats and greater control of the game play being the most important ones. In a client-server architecture, the clients and the server all keep copies of the 3D scene, but the server is always authoritative. For example, many of these games have physics simulation both in the clients, for responsiveness, and in the server, for authoritativeness; parallel universes exist, but ultimately the server state prevails.

These days, many hybrid architectures are used. Hybrid architectures run on server farms controlled by game companies or game distributors. These servers can be geographically positioned to provide higher quality of service and/or to distribute the load more evenly. The peering, when it exists, happens between servers within the same domain of trust, and not between the client components that run in the users’ computers.

At their core, networked multi-user environments suffer from the N^2 problem, that is, N event-producing agents (users, NPCs, active objects) that receive events from others can potentially generate N^2 traffic. This quadratic increase is no minor issue, as it can quickly overwhelm the network and the software components that process the events.

Dimensions of Scalability

Networked multi-user environments may be scaled up in many dimensions. The number of users, the fidelity of the users’ actions, the size and complexity of the 3D scene, and the number of simulations or simulated entities are some of the most common dimensions over which these environments are scaled up. Each of these dimensions carries with it specific demands. For example, scaling up the number of interacting users can be done without adding any hardware by decreasing the amount of interactivity and responsiveness of the environment, i.e. by decreasing the number of events generated by each user. But that may not be desirable, as it may degrade the user experience. Another example: the advent of cameras and other sensors that can track subtle changes in a person’s physical demeanor and gestures allows the design of remote-controlled avatars using a person’s body, but that carries with it much higher demands in terms of state updates distributed to the other users. A final example: a very large “map” (3D scene) may be required in some applications, but such large maps pose many problems; for starters, distributing them among many users may cause network congestion; then, since computer memory is limited, the user’s computer will necessarily have to break it into smaller chunks for rendering purposes; better representations of 3D objects and some form of interest management help in decreasing the load.

In order to understand the scalability of these systems, it is important to understand what kinds of load we have at hands, and what produces that load. In general, there are two kinds of load: network traffic and processing cycles. Network traffic is directly related to the number of events and the amount of data to be distributed. In turn, this is related to the number of event-generating agents and the rate at which events are generated. Processing cycles are used for rendering, for simulation, for handling the network traffic that comes from other components, and for assorted housekeeping.

Systematic Experiments

While OpenSimulator is a great platform for doing design experimentation with a real system, it is too complex to study the dimensions of scalability at their core. When we started to systematically study these systems, we designed a simplified multi-user game with which we were able to measure load in a highly controlled manner. Let me explain this simple game, and the experiments we made with it.

The game is a multi-user Jigsaw puzzle. As the picture above shows, pieces of a jigsaw puzzle are scattered around a canvas area. All users connecting to a certain URL see the exact same state of the game, and each one of them is able to move the pieces. As users move pieces, the other users can see those movements in real-time. The game can be accessed, live, here.

The game uses an architecture that we call RCAT (Restful Client-server ArchiTecture), a 4-tier architecture illustrated in the picture below:

At the top, we have the game clients, in our case, Web browsers running JavaScript that render the Jigsaw puzzle game and receive user input. These clients connect via WebSockets to frontline proxies. The proxies serve a specific purpose: they handle client traffic in the best possible way. For example, in a globally distributed infrastructure, like some of the commercial games, these proxies would be placed in strategic points of the world (e.g. North America, Asia, Europe, Australia, etc.). Not only that, but they are also the components that implement the network protocol that best serves the interaction between the untrusted clients and the trusted server-side infrastructure. In our case, that was WebSockets. From the proxies down, we have the trusted server-side, which consists of one or more application servers (i.e. the game servers) and, at the bottom, the persistent database, which is shared among all application servers. In our case, the application servers all do the same thing, namely, they enforce the rules of the game, such as making sure no one piece is grabbed by more than one user or checking whether a piece is placed in the correct slot. In Web terms, when application servers are replicated and requests can be forwarded to any of them, this is called horizontal scalability.

In order to make controlled experiments, we implemented N user-emulating bots with an external program. Each bot grabs a piece of the puzzle and “moves it” for a while. The bots were configured to send U position updates per second. Both N and U are configurable parameters in the experiments.

Experiment 1: Quadratic Behavior

We wanted to map out the network load characteristics at both ends of the architecture. We used a simple setup consisting of one single proxy, one application server and the database server. The proxy and the database were on the same physical server; the application server was on a separate server; the bots were on three other machines. In total, there were five machines; all of them were Optiplex 980 with eight 2.8 GHz i7 cores.

At the proxy side, we expected a quadratic increase of traffic with a linear increase of bots. Indeed, that’s exactly what we observed:

The red line at the bottom shows the cumulative traffic from the clients to the proxy: as the number of client increases, that traffic increases linearly. The blue line shows the traffic in the opposite direction, i.e. from the proxy to all the clients. Remember that these systems suffer from an N^2 behavior: for every new client, there is an increase of U updates per second (upstream) that need to be distributed to all the N-1 clients (downstream).

The observed quadratic behavior was expected. It shows, in no uncertain terms, that without optimizations on event streams, scalability of the frontline network load upon increase of the number of users is… a nasty problem.

But this is not the only bottleneck in our architecture. In the most naive implementation of this game, every time the game server receives a position message from a client, it updates in the database the position of the piece, retrieves the list of connected clients, and sends to the proxy, in a single message, 1) the message to be forwarded to the clients, containing the piece’s new position, and 2) the list of clients to send the message to (i.e. everyone currently connected). As the number of clients increases, so does the amount of client-related data that is retrieved from the database. Coupled with an increase in the rate of database lookups, this makes up for another nasty quadratic situation:

Obviously, the naive implementation is terrible; caches need to be used whenever possible. In this case, the game server can keep the list of connected clients in cache, which eliminates the quadratic bandwidth growth altogether. Caches, however, increase the complexity of the game logic, as, in the presence of multiple game servers and proxies, they need to be refreshed every so often.

Again, we were not attempting to be smart; we simply wanted to verify the consequences of increasing the number of users in simple, but unsophisticated architectures. Quadratic behavior can emerge in many points of these architectures.

Experiment 2: Effect of Update Rate

In this experiment, we measured the round-time trip latency of messages at the bots, and the CPU consumption in all servers (using the getrusage function). We wanted to check how the number of proxies and the number of game servers scale with the number of bots for different position update frequencies. We define the maximum capacity of the system as the highest number of connected clients when the 99th percentile of the latency is below 100 ms.

The commodity machines used for the experiments were Dell Core i7 2600, each with four hyper-threaded 3.4-Ghz processors. Up to two machines ran proxy instances, up to two others ran game server instances, one other the database, and two others the bots. All machines resided on the same 1-Gbps LAN, and were one hop from each other. Experiment names follow the format “X/Y/Z”, standing for X cores running one proxy each, Y cores running one game server each, and bots sending Z messages per second. For example, in the scenario 2/2/5, there are two proxy instances running on the same machine, two game servers sharing one other machine, and bots on two other machines send five updates per second. In the scenario 4+4/4+4/2.5, there are four proxies running on one machine, four proxies on another, four game servers on a third machine, four game servers on a fourth machine, and the bots send 2.5 updates per second.
Here is the main chart:

The first thing shown in this chart is that in order to handle roughly twice the number of users, we need four times the number of cores — see, for example, the blue line, which starts at a maximum capacity of 100 users for 1 core for the proxy and 1 core for the game server, and reaches close to 200 users with 4 proxies and 4 game servers. A similar trend is seen for the other lines.

The second thing this chart tells us is that doubling the rate of updates decreases the maximum capacity of the system by 20% to 40%. This is not an intuitive result; the next chart sheds some light into it:

This chart illustrates the scenario with eight proxies and eight game servers (4+4/4+4). At 2.5 messages per second, the maximum capacity (229 clients) is reached while the game servers only consume around 250% CPU. Meanwhile, the proxies consume twice as much CPU. In contrast, when clients send 10 messages per second, the game servers consume more CPU than the proxies when reaching the overall system’s maximum capacity (125 clients). In this scenario, it is difficult to know which of the proxies or the game servers is the bottleneck. Yet it seems that handling many slow-paced connections is more costly than handling fewer, but active connections. Thus for slow message rates, the proxy is the bottleneck. But for fast message rates, the game server may be limiting. In fact, the game logic of our jigsaw puzzle is very simple (e.g. no collision detection); for more sophisticated games, the game server CPU may increase even faster than shown above.

More details of this study can be found in Chapter 7 of this book.

Load Management

As discussed above, because of their N^2 nature, multi-user environments can quickly become saturated and unresponsive. Over the years, researchers and engineers have devised ways of coping with this problem. We describe some of the most common techniques of managing and decreasing load within the confines of the environment design.

Interest Management. Rather than blindly broadcasting all events to all connected users, game servers can do something smarter, and notify users only about the events that matter to them. For example, if the environment simulates 3D space, not much will be lost if the user sees only events for objects and users within, say, 100m range, because everything else will be very small anyway; alternatively, a combination of size and distance can used to filter events. Depending on the application, many criteria can be used for managing the interest of users and filtering the events accordingly. Interest management is used pervasively and aggressively in commercial games, as it is the most straightforward technique for cutting down the number of events and the load associated with them.

Data prioritization. In these systems, some events are critical, while others are not so critical. When bandwidth or computational resources become scarce, only critical messages need to be forwarded.

Time dilation. As described early on, these systems are based on various loops running at specific rates. When the system starts to overload, one possible way of dealing with it gracefully is to slow down the loops incrementally, i.e. to decrease their rate. This makes the perception of time slow down, everything happens as if in slower motion. This technique is used most visibly in Eve Online, but is also used in special circumstances in OpenSimulator.

Rewriting history. In some of these environments, it is better to be wrong but on time than right but late. This is particularly important when the server starts to be overloaded and latency is high. As such, the clients often react immediately to local events with actions that, upon verification on the server, may need to be rolled back, sometimes producing a visible glitch. For example, if the player shoots another player, the client may instantly decide that it was a hit and start showing a corresponding animation on the player that’s been hit. However, this is just a local estimate of the true sequence of events. Once communicated to the server, and compared to the reported location of the other player, the server may determine that there was no hit. In that case, the local client needs to roll back its decision. Games can minimize these visual glitches by using a variety of special effects such as clouding the spot of the potential hit with fog that makes the outcome unclear for a few milliseconds until the authoritative decision is received from the server.

Load Partitioning

In massive multi-user environments, however, managing the load is not enough. The workload generated by the users, scripted objects and NPCs is such that it needs to be partitioned among several servers. The following are the most commonly used approaches.

Shards. Rather than trying to scale the games to thousands of users in the same game play, many commercial games cap the number of interacting users to a few players within a space, and provide multiple copies of that same space for different groups of users. These copies are called shards. Typically, each shard has its own dedicated hardware. This practice can be tracked to Ultima Online in 1997; it is used pervasively in commercial games, such as World of Warcraft, Call of Duty and countless others.

Space partitioning (static). One of the most common ways of load partitioning 3D environments, and the oldest too, is to partition the virtual space into regions, or zones, each one assigned to a server. This idea can be traced back to DIVE in the early 90s, and is used in many commercial games, including Second Life and, consequently, OpenSimulator. In OpenSimulator, these regions are squares whose sizes are multiples of 256m. Each server is responsible for simulating the objects and NPCs within its space, and for handling client connections of users in that space. Users can move from region to region following some hand-off protocol.

Space partitioning (dynamic). Static space partitioning can result in uneven load distributions, because users tend to gather in groups that are unevenly distributed in space. Dynamic space partitioning techniques aim at addressing this problem by dynamically reconfiguring the portions of virtual space associated with servers, depending on the number of users in the areas. Some of these approaches use Voronoi diagrams, others use microcells that can be dynamically associated with servers. Recent work suggests that the amount of interactions between two dynamically configured regions can result in a high load for adjacent regions, thereby making dynamic space partitioning not suitable for all object or player distributions and behaviors.

Besides these simple approaches to load partitioning, there are others that look at load under different angles. Most of these are, or have been, research experiments.
Project Darkstar, the server supporting OpenWonderland, divides the load by task triggered by user input. Kiwano divides the world by space, and groups the load generated by the users’ input by their proximity in space. Meru partitions the load by both space and content (3D objects). Within the OpenSimulator ecosystem, the DSG architecture partitions the workload by feature, such as physics and script execution. We are currently experimenting with aspect-oriented load partitions, where different aspects of the simulation are assigned to different processes or even servers.

The art of designing scalable distributed virtual environments lies in finding the right load partitions for the purpose for which the environments are being designed. Load partitions carry overhead costs in terms of coordination among servers. In a “good” partition design, the system will scale horizontally, i.e. more load can be handled by adding more servers in as linear a relation as possible; in a “bad” partition design the benefits of load distribution will be dwarfed by the communication overhead among servers.

Final Words

As mentioned in the beginning, these systems are fascinating. Fundamentally, what’s being played in them (besides the games) is the tension between global effects and local observations, in a platform with insurmountable limits such as the speed at which information can spread over the network, the bandwidth of the network and the resource limits of the computing nodes. More than any old distributed system, environments that share a virtual 3D space among people distributed all over the world make these fundamental problems quite tangible. In order to achieve the illusion of a single, consistent universe, we are left to having to play with space and time: what runs where (load partitioning) and when (interpolation, delayed execution, time dilation, etc.).

Additional References

"This post includes joint work with Thomas Debeauvais and Arthur Valadares, both graduate students at UC Irvine."

You may also like:

Denault, A.; Kienzle, J., "Journey: A Massively Multiplayer Online Game Middleware," in Software, IEEE , vol.28, no.5, pp.38-44, Sept.-Oct. 2011. doi: 10.1109/MS.2011.88
Larrabee, T.; Mitchell, C.L., "Gambit: A Prototyping Approach to Video Game Design," in Software, IEEE , vol.1, no.4, pp.28-36, Oct. 1984. doi: 10.1109/MS.1984.229456
Sanders, J., "Opportunities online: fact or fantasy?," in Software, IEEE , vol.15, no.2, pp.62-64, 122, Mar/Apr 1998. doi: 10.1109/52.663787

Associate Editors

Jeffrey Carver (Practitioners' digest)
Dario Di Nucci (Testing)
Niko Mäkitalo (Microservices/Software Architecture)
Sofia Ouhbi (Requirements Engineering and Software Sustainability)
Varun Gupta (Global developments)
Jinghui Cheng (Human Aspects)
Muneera Bano (User Centric/Human Aspects)
Ronald Jabangwe (Software Engineering Process Models)
Mehdi Mirakhorli (Design/ Architecture and Requirements)
Brittany Johnson (Issue and SE Radio Summary)
Sarah Nadi (Software release and configuration management)
Stefano Zacchiroli (Open source software systems)
Federica Sarro (Mobile applications and systems)
Sridhar Chimalakonda (Software Quality and Software Reuse)
Danilo Pianini (Pervasive computing)
Karim Ali (Programming Languages)
Mei Nagappan (Practitioner perspectives)
Xabier Larrucea (Practitioner perspectives)