Sunday, January 10, 2016

Architectures for Massively Multi-User Environments


by  Cristina Videira Lopes (@cristalopes)
Associate Editor: Mehdi Mirakhorli (@MehdiMirakhorli)

OSCC'14

Multi-user online games are among the most successful interactive, world-scale distributed systems built in the past decade. They come in many flavors; some support small groups of players (e.g. Call of Duty series, with less than 20 players in multiplayer mode), others support massive crowds (e.g. Eve Online, with upwards of 2,000 players engaged in quasi-real-time in some of the epic battles), and many are somewhere in between. The vast majority of these games are commercial; partly because of that, the technical aspects of these systems are vague, as the opportunity for player cheats increases when the player base is aware of technical details. Reverse engineering is strictly prohibited. Yet, these systems are fascinating!

From an engineering perspective, they embody almost everything that is interesting in computing systems these days: graphics, networking, content distribution, state synchronization, quasi-real-time responsiveness, physics simulation, operations simulation, AI, end-user programming, security, privacy, server farms, and even, in some cases, dedicated console hardware. From an application perspective, although most of the applications so far have been in the entertainment industry (games), the technology is general enough to be applicable to many other domains, from telemedicine to education to urban planning to teleconferencing. The Internet of Things (IoT), especially when visualizations are involved, is proving to be another incarnation of these systems. Finally, from a philosophical perspective, simulating reality, or environments that go beyond reality, gives valuable insights into the nature of the world and how we perceive it: many of the technical challenges faced in these environments touch on foundational topics such as the nature of effects and observations.

Over the past 8 years, we have been involved in the design and development of one of these 3D multi-user environments, OpenSimulator. OpenSimulator came into existence in 2007 thanks to one of those rare situations in which a commercial game was partially open sourced: Linden Lab’s Second Life, more specifically, the Second Life client. The publication of the client code allowed, and even encouraged, the development of an alternative server-side that was completely independent from Linden Lab’s server side. OpenSimulator includes a clean-room reverse engineering effort that honors the Linden Lab protocol and is capable of providing the same user experience; but it also greatly expands what can be done in those environments, to the point of supporting completely different clients. By now, OpenSimulator is a multi-user server framework that can be customized in a large number of ways. Thousands of people are using it, primarily in the entertainment and education domains. One of the most visible events of this community is the OpenSimulator conference, an annual conference started in 2013, held entirely in an OpenSimulator-based virtual environment hosted at UC Irvine. The conference brings together hundreds of members of the community in order to share experiences and future plans.

This post is not so much about OpenSimulator, as it is about the lessons learned with it in the domain of massive 3D multi-user environments. For that, we are going to cover the following parts: (1) basics of 3D environments; (2) basics of networked multi-user environments; (3) dimensions of scalability; (4) some systematic experiments we did; (4) common approaches to managing the load; and (5) common and not-so-common approaches to partitioning the load.

The Basics


Let’s start with the absolute simplest environments, the ones that aren’t even networked. A non-networked graphical 3D environment is a program with essentially two main loops: the physics loop and the rendering loop. These two loops aren’t necessarily, or even desirably, at the same rate; they serve different purposes.
  • The physics loop iterates through the objects in the scene and performs physical simulation on them. Physical simulation can be as simple as detecting collisions and simulating gravity or as complex as performing fluid dynamics. Physics loops run at a rate that achieves whatever physical simulation is desired. For example, if it’s as simple as detecting collisions in slow-moving objects, the physics loop can be as slow as 10 frames per second, but if the objects move much faster, then the rate must be much higher, or important events will be missed.
  • The rendering loop takes the objects in the scene, a perspective (i.e. a camera) on them, and renders the 3D scene on the [2D] screen. The refresh rate is largely dependent on the human visual system, so it’s typically above 24 frames per second. For immersive displays such as the Oculus Rift, it is desirable that it is above 60 fps.

Additionally to the two main loops, these environments take user input such as the keyboard and the mouse, and react to it by either changing the perspective (camera) and/or moving the user’s representation in the environment (aka avatar).

Also, some of these environments have assorted non-physical simulations, such as artificial intelligence Non-Player Characters (NPCs) and active objects that execute some script. These non-physical simulations can be designed and integrated in a number of ways, but they are often integrated using the basic model of iterations, and given some compute cycles in every frame. Again, these loops may be independent from all the other loops.

(In some games there is only one single loop — a single thread — that allocates compute cycles to the various components of the game. This design makes performance be fairly predictable and independent of the number of cores in the users’ machines.)

Networked Multi-User Environments


Multi-user, networked games add complexity to this model, because the state of the world needs to be shared among the many participants who may be distributed all over the world, and that are both observers of, and actors on, the shared environment. Distribution brings with it a third important loop: state updates over the network to other users. The rate at which this loop executes depends on the characteristics of the application: for applications where precision of effects is very important, this rate tends to be high, sometimes as high as 100 fps; for applications that can tolerate a fair amount of imprecision, the rate can be lower — as low as 1 fps. State updates often result in bursty network traffic: when the user is not doing anything, no updates are sent, but when the user interacts with the environment, a rapid stream of updates is sent.

The requirements of massive multi-user environments are the following:

  • The world may need to exist and persist independently of users being connected to it.
  • Autonomous agents, such as Non-Player Characters (NPCs), need to run somewhere and their effects need to be observably consistent for all users.
  • Physical and non-physical simulations need to be consistent for all users. For example, when two agents arrive simultaneously at a desired position, one of them must have gotten there first, according to some absolute arbiter.
  • The actions that each user does need to be broadcast to all other users as fast and as accurately as possible.
  • The local effects of the user’s actions cannot wait for global consistency checks, or the perceived responsiveness of the environment will be severely affected.
  • The movement of other agents and non-local NPCs need to be smoothed out in the local rendering, in spite of network jitter and possible low-rate stream of updates for them.
  • When it comes to making important decisions, the software that runs at any user’s computer cannot be trusted.

Networked multi-user environments traditionally fall into two architectural camps: peer to peer and client-server. The first networked games were peer to peer within a local area network; they gave rise to LAN parties, whose popularity peeked in the 1990s. In the early 2000s, peer to peer gaming was all the rage in Academic circles. However, around that same time, commercial games quickly steered towards the client-server architecture, for a variety of reasons — security/cheats and greater control of the game play being the most important ones. In a client-server architecture, the clients and the server all keep copies of the 3D scene, but the server is always authoritative. For example, many of these games have physics simulation both in the clients, for responsiveness, and in the server, for authoritativeness; parallel universes exist, but ultimately the server state prevails.

These days, many hybrid architectures are used. Hybrid architectures run on server farms controlled by game companies or game distributors. These servers can be geographically positioned to provide higher quality of service and/or to distribute the load more evenly. The peering, when it exists, happens between servers within the same domain of trust, and not between the client components that run in the users’ computers.

At their core, networked multi-user environments suffer from the N^2 problem, that is, N event-producing agents (users, NPCs, active objects) that receive events from others can potentially generate N^2 traffic. This quadratic increase is no minor issue, as it can quickly overwhelm the network and the software components that process the events.

Dimensions of Scalability


Networked multi-user environments may be scaled up in many dimensions. The number of users, the fidelity of the users’ actions, the size and complexity of the 3D scene, and the number of simulations or simulated entities are some of the most common dimensions over which these environments are scaled up. Each of these dimensions carries with it specific demands. For example, scaling up the number of interacting users can be done without adding any hardware by decreasing the amount of interactivity and responsiveness of the environment, i.e. by decreasing the number of events generated by each user. But that may not be desirable, as it may degrade the user experience. Another example: the advent of cameras and other sensors that can track subtle changes in a person’s physical demeanor and gestures allows the design of remote-controlled avatars using a person’s body, but that carries with it much higher demands in terms of state updates distributed to the other users. A final example: a very large “map” (3D scene) may be required in some applications, but such large maps pose many problems; for starters, distributing them among many users may cause network congestion; then, since computer memory is limited, the user’s computer will necessarily have to break it into smaller chunks for rendering purposes; better representations of 3D objects and some form of interest management help in decreasing the load.

In order to understand the scalability of these systems, it is important to understand what kinds of load we have at hands, and what produces that load. In general, there are two kinds of load: network traffic and processing cycles. Network traffic is directly related to the number of events and the amount of data to be distributed. In turn, this is related to the number of event-generating agents and the rate at which events are generated. Processing cycles are used for rendering, for simulation, for handling the network traffic that comes from other components, and for assorted housekeeping.

Systematic Experiments


While OpenSimulator is a great platform for doing design experimentation with a real system, it is too complex to study the dimensions of scalability at their core. When we started to systematically study these systems, we designed a simplified multi-user game with which we were able to measure load in a highly controlled manner. Let me explain this simple game, and the experiments we made with it.

Jigsaw

The game is a multi-user Jigsaw puzzle. As the picture above shows, pieces of a jigsaw puzzle are scattered around a canvas area. All users connecting to a certain URL see the exact same state of the game, and each one of them is able to move the pieces. As users move pieces, the other users can see those movements in real-time. The game can be accessed, live, here.

The game uses an architecture that we call RCAT (Restful Client-server ArchiTecture), a 4-tier architecture illustrated in the picture below:

RCAT

At the top, we have the game clients, in our case, Web browsers running JavaScript that render the Jigsaw puzzle game and receive user input. These clients connect via WebSockets to frontline proxies. The proxies serve a specific purpose: they handle client traffic in the best possible way. For example, in a globally distributed infrastructure, like some of the commercial games, these proxies would be placed in strategic points of the world (e.g. North America, Asia, Europe, Australia, etc.). Not only that, but they are also the components that implement the network protocol that best serves the interaction between the untrusted clients and the trusted server-side infrastructure. In our case, that was WebSockets. From the proxies down, we have the trusted server-side, which consists of one or more application servers (i.e. the game servers) and, at the bottom, the persistent database, which is shared among all application servers. In our case, the application servers all do the same thing, namely, they enforce the rules of the game, such as making sure no one piece is grabbed by more than one user or checking whether a piece is placed in the correct slot. In Web terms, when application servers are replicated and requests can be forwarded to any of them, this is called horizontal scalability.

In order to make controlled experiments, we implemented N user-emulating bots with an external program. Each bot grabs a piece of the puzzle and “moves it” for a while. The bots were configured to send U position updates per second. Both N and U are configurable parameters in the experiments.

Experiment 1: Quadratic Behavior


We wanted to map out the network load characteristics at both ends of the architecture. We used a simple setup consisting of one single proxy, one application server and the database server. The proxy and the database were on the same physical server; the application server was on a separate server; the bots were on three other machines. In total, there were five machines; all of them were Optiplex 980 with eight 2.8 GHz i7 cores.

At the proxy side, we expected a quadratic increase of traffic with a linear increase of bots. Indeed, that’s exactly what we observed:

Experiment 1 Proxies

The red line at the bottom shows the cumulative traffic from the clients to the proxy: as the number of client increases, that traffic increases linearly. The blue line shows the traffic in the opposite direction, i.e. from the proxy to all the clients. Remember that these systems suffer from an N^2 behavior: for every new client, there is an increase of U updates per second (upstream) that need to be distributed to all the N-1 clients (downstream).

The observed quadratic behavior was expected. It shows, in no uncertain terms, that without optimizations on event streams, scalability of the frontline network load upon increase of the number of users is… a nasty problem.

But this is not the only bottleneck in our architecture. In the most naive implementation of this game, every time the game server receives a position message from a client, it updates in the database the position of the piece, retrieves the list of connected clients, and sends to the proxy, in a single message, 1) the message to be forwarded to the clients, containing the piece’s new position, and 2) the list of clients to send the message to (i.e. everyone currently connected). As the number of clients increases, so does the amount of client-related data that is retrieved from the database. Coupled with an increase in the rate of database lookups, this makes up for another nasty quadratic situation:

Fig5

Obviously, the naive implementation is terrible; caches need to be used whenever possible. In this case, the game server can keep the list of connected clients in cache, which eliminates the quadratic bandwidth growth altogether. Caches, however, increase the complexity of the game logic, as, in the presence of multiple game servers and proxies, they need to be refreshed every so often.

Again, we were not attempting to be smart; we simply wanted to verify the consequences of increasing the number of users in simple, but unsophisticated architectures. Quadratic behavior can emerge in many points of these architectures.

Experiment 2: Effect of Update Rate


In this experiment, we measured the round-time trip latency of messages at the bots, and the CPU consumption in all servers (using the getrusage function). We wanted to check how the number of proxies and the number of game servers scale with the number of bots for different position update frequencies. We define the maximum capacity of the system as the highest number of connected clients when the 99th percentile of the latency is below 100 ms.

The commodity machines used for the experiments were Dell Core i7 2600, each with four hyper-threaded 3.4-Ghz processors. Up to two machines ran proxy instances, up to two others ran game server instances, one other the database, and two others the bots. All machines resided on the same 1-Gbps LAN, and were one hop from each other. Experiment names follow the format “X/Y/Z”, standing for X cores running one proxy each, Y cores running one game server each, and bots sending Z messages per second. For example, in the scenario 2/2/5, there are two proxy instances running on the same machine, two game servers sharing one other machine, and bots on two other machines send five updates per second. In the scenario 4+4/4+4/2.5, there are four proxies running on one machine, four proxies on another, four game servers on a third machine, four game servers on a fourth machine, and the bots send 2.5 updates per second.
Here is the main chart:

Update Rate


The first thing shown in this chart is that in order to handle roughly twice the number of users, we need four times the number of cores — see, for example, the blue line, which starts at a maximum capacity of 100 users for 1 core for the proxy and 1 core for the game server, and reaches close to 200 users with 4 proxies and 4 game servers. A similar trend is seen for the other lines.

The second thing this chart tells us is that doubling the rate of updates decreases the maximum capacity of the system by 20% to 40%. This is not an intuitive result; the next chart sheds some light into it:

CPU


This chart illustrates the scenario with eight proxies and eight game servers (4+4/4+4). At 2.5 messages per second, the maximum capacity (229 clients) is reached while the game servers only consume around 250% CPU. Meanwhile, the proxies consume twice as much CPU. In contrast, when clients send 10 messages per second, the game servers consume more CPU than the proxies when reaching the overall system’s maximum capacity (125 clients). In this scenario, it is difficult to know which of the proxies or the game servers is the bottleneck. Yet it seems that handling many slow-paced connections is more costly than handling fewer, but active connections. Thus for slow message rates, the proxy is the bottleneck. But for fast message rates, the game server may be limiting. In fact, the game logic of our jigsaw puzzle is very simple (e.g. no collision detection); for more sophisticated games, the game server CPU may increase even faster than shown above.

More details of this study can be found in Chapter 7 of this book.

Load Management


As discussed above, because of their N^2 nature, multi-user environments can quickly become saturated and unresponsive. Over the years, researchers and engineers have devised ways of coping with this problem. We describe some of the most common techniques of managing and decreasing load within the confines of the environment design.

Interest Management. Rather than blindly broadcasting all events to all connected users, game servers can do something smarter, and notify users only about the events that matter to them. For example, if the environment simulates 3D space, not much will be lost if the user sees only events for objects and users within, say, 100m range, because everything else will be very small anyway; alternatively, a combination of size and distance can used to filter events. Depending on the application, many criteria can be used for managing the interest of users and filtering the events accordingly. Interest management is used pervasively and aggressively in commercial games, as it is the most straightforward technique for cutting down the number of events and the load associated with them.

Data prioritization. In these systems, some events are critical, while others are not so critical. When bandwidth or computational resources become scarce, only critical messages need to be forwarded.

Time dilation. As described early on, these systems are based on various loops running at specific rates. When the system starts to overload, one possible way of dealing with it gracefully is to slow down the loops incrementally, i.e. to decrease their rate. This makes the perception of time slow down, everything happens as if in slower motion. This technique is used most visibly in Eve Online, but is also used in special circumstances in OpenSimulator.

Rewriting history. In some of these environments, it is better to be wrong but on time than right but late. This is particularly important when the server starts to be overloaded and latency is high. As such, the clients often react immediately to local events with actions that, upon verification on the server, may need to be rolled back, sometimes producing a visible glitch. For example, if the player shoots another player, the client may instantly decide that it was a hit and start showing a corresponding animation on the player that’s been hit. However, this is just a local estimate of the true sequence of events. Once communicated to the server, and compared to the reported location of the other player, the server may determine that there was no hit. In that case, the local client needs to roll back its decision. Games can minimize these visual glitches by using a variety of special effects such as clouding the spot of the potential hit with fog that makes the outcome unclear for a few milliseconds until the authoritative decision is received from the server.

Load Partitioning


In massive multi-user environments, however, managing the load is not enough. The workload generated by the users, scripted objects and NPCs is such that it needs to be partitioned among several servers. The following are the most commonly used approaches.

Shards. Rather than trying to scale the games to thousands of users in the same game play, many commercial games cap the number of interacting users to a few players within a space, and provide multiple copies of that same space for different groups of users. These copies are called shards. Typically, each shard has its own dedicated hardware. This practice can be tracked to Ultima Online in 1997; it is used pervasively in commercial games, such as World of Warcraft, Call of Duty and countless others.

Space partitioning (static). One of the most common ways of load partitioning 3D environments, and the oldest too, is to partition the virtual space into regions, or zones, each one assigned to a server. This idea can be traced back to DIVE in the early 90s, and is used in many commercial games, including Second Life and, consequently, OpenSimulator. In OpenSimulator, these regions are squares whose sizes are multiples of 256m. Each server is responsible for simulating the objects and NPCs within its space, and for handling client connections of users in that space. Users can move from region to region following some hand-off protocol.

Space partitioning (dynamic). Static space partitioning can result in uneven load distributions, because users tend to gather in groups that are unevenly distributed in space. Dynamic space partitioning techniques aim at addressing this problem by dynamically reconfiguring the portions of virtual space associated with servers, depending on the number of users in the areas. Some of these approaches use Voronoi diagrams, others use microcells that can be dynamically associated with servers. Recent work suggests that the amount of interactions between two dynamically configured regions can result in a high load for adjacent regions, thereby making dynamic space partitioning not suitable for all object or player distributions and behaviors.

Besides these simple approaches to load partitioning, there are others that look at load under different angles. Most of these are, or have been, research experiments.
Project Darkstar, the server supporting OpenWonderland, divides the load by task triggered by user input. Kiwano divides the world by space, and groups the load generated by the users’ input by their proximity in space. Meru partitions the load by both space and content (3D objects). Within the OpenSimulator ecosystem, the DSG architecture partitions the workload by feature, such as physics and script execution. We are currently experimenting with aspect-oriented load partitions, where different aspects of the simulation are assigned to different processes or even servers.

The art of designing scalable distributed virtual environments lies in finding the right load partitions for the purpose for which the environments are being designed. Load partitions carry overhead costs in terms of coordination among servers. In a “good” partition design, the system will scale horizontally, i.e. more load can be handled by adding more servers in as linear a relation as possible; in a “bad” partition design the benefits of load distribution will be dwarfed by the communication overhead among servers.

Final Words


As mentioned in the beginning, these systems are fascinating. Fundamentally, what’s being played in them (besides the games) is the tension between global effects and local observations, in a platform with insurmountable limits such as the speed at which information can spread over the network, the bandwidth of the network and the resource limits of the computing nodes. More than any old distributed system, environments that share a virtual 3D space among people distributed all over the world make these fundamental problems quite tangible. In order to achieve the illusion of a single, consistent universe, we are left to having to play with space and time: what runs where (load partitioning) and when (interpolation, delayed execution, time dilation, etc.).

Additional References


"This post includes joint work with Thomas Debeauvais and Arthur Valadares, both graduate students at UC Irvine."

You may also like:

4 comments:

  1. My two cents here, as I focused on the bandwidth optimization aspects of multiplayer games in an article and a talk.

    To make a long story short, to use TCP for multiplayer games, adaptive streaming techniques should be used. In other words, you need to make sure that the amount of real-time data sent to synchronize the game world among the clients is governed by the currently available bandwidth and latency for each client. Adaptive streaming techniques include dynamic throttling, conflation and delta delivery.

    If you would like to learn more and possibly provide some feedback, here are the pointers:

    The article is titled "Optimizing Multiplayer 3D Game Synchronization Over the Web" and was republished by Gamasutra and GameDev. Here is the original: http://blog.lightstreamer.com/2013/10/optimizing-multiplayer-3d-game.html

    The talk was given at HTML5DevConf and is titled "More Than Just WebSockets for Real-Time Multiplayer Games and Collaboration". You can find both the video and the slides linked here: http://blog.lightstreamer.com/2013/12/html5devconf-more-than-just-websockets.html

    ReplyDelete
    Replies
    1. Thanks Alessandro for sharing your thoughts and providing the links to the other articles. These are definitely complimentary.

      Delete