"Desync" is a very hot topic. At best it's a minor annoyance when it occurs and at worst it can cause characters to get killed in situations where they thought there were no monsters around. We have many changes coming that will substantially improve the situation, but would like to also explain how our synchronisation systems work in case you're interested, and to make it clear that game state synchronisation is a problem that all online games need to deal with.
In this article I'm going to try to clearly explain:
- How different types of online games handle latency
- How our system of action prediction works
- Why sync problems occur with this systemm and how they manifest
- Why desync has to exist and why rubber-banding is good
- Why some other games don't appear to have similar problems
- What we're planning to do to improve synchronisation
How different types of online games handle latency
Any game has calculations that occur to determine the result of actions. In RPGs, these can range from combat calculations (who did what damage) to important economic transactions involving game items. To prevent players cheating, it's important that these calculations are not done on the gamer's computer, because they can easily modify the result of such calculations.
Because of this, all calculations that affect someone's progress must be done on servers that we control. These servers exist all over the world (Texas, Amsterdam, Singapore and Australia), but due to the speed of light and other physical limitations, it's not instant to send or receive data from them. We typically see response times between our players and the servers of around 50-250ms.
All online games have this situation. The server has to dictate whether things happen or not, but there's a 50-250ms delay before data gets to the server and back. There are three ways that games can solve this:
- Trust the client. This means people can cheat, but the results are instant. We will not do this.
- Wait until data arrives back from the server before doing anything. This is a very common strategy in RTS and MOBA games. If you click to move, the unit will only start moving once the server says so, which is 50-250ms later. If you are close to the server, you'll quickly get used to the lag and everything feels pretty good. If you're far away (New Zealand, for example), it feels like you're playing drunk. Every time you issue an order, nothing happens for quarter of a second. This does not work well for Action RPGs - the immediacy and pace of combat requires that actions start to execute instantly.
- Start predicting the result of the action as though the server said yes, immediately. When the server later gets back to you with a result, factor it in. This is what Action RPGs including Path of Exile do. It means that when you click to move, or click to attack, it occurs instantly and feels great. The problem is what happens when the server decides that the action can't have occurred - that's when the game gets badly out of sync.
Action RPGs have to use the third system (action prediction) to feel responsive. The problem is, the second you start moving, you're implicitly out-of-sync by definition. Your client has drawn the first few frames of movement (to be nice and responsive), but the server has no idea you clicked a button yet until the data arrives. Action prediction is mandatory for this type of game but results in you being slightly out-of-sync almost all of the time. This is generally no problem, but once too many predictions get made based on incorrect data, very bad things happen. The challenge is detecting and correcting the situation before this occurs.
How our system of action prediction works
Let's say you're playing with 200ms round-trip latency and you click a monster that is 2 seconds of travel distance away from you. Assume your attack animation has its contact point after 300ms (which is where damage is dealt).
0ms: You click the monster. Your character starts running towards it on the client.
100ms: Your click arrives at the server. The character there starts running towards the monster also. At this stage your local character is already 5% of the way there.
2000ms: Your character arrives at the monster on the client. It's not there yet on the server. You don't even know if it'll ever arrive for sure (it might get interrupted by an attack still). Your client starts to animate the sword swing:
2100ms: Your character arrives at the monster on the server. The server immediately performs the combat calculation in advance of the contact point and sends the tentative result back to the client.
2200ms: You receive the notification from the server about what type of damage you will deal and roughly how much. Thankfully it arrived before the contact point of the animation! This is not always the case.
2300ms: You hit the contact point on the client. Because you have the damage information in advance, you can draw a pleasing blood splatter, fire effect and so on. This hit has not even occurred yet on the server.
2400ms: You hit the contact point on the server. The damage is locked in and actually applied to the monster. It dies. Experience and item drops are calculated and sent to the client.
2500ms: Your client receives an experience update and the information of what items to show falling to the ground.
Despite the fact that your information is delayed by 100ms, it arrived before the contact point and the only indication of playing under latency that the client noticed was the fact that it took a tenth of a second for the item drops to arrive. At no point in that process was any gameplay calculation compromised in a way that would enable players to cheat the system.
Why sync problems occur with this system and how they manifest
This above example assumes that everything went smoothly. It's entirely possible for the 2 second travel time to be completely different on both ends, or for a lag spike to occur causing the timing to get completely out of sync. If the attack is interrupted on the server before it starts (during movement) but not on the client, then you have a long animation playing that can't be cancelled because the communication time is a decent length of the animation.
Even if no strange lag occurs, the monsters that are nearby are pathfinding on the client to where they think you are - which by definition is different than on the server because of latency. These entities have to find paths that go around the other monsters, which of course are in subtlety different positions on both ends. The differing paths further contribute to the monsters being in the wrong place.
It's worth stressing that in 99% of combat events, everything feels fine. Although the simulation is out-of-sync due to the speed of data transmission, the timing generally works out and monsters who are following weird paths get to you at roughly the right times and in roughly the right places. It's hard to really know that anything's wrong... except when it's horribly wrong.
Unfortunately, when things are very out of sync, players have a pretty bad time. They take damage out of nowhere or find that they're actually trapped between monsters that didn't appear in the right places on their client. We have code to detect these situations and hopefully resync (rubber-band) the entities back into place quickly, but it's often not good enough.
Why desync has to exist and why rubber-banding is good
The key thing to understand is that Action RPGs have to use an action prediction system like this. If they wait for confirmation of every action from the server then it feels terrible to control.
Even if our resyncing code was perfect, there would be situations where the game gets out of sync just because of tiny timing differences. Imagine you're running near a large rock, and you arbitrarily click on the other side of it. Both the client and the server attempt to find the shortest path around the rock. Because your client is ahead of the server by definition (as the movement was processed there approximately 50-250ms earlier, so that it was responsive), there are cases where the client may choose to go a different way around the rock than the server. If you were hit by a monster en-route, then your movement will be interrupted in a different place on both simulations. You are now out of sync. Intelligent resync code would detect this and rubber-band you across the rock back to where you're meant to be.
The key observation here is that improved resync code involves more rubberbanding than we have at the moment. If we do it properly, monsters and players will be corrected to better positions more frequently, to prevent anything getting drastically out of place. Many players interpret the rubber-banding itself as "desync", when in reality it's what is fixing the problem as it is detected. It's not going to be easy explaining that the increased rate of rubber-banding is not only good, but also the ideal solution.
Why some other games appear to not have similar problems
Games using the "wait until server responds" method (RTS and MOBA games) have much higher input latency but don't have the same sync issues that we do. They have their own class of game state synchronisation problems that we thankfully don't have to deal with.
Games using client action prediction like ours run into exactly the same sync issues that we do unless they cheat on certain aspects of the simulation. For example, it's common for Action RPGs to do some combination of the following:
- Entities can hit each other from a long distance away
- There's no chance to hit - all hits occur for sure
- Various speed/collision concessions that make it easy to speedhack and/or walk through monsters with modified clients
- Attack animations cannot be interrupted (i.e. what we treat as Stun).
Unfortunately, we don't want to do any of those things! They each individually ruin part of the hardcore experience: by allowing combat/movement cheats, preventing accuracy from existing as a mechanic, prevent stunlock, preventing people getting blocked in, etc.
Due to the fact that we want to have hardcore game mechanics (i.e. ones where position matters and it's difficult to cheat in PvP), the only option for us has been to put a lot of work into improving our combat simulation and resync code.
What we're planning to do to improve synchronisation
There are a lot of changes that we're experimenting with that may individually improve the synchonisation of the combat simulation (along with their potential drawbacks):
- Have monsters on the client attack your server location rather than client location to reduce entropy. Maybe compromise on them attacking a mid-way point between the two. The drawback here is that it means they'll appear they are swinging at the air, but they're technically more in sync.
- Display blood and elemental effects at the contact point on the client rather than as damage confirmation. This will mean that combat feels more impactful, but we lose the communicated visual information about whether damage was actually dealt. It could be that this is easier to apply to effects from spells because they generally don't have a hit/miss calculation.
- Resync entities that successfully hit you when nothing is on the client near you. This may actually pull the entity even more out-of-sync if you're in the wrong place yourself.
- Resync everything in an area around a desynced entity. This reduces overall entropy massively but would be pretty jarring.
- Delay actions if the client was ahead on its path. This will solve the case where monsters die before you get to them (if you were out of sync) but technically results in lower combat efficiency for players in these cases.
- Improve the distance-based resyncing that occurs for things that are far away from where they should be. It doesn't currently take movement speed into account properly. This is why Rhoas feel quite out of sync when charging.
- Measure overall entropy around the player and force a resync if it exceeds some threshold. The problem is that by the time the resynced information gets to the client, more actions could have occured.
- Fix bugs with specific skills that cause them to act differently on the client and server (Whirling Blades, for example, sometimes fails to trigger based on distance on one end).
- Other changes too subtle/difficult to explain clearly here
At this stage it looks like the biggest gains will come from improving the resync code so that it rapidly and reliably resyncs the combat situation if things get too desynchronised. This will mean more rubber-banding (as explained earlier), but will massively reduce deaths that occur from the player not being able to see the true locations of entities.
I explained the above changes with their drawbacks because I want to make it clear that this problem is intrinsically difficult to solve. We're fighting against both the laws of physics (travel speed of data) and the desire to not compromise gameplay mechanics. I have full confidence that we will incrementally deploy changes that substantially improve this situation.
Update (April 20, 2014):
As I post this article to the new Development Manifesto, I couldn't find any way to improve the above explanation - it's still very accurate, and other than a few small edits, I'd like to leave it mostly as-is.
There are some points that I'd like to clarify that have come up since it was initially posted:
- Desync is not affected by stress on our servers (as explained in the above article). Buying more servers doesn't solve the problem. We have enough servers currently. When they're overloaded, they fail to spawn new areas, rather than affecting combat prediction.
- Some players frequently call things "desync" that aren't related to action sync. For example, if there's a lag-spike on the general internet, or if the client has to load an art asset in a hurry (which causes a frame-rate stall), then this is often described as "desync" by some players. Those other problems are issues that require separate solutions. It's relatively frustrating to see someone say "the desync is really bad today" when they mean "Some ISP between me and the server is lagging badly today".
- The reason why people feel that sync is worse in races is because those are the times when they're really pushing their character through small doorways and dense packs of monsters. Races occur on the same servers and, unless league mods are present, under the same game rules are the regular game.
- Suggesting we switch to a synchronous action model like RTS/MOBA games really isn't a good solution. I understand that RTSs/MOBAs don't have desync, but they do have lag after each and every click. When the semantic is that you're ordering a unit around, this lag is understandable, but if you *are* the unit, then it feels really terrible. Right now, our client-prediction allows users in obscure countries to play Path of Exile while under 200+ms of latency. This would be a much worse experience under that alternate action model. In addition, it'd require rewriting most of the game.UPDATE: We re-wrote most of the game and now we support both models simultaneously. Turns out to be okay.
- There are absolutely problems with various skills (Whirling Blades, Cyclone, Brutus' Hook, etc) that make them more desync-prone than others. These are being aggressively investigated and we should have good progress to share on that front soon.
- Other than skill-specific improvements, our experimental changes so far attack the problem by causing more intelligent resyncs. There's a very tough trade-off between small frequent resyncs (which make the game feel jittery but it's kept in sync) vs less-frequent but larger teleports. Users have come to associate a resync with being out of sync, so updating them more frequently makes them feel like the sync is worse rather than better.
To elaborate on the last point and clarify the problem in a nutshell: in order to keep hardcore game mechanics like body-blocking, stunning and missing while also preventing players from manipulating combat results, small amounts of desync will occur naturally. There is no way around this, due to the speed of light. An ideal solution from Grinding Gear Games would be to very rapidly detect and correct those sync problems, putting things back where they should be. We have not yet delivered this solution to our satisfaction. Once we have, though, you may notice periodic resyncs, which may initially feel like you're out of sync all the time. That's because the system will be acknowledging it and correcting it, rather than assuming that it's all going to be fine and letting you end up two rooms away, pinned against a wall.
I'll update this article as we continue to make progress on this area. Thanks for reading this far - you now know more than you ever wanted to about the pains of networked game state synchronisation.