Incident Report On Full Realm Crashes

At 12:36 NZT, 14:21 NZT and 15:55 NZT we had full realm crash events. I would like to talk about what led to these and what we are doing to fix them.

The first crash was at 12:36 NZT. The realm failed to auto recover and we had to manually intervene to restart the realm. Service was restored at 12:51 NZT.

Our server team was investigating almost immediately and the most obvious problem was a series of Out of Memory errors on the Instance Manager, a server that manages the set of game servers that the players play on. We came up with a plan to reduce the database cache size to free up memory for other purposes but unfortunately this would be difficult to do without a full realm restart.

We changed the configuration so that if the realm were to crash again then it would have the correct cache configuration after restarting. We prepared a patch that included some other fixes and begin the process of deploying it.

Before we could do this, however, the realm crashed again at 14:21 NZT. The realm again failed to auto recover and we decided to deploy the patch as part of the restart manually. Service was restored at 14:30 NZT. While this was unfortunate, we believed that at least the fix was in place so we wouldn't see another crash.

Another aspect of this incident was that we forgot to deploy the patch for the steam client. Normally this would happen as part of the automatic deployment system, but because this patch was deployed manually this did not happen. We didn't realise that this had occurred until we started to see user reports on this issue a little while later.

Unfortunately, investigation of the second realm crash showed that there was no out of memory condition. Upon closer examination of the logs it turned out that there was a rare condition in the logic for when services in our backend network reconnect that could lead to a service thinking that the realm had crashed causing the service to restart itself. This error had existed for a long time but there was a second bug, introduced with ascendancy, that was causing the Instance Manager to disconnect and reconnect approximately once per minute. In addition, the large number of players in ascendancy causes this bug to manifest more easily.

The first crash was also caused by this problem, but was masked by the out of memory condition.

As soon as we discovered this, we put fixed the issue that caused the Instance Manager's frequent disconnections and put this change in place so that if the realm were to crash again it would be in a better state.

We then began work on a fix for the reconnection issue.

Unfortunately at 15:55 NZT the realm crashed again. The realm again failed to auto recover and required manual intervention. Service was restored at 16:01 NZT.

We have now fixed the reconnection issue but it will require a further restart in order to apply this fix. We are currently planning on applying this in the next patch. It is possible that the realm will crash again before that point, but we believe the chance of this is low because we have fixed the bug that was causing the large numbers of disconnections in the first place.

The initial cause of this incident was a combination of two bugs, one of which is long standing and one of which was more recent. The long-standing reconnection bug would have been very hard to detect in any usual circumstance but the bug with the Instance Manager should have been caught.

If we had inspected the logs of the pre-release versions more carefully then this bug would have been caught. In the future we will be adding a much more stringent requirement for log inspection of pre-release versions of the game before we will allow them to go live.

Another issue that this situation has brought to light is that our automatic realm recovery code has failed and we will need to do a full audit of that process in order to make sure that the realm will recover from any incidents quickly in the future.

Realm crashes do happen, but we should never have a crash that was preventable and extended realm downtime is totally unacceptable. This isn't the level of service we want associated with Grinding Gear Games. I would like to apologise for these incidents and assure you that we are going to do our best to improve the situation in the future.
Path of Exile II - Game Director
Last edited by Jonathan on Mar 4, 2016, 11:25:09 PM
Last bumped on Mar 9, 2016, 12:14:16 PM

Report Forum Post

Report Account:

Report Type

Additional Info