Incident Report On Full Realm Crashes

At 12:36 NZT, 14:21 NZT and 15:55 NZT we had full realm crash events. I would like to talk about what led to these and what we are doing to fix them.

The first crash was at 12:36 NZT. The realm failed to auto recover and we had to manually intervene to restart the realm. Service was restored at 12:51 NZT.

Our server team was investigating almost immediately and the most obvious problem was a series of Out of Memory errors on the Instance Manager, a server that manages the set of game servers that the players play on. We came up with a plan to reduce the database cache size to free up memory for other purposes but unfortunately this would be difficult to do without a full realm restart.

We changed the configuration so that if the realm were to crash again then it would have the correct cache configuration after restarting. We prepared a patch that included some other fixes and begin the process of deploying it.

Before we could do this, however, the realm crashed again at 14:21 NZT. The realm again failed to auto recover and we decided to deploy the patch as part of the restart manually. Service was restored at 14:30 NZT. While this was unfortunate, we believed that at least the fix was in place so we wouldn't see another crash.

Another aspect of this incident was that we forgot to deploy the patch for the steam client. Normally this would happen as part of the automatic deployment system, but because this patch was deployed manually this did not happen. We didn't realise that this had occurred until we started to see user reports on this issue a little while later.

Unfortunately, investigation of the second realm crash showed that there was no out of memory condition. Upon closer examination of the logs it turned out that there was a rare condition in the logic for when services in our backend network reconnect that could lead to a service thinking that the realm had crashed causing the service to restart itself. This error had existed for a long time but there was a second bug, introduced with ascendancy, that was causing the Instance Manager to disconnect and reconnect approximately once per minute. In addition, the large number of players in ascendancy causes this bug to manifest more easily.

The first crash was also caused by this problem, but was masked by the out of memory condition.

As soon as we discovered this, we put fixed the issue that caused the Instance Manager's frequent disconnections and put this change in place so that if the realm were to crash again it would be in a better state.

We then began work on a fix for the reconnection issue.

Unfortunately at 15:55 NZT the realm crashed again. The realm again failed to auto recover and required manual intervention. Service was restored at 16:01 NZT.

We have now fixed the reconnection issue but it will require a further restart in order to apply this fix. We are currently planning on applying this in the next patch. It is possible that the realm will crash again before that point, but we believe the chance of this is low because we have fixed the bug that was causing the large numbers of disconnections in the first place.

The initial cause of this incident was a combination of two bugs, one of which is long standing and one of which was more recent. The long-standing reconnection bug would have been very hard to detect in any usual circumstance but the bug with the Instance Manager should have been caught.

If we had inspected the logs of the pre-release versions more carefully then this bug would have been caught. In the future we will be adding a much more stringent requirement for log inspection of pre-release versions of the game before we will allow them to go live.

Another issue that this situation has brought to light is that our automatic realm recovery code has failed and we will need to do a full audit of that process in order to make sure that the realm will recover from any incidents quickly in the future.

Realm crashes do happen, but we should never have a crash that was preventable and extended realm downtime is totally unacceptable. This isn't the level of service we want associated with Grinding Gear Games. I would like to apologise for these incidents and assure you that we are going to do our best to improve the situation in the future.
Path of Exile II - Game Director
Last edited by Jonathan on Mar 4, 2016, 11:25:09 PM
Last bumped on Mar 9, 2016, 12:14:16 PM
This is my first day with the game. It's great. Thanks for the hard work and the interesting read.
I always have mad respect for people who can own up to their mistakes. Well done.
1 bug is usually easy to find.
2+ bugs isn't just 2 times harder, it's 10 times harder, cause they interact, and it's also tough when one was there for a long time, cause you have to re-check some old stuff to grab it.
Now, let's just hope a third one didn't dodge your debuging skills XD
Anyway, gratz =]
PLEASE QUOTE ME IF YOU ARE EXPECTING A REPLY
Last edited by xhul on Mar 5, 2016, 12:02:18 AM
Jonathan, get some low level intern to type this for ya. Your time is better spent elsewhere, not to mention it costs a lot!
IGN: MyOtherRangerDied // DontBeStupidAndDieAgain // GenericSporkerWitch
https://google.com
Thanks for the info. Unexpected to get technical detail about the launch.

The servers weren't down for very long, thanks for the hard work.

Always a pleasure to have these reports, this communication is part of why you guys are so great!
As always, the transparency (and moreover, promptness of it) is appreciated. That IS something that sets GGG apart from the rest.

Granted, I know that you must simplify it for the sake of readability of your entire playerbase (it makes it less transparent, I suppose, if <1% of your players can understand it) but it HAS made me a bit more curious, on at least one case:

"
Jonathan wrote:
Our server team was investigating almost immediately and the most obvious problem was a series of Out of Memory errors on the Instance Manager

On the last day or so of Talisman, I noticed that it SEEMED as if the server was failing to eliminate "closed" instances that had timed out. At the time, I had shrugged it off, and assumed there was nothing awry there. (I was re-running multiple instances of The Ledge, as I did 7 rigwald runs in quick succession; I noticed there were MANY instances with the timer marked "closed")

Is there a chance I was mistaken, and that might've been related? If so, I apologize for not deeming it worth reporting.
Rufalius, hybrid Aura/Arc/Mana Guardian | Hemorae, TS Raider | Wuru, Ele Hit Wand Trickster
Thanks for the in-depth report! It's really incredible that we got this level of detail AND this fast. I guess there's a reason Jonathan time is the most valuable time :)
Seriously, thank you GGG. It means the world to me when you put all your soul into keeping the game healthy.
Prejudice is a burden that confuses the past, threatens the future, and renders the present inaccessible.

Report Forum Post

Report Account:

Report Type

Additional Info