Incident Report for 2016-06-09 Downtime

At 10:53am NZT a server administrator pushed a configuration change that was supposed to add some new servers into the Path of Exile realm. The configuration change failed to apply, but it was quickly noticed that new game instances were failing to start.

By 10:55am NZT the administrator had realised his mistake. The name of the deployed build was incorrectly specified. He quickly created a fixed configuration file and attempted to push this configuration change.

Unfortunately this is when we discovered the true damage that the initial configuration push had caused.

Each Path of Exile build is stored in its own directory on the server. On some of our older servers we actually have builds going all the way back to version 0.10.0. The current build is stored in builds/2.3.0/7/ for example.

All data for the build, the executables, data, configuration, scripts, etc are all stored in these build directories. In a deploy, the directory of the correct build is symlinked to the working directory that the servers run from.

Because the build name had been specified incorrectly, the reconfiguration had changed the symlinks on every Path of Exile server to point to a non-existent directory.

Because the reconfiguration scripts are also part of the build and accessed through this symlink, when we tried to push a fixed configuration, the reconfiguration scripts were not present and the reconfiguration failed. Without our tools for deploying configuration, we would need to fix the hundreds of servers that make up the Path of Exile realm manually.

At 11:00am NZT the administrator realised that the realm would require substantial maintenance to fix, so he shut the realm down fully so that he could work on it.

During the time between 11:00am NZT and 11:15am NZT the administrator worked out a process to redeploy fixed configuration scripts to all the servers in the Path of Exile realm.

At 11:15am NZT he began the process of restoring the Path of Exile realm. At 11:20am NZT the realm was restored and working normally.

The initial cause of this incident was a mistake in a configuration file. While mistakes happen, we didn't have adequate checks in place to prevent this level of catastrophic failure from happening.

We have the ability to roll back configuration changes, but in this case the configuration change broke our ability to do a rollback. Clearly this shouldn't be possible.

We will immediately be adding sanity checks to make sure that the build that is specified is a valid Path of Exile build directory. That will make sure that at the very least our automated configuration tools are not disabled by a mistake like this. We are also considering trying to pull more of the rollback system out of the build itself, so that we are more immune to this kind of problem.

While downtime does happen, extended realm downtime caused by mistakes on our part is unacceptable. I would like to apologise for this incident and assure you that we will do our best to prevent another incident like it in the future.
Path of Exile II - Game Director
Last edited by Bex_GGG on Jun 8, 2016, 10:36:53 PM
Last bumped on Jun 18, 2016, 6:35:32 AM
Thanks for being transparent about the incident that occurred almost two hours ago, Jonathan.
Sometimes you can take the game out of the garage but you can't take the garage out of the game.
- raics, 06.08.2016

Last edited by JohnNamikaze on Jun 8, 2016, 8:35:25 PM
I want a refund
Saved
This explains the site going down mid forum post

Edit: Holy shit you guys practically blew up your servers
Sure, the lab can be hard, but it's pretty easy if you're properly geared, and not terrible at the game.
Last edited by Rowanbladex on Jun 8, 2016, 8:39:12 PM
Thanks for the update, For the most part we understand that mistakes happen, I do applaud the team there for getting ontop of any issues such as this in such a timely manner, and for you informing us of the reasons that caused this.
Beta Member Since 0.9.0 | Current Character : ExExCorpse
Creator of Prismatic Rings AND Unique Thief's Torment Prismatic Ring
----------------------------------------------------------------------------------
The Guide to Loot Filters - Here
Great job on the details.
On another note, will the sanity checks extend to the players or remain focused on the programmers?;)
I, for one, already know that I'm not sane, but I'm sure that some of the others might be caught off guard.
Elder Shaper of Play-Doh
Last edited by jbuehring on Jun 8, 2016, 8:38:35 PM
When the game crashed i had opened an artisan strongbox that dropped 55 mirrors of kalandra. I didnt get chance to pick them up because the game crashed, I demand you give me my mirrors... lol
Jepp, Transparency rocks Guys...and nothing to worry :)
We're all not perfect and make Mistakes !
Mistakes rarely happens on this scale over at GGG. When it does, they're upfront about it AKA forgiven!
Exiled for Eternity
At around that time screen froze for me, Tcp disconnected in about 0,5 seconds but logged back in to find my Char in Standard. I don´t think anything on Screen was able to kill me in the 0,5s time frame, so the server probably didnt get the message :/
Well RIP!

Thx for working so hard on the issues, but i´m probably going to stick with Standard for a while, which is a shame since i enjoy the hardcore environment more.
IGN:
Omegatherion

Report Forum Post

Report Account:

Report Type

Additional Info