Incident Report for 2016-06-09 Downtime

At 10:53am NZT a server administrator pushed a configuration change that was supposed to add some new servers into the Path of Exile realm. The configuration change failed to apply, but it was quickly noticed that new game instances were failing to start.

By 10:55am NZT the administrator had realised his mistake. The name of the deployed build was incorrectly specified. He quickly created a fixed configuration file and attempted to push this configuration change.

Unfortunately this is when we discovered the true damage that the initial configuration push had caused.

Each Path of Exile build is stored in its own directory on the server. On some of our older servers we actually have builds going all the way back to version 0.10.0. The current build is stored in builds/2.3.0/7/ for example.

All data for the build, the executables, data, configuration, scripts, etc are all stored in these build directories. In a deploy, the directory of the correct build is symlinked to the working directory that the servers run from.

Because the build name had been specified incorrectly, the reconfiguration had changed the symlinks on every Path of Exile server to point to a non-existent directory.

Because the reconfiguration scripts are also part of the build and accessed through this symlink, when we tried to push a fixed configuration, the reconfiguration scripts were not present and the reconfiguration failed. Without our tools for deploying configuration, we would need to fix the hundreds of servers that make up the Path of Exile realm manually.

At 11:00am NZT the administrator realised that the realm would require substantial maintenance to fix, so he shut the realm down fully so that he could work on it.

During the time between 11:00am NZT and 11:15am NZT the administrator worked out a process to redeploy fixed configuration scripts to all the servers in the Path of Exile realm.

At 11:15am NZT he began the process of restoring the Path of Exile realm. At 11:20am NZT the realm was restored and working normally.

The initial cause of this incident was a mistake in a configuration file. While mistakes happen, we didn't have adequate checks in place to prevent this level of catastrophic failure from happening.

We have the ability to roll back configuration changes, but in this case the configuration change broke our ability to do a rollback. Clearly this shouldn't be possible.

We will immediately be adding sanity checks to make sure that the build that is specified is a valid Path of Exile build directory. That will make sure that at the very least our automated configuration tools are not disabled by a mistake like this. We are also considering trying to pull more of the rollback system out of the build itself, so that we are more immune to this kind of problem.

While downtime does happen, extended realm downtime caused by mistakes on our part is unacceptable. I would like to apologise for this incident and assure you that we will do our best to prevent another incident like it in the future.
Path of Exile - Lead Programmer
Last edited by Bex_GGG on Jun 8, 2016, 10:36:53 PM
Last bumped on Jun 18, 2016, 6:35:32 AM

Report Forum Post

Report Account:

Report Type

Additional Info