3.7.4 Deployment Downtime Incident Report
Today during the deployment of the 3.7.4 patch we encountered some problems that led to an extended amount of downtime for Path of Exile as well as progress loss for any gameplay time played between when we initially deployed 3.7.4 and when we were finally able to deploy the fixed version approximately 2 hours later. I would like to take some time to explain what happened and how we will be preventing it in the future.
TimelineAt 12:00 NZT today we took the realm down to deploy 3.7.4. The downtime took approximately 18 minutes and the realm was back up at 12:18 NZT. At 12:25 NZT we were informed by Kakao Games, our Korean publisher, that Korean users who were logging in were being presented with the Create New Account dialog instead of logging into the existing accounts.
Without knowing what the problem was, my first call was to order the shutdown of the Korean login gateway in order to potentially reduce the amount of damage that might be done so that we could get time to investigate.
We fairly quickly determined that the patch contained a database migration to accounts that nobody from the production or server admin teams knew was in the patch. This migration destroyed all Kakao credentials in the database.
The migration system we use has the ability to fairly quickly swap back a specific data table in the case that a database migration causes damage, so we decided to shut the realm down, roll back the account data and redeploy 3.7.3d, preserving player progress made after the deploy, until we could create a fixed version of 3.7.4. We shut the realm down at 12:35 NZT.
At 12:48 NZT we had deployed 3.7.3d with the rolled back account database only.
Soon after that we started getting reports that people had lost fossils. Because in 3.7.3d fossils did not have a stack size, the stack size was lost when a user loaded any stash tab with stacked fossils in it. Unfortunately when I made the call to roll back the accounts database, I failed to consider the situation with fossils.
In order to prevent further damage, the realm was shutdown again at 13:00 NZT.
Unfortunately, this type of corruption wouldn't have been possible to fix with post-hoc analysis of logs so we decided to do a full database point-in-time rollback which would lose 29 minutes of player gameplay. I made the call that the economic damage of losing 29 minutes of game time would probably be lower than the economic impact of a lot of players losing a large number of valuable fossils.
By 13:05 NZT we started the process of rolling back the database with point-in-time recovery. Rolling the database backwards in this manner is a somewhat slow process, but by 13:45 NZT it was complete.
At this point we realised that the fixed version of 3.7.4 was very close to being ready, so we decided to wait for it to be deployed to the servers rather than deploy 3.7.3d again only to restart the realm again a few minutes later.
At 14:04 NZT we restarted the realm on version 3.7.4, and the (fixed) migration ran which was completed at 14:15 NZT at which point we allowed access to all the players.
In total this incident led to 123 minutes of effective downtime including the 29 minutes of gameplay that was lost.
ResponseThere were several mistakes made that led to this incident.
The first one is that a database migration was added to a patch without anyone being informed. Normally, when a database migration is included in a patch, there is special attention applied by QA and the server administrators to make sure that no data is lost and that it doesn't cause any issues. The producer who merged the change was unaware that there was a migration in the change they merged. In order to make sure that these changes get attention, we will add a system to prevent migrations being added without specific acknowledgement by the producer.
The second problem is that our QA processes didn't catch the issue. Even though this area wasn't flagged to QA, the problem was basic enough that we should have discovered it. The QA Quick-Check list did include that Kakao login is functional, but it didn't include a check to make sure that an old account still exists! We will be amending the checklist.
The third mistake was mine. After I learned about the faulty account migration I didn't take enough time to consider the changes in 3.7.4 to work out if any of them could have caused potential issues before making the call to roll back to 3.7.3d. In the future, we will make sure to understand all of the repercussions of a rollback before making this call. We would still have had some downtime, but we would have prevented any player gameplay loss.
This incident is not up to the standard of service that we want to provide to you, and I want to personally apologise for how it was handled. You should expect better from Grinding Gear Games and we will be doing everything we can to avoid this type of thing in the future.