3.7.4 Deployment Downtime Incident Report


Today during the deployment of the 3.7.4 patch we encountered some problems that led to an extended amount of downtime for Path of Exile as well as progress loss for any gameplay time played between when we initially deployed 3.7.4 and when we were finally able to deploy the fixed version approximately 2 hours later. I would like to take some time to explain what happened and how we will be preventing it in the future.

Timeline

At 12:00 NZT today we took the realm down to deploy 3.7.4. The downtime took approximately 18 minutes and the realm was back up at 12:18 NZT. At 12:25 NZT we were informed by Kakao Games, our Korean publisher, that Korean users who were logging in were being presented with the Create New Account dialog instead of logging into the existing accounts.

Without knowing what the problem was, my first call was to order the shutdown of the Korean login gateway in order to potentially reduce the amount of damage that might be done so that we could get time to investigate.

We fairly quickly determined that the patch contained a database migration to accounts that nobody from the production or server admin teams knew was in the patch. This migration destroyed all Kakao credentials in the database.

The migration system we use has the ability to fairly quickly swap back a specific data table in the case that a database migration causes damage, so we decided to shut the realm down, roll back the account data and redeploy 3.7.3d, preserving player progress made after the deploy, until we could create a fixed version of 3.7.4. We shut the realm down at 12:35 NZT.

At 12:48 NZT we had deployed 3.7.3d with the rolled back account database only.

Soon after that we started getting reports that people had lost fossils. Because in 3.7.3d fossils did not have a stack size, the stack size was lost when a user loaded any stash tab with stacked fossils in it. Unfortunately when I made the call to roll back the accounts database, I failed to consider the situation with fossils.

In order to prevent further damage, the realm was shutdown again at 13:00 NZT.

Unfortunately, this type of corruption wouldn't have been possible to fix with post-hoc analysis of logs so we decided to do a full database point-in-time rollback which would lose 29 minutes of player gameplay. I made the call that the economic damage of losing 29 minutes of game time would probably be lower than the economic impact of a lot of players losing a large number of valuable fossils.

By 13:05 NZT we started the process of rolling back the database with point-in-time recovery. Rolling the database backwards in this manner is a somewhat slow process, but by 13:45 NZT it was complete.

At this point we realised that the fixed version of 3.7.4 was very close to being ready, so we decided to wait for it to be deployed to the servers rather than deploy 3.7.3d again only to restart the realm again a few minutes later.

At 14:04 NZT we restarted the realm on version 3.7.4, and the (fixed) migration ran which was completed at 14:15 NZT at which point we allowed access to all the players.

In total this incident led to 123 minutes of effective downtime including the 29 minutes of gameplay that was lost.

Response

There were several mistakes made that led to this incident.

The first one is that a database migration was added to a patch without anyone being informed. Normally, when a database migration is included in a patch, there is special attention applied by QA and the server administrators to make sure that no data is lost and that it doesn't cause any issues. The producer who merged the change was unaware that there was a migration in the change they merged. In order to make sure that these changes get attention, we will add a system to prevent migrations being added without specific acknowledgement by the producer.

The second problem is that our QA processes didn't catch the issue. Even though this area wasn't flagged to QA, the problem was basic enough that we should have discovered it. The QA Quick-Check list did include that Kakao login is functional, but it didn't include a check to make sure that an old account still exists! We will be amending the checklist.

The third mistake was mine. After I learned about the faulty account migration I didn't take enough time to consider the changes in 3.7.4 to work out if any of them could have caused potential issues before making the call to roll back to 3.7.3d. In the future, we will make sure to understand all of the repercussions of a rollback before making this call. We would still have had some downtime, but we would have prevented any player gameplay loss.

This incident is not up to the standard of service that we want to provide to you, and I want to personally apologise for how it was handled. You should expect better from Grinding Gear Games and we will be doing everything we can to avoid this type of thing in the future.

Thanks,
Jonathan
Posted by 
on
Grinding Gear Games
skok ya jdal , omg
Thanks for the write up. Always interesting to read about what happens behind the scenes.
Crit happens.
I have only been playing Path of Exile since the start of Legion but want to say that the quality of service provided by the team working on this game is incredible. Today's issue was relatively minor in the scheme of things but it's amazing to see acknowledgement of mistakes and great communication with the community. Major props to you and everyone there Jonathon.
No biggie


"
SH1FT1 wrote:
I have only been playing Path of Exile since the start of Legion


Mad respects for starting in hc, keep it up cowboy
May the Dominus be with with you
Last edited by Nubatack on Jul 25, 2019, 12:22:47 AM
Thank you for being so forthcoming about what happened. Many would just coldly apologize and nothing else.

And, even if tiny, insight on how things work is always interesting.
Last edited by Zphyr on Jul 25, 2019, 12:27:09 AM
Another good example of GGG standing up and taking responsibility for things that happen.

Even though nobody likes when stuff like this happens, we (at least I) very much appreciate how open you are about the issues, how they are handled, and what will be done to prevent future similar incidents.

It's right that you are hard on yourselves. It is the only way to meet high standards. But go have a beer or cup of jo. You deserve it.
Great wright up. I like hearing what happen in the background to me it makes what little play time we did miss minuscule with what could have actually of happened if you were not quick. Keep up the good work and the communication that is one of the reasons I will be supporting this game in the future.
You all are amazing. Mistakes happen. Thank you for being communicative and honest, and thank you for caring. <3
Next Incident Report blog post from the Devs:

How Legion League turned out to be Cyclone Headhunter League. Lessons learned.

Report Forum Post

Report Account:

Report Type

Additional Info