Incident Report for 26th March Item Corruption

Today, starting at 17:35:15 NZT for around 18 minutes, a lot of strange items were generated in Path of Exile. Some of these items are related to unreleased or experimental content. Some of these items generated in odd ways that couldn't ever exist. Due to the nature of the economic damage, we are deleting all items that were created during these 18 minutes. In a game where we want players to value items, this kind of economic damage is not acceptable and so I would like to apologise that we let this occur.

The process that led to this situation is rather complex, but I will attempt to explain what happened.

At 16:53:35 NZT we deployed a change only to the website that was intended to show the new race season rewards for season seven. The correct procedure for this operation should have been to merge the intended website changes into a branch of the existing 1.1.1b build and then deploy that to our web servers. Instead, the build that was actually deployed to the website only was the current version of 1.1.1c that we are currently working on. This issue by itself was not enough to cause any problems, but it did mean that the 1.1.1c build was available (though inactive) on the live production servers.

As far as both Chris and Carl who were responsible for producing this change were aware, this was a reasonable step as they were unaware of the kinds of problems this extra version could cause.

At 17:35:15 NZT a trivial configuration change was applied to the production realm. This was when the problem started occurring.

One of the known limitations of our deployment system at the moment is that even though we have the ability to deploy a new version to a limited subset of services / servers, the current "live" version is a global setting across all services / servers. This means that while most of the realm is still running 1.1.1b, the current version that our deployment system believed was live was 1.1.1c.

Part of the reconfiguration process is what we call a redeploy. This is basically where we link up the data and configuration of the version that we want to deploy to the location the servers are run in. This is performed during reconfiguration to hook up the new config file and would normally deploy the same version that was already there.

In this case, the deployment system believed that 1.1.1c was the current version, so the redeploy changed the links to this new version. This effectively deployed 1.1.1c early and unintentionally on the game instance servers.

While we knew that this could happen, Thomas (our server admin) thought that the earlier update only contained website changes and was based on 1.1.1b. Were this the case, this process would not have resulted in any problems because the data used by the game servers would not have been any different between the two versions.

Unfortunately this process resulted in a very bizarre and new failure mode that we had not previously considered. When we create new instance servers for players to play on, they are forked off from a so called "Prespawner" instance that has preloaded most of the game data into memory. This is so that instances can share memory for static assets that are used for the game. Unfortunately there were some gaps in the data that the prespawner instance was loading, which means that those assets were loaded on demand if a specific game server needed them.

Due to the situation here, this means that some of the data tables (the ones that were preloaded) were from version 1.1.1b and the rest were from 1.1.1c. The reason this resulted in the issues that we saw today is because rows from one table references data from other tables by index. When rows are added or removed from the middle of a table, the index of those rows in the data change. This is what made the references effectively go out of sync.

Effectively references to rows were pointing to essentially arbitrary places. Uniques were pointing to the wrong mods. Base types were pointing to the wrong art. Quest rewards were pointing to the wrong base types. Many other tables were similarly broken.

When we got reports of the problem and started to see strange items linked we knew that something must have gone wrong with the deploy so we decided to immediately shut down the realm to prevent further damage. This process started at 17:52:20 NZT.

After shutting the realm down, we had a hard decision to make. Do we do a database rollback? There were only 18 minutes of damage, but at the time we had only seen only a few strange items and did not understand the extent of the problem.

Rolling back the database would have taken a fair amount of time, during which we would have had downtime. There are also a lot of other issues that take a lot of time to sort out when doing a rollback. For example, new accounts that have been created by users will be lost. Any purchases of microtransaction points would have been undone, but we would still have the money. If we had done a rollback we would have needed to manually sort these issues out after the fact.

Therefore, at that time I decided that a rollback would cause more harm than good, and we decided to redeploy back to the known good version 1.1.1b and restart the realm. In retrospect, a rollback would have been preferable as the current situation may still result in a few odd items. (This could come about if someone Orb of Chanced a unique item).

At 17:57:16 NZT the realm was back online and no new broken items could be created.

At this point we had several options for how to fix the problem of all the broken items that were created. One option was to try and remove only item that showed signs of corruption. The problem with this approach is that it would be very difficult and time consuming to write a verifier that was reliable. We could never really be sure we got everything. We also wanted to destroy all the items before players had time to become attached to them so speed of development was important. A solution that fully determined if an item was corrupted may have taken a day of development.

Due to this, we decided to destroy all items that were created during the 18 minutes that the problem was active for. This is effectively the same as if we had done a rollback at the time, but you will still get to keep other forms of progress that are not items gained during that time. You will also get to keep currency items created during that time as they are not affected by the problems in any important way.

This problem was ultimately caused by multiple mistakes and miscommunications. The build for the website update was prepared incorrectly. A configuration change was pushed when the realm was not in the correct state for it. There was a serious flaw in the preload system on the game servers that allowed data from multiple versions to be loaded.

There are a few steps that we will be taking to avoid this kind of problem happening again.

The first is that we will be fixing the deployment system to allow for multiple versions properly. This is already something that we were actively developing, but it would have prevented this problem from occurring. Using the deployment system as it was without proper support for independent versions was dangerous and we paid the price. It will not happen again.

The second is that I will be making sure that the entire data set is loaded at once on the instance prespawner. This also would have prevented the problem by not allowing the data set to get out of sync with itself and should mean that this kind of catastrophic failure mode can't happen again.

I'm very sorry for this issue, and for any items that you may have lost during that 18 minutes of play time. This isn't the level of service that you deserve from us and I hope you will find that we do better in the future.
Path of Exile - Lead Programmer
Last edited by Jonathan on Mar 26, 2014, 5:05:20 AM
Last bumped on Feb 24, 2017, 6:56:32 AM
"
Supervixen wrote:
What of items that were created before the last 18 minutes? A weapon I got hours ago is gone. Hours != 18 minutes?


To clarify, the items deleted are from 17:35:15 NZT till 17:57:16 NZT. This was over 4 hours ago.
Path of Exile - Lead Programmer
"
Trakios wrote:
I have no problem w losing my helm , but the loss, of the cosmetic effect was painful.-


Please contact support. They will get you your cosmetic effect back.
Path of Exile - Lead Programmer
"
Redblade wrote:
So not only did you shut down the server, which I totally understand, wasting my map. But then you proceed to delete a 4 perfect stat chest that dropped for me in the map you wasted without at the very least refunding open maps at the time of emergency shutdown.

I really hope you add a map refund system on emergency shutdown to your to do list. Losing high level maps during a stint of bad map RNG because of a reboot is beyond irritating.


I'll discuss it with the team. It may be possible to add this.
Lead Developer. Follow us on: Twitter | YouTube | Facebook | Contact Support if you need help!

Report Forum Post

Report Account:

Report Type

Additional Info