Incident Report for 03/11/2015 17:35NZT Outage

I'm sorry for the brief incident report, I wasn't directly involved in the fix to this issue.

This morning, we started the five-week Darkshrine events. Player turnout was higher than expected, so we ordered some more servers.

At 5:35pm, a server administrator made an error when configuring the servers. This resulted in an invalid version of the game being deployed to other live servers. Players became unable to create new instances. Our server admins immediately began investigating and had worked out the problem within minutes. Unfortunately, the conflicting game version caused the servers to repeatedly crash, making them essentially unresponsive.

At 18:10pm, we took the entire realm down in order to reconfigure and restart it. Due to the servers being unresponsive, it took 45 minutes for us to complete this restart. Several other unforeseen problems (like low drive space on one server and a badly timed daily backup starting) complicated this process. (I got to see my first >2000 load average...)

There was a total of 35 minutes of soft downtime and 45 minutes of hard downtime. This is an unacceptably long period of downtime time for a relatively simple mistake. I am very sorry that this occurred.

The additional downtime was due to unforeseen problems, but we will work to make sure that they don't occur again.

Tomorrow when more senior developers are in the office we'll review the process and make sure that improvements are made.

The servers are now back up. Once again, sorry for the mistake and lengthy time required to fix it. We hope you continue to enjoy the Darkshrine events.

(Thankfully, the invalid state described above could not cause item corruption.)
Lead Developer. Follow us on: Twitter | YouTube | Facebook | Contact Support if you need help!

Report Forum Post

Report Account:

Report Type

Additional Info