At 15:55NZT today we experienced a sustained outage of all Path of Exile service for 1 hour and 4 minutes until 16:59NZT. While downtime can happen from time to time, extended downtime of this nature is not acceptable.
By 15:56NZT our server admin Thomas was alerted to the problem and began attempting to diagnose it. At that time all Path of Exile servers, including hot spares in our server host's Dallas data centre were unreachable. This data centre is where Path of Exile's core infrastructure is hosted. We notified our server host about the problem.
At 16:19NZT we were notified that there had been a power event in one of the server rooms of that data center. All servers and network infrastructure in that server room had lost power.
At 16:20NZT power was restored and equipment began to power up again.
At 16:51NZT all of the network infrastructure was back up and our servers were available again. Unfortunately several of our servers did not come back up and one of those servers was a primary database server. At this time we decided to initiate failover of this server to the hot spare.
By 16:57NZT we had prepared the necessary config changes to move to the hot spare. At this time we begin starting the realm again.
At 16:59NZT the realm was back up and functioning again normally.
While this downtime was caused by a power incident outside our control, our ability to recover from such events is our responsibility. For an incident like this, we should require at most 5 minutes of downtime while we move to redundant infrastructure.
This incident is our fault because we did not take sufficient steps to isolate our redundant infrastructure. While we had hot spares, we didn't take care to ensure that those hot spares were actually powered by separate power systems.
In addition, it took longer than it should have once our failover sequence began to actually make the switch.
Over the next few days we will be taking steps to move redundant infrastructure to different locations so that they are completely isolated from failure. We will also be working our procedures for failover so that they can be performed faster and with less steps required.
I'm personally very sorry. As the Technical Director of Grinding Gear Games it is my responsibility to ensure that our infrastructure is sufficiently redundant so that when disasters happen, we have the minimum possible disturbance to our users. This is not the level of service that you should expect from our company, and I can assure you that we will be making changes so that an incident like this will not happen again.