We experienced a problem with our database servers today that caused a partial outage for about 30 minutes across the system with the error message:

Error getting data from store (503)

During this time, the majority of pages in Aspire remained available due to the fact that our caching system remained operational during the database outage. So it is possible that your users will not have detected any loss of service, however some customers experienced a total outage for the duration.

This article explains why the outage occurred and what steps we are putting in place to resolve.

Our database servers use a multi-site replication strategy to make sure data remains consistent even if a whole data center is permanently lost. Within the primary site, there is a master and slave database – the master receives all of the read and write traffic, and replication occurs to the slave. If the master database is lost, the slave can take over the load until the master is recovered.

In addition to this, we have an additional, secondary offsite slave database in an alternative, geographically separate data center. Data is also replicated to this database from the primary master/slave setup. 

Sometime this morning this secondary slave database ran out of disk space and triggered our alerting mechanism. At this point the database system as a whole continued to function as normal – the secondary slave is just a precautionary backup solution and does not satisfy live requests.

The team provisioned some more disk space on the slave system and brought it back up. During the time the slave had been down, a large amount of writes had occurred to the primary master/slave setup. Bringing the secondary slave up resulted in a large quantity of replication traffic that overwhelmed the master and caused it to stop serving live requests – this is what caused the outage to customers earlier. 

The team were not alerted to the fact that the master had stopped serving live traffic until customers raised support tickets, at which time they backed down the secondary slave to stop the load on the primary – at this point normal service was resumed and the system was back to functioning as normal.

The team will restore the secondary slave out of hours using an alternative mechanism so as not to cause further outages.

Actions

  1. We will review our monitoring system so we can understand why we were not immediately informed of the overwhelmed primary when we brought the secondary slave into service.
  2. We are already working on a new database architecture project. In the new architecture, by design the replication strategy is different and situations such as this are much, much less likely to occur.
We’re very sorry for any inconvenience caused to you and your users. If you have any further questions please feel free to post them below or raise a support ticket.

The quest for perfect uptime fascinates me. The complexity and unexpected consequences when trying to reach it.

I recently saw the quote regarding IT uptime: “The paradox of resilience is that resilience requires complexity, and complexity is the enemy of resilience” which sums it up nicely. The article linked to here is a good example, Talis in my view had done everything right, and I was quite impressed in the set-up described here, re-syncing a *second* off-site slave database took down the live site.

A good, open, write up of what was quite a short bit of downtime.

Leave a Reply