The Importance of testing backups

Another incidence of a tired admin fixing an outage to cause a bigger outage isn't news as such, however I have to hand it to gitlab with their open honesty about this weeks incident.

After a spam storm created serious (4GB) replication lag on the firms postgresql database cluster, to fix the replication a very very tired on-call team-member then deleted the data folder on the active rather than the replicating server.

The full incident is documented here

I embrace the honesty that they have shown as this enables the whole community to learn from this and offer better services to our clients. This is very much the message in Black Box Thinking by Matthew Syed. Matthew describes the difference between closed cultures where mistakes are hidden vs an open hostest culture where mistakes are open and much learning and prevention occurs as a result.

As shown by the support on Twitter the DevOps and cloud reliability engineers agree.

Lessons so far? Test your backups, you never know when you will really need them.

With my ethos about servers being disposible, I love destroying and rebuilding servers, to prove in any Disaster Recovery situation, the service can be restored. This relies on well designed recovery processes and code, keeping the focus away from avoiding failure, to focus on embracing failure and reducing the mean time to recovery.