February 10, 2016 outages

'(The Outage Log)

Audience: sysadmins

“A captain of a ship, no matter his rank, must follow the book.” —Captain James T. Kirk

I’ve been keeping an "Outages Log" for the last few years to track downtime and causes of failures (sometimes they’re only theories). This has turned out to be an invaluable record, so I’ll share the idea here and encourage you to keep one, too.

The OutagesLog is maintained as a wiki page. It’s helpful is spotting problem patterns, and identifies when something has been neglected. The goal is to never have the same problem twice.

The front-matter is a brainstorm/laundry list of things that can go (or have gone) wrong.

  • Cloudflare hiccups

  • Data center network

  • Failing critical batch job

  • Hardware

  • App bugs (looping retries)

  • DB swap/backup problem

  • Forgetting to restart passengers on deploy

  • DNS confusion/loss

  • Bad upgrade rotation

A list of symptoms of each is also helpful.

Depending on the number of machines you’re managing, it might be feasible to also track system updates and reboots here. Hopefully these are done on a consistent schedule across most machines.

Here is an example set of entries. The entries are organized by year. Each entry contains a date-time-stamp with the number of minutes down/impacted, followed by diagnosis. A more detailed reporting would also contain a link to a fuller description of the outage and planned remedies.

2014
[Jan 13 18:14] 18m: db network failure; VIP disappeared (link to full report)
[Feb 06 22:44] 2m: Two hiccups. DC acknowledged router issue on their end
[Feb 28 12:24] 3h: Wow, DC breaker trip, fries dbs91 controller. Recovery takes many hours before discovering disk will be too slow. Eventual reboots help, but we have to fail over to Houston.