MTTR: lower isn’t always better

It’s a great idea to track your MTTR (Mean Time To Recover) as an operational metric. MTTR is defined as the average interval between onset of a failure and recovery from that failure. We acknowledge that failures are part of the game, so we want our organization to be good at responding quickly to them. It’s intuitive that we’d want our MTTR to trend down.

This is one of those places where our intuition can be misleading.

MTTR is an average over incidents of incident duration. That means that the total amount of downtime gets denominatored out. Consider these two brothers who run different websites:

  • Achenar’s site only had 1 outage in September, and it lasted 60 minutes.
  • Sirius’s site had 120 outages in September, lasting 20 minutes each.

Sirius had 40 times as much downtime as Achenar in the month of September. Sirius’s MTTR, however, was 1/3 that of Achenar: 20 minutes rather than 60 minutes.

Lowering your MTTR is a good strategy in certain situations. But you need to make sure it’s the right strategy. If you don’t look at the whole picture, things like nuisance alarms and insufficient automation can be confounded with the meaning of your MTTR. If you fix a whole bunch of meaningless alerts that always recover quickly without intervention (you know the type), your MTTR goes up!

mttr_vs_incident_count

 

MTTR is useful to track, and it can be useful for decision-making. Just remember: our goal is to minimize downtime and noise, not MTTR. If the path of least resistance to lower downtime and a stronger signal is to respond to incidents quicker, then MTTR is your best friend. But that’s not always true.