MTTR: lower isn’t always better

It’s a great idea to track your MTTR (Mean Time To Recover) as an operational metric. MTTR is defined as the average interval between onset of a failure and recovery from that failure. We acknowledge that failures are part of the game, so we want our organization to be good at responding quickly to them. It’s intuitive that we’d want our MTTR to trend down.

This is one of those places where our intuition can be misleading.

MTTR is an average over incidents of incident duration. That means that the total amount of downtime gets denominatored out. Consider these two brothers who run different websites:

  • Achenar’s site only had 1 outage in September, and it lasted 60 minutes.
  • Sirius’s site had 120 outages in September, lasting 20 minutes each.

Sirius had 40 times as much downtime as Achenar in the month of September. Sirius’s MTTR, however, was 1/3 that of Achenar: 20 minutes rather than 60 minutes.

Lowering your MTTR is a good strategy in certain situations. But you need to make sure it’s the right strategy. If you don’t look at the whole picture, things like nuisance alarms and insufficient automation can be confounded with the meaning of your MTTR. If you fix a whole bunch of meaningless alerts that always recover quickly without intervention (you know the type), your MTTR goes up!

mttr_vs_incident_count

 

MTTR is useful to track, and it can be useful for decision-making. Just remember: our goal is to minimize downtime and noise, not MTTR. If the path of least resistance to lower downtime and a stronger signal is to respond to incidents quicker, then MTTR is your best friend. But that’s not always true.

4 thoughts on “MTTR: lower isn’t always better

  1. Pingback: Reducing Incident Resolution Time - DevOps.comDevOps.com

  2. Hi Dan. This is simply gets out of equation for availabilty: MTBF / (MTBF + MTTR). You can’t focus only on reducing MTTR here — if your MTBF is also low it doesn’t help much. The same if you focus only on increasing MTBF without keeping MTTR low your availability may improve but also may actually drop.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s