Explaining the fire

When your site goes down, it’s all hands on deck. A cross-functional team must assemble fast and pursue an organized response. It feels like fighting a fire. So it’s not surprising that formal IT incident management tends to borrow heavily from the discipline of firefighting.

However, in software incident response, we have a crucial constraint that you won’t find in firefighting. Namely, in order to fix a software product, we first have to understand why it’s broken.

When the firefighters arrive at the blazing building, they don’t need to explain the fire. They need to put it out. It doesn’t matter whether a toaster malfunctioned, or a cat knocked over a candle, or a smoker fell asleep watching The Voice. The immediate job is the same: get people to safety and put out the fire.

But when PagerDuty blows up and we all stumble into the incident call, we need at least a vague hypothesis. Without one, we can’t even start fixing the problem. What should we do? Reboot one of the web servers? Which one? Should we revert the last deploy? Should we scale up the database? Flush the CDN? Open a support ticket with Azure? Just wait?

We can’t act until we have at least some explanation for how the outage came about.

Often, the process of diagnosis – of explaining the failure – takes up the majority of the incident. Diagnosis isn’t easy, especially in a group and under pressure. Important facts go ignored. Hypotheses get forgotten, or remain unchallenged in the face of new information. Action items fall through the cracks. Diagnostic disconnects like these add up to longer outages, noisier public-facing comms, and repeat failures.

And yet, when we look to improve IT incident response, what do we usually focus on? On-call rotations, status page updates, command-and-control structure. Sliding-down-the-firepole, radioing-with-dispatch type stuff.

In software incident response, we need to maintain a coherent diagnostic strategy in the face of scarce information and severe time pressure. This makes us, on one dimension at least, more like doctors than firefighters. This is one of the reasons that engineering teams find immense value in clinical troubleshooting. It brings rigor and transparency to the joint diagnostic effort.

I teach clinical troubleshooting as part of Scientific Incident Response in 4 Days. Check it out.

3 thoughts on “Explaining the fire”

>>But when PagerDuty blows up and we all stumble into the incident call, we need at least a vague hypothesis. Without one, we can’t even start fixing the problem<<

Not entirely. While a vague hypothesis is needed to avoid a mitigation step that can aggravate the problem, a good Reliability design involves identification of a mitigation that can just be done before you root cause the problem.

I would argue that, one has to consciously make it a trade off during a design whether you want the design to be super mitigation friendly or simpler to implement and not so easy to mitigate when the system misbehaves.

2024/04/15 at 17:52 Reply

I have always thought of this in terms of a first responder. A on-call engineer turns up as a paramedic/firefighter/police officer/mountain rescue. It’s not about diagnosis, it’s always mitigation first, stop the bleeding/put out the fire/make the public safe/get them off the mountain. Once that’s done you start the forensics. I have had a lot of success starting every incident checklist with a simple reminder to mitigate first.

2024/04/19 at 16:22 Reply

Dan Slimmon

I think it’s absolutely correct to mitigate as soon as possible. My contention is mainly that mitigation is contingent on at least some form of diagnosis.

This is true (I imagine) for paramedics too. If you find a patient unconscious and bleeding, you’re going to behave differently than if you encountered them lucid and sitting upright. That’s because you have a diagnosis, or at least a differential diagnosis. In order to mitigate, you need to have some idea (even if it’s wrong) of what is wrong. Which is different from arriving at a fire, where the answer to “what is wrong?” is always “There’s a fire.”

2024/04/23 at 16:15 Reply

Share this:

3 thoughts on “Explaining the fire”

Leave a comment Cancel reply

Discover more from Dan Slimmon