Incident SEV scales are a waste of time

Ask an engineering leader about their incident response protocol and they’ll tell you about their severity scale. “The first thing we do is we assign a severity to the incident,” they’ll say, “so the right people will get notified.”

And this is sensible. In order to figure out whom to get involved, decision makers need to know how bad the problem is. If the problem is trivial, a small response will do, and most people can get on with their day. If it’s severe, it’s all hands on deck.

Severity correlates (or at least, it’s easy to imagine it correlating) to financial impact. This makes a SEV scale appealing to management: it takes production incidents, which are so complex as to defy tidy categorization on any dimension, and helps make them legible.

A typical SEV scale looks like this:

  • SEV-3: Impact limited to internal systems.
  • SEV-2: Non-customer-facing problem in production.
  • SEV-1: Service degradation with limited impact in production.
  • SEV-0: Widespread production outage. All hands on deck!

But when you’re organizing an incident response, is severity really what matters?

The Finger of God

SEV scales are rather like the Fujita Scale, which you may recall from the American film masterpiece Twister (1996). The Fujita Scale was invented in 1971 by Ted Fujita (seen here with Whoosh, a miniature tornado he rescued from tornado poachers and raised as his own).

The Fujita scale classifies tornados according to the amount of damage they cause to human-built structures and vegetation:

F0Light damageBroken windows; minor roof damage; tree branches broken off.
F1Moderate damageCars pushed off road; major roof damage; significant damage to mobile homes
F2Significant damageRoof loss; collapse of exterior walls; vehicles lifted off ground; large numbers of trees uprooted
F3Severe damageA few parts of buildings left standing; train cars overturned; total loss of vegetation
F4Devastating damageHomes reduced to debris; trains lifted off ground and turned into projectiles
F5Incredible damageWell-anchored homes lifted into the air and obliterated; steel-reinforced structures leveled; “Incredible phenomena can and will occur”

Or, more colorfully,

BILL: It’s the Fujita scale. It measures a tornado’s intensity by how much it eats.
MELISSA: Eats?
BILL: Destroys.
LAURENCE: That one we encountered back there was a strong F2, possibly an F3.
BELTZER: Maybe we’ll see some 4’s.
HAYNES: That would be sweet!
BILL: 4 is good. 4 will relocate your house very efficiently.
MELISSA: Is there an F5?
(Silence.)
MELISSA: What would that be like?
JASON: The Finger of God.

Fujita scores are not assigned at the moment a tornado is born, nor at the moment it is observed, nor yet the moment it’s reported to the National Weather Service. They’re assigned only after the damage has been assessed. During this assessment, many tools may be brought to bear, including weather radar data, media reports, witness testimony, and expert elicitation. It can take quite a while. It may even be inconclusive: the Enhanced Fujita scale has a special “unknown” score for tornados, which is used for storms that happen to cross an area devoid of buildings or vegetation to “eat.” No matter how immensely powerful a tornado may be, if it causes no surveyable damage, its score is unknown.

Is severity even the point?

In many cases, when we’re just beginning to respond to a software incident, we don’t yet know how bad a problem we’re dealing with. We can try to use the evidence at hand to hazard a guess, but the evidence at hand is likely to be scant, ambiguous, and divorced from context. (I’m by no means the first to notice these issues with SEV scales – see Fred Hebert’s Against Incident Severities and in Favor of Incident Types for an excellent take.)

Maybe all we have is a handful of support tickets. A handful of support tickets could imply a small hiccup, or it could point to the start of a major issue. We don’t know yet.

Or the evidence might be a single alert about a spike in error rate. Are the errors all for a single customer? What’s the user’s experience? What’s a normal error rate?

This uncertainty makes it quite hard to assign an accurate incident severity – at least, without doing additional, time-consuming research. Ultimately, though, what incident response calls for is not a judgement of severity. It’s a judgement of complexity.

The Incident Command System, employed by disaster response agencies such as FEMA and the DHS, provides an illustration of complexity-based scoring. It defines a range of incident types:

Incident typeComplexityExamples
Type 5Up to 6 personnel; no written action plan requiredVehicle fire; injured person; traffic stop
Type 4“Task Force” or “Strike team” required; multiple kinds of resources requiredHazmat spill on a roadway; large commercial fire
Type 3Command staff positions filled; incident typically extends into multiple shifts; may require an incident baseTornado damage to small section of city; detonation of a large explosive device; water main break
Type 2Resources may need to remain on scene several weeks; multiple facilities required; complex aviation operations may be involvedRailroad car hazmat leak requiring multi-day evacuation; wildfire in a populated area; river flooding affecting an entire town
Type 1Numerous facilities and resource types required; coordination between federal assets and NGO responders; 1000 or more respondersPandemic; Category 4 hurricane; large wind-driven wildfire threatening an entire city

Complexity-based classification has a key advantage over that based on impact. Namely: by the time you’ve thought enough to know how complex a response you need, you already have the beginning of a plan. In other words, whereas the impact-based Fujita system is suited to analysis, the complexity-based ICS types are suited to action.

A complexity scale for software incidents

Impact-based systems, like SEV scores, do little to support responders at incident time. A complexity-based system is far better. Here’s an example of a complexity-based system for software incidents:

ComplexityExample
C-1A handful of engineers from a single team. No handoffs.Disk space is filling up on a log server
C-2Coordination between teams and/or across shifts. Customer support may be involved.Out-of-control database queries causing a performance degradation for some customers
C-3Coordination among 3 or more teams. Customer-facing status posts may be called for. Deploys may be paused.Outage of account-administrative functions. Severe delay in outgoing notifications.
C-4Sustained, intense coordination cutting across all engineering team boundaries. Third-party relationships activated. Executives may be involved.Website down due to unexplained network timeouts. Cloud provider region failure.

If you’re in control of your company’s incident response policy, consider whether a severity scale is going to be helpful to responders in the moment. You might come to the conclusion that you don’t need any scale at all! But if you do need to classify ongoing incidents, would you rather use a system that asks “How bad is it?,” or one that asks “What’s our plan?”

Leading incidents when you’re junior

If you’re a junior engineer at a software company, you might be required to be on call for the systems your team owns. Which means you’ll eventually be called upon to lead an incident response. And since incidents don’t care what your org chart looks like, fate may place you in charge of your seniors; even your boss!

That’s a lot of pressure, and they certainly didn’t teach you how to do it in school. You’re still just learning the ropes, and now they expect you to be in charge? During an outage? And tell more senior engineers what to do? It seems wrong and unfair.

But let your inexperience be an advantage!

Incident lead is not a technical role

The incident lead is the person accountable for keeping the response effort moving swiftly forward. That involves a wide variety of activities, of which fixing the problem only represents a subset.

Just like the leader of any team, the incident lead’s main job is to keep all the participants on the same page – in other words, to maintain common ground. It’s common ground that allows a group of individuals to work together as more than just individuals. And you don’t need to be deeply versed in the tech to do that. You just need to ask questions.

Aim to understand the problem just enough to make pretty good decisions. Your decisions don’t have to be perfectly optimal. If the primary SME says something like,

It looks like maybe the Chargeover service is borked.

and you don’t know what the Chargeover service is or why it might be borked: speak up! The Primary SME is already deep in the problem space, so they often won’t think to explain what they mean. And chances are you’re not the only one on the call who needs an explanation. As incident lead, it’s up to you to get clarity – not just for yourself, but for the whole group.

As someone who’s new to the tech stack, you’re perfectly placed to ask fundamental questions. So ask. For example, you could ask:

  • What makes you say the Chargeover service is borked? Did you see a graph or logs or something?
  • I’m not familiar with the Chargeover service – what does it do?
  • Do you have a hypothesis yet about why it’s borked?

You won’t need to ask a bunch of questions right in a row. Usually one or two is sufficient to jolt an SME out of “fixing mode” and into “explaining mode.” Then you can draw out enough information to build your own sufficient understanding, and in the process, the whole call will get an improved, shared understanding by listening to your conversation. It will develop common ground.

How do you know when your understanding is sufficient? That’s a job for closed-loop communication. As soon as you think you can, repeat back in your own words the following:

  1. The symptoms
  2. The main hypothesis that the SME is entertaining to explain the symptoms
  3. Any other hypotheses in play
  4. What action(s) the SME is planning to take

If you say these things and the SME says, “Yup, that’s right,” then congratulations! You’ve successfully established common ground among incident responders. You’ve done a better, more valuable job than the vast majority of incident leads (even ones who are very experienced engineers). Because you asked fundamental questions and listened.

If you’re looking to raise your incident response game, my 3-part course Leading Incidents is just what you need.

Ask questions first, shoot later

One piece of common-sense advice that you often hear about incident response is,

Fix it first. Ask “why” later.

This chestnut is often deployed to combat what is perceived as excessive investigation. And like most common-sense advice, it’s approximately right in lots of situations. But it misses a crucial point, and at its worst, this attitude perpetuates failure.

Diagnosing and fixing

Incident response comprises two intertwined, but distinct, activities: diagnosing and fixing. This point is illustrated in David Woods’ 1995 paper, Cognitive demands and activities in dynamic fault management: abductive reasoning and disturbance management (which uses the term “response planning” for what I’m calling “fixing”):

Diagnosing and fixing can involve overlapping activities, such that they blend together during incident response. For example, if you have diagnosed a partial outage as resulting from a web server that’s used up all its allotted file handles, you might restart that web server. This would be a “diagnostic intervention,” in that it serves to advance both the fix (if your diagnosis holds water, then restarting the web server will fix the problem) and the diagnosis (if restarting the web server fixes the problem, then you have additional evidence for your diagnosis; if it doesn’t, then you know you need a new diagnosis).

The fact that fixing and diagnosing often converge to the same actions doesn’t change the fact that these two concurrent activities have different goals. The goal of fixing is to bring the system into line with your mental model of how it’s supposed to function. The goal of diagnosing is to bring your mental model into line with the way the system is actually behaving.

Usually these two goals are aligned with one another, but sometimes they demand different actions. And that’s what opens the door for someone to say, “Let’s fix the problem first and ask questions later.” However, this implies a naïve conception of the incident response process, which I’ll represent schematically here:

In this view, you fix first, then you diagnose – perhaps in a post-mortem or a root-cause analysis. But in a field like ours, in which complex systems are at play, this is simply not how things work. A complex system has infinitely many possible failure modes. Therefore there are infinitely many possible actions that might be necessary to recover from a failure. In order to even attempt a fix, you must always start with some kind of diagnosis.

Sure, sometimes the diagnostic effort might be very brief and straightforward. Suppose you get an alert about some new error happening in production. You immediately recognize the error as resulting from a code change you just deployed, and you revert the change.

Because the diagnosis was so quick, it may feel like you simply fixed the problem as soon as you saw it. But you still undertook a diagnostic process. You saw the alert and developed a hypothesis (“My code change caused these errors”), and that hypothesis turned out to be right. Had you truly done no diagnosis, then you wouldn’t have known what to do. The incident actually looked like this:

Contrast this with another scenario. You get alerted about slow page-loads. Together with a team, you begin to investigate the slowness. But no explanation is forthcoming. It takes an hour of searching logs, reading documentation, and consulting with other teams before you have a satisfactory explanation: an mission-critical cache object has gotten too large to store in the cache, so it has to be fetched from origin on every request. Upon reaching this diagnosis, you immediately know what to do to fix the problem:

During this long diagnostic phase, nobody would have said, “Fix the problem first. Worry about ‘why’ later.” The diagnostic effort was clearly pursued in service of fixing the issue. Whether it takes a split-second or a week, a diagnosis (at least a differential diagnosis) always has to be reached before the problem can be fixed.

These are simple examples. In a more general case, you do some diagnosis, which produces a potential fix. That fix doesn’t work (or only partly works), so diagnosis continues until another potential fix presents itself. And since multiple responders are present on the call, diagnosis doesn’t generally have to totally halt in order for fixes to be pursued:

The “Fix first” shibboleth comes out when someone perceives that there is already a potential fix, but nobody is applying that fix yet. So when you hear it, or you’re tempted to speak it yourself, first ask yourself:

  1. Is there a potential fix on the table?
  2. Is that potential fix worth pursuing immediately?

If the answer to both of these questions is “yes,” then by all means, get on the fix. But don’t halt diagnosis to do so, unless you’re so labor-constrained that you must.

If either question elicits a “no,” then you should talk through your reasoning with the group and make the case for pursuing further diagnosis before taking action.

Fix first, ask questions later.

Ask questions until you can fix.

––

I teach Scientific Incident Response in 4 Days, an in-depth incident response training course for engineering teams.

Garden-path incidents

Barb’s story

It’s 12 noon on a Minneapolis Wednesday, which means Barb can be found at Quang. As the waiter sets down Barb’s usual order (#307, the Bun Chay, extra spicy), Barb’s nostrils catch the heavenly aroma of peanuts and scallions and red chiles. A wave of calm moves through her. Barb pulls her chair forward, cracks apart her wooden chopsticks, and…her pager goes off.

After cursing under her breath, she dutifully reads the message:

Error rate for `environment:production' exceeds 100 msg/s

Welp.

Barb grabs one quick bite of spring roll as she flags down the waiter for a to-go box. Opening Slack on her phone, she declares an incident, joins the Zoom call, and hurries back up Nicollet Ave. and around the corner, toward her apartment.

Five minutes later, finally sitting at her laptop, Barb is pretty sure she knows what the problem is. The error flooding the logs is:

object 'nil' has no method 'reimport!'

That looks like a straightforward app bug to her, and wouldn’t you know it? Right before these errors started cropping up, there was a deploy to the Rails app by a newish dev named Alice, who according to her Slack profile, is based in Seattle. Barb asks this ‘Alice’ to join the incident Zoom.

– Hi, this is Alice. What’d I do?
– Thanks for joining, Alice. I’m incident commander for this spike of production errors. It looks like you deployed a change at 17:46 UTC and a bunch of errors started happening. Can you revert that change please?
– Sure, no problem. I’ll put together the revert PR now.

5 minutes later, Alice’s PR is approved. Alice click’s “Merge.” The pair begin the anxious but familiar 15-minute wait for CI to pass, all the while greeting and informing the bewildered latecomers who straggle into the call. 

Alice’s story

Alice stares blankly at the white rectangle on her monitor. She spent her first hour getting yesterday’s frontend bugfixes pushed out, and now it’s time to start her PowerPoint. She’ll be working on this PowerPoint for the rest of the morning, probably through lunch, and all afternoon.

Alice shuts her eyes and heaves a dismal sigh. Alice fucking hates PowerPoint. But she can’t put it off anymore. So she dons her headphones, cracks her knuckles,, and… gets an urgent Slack message:

Morning, Alice – we’ve got a production incident involving a spike of errors, and it looks like it coincides with a deploy of yours. Can you jump on https://zoom.globocorp.co/z/123456789… when you have a moment please?

As she waits for Zoom to load, Alice feels something almost like relief. At least she doesn’t have to work on that goddamn PowerPoint yet.

– Hi, this is Alice. What’d I do?
– Thanks for joining, Alice. I’m incident commander for this spike of production errors. It looks like you deployed a change at 16:46 UTC and a bunch of errors started happening. Can you revert that change please?
– Sure, no problem. I’ll put together the revert PR now.

Alice quickly whips up that PR and gets it approved. She spends the next 15 minutes waiting for CI to pass, while absent-mindedly writing the first slide of her PowerPoint. By the time the tests are green, she has typed out and deleted 4 different titles.

The real story

This incident seems to have gone about as well as it could, considering. Alice was on the call within 7 minutes of the alert, and a PR was ready 5 minutes later. It would be great if CI were faster, or even better if CI could be skipped for a revert. They’ll talk about that at the post-mortem.

However, nobody in the call yet knows what really happened. What really happened is this:

    • Alice’s 16:46 UTC deploy was the first to pick up the latest Docker image.
    • The new Docker image includes an update to a software dependency.
    • The updated dependency has a bug that only shows up in production.

    But instead of knowing any of that, Alice and Barb are sitting here for 15 minutes waiting for CI to run, so they can deploy a fix that won’t even work.

    This is a garden-path incident. Barb has what she feels is a strong signal from the telemetry, which points toward a bug in Alice’s code. Alice has what she feels is a strong signal, which is that Barb seems very confident in her conclusion. But they’ve been led up the garden path, and as a consequence, this incident will run longer than it needs to.

    How this could all have been avoided

    Imagine instead, that Barb and Alice are both in the habit of saying their hypotheses out loud.

    When Alice joins the call, Barb instead says:

    – Thanks for joining, Alice. I’m incident commander for this spike of production errors. It looks like you deployed a change at 16:46 UTC and a bunch of errors started happening. My hypothesis is that your change triggered this spike of errors. Can you revert the change please?

    Instead of letting Alice infer that the cause of the error spike is already known to be her deploy, Barb acknowledges the limits of her certainty. She has a hypothesis, not a definitive diagnosis. This gives Alice the opportunity to respond with something like:

    – Well, are the errors from the backend or the frontend? Because my change was frontend-only.

    And just like that, Alice and Barb have stepped back from the garden path. Instead of waiting around for a useless CI cycle, they can continue straight away with diagnosis.

    Note that, even if Barb doesn’t state her hypothesis, things will still be okay as long as Alice does:

    – Hi, this is Alice. What’d I do?
    – Thanks for joining, Alice. I’m incident commander for this spike of production errors. It looks like you deployed a change at 16:46 UTC and a bunch of errors started happening. Can you revert that change please?
    – Sure, no problem. I’ll put together the revert PR now. Just to be clear, the hypothesis is that my frontend-only changeset is somehow causing these nil-has-no-method errors in the backend?
    Uh, did you say frontend-only?

    Again, Barb and Alice have gotten themselves off the garden path. Which means this technique – stating your hypothesis and asking for rule-outs – is something you can do unilaterally starting today to make your team better at troubleshooting.

    Another thing you can do to make your team better at troubleshooting is employ Clinical Troubleshooting against your next head-scratcher of a bug.

    Explaining the fire

    When your site goes down, it’s all hands on deck. A cross-functional team must assemble fast and pursue an organized response. It feels like fighting a fire. So it’s not surprising that formal IT incident management tends to borrow heavily from the discipline of firefighting.

    However, in software incident response, we have a crucial constraint that you won’t find in firefighting. Namely, in order to fix a software product, we first have to understand why it’s broken.

    When the firefighters arrive at the blazing building, they don’t need to explain the fire. They need to put it out. It doesn’t matter whether a toaster malfunctioned, or a cat knocked over a candle, or a smoker fell asleep watching The Voice. The immediate job is the same: get people to safety and put out the fire.

    But when PagerDuty blows up and we all stumble into the incident call, we need at least a vague hypothesis. Without one, we can’t even start fixing the problem. What should we do? Reboot one of the web servers? Which one? Should we revert the last deploy? Should we scale up the database? Flush the CDN? Open a support ticket with Azure? Just wait?

    We can’t act until we have at least some explanation for how the outage came about.

    Often, the process of diagnosis – of explaining the failure – takes up the majority of the incident. Diagnosis isn’t easy, especially in a group and under pressure. Important facts go ignored. Hypotheses get forgotten, or remain unchallenged in the face of new information. Action items fall through the cracks. Diagnostic disconnects like these add up to longer outages, noisier public-facing comms, and repeat failures.

    And yet, when we look to improve IT incident response, what do we usually focus on? On-call rotations, status page updates, command-and-control structure. Sliding-down-the-firepole, radioing-with-dispatch type stuff.

    In software incident response, we need to maintain a coherent diagnostic strategy in the face of scarce information and severe time pressure. This makes us, on one dimension at least, more like doctors than firefighters. This is one of the reasons that engineering teams find immense value in clinical troubleshooting. It brings rigor and transparency to the joint diagnostic effort.

    I teach clinical troubleshooting as part of Scientific Incident Response in 4 Days. Check it out.

    Incident, Inçident, Incidënt

    When you deploy broken code, it may cause an incident. Then you’ll have to declare an incident. And don’t forget to create an incident so customers can stay informed!

    “Incident” can mean a bunch of things, and these meanings are only partly distinguishable through context. When the word “incident” is used ambiguously, confusion can emerge.

    The effort to fix the problem

    In many organizations, the organizational formalism around responding to a problem is called an “incident.” For example,

    At 08:40 UTC, engineer Geordi La Forge declared an incident after observing high API latencies in production. The incident was led by Beverly Crusher, and it was resolved after 32 minutes.

    We’ll call this usage an incident-effort. An incident-effort need not correspond to an outage. You can create one for any problem on which you want multiple sets of eyes. For example, if deployment to the staging environment is borked, or if disk utilization is growing too rapidly on a database server, you might declare an incident(-effort).

    The public thread of communication

    When stakeholders need to be informed of a problem – often, these stakeholders are customers – we’re expected to “create an incident.” This kind of “incident” (let’s call it an incident-post) is usually a post on a publicly-facing website, describing the problem and the current state of the response to it.

    An incident-effort may exist without an incident-post. An incident-post may also exist without an incident-effort (for example, if an incident is used to communicate about a planned maintenance window).

    The problem itself

    An incident on the DataDog status page.

    Finally, there is the problem itself: that which is to be fixed, or communicated about.

    There are many problems that don’t merit an incident-effort, let alone an incident-post. We usually just call such problems “bugs” or “risks.”

    When someone says “We’ve had too many incidents this quarter,” they’re probably talking about problems that caused downtime. But they might be talking about incident-posts themselves, depending on how they’re looking at things. It pays to clarify.

    Never being one to shy away from an opportunity to make a Venn Diagram, I’ll leave you with a few common phrases around incident management:

    I offer incident response training or remote or on-site. Get in touch.

    Dead air on the incident call

    When troubleshooting a high-impact software failure in a group, you have to be ready for shifts in tenor. One moment there’s a frenzy of coordination, and the next: absolute silence.

    Silence is natural and often useful. But to be an effective incident commander – whose job is to keep the problem-solving effort moving forward – you must develop a keen ear for silence.

    Silence can mean different things to different people in different situations. In this post, I’ll present a few incident scenarios and explore the role of the incident commander in breaking (or simply abiding in) dead air.

    “Any minute now”

    Sylvain (from the [s]upport team) has spun up an incident call after getting 3 consecutive reports of broken CSS on the production app. You are the incident commander, and Oscar (from [o]perations) has immediately volunteered to be the primary investigator. Deanna and Deepak (software [d]evs) are also on the call.

    There’s some ambiguity about whether the CSS issue merits a status page post. Nobody has found a “smoking gun” showing that, for example, 404 errors are happening at an elevated rate. And now Oscar announces, “I’m seeing some log entries from the web server that look a little weird. I’m gonna look at those.” This is the beginning of a 5-minute silence.

    During the silence, Deanna, Deepak, and Sylvain are all waiting, hoping that these log entries that Oscar just noticed turn out to be the smoking gun. They’re putting their eggs in the basket of Oscar’s intuition. Hopefully he’s seen this issue before, and any minute now he’ll say “Okay, I’m pushing a fix.”

    The problem is, it’s equally possible that Oscar has latched onto a red herring (some salient but ultimately irrelevant observation). If there were a conversation in place of silence, then Deanna could be researching the error message, or Deepak might be able to immediately rule out the error as a cause of the broken CSS, or Sylvain could provide a detail from one of the customer tickets that would narrow the scope of the investigation. But instead, everybody’s twiddling their thumbs hoping for Oscar to emerge with a fix.

    An incident commander is responsible for keeping the whole problem-solving effort moving forward. So it’s incumbent on you to interrupt this silence.

    Try drawing more information out of Oscar:

    • “Oscar, do you mind sharing your screen so Deepak and Deanna can see the weird log messages too?”
    • “What’s the error message, Oscar? Can you send a link to a log search?”
    • “Do we know when these log events started? Does that line up with when we started receiving these support tickets, Sylvain?”

    The more we audit each other’s thought processes, the more effective we are at joint problem-solving. An IC must make this happen.

    “LGTM”

    Sylvain has spun up an incident call after getting 3 consecutive reports of broken CSS on the production website. You are the incident commander.

    Oscar has checked a bunch of graph dashboards and hasn’t found any evidence of a widespread system failure. He’s said as much. Now there’s relative silence on the call for five minutes.

    Deanna and Deepak are basically in agreement with Oscar: there’s no evidence of a system health issue. To them, and to Oscar, it’s not really clear how strong a signal Sylvain has. It could just be a coincidence that these three reports all arrived in a row. The engineers on the call are thinking, I guess we’ll keep poking at this, but we’re not even sure this is a real issue. We need more information.

    Sylvain, on the other hand, is positive that something is wrong. Getting 3 support tickets in a row about the same behavior is very strong evidence to him. He’s presented his information to the investigators, and now he’s thinking, Okay, they say it’s not a widespread issue. But I’m sure Oscar is getting to the bottom of it.

    There’s been a common ground breakdown, and as a result, a silence that becomes more and more frustrating.

    As incident commander, you should focus the group’s attention on observable symptoms by asking questions like:

    • “Has anybody been able to reproduce these broken page-loads in a browser? Preferably with Dev Tools turned on?”
    • “Sylvain, I don’t have an intuition for support ticket frequencies. How unusual is it to get 3 reports of the same thing right in a row like this?”
    • “Can we find, in the access logs, just one example of a stylesheet request that returned a non-200 response?”

    “Let’s see here…”

    Sylvain has spun up an incident call after getting 3 consecutive reports of broken CSS on the production website. You are the incident commander. The investigation has been going along, and Oscar is chasing down a hunch that a particular error message from the web server is related to the stylesheet failures. Deanna is digging into some code to help validate Oscar’s hunch.

    Deepak joins the call. There’s no chatter, as everyone is waiting for Oscar and Deanna to come up with their findings. So Deepak reads the chat scrollback, which takes him about 5 minutes. It’s not until the end of those 5 minutes that Deepak understands what Oscar and Deanna are working on.

    As it happens, Deepak has seen the web server error message in question before. He knows what it means, and he can explain why it’s a red herring. But for the 5 minutes it takes him to get up to speed by reading the chat scrollback, silence persists.

    In order to keep a problem-solving effort moving forward, an incident commander should ensure that every new participant gets up-to-date knowledge of what the group is doing and why. At small scale (less than, say, 10 people on the call), you can do this verbally. For example, you could say to Deepak when he joins the call, “Hi Deepak. Right now, Oscar and Deanna are investigating a web server error message that might be related to failed stylesheet loads. You can see the error message in the chat.”

    When there are more than 10 people, the verbal approach stops working. It becomes necessary to have a shared document of some sort, continuously updated by a “scribe.” It’s not sufficient for this document to be merely a timeline of events: it must highlight the current state of the joint diagnostic effort. I recommend clinical troubleshooting for this.

    “I need 5 minutes”

    When incident response is going right, everybody understands what’s being done by whom, and why. As information comes to light and our strategies evolve, it takes more or less constant communication to maintain this state. That’s why silence on an incident call is so often an indicator of trouble: when there’s silence, communication isn’t happening.

    There is, however, a healthy kind of dead air.

    Sometimes an investigator needs to go silent for a while to chase down a hunch, or collect some data, or research some question. As long as such a silence is negotiated in advance, with a specific time to reconvene, it can serve a crucial purpose. I call this functional dead air.

    It’s the job of the incident commander to ensure that every nontrivial silence is functional. First, communicate what’s being done by whom, and why. Only then, do it.


    I offer incident response training for engineers. Get in touch!