One of the first jobs I took on at Hashicorp was to create a training document for our corps of Incident Commanders. It was a super interesting task, because it gave me an opportunity to synthesize a whole bunch of thoughts I’ve been exposed to during my many years of responding to incidents.
Below is the training document I wrote, mostly unedited. I hope it can be of some use to you, whether you’re defining an incident response policy or just noodling on incident response in general. Enjoy!
So you want to be an Incident Commander.
Our incident response protocol indicates that every incident must have an Incident Commander. Incident command is a rewarding job. But it’s a job that takes skills that you probably haven’t exercised in your job before – skills that you may not even have recognized as skills!
This document explains what being an Incident Commander entails. It then presents an overview of things that commonly go wrong during an incident, and some strategies for dealing with them.
To be an Incident Commander, you need to do these things:
- Read this document
- Familiarize yourself with the Incident Commander reference sheet (I’ll try to get permission to publish this reference sheet in a future post. It’s basically a step-by-step runbook that walks through some formal procedures)
- Shadow an incident commander on a real incident
- Get added to the @inccom Slack group
Once you’ve completed these steps: congratulations! You’re a member of the Incident Command Team. Next time somebody pings the @inccom group, you can go ahead and volunteer. Just remember to refer to the Incident Commander reference sheet, which you can bring up in Slack by typing
start incident in any channel.
What does an Incident Commander do?
An Incident Commander’s job is to keep the incident moving toward resolution. But an Incident Commander’s job is not to fix the problem.
As Incident Commander, you shouldn’t touch a terminal or search for a graph or kick off a deploy unless you’re absolutely the only person available to do it. This may feel uncomfortable, especially if your background is in engineering. It will probably feel like you’re not doing enough to help. What you need to remember is this: whatever your usual job, when you’re the Incident Commander, your job is to be the Incident Commander.
How are you supposed to keep an incident moving toward resolution without fixing things? Well, incidents move forward when a team of people works together. The Incident Commander’s job is to form that team and keep the team on the same page. It’s a demanding task that will take your full attention, and it comprises three main pieces:
- Populating the incident hierarchy
- Being the ultimate decision maker
- Facilitating the spread of information
We’ll talk about each of these pieces in turn.
Populating the incident hierarchy
In an incident, everybody who’s working on the problem has a specific role. As a corollary, anyone who hasn’t been explicitly assigned to a role should not work on the problem. It’s critical that everyone involved knows what they’re accountable for and who they’re accountable to.
There are four main roles in incident response, which should be filled as soon as possible and remain filled until the incident is closed. As detailed in the Incident Commander reference sheet, the Incident Commander is responsible for maintaining in the topic of the Slack channel an up-to-date accounting of who’s in which of the main roles.
The main roles are as follows:
- Incident Commander – you!
- Primary SME (Subject Matter Expert) – in charge of technical investigation into the problem
- External Liaison – in charge of communicating with customers about the incident
- Scribe – responsible for taking notes in Slack about the incident and keeping track of follow-up items
At the very beginning of an incident, you may, as Incident Commander, need to handle more than one of these responsibilities. But at the earliest opportunity, you should assign others to these roles. Or, if it happens that you’re the person with the best shot at fixing the problem, you should assign someone else to be the Incident Commander and make yourself the Primary SME.
When you assign a person to a role, it’s good to be assertive. Instead of asking “Can anyone volunteer to serve as scribe?” try picking a specific person and saying “
Name, can you be the Scribe for this incident?”
It’s very common for incident response to require more than just four people. But like we said above, it’s critical that everyone involved know what they’re accountable for and who they’re accountable to. For this reason, whenever someone new wants to start work on the incident, you must either assign them one the main roles or make them subordinate to someone who already has a role. For example, if the External Liaison has their hands full doing customer communications and someone volunteers to help, you can say something to the effect of “
Name, you are now a Deputy External Liaison. You report to
the External Liaison. Please confirm.”
Everyone in an incident should look to the Incident Manager for the final word on who’s responsible for what. It’s a part of being the ultimate decision maker.
Flexibility is your prerogative
The predefined incident hierarchy is designed to fit most scenarios. Sometimes, there will be incidents that aren’t well served by it. As the Incident Commander, you have the ability to modify the hierarchy on a case-by-case basis so it best fits the incident at hand. It’s important that your effectiveness is not hampered by a strict adherence to formula. You are empowered to temporarily change the system as you see fit.
For example, it may make sense to have multiple Primary SMEs. If an incident impacts multiple systems, you might want to pick an SME for each one. The procedure for assigning multiple SMEs would be the same as in the case of a single Primary SME, as would their responsibilities. You might want to say, “
Name, you are now the Primary SME for
Area of Responsibility. Please confirm.”
Being the ultimate decision maker
When we say the IC is the “ultimate decision maker,” we don’t mean that they’re somehow expected to make better decisions. What we mean is that everyone involved in an incident treats the IC’s decisions as final and binding. What matters is not so much that you always make the correct decision; what matters is that you make a decision.
Having an Incident Commander available to make decisions enables others to act in the way an incident demands. Rather than second-guessing themselves and spending lots of time weighing pros and cons, people can surface important decisions to you. Given the information available, you can then choose what seems like the best path forward. And since your decision is final and binding, the decision itself keeps everyone on the same page.
There’s another, less obvious benefit to the Incident Commander’s role as ultimate decision maker. When people are actively digging into production problems or liaising with customers, they’re constantly building context that others don’t have. In order to have the IC make important decisions, those people have to explain enough of their mental context to make the decision tractable. This is one of the ways we make sure that information gets spread around during an incident.
Facilitating the spread of information
Managing information flow is the single most important responsibility of the Incident Commander.
When we think about information during an incident, we’re usually thinking about the data that comes out of our telemetry systems or the output of commands that we run. That’s the kind of information we tend to spread around most readily. But as the Incident Commander, this concrete, sought-out information is not the only kind of information you need to be concerned about.
There’s also information inside the heads of all the people involved in incident response. Everyone has a different perspective on the incident. And, in general, people don’t know what pieces of the picture they have that others are missing. Therefore, the IC should always be looking for opportunities to get useful context out of people’s heads and into the sphere of shared knowledge.
The key to facilitating the spread of information is to manage signal-to-noise ratio. “Signal” means information that can be used to move incident resolution forward, and “noise” is information that can’t. So when there’s a piece of information that needs to get to somebody, the Incident Commander’s job is to boost the signal and make sure it gets to the right place. Conversely, when there’s an information stream that nobody can use – for example, someone in the channel posting updates on some irrelevant system’s behavior, or someone in the video call asking for duplicate status updates – your job is to suppress that noise.
To put it simply, the IC is responsible for making sure all incident communication channels remain high-signal, low-noise environments.
How incidents get off track
Every incident is different, but it’s useful to know some common ways in which incident response can go astray. When you recognize these anti-patterns, try applying the tactics described below to get the team back on track.
Perhaps the most common incident response anti-pattern is thematic vagabonding. This is when responders keep moving from one general area of investigation to another. When thematic vagabonding is happening, you’ll notice that:
- Responders look for clues in various places without stating any specific idea about what could be wrong.
- Ideas about the nature of the problem remain vague, like “something could be wrong at the API layer.” There doesn’t seem to be any momentum toward developing those vague ideas into actionable theories.
- It’s hard to follow the Primary SME’s train of thought.
Thematic vagabonding is a source of noise. It generates a lot of information, but that information doesn’t get used in any coherent way.
When you notice thematic vagabonding, a good thing to do is to start asking the Primary SME to elaborate on their motivation for each action they perform. For example, if they say “I’m looking through the database error logs,” you might reply “What did you see that makes you think there’d be database errors?” Challenge them to explain how database issues could cause the problem under investigation, and why database error logs seem likely to lead them to the root cause.
If the thematic vagabonding originates not from the Primary SME but from others, it might be good to redirect those people’s attention to whatever the Primary SME is looking into. For example, if the Primary SME is investigating load balancer anomalies and someone suggests “I’m going to look at recent deploys to see if anything big was changed,” you might say “Before you do that, I want to make sure we’ve got enough eyes on these load balancer anomalies.
<Primary SME>, can you use help interpreting the weird log entries you found?”
Tunnel vision is, in a way, the opposite of thematic vagabonding. It’s when responders get stuck on a particular idea about what might be wrong, even though that idea is no longer productive. Tunnel vision happens when investigators fail to get a signal that should push them on to the next phase of investigation.
Despite having opposite symptoms from those of thematic vagabonding, tunnel vision can be addressed with a similar approach: asking investigators to elaborate on their motivations. Sometimes simply repeating back their own explanation is all it takes to make them realize that they’re going down a rabbit hole.
Another useful tactic for putting an end to tunnel vision is to get responders to seek out disconfirming evidence. For example, if the Primary SME is stuck on the idea that a particular code change is responsible for the problem under investigation, but that idea doesn’t seem to be bearing fruit, you might ask them “If we wanted to prove that this change was not the cause of the problem we’re seeing, how could we prove that?” By making this conceptual shift, investigators are forced to engage with ideas outside their tunnel, and this will often allow them to start making progress again.
Inconsistent mental models
In order to collaborate effectively, incident responders need to have a shared set of ideas about how the problem under investigation could be caused. These ideas are called hypotheses.
When hypotheses are in short supply or insufficiently communicated, incidents tend to stall out. As Incident Commander, part of your job is to make sure that responders are all on the same page about which hypotheses are being entertained and which hypotheses have already been disproven. In addition, it’s a good idea to keep track of what hypothesis is motivating each investigative action. If you don’t fully understand why the Primary SME is digging into queueing metrics, maybe you should ask them to explain their thought process before continuing.
Sometimes progress on an incident will slow to a halt because there are no clear hypotheses left to investigate. Unless this situation is addressed, the response can devolve into thematic vagabonding or tunnel vision. When you become aware of hypothesis scarcity, it can be useful to call a pause to any active investigation while the group brainstorms new hypotheses. You may get some push-back because it will feel to some like a waste of time. But sometimes, in order to move forward with concrete break-fix work, you need to do some abstract ideation work first.
Disconnect between IC and Primary SME
The worst incident train wrecks happen when the Incident Commander and the Primary SME get out of sync. Of all the relationships that make up an incident response effort, theirs is the most important. It’s so important, in fact, that we have a process just for ensuring that the relationship between IC and Primary SME stays solid. It’s called the hands-off status update.
At the very beginning of an incident – as soon as IC and Primary SME are both assigned and present in the video call – the IC should ask for a hands-off status update. The “hands off” means that, until the update is over, neither person should be typing or clicking or reading. Both should be focused entirely on communicating with each other.
The hands-off status update consists of five questions:
- Are you ready for a hands-off status update? This question serves as a reminder that the hands-off status update is beginning, and that both the IC and the Primary SME should be focused on it. If the Primary SME says they’re not ready for a hands-off status update, ask them if they can be ready 60 seconds from now.
- What’s your best guess at impact? It won’t always be clear how many customers are affected by the problem under investigation, or how badly the customer experience is disrupted. But it’s always useful for the Primary SME to venture a guess.
- What possible root causes are you thinking about? This question helps obviate thematic vagabonding and tunnel vision. When the Primary SME states their thought process out loud, everyone in the video call – including the Primary SME themself – gets a clearer sense of the path forward.
- What’s your next move? While the possible root causes are still fresh in everyone’s mind, we take an opportunity to establish the next step of problem solving. As the IC, it’s your job to make sure that the Primary SME’s next move makes sense in the context set by the answers to the previous two questions.
- Is there any person you would like brought in? Finally, we give the Primary SME an opportunity to consider whether there are specific individuals whose skills would be useful in moving incident response forward. If they name anyone, you should do your best to bring that person into the incident response channel and the video call and – if possible – assign them an SME role subordinate to the Primary SME.
Once you’re confident that you understand the Primary SME’s answers to all of these questions, the last step of the hands-off status update is to schedule the next hands-off status update. Pick a time between 5 and 20 minutes from now, and tell the Primary SME “I’ll ask you for another hands-off status update in
<that many> minutes.” Finally, set a timer to remind yourself. Repeat this cycle until the incident is resolved.
The Incident Commander reference sheet
To ensure consistent handling of roles and information during incidents, we have defined some standard procedures in the Incident Commander reference sheet. You should review the reference sheet before you sign up to be an Incident Commander. As you read it, remember the three main responsibilities of the IC:
- Populating the incident hierarchy
- Being the ultimate decision maker
- Facilitating the spread of information
- 📄 Common Ground and Coordination in Joint Activity (Klein, Feltovich, Woods 2004). This paper analyzes joint cognition – which is what we do when we work together to resolve incidents – from the perspective of “common ground.” It describes one of the ways in which this “common ground” most frequently falls apart. The authors call this the Fundamental Common Ground Breakdown, and by understanding it and recognizing it, you can become a more effective Incident Commander.
- 🎬 How to Create a Differential Diagnosis. What we do in incident response has a lot in common with what doctors do when they’re trying to make a diagnosis. In both cases, the investigator is faced with a highly complex system, a ticking clock, and a limited arsenal of tactics for obtaining explanations of the system’s behavior. Although this video is targeted at medical students rather than software engineers, Incident Commanders can benefit enormously from learning the principles of differential diagnosis and applying the formalism to incident response.