June 2024 – Dan Slimmon

If you’re a junior engineer at a software company, you might be required to be on call for the systems your team owns. Which means you’ll eventually be called upon to lead an incident response. And since incidents don’t care what your org chart looks like, fate may place you in charge of your seniors; even your boss!

That’s a lot of pressure, and they certainly didn’t teach you how to do it in school. You’re still just learning the ropes, and now they expect you to be in charge? During an outage? And tell more senior engineers what to do? It seems wrong and unfair.

But let your inexperience be an advantage!

Incident lead is not a technical role

The incident lead is the person accountable for keeping the response effort moving swiftly forward. That involves a wide variety of activities, of which fixing the problem only represents a subset.

Just like the leader of any team, the incident lead’s main job is to keep all the participants on the same page – in other words, to maintain common ground. It’s common ground that allows a group of individuals to work together as more than just individuals. And you don’t need to be deeply versed in the tech to do that. You just need to ask questions.

Aim to understand the problem just enough to make pretty good decisions. Your decisions don’t have to be perfectly optimal. If the primary SME says something like,

It looks like maybe the Chargeover service is borked.

and you don’t know what the Chargeover service is or why it might be borked: speak up! The Primary SME is already deep in the problem space, so they often won’t think to explain what they mean. And chances are you’re not the only one on the call who needs an explanation. As incident lead, it’s up to you to get clarity – not just for yourself, but for the whole group.

As someone who’s new to the tech stack, you’re perfectly placed to ask fundamental questions. So ask. For example, you could ask:

What makes you say the Chargeover service is borked? Did you see a graph or logs or something?
I’m not familiar with the Chargeover service – what does it do?
Do you have a hypothesis yet about why it’s borked?

You won’t need to ask a bunch of questions right in a row. Usually one or two is sufficient to jolt an SME out of “fixing mode” and into “explaining mode.” Then you can draw out enough information to build your own sufficient understanding, and in the process, the whole call will get an improved, shared understanding by listening to your conversation. It will develop common ground.

How do you know when your understanding is sufficient? That’s a job for closed-loop communication. As soon as you think you can, repeat back in your own words the following:

The symptoms
The main hypothesis that the SME is entertaining to explain the symptoms
Any other hypotheses in play
What action(s) the SME is planning to take

If you say these things and the SME says, “Yup, that’s right,” then congratulations! You’ve successfully established common ground among incident responders. You’ve done a better, more valuable job than the vast majority of incident leads (even ones who are very experienced engineers). Because you asked fundamental questions and listened.

If you’re looking to raise your incident response game, my 3-part course Leading Incidents is just what you need.

It only takes a few off-the-rails incidents in your software career to realize the importance of writing things down. That’s why so many companies’ incident response protocols define a scribe role. The scribe’s job, generally, is to take notes on everything that happens. In other words, the scribe produces an artifact of the response effort.

Scribe is a pretty simple – and therefore often dull – job. Usually, you just listen along and take a series of timestamped notes, like this:

14:56 Incident call convened. Jamie is Incident Commander
14:59 Dan is assigned as scribe
15:00 Jamie posts to status page
15:01 Jamie gets paged about a second service, possibly related
15:06 Taylor joins call, Taylor assigned as Primary Investigator
15:07 Jamie gives status update: two web servers seems to have dropped out of the cluster due to failing health checks; the health checks are failing with error connection timed out to redis-main.lan:6379

This is better than nothing. When new responders join the effort, they can read this timeline to get up to speed. And later, when it’s time to do a post-mortem, these notes can become (the first draft of) the timeline that gets reviewed.

But I teach scribes to create a very different kind of artifact: one that raises up the scribe from essentially a technical stenographer to an active and vital participant in the problem-solving effort.

The decay of understanding

As I’ve noted before on this very blog, if you want to fix a problem in a software system, you first have to build an understanding of the problem. Generally, the problems that can be solved without building understanding have already been eliminated.

Sometimes understanding seems to emerge spontaneously from the facts, like when someone deploys a change and the site goes down. But usually, incident responders have to work together to construct understanding over the course of the response effort. Often this process represents the bulk of the response team’s labor, and consequently, the bulk of the incident’s duration. What’s worse: the whole time you’re trying to build understanding, you have to fight against understanding decay.

As you respond to an incident, your understanding of the situation decays. Because:

You forget things you learned earlier.
The situation changes out from under you.

And furthermore, you’re not just trying to build your own understanding. You’re working as part of a team of responders who need to build a joint understanding in order to collaborate. Joint understanding suffers from the same sources of decay as individual understanding, along with many more sources:

Any two responders will get exposed to different facets of the problem. You’ll look at a different set of graphs, latch onto different error messages, and zoom in on different parts of a trace.
Two responders may assign different weights to the same piece of evidence. If you’re familiar with subsystem S, and you see evidence that subsystem S is malfunctioning, this will impact your mental model of the situation more heavily than it will impact that of your teammate, who is more familiar with a different part of the stack.
People continuously join and leave the response team. When a participant leaves, she takes her part of the joint understanding with her. When one joins, he needs to spend time “spinning up context” on the effort – and even then, he can at best obtain only an approximation of the understanding shared by the people already on the call.
Miscommunication is common, so even if two responders try to synchronize their understanding, their joint understanding will often end up with significant gaps.

A group’s ability to solve problems depends on joint understanding, and joint understanding decays over time. And in a high-pressure, dynamic situation (like an incident), it can decay fast. Unless a group works continuously to preserve and repair its joint understanding, this decay leads predictably to frustration, confusion, and mistakes. The center cannot hold.

There was an attempt (to preserve and repair joint understanding)

This need to preserve and repair joint understanding is the main reason that incident response demands a scribe. The scribe keeps a written artifact, which responders can refer to when they need to remember things they’ve forgotten, or resolve a disagreement about the facts of the case. This artifact also reduces the understanding decay that results from responders leaving the call, since those who newly join can get up to speed by reading it.

The usual kind of scribe artifact, a timeline of notes, is thus a method of maintaining and repairing understanding. And yet, as a tool for that purpose, consider its defects. The understanding encoded by the timeline is stored in “diff” format. Those who wish to come up to speed with the ongoing effort must, starting with their background knowledge, construct their understanding inductively.

This diff-format characteristic introduces 2 problems.

Problem 1: the amount of time that it takes to read through the timeline grows linearly with the timeline’s length. Eager new responders are encouraged to spin up on context by reading the timeline (or reading the chat backscroll, which is just another kind of timeline). But as an investigation goes on, the timeline gets longer and longer, making it more and more cumbersome to maintain joint understanding.

Problem 2 is even more serious. Because any two responders start with different background understandings, they will have a tendency to interpret the same facts differently, potentially ending up at quite different understandings. This is the Fundamental Common Ground Breakdown (link to KFBW paper), and it becomes more and more pernicious as the timeline gets longer.

Taken together, these two deficiencies mean the incident investigations that run the longest will also be the hardest to onboard new responders onto.

The Incident Summary

When it’s up to me, I ask the scribe to focus instead on composing an incident summary. This is a series of bullet points that lives at the top of the document. For example:

Since 09:04, users are experiencing increased page-load times. Several have filed support tickets.
At 09:04 there was a sharp increase in median web request latency, from ~40ms to ~90ms
From about 08:42 we observed a linearly increasing number of row locks in the database
We think the row locks are causing the high web request latencies
Anita is taking a closer look at the row locks to see where they’re coming from
It’s also possible that the web requests are getting slowed down for some other reason, and since they’re slow, they’re holding locks for longer. So Sigmund is investigating the request traces

or:

The hourly key-value backup job has failed 3 consecutive times (No customer-facing impact)
Starting with the run that kicked off at 18:31, the hourly backup job for the key-value store has been failing to run to completion
The job exits suddenly in the middle of copying files to cold storage. It produces no error message before crashing
Our best guess is that the job is running out of memory
Bradley is checking the server logs for OOMkill messages
The cause could also be something like a network timeout (but we think that would be logged, so maybe not)

The Incident Summary should go in its own section at the very top of the scribe document. It should be composed of 5–8 bullet points for a total of 50–150 words. It should cover (roughly in this order):

The impact of the problem (especially with regard to customer experience)
The most important symptoms that have been observed
Our leading hypothesis to explain the symptoms
What actions are being taken and by whom
At least one alternative hypothesis that hasn’t been ruled out.

Why this is so much better

As a tool for maintaining and repairing joint understanding, the Incident Summary has many advantages over the traditional timeline format.

Instead of the current understanding being encoded in “diff” format, it is available to the reader in one quick bite. This eliminates the problem of linearly-increasing context spin-up time. It also serves to place bounds on how different any two responders’ individual understandings can be – since both must coincide with the Summary.

Finally – and most importantly, if you ask me – it forces the response team to discuss their hypotheses and the limits of their certainty. This results in better plans, which means shorter incidents.

Does this mean incident timelines are deprecated?

I don’t think so. There are still many notes worth taking that won’t end up in the Incident Summary, and it can make perfect sense to keep those notes in a timeline format.

However, I do think that the scribe’s primary focus should be keeping the Incident Summary accurate and succinct. If that focus detracts from the completeness of the timeline-formatted notes further down in the document, so be it. In the presence of time pressure and a shifting knowledge base, the Summary matters more.

You can get Leading Incidents, my 3-part course on incident response, for 30% off (~~$125~~ 87.50) by joining the D2E mailing list now!

Dan Slimmon

Month: June 2024

Leading incidents when you’re junior