Fight understanding decay with a rich Incident Summary

It only takes a few off-the-rails incidents in your software career to realize the importance of writing things down. That’s why so many companies’ incident response protocols define a scribe role. The scribe’s job, generally, is to take notes on everything that happens. In other words, the scribe produces an artifact of the response effort.

Scribe is a pretty simple – and therefore often dull – job. Usually, you just listen along and take a series of timestamped notes, like this:

  • 14:56 Incident call convened. Jamie is Incident Commander
  • 14:59 Dan is assigned as scribe
  • 15:00 Jamie posts to status page
  • 15:01 Jamie gets paged about a second service, possibly related
  • 15:06 Taylor joins call, Taylor assigned as Primary Investigator
  • 15:07 Jamie gives status update: two web servers seems to have dropped out of the cluster due to failing health checks; the health checks are failing with error connection timed out to redis-main.lan:6379

This is better than nothing. When new responders join the effort, they can read this timeline to get up to speed. And later, when it’s time to do a post-mortem, these notes can become (the first draft of) the timeline that gets reviewed.

But I teach scribes to create a very different kind of artifact: one that raises up the scribe from essentially a technical stenographer to an active and vital participant in the problem-solving effort.

The decay of understanding

As I’ve noted before on this very blog, if you want to fix a problem in a software system, you first have to build an understanding of the problem. Generally, the problems that can be solved without building understanding have already been eliminated.

Sometimes understanding seems to emerge spontaneously from the facts, like when someone deploys a change and the site goes down. But usually, incident responders have to work together to construct understanding over the course of the response effort. Often this process represents the bulk of the response team’s labor, and consequently, the bulk of the incident’s duration. What’s worse: the whole time you’re trying to build understanding, you have to fight against understanding decay.

As you respond to an incident, your understanding of the situation decays. Because:

  • You forget things you learned earlier.
  • The situation changes out from under you.

And furthermore, you’re not just trying to build your own understanding. You’re working as part of a team of responders who need to build a joint understanding in order to collaborate. Joint understanding suffers from the same sources of decay as individual understanding, along with many more sources:

  • Any two responders will get exposed to different facets of the problem. You’ll look at a different set of graphs, latch onto different error messages, and zoom in on different parts of a trace.
  • Two responders may assign different weights to the same piece of evidence. If you’re familiar with subsystem S, and you see evidence that subsystem S is malfunctioning, this will impact your mental model of the situation more heavily than it will impact that of your teammate, who is more familiar with a different part of the stack.
  • People continuously join and leave the response team. When a participant leaves, she takes her part of the joint understanding with her. When one joins, he needs to spend time “spinning up context” on the effort – and even then, he can at best obtain only an approximation of the understanding shared by the people already on the call.
  • Miscommunication is common, so even if two responders try to synchronize their understanding, their joint understanding will often end up with significant gaps.

A group’s ability to solve problems depends on joint understanding, and joint understanding decays over time. And in a high-pressure, dynamic situation (like an incident), it can decay fast. Unless a group works continuously to preserve and repair its joint understanding, this decay leads predictably to frustration, confusion, and mistakes. The center cannot hold.

There was an attempt (to preserve and repair joint understanding)

This need to preserve and repair joint understanding is the main reason that incident response demands a scribe. The scribe keeps a written artifact, which responders can refer to when they need to remember things they’ve forgotten, or resolve a disagreement about the facts of the case. This artifact also reduces the understanding decay that results from responders leaving the call, since those who newly join can get up to speed by reading it.

The usual kind of scribe artifact, a timeline of notes, is thus a method of maintaining and repairing understanding. And yet, as a tool for that purpose, consider its defects. The understanding encoded by the timeline is stored in “diff” format. Those who wish to come up to speed with the ongoing effort must, starting with their background knowledge, construct their understanding inductively.

This diff-format characteristic introduces 2 problems.

Problem 1: the amount of time that it takes to read through the timeline grows linearly with the timeline’s length. Eager new responders are encouraged to spin up on context by reading the timeline (or reading the chat backscroll, which is just another kind of timeline). But as an investigation goes on, the timeline gets longer and longer, making it more and more cumbersome to maintain joint understanding.

Problem 2 is even more serious. Because any two responders start with different background understandings, they will have a tendency to interpret the same facts differently, potentially ending up at quite different understandings. This is the Fundamental Common Ground Breakdown (link to KFBW paper), and it becomes more and more pernicious as the timeline gets longer.

Taken together, these two deficiencies mean the incident investigations that run the longest will also be the hardest to onboard new responders onto.

The Incident Summary

When it’s up to me, I ask the scribe to focus instead on composing an incident summary. This is a series of bullet points that lives at the top of the document. For example:

  • Since 09:04, users are experiencing increased page-load times. Several have filed support tickets.
  • At 09:04 there was a sharp increase in median web request latency, from ~40ms to ~90ms
  • From about 08:42 we observed a linearly increasing number of row locks in the database
  • We think the row locks are causing the high web request latencies
  • Anita is taking a closer look at the row locks to see where they’re coming from
  • It’s also possible that the web requests are getting slowed down for some other reason, and since they’re slow, they’re holding locks for longer. So Sigmund is investigating the request traces

or:

  • The hourly key-value backup job has failed 3 consecutive times (No customer-facing impact)
  • Starting with the run that kicked off at 18:31, the hourly backup job for the key-value store has been failing to run to completion
  • The job exits suddenly in the middle of copying files to cold storage. It produces no error message before crashing
  • Our best guess is that the job is running out of memory
  • Bradley is checking the server logs for OOMkill messages
  • The cause could also be something like a network timeout (but we think that would be logged, so maybe not)

The Incident Summary should go in its own section at the very top of the scribe document. It should be composed of 5–8 bullet points for a total of 50–150 words. It should cover (roughly in this order):

  1. The impact of the problem (especially with regard to customer experience)
  2. The most important symptoms that have been observed
  3. Our leading hypothesis to explain the symptoms
  4. What actions are being taken and by whom
  5. At least one alternative hypothesis that hasn’t been ruled out.

Why this is so much better

As a tool for maintaining and repairing joint understanding, the Incident Summary has many advantages over the traditional timeline format.

Instead of the current understanding being encoded in “diff” format, it is available to the reader in one quick bite. This eliminates the problem of linearly-increasing context spin-up time. It also serves to place bounds on how different any two responders’ individual understandings can be – since both must coincide with the Summary.

Finally – and most importantly, if you ask me – it forces the response team to discuss their hypotheses and the limits of their certainty. This results in better plans, which means shorter incidents.

Does this mean incident timelines are deprecated?

I don’t think so. There are still many notes worth taking that won’t end up in the Incident Summary, and it can make perfect sense to keep those notes in a timeline format.

However, I do think that the scribe’s primary focus should be keeping the Incident Summary accurate and succinct. If that focus detracts from the completeness of the timeline-formatted notes further down in the document, so be it. In the presence of time pressure and a shifting knowledge base, the Summary matters more.

You can get Leading Incidents, my 3-part course on incident response, for 30% off ($125 87.50) by joining the D2E mailing list now!

Leave a comment