Post-mortems: content over structure

The value of post-mortems is apparent: failures present opportunities to learn about unexpected behaviors of the system, and learning lets us make improvements to the system’s reliability.

The value of post-mortem documents is much less apparent.

Many R&D orgs will insist that the final draft of a post-mortem document have a particular structure. Common components of this structure include:

  • Start and end time of customer impact
  • Time to detection
  • Time to diagnosis
  • Time to recovery
  • A list of action items, each with a link to a ticket
  • Mitigation action items broken down into categories (e.g. “Prevention,” “Detection,” “Impact”)
  • Specific sections (e.g. “Timeline,” “What went wrong?,” “What can we do better?”)
  • Signoffs/approvals

None of these structural requirements facilitate learning. The benefits of post-mortem analysis come not from the document, but rather from the sense-making processes of which the document is an artifact. In order to understand a given failure, we invent hypotheses, test them against our mental model and the observable facts. In order to choose effective strategies for improvement, we converse and debate. And in order to make any of this matter, we establish accountability for seeing those strategies through.

These social processes are the source of the value of post-mortem analysis. The document is just a souvenir.

But what if you want to do meta-analysis? What if you want to analyze trends in incident characteristics over time, or categorize incidents according to some scheme? Don’t you need structure then?

I suppose you do. But good luck getting any useful information. No matter how much structure you insist on, the data set will be hopelessly noisy. Just try to state a hypothesis that can realistically be tested by a meta-analysis of a single organization’s incident data. I don’t think you can.

But what if structure helps me learn?

If structuring the post-mortem process helps you learn, then by all means! categorize! prompt! But recognize structure as a tool rather than an end in itself. Your learning process may benefit from one kind of structure, while somebody else’s may benefit from a different kind of structure, or less or more structure altogether. But the structure of the learning is what matters; not the structure of the document.

Organizational legibility

If you ask me, it’s all about the bureaucratic drive for legibility. Centralized power craves legibility.

A requirement for post-mortem documents to have a specific, consistent structure doesn’t help us learn or improve. So why do we do it?

Idiosyncratic processes like sense-making and learning are illegible to a command-and-control power structure. They come in diverse and intricate forms, instead of the standardized, codified forms that bureaucracy can parse. In service of legibility, a company’s power structure will insist that the post-mortem process culminate in spreadsheet-ready data items like “customer impact duration,” “time to recovery,” “severity level,” and the like. Centralized power demands these simplifications even if they inhibit learning and improvement. To the bureaucracy, legibility itself is the goal.

As an employee, you probably can’t totally disregard these bureaucratic impositions. But, to the extent you can disregard them, you should. Focus on what really matters: learning and improvement.

Outliers carry information. Don’t leave them on the table

Over a decade ago, I saw this talk by John Rauser. Only recently, though, did I come to realize how incredibly influential this talk has been on my career. Gosh what a great talk! You should watch it.

If you operate a complex system, like a SaaS app, you probably have a dashboard showing a few high-level metrics that summarize the system’s overall state. These metrics (“summary statistics”) are essential. They can reveal many kinds of gross changes (both gross “large scale” and gross “ick”) in the system’s state, over many different time scales. Very useful!

But don’t be misled. Summary statistics reveal certain patterns in the system’s behavior, but they are not identical to the system’s behavior. All summary statistics – yes, even distributions – hide information. They’re lossy. It’s easy to get lulled into the sense that, if an anomaly doesn’t show up in the summary statistics, it doesn’t matter. But a complex system’s behavior is not just curves on a plot. It’s a frothing, many-dimensional vector sum of instant-to-instant interactions.

When you investigate an anomaly in summary statistics, you’re faced with a small number of big facts. Average latency jumped by 20% at such-and-such time. Write IOPS doubled. API server queue depth started rising at some later time. Usually, you “zoom in” from there to find patterns that might explain these changes.

When you instead investigate a specific instance of anomalous behavior, you start with a large number of small facts. A request to such-and-such an endpoint with this-and-that parameter took however many seconds and crashed on line 99 of thing_doer.rb. None of these small facts tell you anything about the system’s overall behavior: this is just a single event among millions or billions or more. But, nevertheless: these small facts can be quite illuminating if you zoom out.

First of all, this probably isn’t the only time a crash like this has ever occurred. Maybe it’s happening multiple times a day. Maybe it happened twice as often this week as it did last week. Maybe it’s happening every time a specific customer makes a specific API request. Maybe that customer is fuming.

And second of all, the reason this event caught our eye in the first place was because it was anomalous. It had some extreme characteristic. Take, for example, a request that was served with very high latency. Perhaps, in the specific anomalous case before us, that extreme latency didn’t cause a problem. But how extreme could it get before it did cause a problem? If it took 20 seconds today, could it take 30 seconds next time? When it hits 30, it’ll time out and throw an error. Or, if multiple requests like this all arrived at the same time, could they exhaust some resource and interfere with other requests?

If the only anomalies you investigate are those that show up in summary statistics, then you’ll only find problems that have already gotten bad enough to move those needles. But if you dig into specific instances of anomalous behavior – “outliers” – then you can often find problems earlier, before they become crises.

5 production surprises worth investigating

As an SRE, I’m a vocal believer in following one’s nose: seeking out surprising phenomena and getting to the bottom of them. By adopting this habit, we can find and fix many classes of problems before they turn into incidents. Over time, this makes things run much smoother.

But what is surprising? How do we decide what to ignore and what to chase down?

Here are 5 kinds of surprise that are worth digging into in production data.

Keep in mind that this list is based on one SRE’s experience. Someone else, with different intuitions about system behavior, would make a different list. And that list would be more valuable to them than this list.

The most important thing is not that we always pick the right surprises to investigate. It’s that we make time, consistently, to seek out and explain surprises.

Anyway. Here’s my list.

1. Long running requests

Pick a time window. Sort all the requests during that window descending by duration. At the top of this list, you’ll find classes of requests that spent extraordinarily long in your system.

Were you expecting to see these specific kinds of requests? If not, then congratulations: you’re surprised. And this is quite likely a surprise worth investigating, because

  1. while these slow requests are running, they’re eating up resources (CPU, memory, sockets, and so on), which gives them an outsized potential to degrade the performance of other, unrelated requests.
  2. without intervention, the underlying problem will probably get worse over time. Why wouldn’t it? It has already gotten this bad. And if it gets bad enough, it will eventually cause customer pain.
  3. many of the phenomena contributing to this slowness probably affect a variety of requests. So, if you can make these requests perform better, you may be able to make other requests perform better too.

We don’t have to rigorously analyze and optimize every slow request that goes through the system. Just by coming to understand the issues that cause our slowest requests to be slow, we often find opportunities to make high-impact improvements and mitigate the risk of ending up on an incident call later.

2. Saturation

Saturation is when units of work are waiting in a queue to be processed – be they web requests, or IP packets, or threads, or what have you.

To the extent we’re trying to serve each request as quickly as possible, saturation anywhere is waste. We shouldn’t expect any saturation. Or, at least, we should strive to thoroughly understand every source of it. If we don’t understand it, it will probably get worse over time.

Find every metric that represents a saturation. Put them all on one dashboard. Look at that dashboard every day.

3. Crashes

Every time an exception bubbles up to the top of the stack, that’s unexpected. If it were expected, then the code would have caught and handled the exception.

A tool like Sentry or Datadog Error Tracking (I mention these specific products only because I know them, not because they’re necessarily the best) lets you classify crashes, and see which ones are new or getting more frequent. Granted: it’s often the case that there are far too many classes of crash to investigate all or even most of them. But this shouldn’t stop us from chasing down some of them, according to our intuition about their relative importance.

We have imperfect information until we get our hands dirty, so we won’t always end up picking the highest-impact investigations to undertake. But fixing crashes is always better than ignoring them, and by dividing up the extant crashes into groups and prioritizing one group at a time, we can create some very useful guarantees.

For example, if we find all the crashes that correspond to deadlocks, and fix those, then we guarantee that our system has no deadlocks. Thereafter, problems caused by deadlocks – which tend to be subtle, elusive, and therefore time-consuming to solve – can no longer quietly sow chaos.

4. Utilization spikes

It’s expected for a utilization metric (for example, network bandwidth or memory utilization) to move around a lot. But it’s not expected for utilization to exhibit large spikes. When this happens, it means some recurring event is using more than its fair share of some resource.

When utilization hits 100%, things will degrade catastrophically and immediately. So if you have utilization spikes representing 20% of capacity, then steady-state utilization only has to be 80% before catastrophic degradations start occurring. Or, if the spikes gets worse, and now they’re 30% of capacity, then 70% capacity becomes the new safe limit. We’ll have to waste at least 30% of our capacity by leaving it idle in order to prevent outages.

So why not understand the spikes now, before they get worse?

5. Correlation between latency and throughput

When we see recurring swings in latency metrics, it’s easy to say to ourselves, “Yeah, well, when we’re under more load, things slow down.” But… why? Systems don’t always behave like that! They do it when there’s a bottleneck. If the system were behaving as expected and no resources were exhausted, then the latency graph would be more or less flat.

In other words, if traffic tends to be higher mid-day on weekdays, and latency also tends to rise mid-day on weekdays, then you’re likely saturating at least one resource. Would you rather find that bottleneck is now, or during an incident?

I hope this list is a good place to start if you’re looking to dig up problems. But, ultimately, you must be the judge of what’s surprising to you. Find something you can’t explain. Explain it. Repeat.