Garden-path incidents

Barb’s story

It’s 12 noon on a Minneapolis Wednesday, which means Barb can be found at Quang. As the waiter sets down Barb’s usual order (#307, the Bun Chay, extra spicy), Barb’s nostrils catch the heavenly aroma of peanuts and scallions and red chiles. A wave of calm moves through her. Barb pulls her chair forward, cracks apart her wooden chopsticks, and…her pager goes off.

After cursing under her breath, she dutifully reads the message:

Error rate for `environment:production' exceeds 100 msg/s

Welp.

Barb grabs one quick bite of spring roll as she flags down the waiter for a to-go box. Opening Slack on her phone, she declares an incident, joins the Zoom call, and hurries back up Nicollet Ave. and around the corner, toward her apartment.

Five minutes later, finally sitting at her laptop, Barb is pretty sure she knows what the problem is. The error flooding the logs is:

object 'nil' has no method 'reimport!'

That looks like a straightforward app bug to her, and wouldn’t you know it? Right before these errors started cropping up, there was a deploy to the Rails app by a newish dev named Alice, who according to her Slack profile, is based in Seattle. Barb asks this ‘Alice’ to join the incident Zoom.

– Hi, this is Alice. What’d I do?
– Thanks for joining, Alice. I’m incident commander for this spike of production errors. It looks like you deployed a change at 17:46 UTC and a bunch of errors started happening. Can you revert that change please?
– Sure, no problem. I’ll put together the revert PR now.

5 minutes later, Alice’s PR is approved. Alice click’s “Merge.” The pair begin the anxious but familiar 15-minute wait for CI to pass, all the while greeting and informing the bewildered latecomers who straggle into the call. 

Alice’s story

Alice stares blankly at the white rectangle on her monitor. She spent her first hour getting yesterday’s frontend bugfixes pushed out, and now it’s time to start her PowerPoint. She’ll be working on this PowerPoint for the rest of the morning, probably through lunch, and all afternoon.

Alice shuts her eyes and heaves a dismal sigh. Alice fucking hates PowerPoint. But she can’t put it off anymore. So she dons her headphones, cracks her knuckles,, and… gets an urgent Slack message:

Morning, Alice – we’ve got a production incident involving a spike of errors, and it looks like it coincides with a deploy of yours. Can you jump on https://zoom.globocorp.co/z/123456789… when you have a moment please?

As she waits for Zoom to load, Alice feels something almost like relief. At least she doesn’t have to work on that goddamn PowerPoint yet.

– Hi, this is Alice. What’d I do?
– Thanks for joining, Alice. I’m incident commander for this spike of production errors. It looks like you deployed a change at 16:46 UTC and a bunch of errors started happening. Can you revert that change please?
– Sure, no problem. I’ll put together the revert PR now.

Alice quickly whips up that PR and gets it approved. She spends the next 15 minutes waiting for CI to pass, while absent-mindedly writing the first slide of her PowerPoint. By the time the tests are green, she has typed out and deleted 4 different titles.

The real story

This incident seems to have gone about as well as it could, considering. Alice was on the call within 7 minutes of the alert, and a PR was ready 5 minutes later. It would be great if CI were faster, or even better if CI could be skipped for a revert. They’ll talk about that at the post-mortem.

However, nobody in the call yet knows what really happened. What really happened is this:

    • Alice’s 16:46 UTC deploy was the first to pick up the latest Docker image.
    • The new Docker image includes an update to a software dependency.
    • The updated dependency has a bug that only shows up in production.

    But instead of knowing any of that, Alice and Barb are sitting here for 15 minutes waiting for CI to run, so they can deploy a fix that won’t even work.

    This is a garden-path incident. Barb has what she feels is a strong signal from the telemetry, which points toward a bug in Alice’s code. Alice has what she feels is a strong signal, which is that Barb seems very confident in her conclusion. But they’ve been led up the garden path, and as a consequence, this incident will run longer than it needs to.

    How this could all have been avoided

    Imagine instead, that Barb and Alice are both in the habit of saying their hypotheses out loud.

    When Alice joins the call, Barb instead says:

    – Thanks for joining, Alice. I’m incident commander for this spike of production errors. It looks like you deployed a change at 16:46 UTC and a bunch of errors started happening. My hypothesis is that your change triggered this spike of errors. Can you revert the change please?

    Instead of letting Alice infer that the cause of the error spike is already known to be her deploy, Barb acknowledges the limits of her certainty. She has a hypothesis, not a definitive diagnosis. This gives Alice the opportunity to respond with something like:

    – Well, are the errors from the backend or the frontend? Because my change was frontend-only.

    And just like that, Alice and Barb have stepped back from the garden path. Instead of waiting around for a useless CI cycle, they can continue straight away with diagnosis.

    Note that, even if Barb doesn’t state her hypothesis, things will still be okay as long as Alice does:

    – Hi, this is Alice. What’d I do?
    – Thanks for joining, Alice. I’m incident commander for this spike of production errors. It looks like you deployed a change at 16:46 UTC and a bunch of errors started happening. Can you revert that change please?
    – Sure, no problem. I’ll put together the revert PR now. Just to be clear, the hypothesis is that my frontend-only changeset is somehow causing these nil-has-no-method errors in the backend?
    Uh, did you say frontend-only?

    Again, Barb and Alice have gotten themselves off the garden path. Which means this technique – stating your hypothesis and asking for rule-outs – is something you can do unilaterally starting today to make your team better at troubleshooting.

    Another thing you can do to make your team better at troubleshooting is employ Clinical Troubleshooting against your next head-scratcher of a bug.

    Explaining the fire

    When your site goes down, it’s all hands on deck. A cross-functional team must assemble fast and pursue an organized response. It feels like fighting a fire. So it’s not surprising that formal IT incident management tends to borrow heavily from the discipline of firefighting.

    However, in software incident response, we have a crucial constraint that you won’t find in firefighting. Namely, in order to fix a software product, we first have to understand why it’s broken.

    When the firefighters arrive at the blazing building, they don’t need to explain the fire. They need to put it out. It doesn’t matter whether a toaster malfunctioned, or a cat knocked over a candle, or a smoker fell asleep watching The Voice. The immediate job is the same: get people to safety and put out the fire.

    But when PagerDuty blows up and we all stumble into the incident call, we need at least a vague hypothesis. Without one, we can’t even start fixing the problem. What should we do? Reboot one of the web servers? Which one? Should we revert the last deploy? Should we scale up the database? Flush the CDN? Open a support ticket with Azure? Just wait?

    We can’t act until we have at least some explanation for how the outage came about.

    Often, the process of diagnosis – of explaining the failure – takes up the majority of the incident. Diagnosis isn’t easy, especially in a group and under pressure. Important facts go ignored. Hypotheses get forgotten, or remain unchallenged in the face of new information. Action items fall through the cracks. Diagnostic disconnects like these add up to longer outages, noisier public-facing comms, and repeat failures.

    And yet, when we look to improve IT incident response, what do we usually focus on? On-call rotations, status page updates, command-and-control structure. Sliding-down-the-firepole, radioing-with-dispatch type stuff.

    In software incident response, we need to maintain a coherent diagnostic strategy in the face of scarce information and severe time pressure. This makes us, on one dimension at least, more like doctors than firefighters. This is one of the reasons that engineering teams find immense value in clinical troubleshooting. It brings rigor and transparency to the joint diagnostic effort.

    I teach clinical troubleshooting as part of Scientific Incident Response in 4 Days. Check it out.

    Incident, Inçident, Incidënt

    When you deploy broken code, it may cause an incident. Then you’ll have to declare an incident. And don’t forget to create an incident so customers can stay informed!

    “Incident” can mean a bunch of things, and these meanings are only partly distinguishable through context. When the word “incident” is used ambiguously, confusion can emerge.

    The effort to fix the problem

    In many organizations, the organizational formalism around responding to a problem is called an “incident.” For example,

    At 08:40 UTC, engineer Geordi La Forge declared an incident after observing high API latencies in production. The incident was led by Beverly Crusher, and it was resolved after 32 minutes.

    We’ll call this usage an incident-effort. An incident-effort need not correspond to an outage. You can create one for any problem on which you want multiple sets of eyes. For example, if deployment to the staging environment is borked, or if disk utilization is growing too rapidly on a database server, you might declare an incident(-effort).

    The public thread of communication

    When stakeholders need to be informed of a problem – often, these stakeholders are customers – we’re expected to “create an incident.” This kind of “incident” (let’s call it an incident-post) is usually a post on a publicly-facing website, describing the problem and the current state of the response to it.

    An incident-effort may exist without an incident-post. An incident-post may also exist without an incident-effort (for example, if an incident is used to communicate about a planned maintenance window).

    The problem itself

    An incident on the DataDog status page.

    Finally, there is the problem itself: that which is to be fixed, or communicated about.

    There are many problems that don’t merit an incident-effort, let alone an incident-post. We usually just call such problems “bugs” or “risks.”

    When someone says “We’ve had too many incidents this quarter,” they’re probably talking about problems that caused downtime. But they might be talking about incident-posts themselves, depending on how they’re looking at things. It pays to clarify.

    Never being one to shy away from an opportunity to make a Venn Diagram, I’ll leave you with a few common phrases around incident management:

    I offer incident response training or remote or on-site. Get in touch.

    Dead air on the incident call

    When troubleshooting a high-impact software failure in a group, you have to be ready for shifts in tenor. One moment there’s a frenzy of coordination, and the next: absolute silence.

    Silence is natural and often useful. But to be an effective incident commander – whose job is to keep the problem-solving effort moving forward – you must develop a keen ear for silence.

    Silence can mean different things to different people in different situations. In this post, I’ll present a few incident scenarios and explore the role of the incident commander in breaking (or simply abiding in) dead air.

    “Any minute now”

    Sylvain (from the [s]upport team) has spun up an incident call after getting 3 consecutive reports of broken CSS on the production app. You are the incident commander, and Oscar (from [o]perations) has immediately volunteered to be the primary investigator. Deanna and Deepak (software [d]evs) are also on the call.

    There’s some ambiguity about whether the CSS issue merits a status page post. Nobody has found a “smoking gun” showing that, for example, 404 errors are happening at an elevated rate. And now Oscar announces, “I’m seeing some log entries from the web server that look a little weird. I’m gonna look at those.” This is the beginning of a 5-minute silence.

    During the silence, Deanna, Deepak, and Sylvain are all waiting, hoping that these log entries that Oscar just noticed turn out to be the smoking gun. They’re putting their eggs in the basket of Oscar’s intuition. Hopefully he’s seen this issue before, and any minute now he’ll say “Okay, I’m pushing a fix.”

    The problem is, it’s equally possible that Oscar has latched onto a red herring (some salient but ultimately irrelevant observation). If there were a conversation in place of silence, then Deanna could be researching the error message, or Deepak might be able to immediately rule out the error as a cause of the broken CSS, or Sylvain could provide a detail from one of the customer tickets that would narrow the scope of the investigation. But instead, everybody’s twiddling their thumbs hoping for Oscar to emerge with a fix.

    An incident commander is responsible for keeping the whole problem-solving effort moving forward. So it’s incumbent on you to interrupt this silence.

    Try drawing more information out of Oscar:

    • “Oscar, do you mind sharing your screen so Deepak and Deanna can see the weird log messages too?”
    • “What’s the error message, Oscar? Can you send a link to a log search?”
    • “Do we know when these log events started? Does that line up with when we started receiving these support tickets, Sylvain?”

    The more we audit each other’s thought processes, the more effective we are at joint problem-solving. An IC must make this happen.

    “LGTM”

    Sylvain has spun up an incident call after getting 3 consecutive reports of broken CSS on the production website. You are the incident commander.

    Oscar has checked a bunch of graph dashboards and hasn’t found any evidence of a widespread system failure. He’s said as much. Now there’s relative silence on the call for five minutes.

    Deanna and Deepak are basically in agreement with Oscar: there’s no evidence of a system health issue. To them, and to Oscar, it’s not really clear how strong a signal Sylvain has. It could just be a coincidence that these three reports all arrived in a row. The engineers on the call are thinking, I guess we’ll keep poking at this, but we’re not even sure this is a real issue. We need more information.

    Sylvain, on the other hand, is positive that something is wrong. Getting 3 support tickets in a row about the same behavior is very strong evidence to him. He’s presented his information to the investigators, and now he’s thinking, Okay, they say it’s not a widespread issue. But I’m sure Oscar is getting to the bottom of it.

    There’s been a common ground breakdown, and as a result, a silence that becomes more and more frustrating.

    As incident commander, you should focus the group’s attention on observable symptoms by asking questions like:

    • “Has anybody been able to reproduce these broken page-loads in a browser? Preferably with Dev Tools turned on?”
    • “Sylvain, I don’t have an intuition for support ticket frequencies. How unusual is it to get 3 reports of the same thing right in a row like this?”
    • “Can we find, in the access logs, just one example of a stylesheet request that returned a non-200 response?”

    “Let’s see here…”

    Sylvain has spun up an incident call after getting 3 consecutive reports of broken CSS on the production website. You are the incident commander. The investigation has been going along, and Oscar is chasing down a hunch that a particular error message from the web server is related to the stylesheet failures. Deanna is digging into some code to help validate Oscar’s hunch.

    Deepak joins the call. There’s no chatter, as everyone is waiting for Oscar and Deanna to come up with their findings. So Deepak reads the chat scrollback, which takes him about 5 minutes. It’s not until the end of those 5 minutes that Deepak understands what Oscar and Deanna are working on.

    As it happens, Deepak has seen the web server error message in question before. He knows what it means, and he can explain why it’s a red herring. But for the 5 minutes it takes him to get up to speed by reading the chat scrollback, silence persists.

    In order to keep a problem-solving effort moving forward, an incident commander should ensure that every new participant gets up-to-date knowledge of what the group is doing and why. At small scale (less than, say, 10 people on the call), you can do this verbally. For example, you could say to Deepak when he joins the call, “Hi Deepak. Right now, Oscar and Deanna are investigating a web server error message that might be related to failed stylesheet loads. You can see the error message in the chat.”

    When there are more than 10 people, the verbal approach stops working. It becomes necessary to have a shared document of some sort, continuously updated by a “scribe.” It’s not sufficient for this document to be merely a timeline of events: it must highlight the current state of the joint diagnostic effort. I recommend clinical troubleshooting for this.

    “I need 5 minutes”

    When incident response is going right, everybody understands what’s being done by whom, and why. As information comes to light and our strategies evolve, it takes more or less constant communication to maintain this state. That’s why silence on an incident call is so often an indicator of trouble: when there’s silence, communication isn’t happening.

    There is, however, a healthy kind of dead air.

    Sometimes an investigator needs to go silent for a while to chase down a hunch, or collect some data, or research some question. As long as such a silence is negotiated in advance, with a specific time to reconvene, it can serve a crucial purpose. I call this functional dead air.

    It’s the job of the incident commander to ensure that every nontrivial silence is functional. First, communicate what’s being done by whom, and why. Only then, do it.


    I offer incident response training for engineers. Get in touch!

    Clinical troubleshooting: diagnose any production issue, fast.

    Over my career as an SRE, I’ve diagnosed thousands of production issues. When I’m lucky, I have enough information at hand, and enough expertise in the systems involved, to get to the bottom of the problem on my own. But very often I need to bring together a team.

    Troubleshooting with a team unleashes enormous power. Problems that would have taken me days to solve as an individual might take only hours or minutes, thanks to the benefit of pooled knowledge.

    However, collaborative troubleshooting also comes with dangers. Time and again, I’ve seen groups struggle to make progress on an investigation due to miscommunication, misalignment, and confusion. Among other difficulties, the fundamental common ground breakdown can have especially heinous consequences in this context.

    Over the years, I’ve developed a reliable method for harnessing the diagnostic power of groups. My approach is derived from a different field in which groups of experts with various levels of context need to reason together about problems in a complex, dynamic system: medicine.

    I call this method clinical troubleshooting.

    The clinical troubleshooting process

    Although clinical troubleshooting can be useful in solo troubleshooting, it really shines as a group activity. It’s a lightweight structure that always adds value. I recommend reaching for clinical troubleshooting as soon as you need to involve another person in any “why” question about an unexpected behavior of your system.

    Step 0: Get a group together

    Before you start, gather the people you’ll be troubleshooting with. Any synchronous communication channel can work for this: Slack, Zoom, a meeting room; you name it.

    You don’t need a big group. In fact, a small group is best. What matters most is that you bring together diverse perspectives. If you’re a backend engineer, try to pull in a network engineer and a frontend engineer, or a support agent and a sysadmin. Cast a wide net.

    Once you have an initial group, share a blank Google doc with everyone.

    Step 1: Identify symptoms

    Add a Symptoms header to the doc.

    You’re the one bringing the issue, so you must have some observations already. Write those down in a numbered list.

    It’s important that it be a numbered list rather than a bulleted list. As the troubleshooting process goes on, you’re going to want to refer to individual symptoms (and, later, hypotheses and actions). If each symptom has a number and the number never changes, this is a lot easier.

    Ask your collaborators to list symptoms, too. They may have observed some other facet of the problem, or they may think to look at a graph that you don’t know about.

    Here’s what an initial symptom list might look like:

    Symptoms

    1. About 5 times a day, the Storage API sends a spike of 503 responses. Each spike lasts about 500 milliseconds and includes between 200 and 1000 responses (about 0.1 to 0.5% of all responses sent during the interval)
    2. Outside of these spikes, the Storage API has not sent any 503 responses at all in the last 14 days.
    3. The failing requests have short durations, around the same as those of normal requests (mostly under 100 milliseconds).

    (In this and subsequent examples, don’t worry about understanding the exact technical situation. Clinical troubleshooting can be used on problems  in any part of any tech stack.)

    All the symptoms on the list should be precise and objective. In other words, if a statement is quantifiable, quantify it. Don’t make suppositions yet about why these phenomena have been observed. That comes next.

    Once you’re all on the same page about what problem you’re investigating, the initial symptom list is done.

    Step 2: Brainstorm hypotheses

    Add a Hypotheses header to the doc. Invite your colleagues to join you in suggesting hypotheses that might explain the symptoms.

    Let the ideas flow, and write them all down. This is where having a diverse set of perspectives in the room really pays off. Your co-investigators will think of hypotheses that would never have occurred to you, and vice versa. The more of these you come up with, the more likely the actual explanation will be on the list.

    A hypothesis can’t be just anything, though. A hypothesis must

    • explain (at least some of) the symptoms,
    • accord with all known facts, and
    • be falsifiable (that is: if it were false, we’d be able somehow to prove it false).

    For example, given the symptoms above, “requests to the storage API are getting queued up behind a long-running query” would not be a sound hypothesis, since it’s inconsistent with Symptom 3. If requests were queued up, we’d expect them to take longer before failing.

    After some discussion, your hypothesis list might look like this:

    Hypotheses

    1. A particular request causes an out-of-memory (OOM) event on a storage server, and all in-flight requests to that server get killed.
    2. A network hiccup causes connections between the load balancer and a storage server to be severed.
    3. Requests to the storage API are getting queued up behind a long-running query
      1. Discarded because inconsistent with Symptom 3
    4. A network hiccup causes connections between storage API servers and a persistence layer node to be severed.

    Again, use a numbered list. If a hypothesis is ruled out or deemed irrelevant, don’t delete it: you don’t want the list numbering to change. Instead, mark it in some with formatting. I use strikethrough. Gray works too.

    Step 3: Select actions

    Add an Actions header.

    In a new numbered list, choose one or two actions that will advance the troubleshooting effort. Usually, you should pick actions that will rule out, or “falsify,” one or more of the hypotheses on the table.

    To rule out Hypothesis 2 above, for instance, you could review the logs for one of the error spikes and check whether all the affected requests were associated with the same load balancer or the same storage server. If the requests are distributed across your infrastructure, then Hypothesis 2 is ruled out (and Hypothesis 1 as well, for that matter!).

    When you agree upon actions, it’s best to assign them to individuals. Sometimes an action can be taken right away, and other times it’ll take a while and the group will have to reconvene later. But ownership should never be unclear.

    Ruling out hypotheses the only purpose of actions in clinical troubleshooting. You can also assign actions that expand the group’s understanding of the problem, in order to generate new symptoms and new hypotheses. These actions can be things like, “Read the documentation on system X‘s network configuration,” or “Search for blog posts about error message E.” As long as there’s at least one hypothesis in the running, though, there ought to be at least one action in flight that could falsify it. That’s one of the ways clinical troubleshooting ensures constant progress.

    Steps 4 through N: Cycle back through

    When actions are completed, you get more information for the symptom list. More symptoms suggest new hypotheses. New hypotheses imply further actions. Just keep going through the cycle until you’re satisfied.

    Sometimes you won’t be satisfied until you have a definitive diagnosis: a solid explanation for all the symptoms that’s been proven right. Other times, you’ll be satisfied as soon as you take an action that makes the problem go away, even if there’s still uncertainty about what exactly was going on.

    In any case, clinical troubleshooting will reliably get you to the solution.

    Keep things moving forward

    In the absence of structure, collaborative diagnosis can stall out. Or worse, go backward.

    With clinical troubleshooting, there’s always a next step forward. Teams that practice this method will consistently get to the bottom of technical mysteries, even when strapped for data or under intense pressure. And over time, as this scientific way of thinking becomes a habit, and then a culture, we come to understand the behavior of our system that much better.

    I can teach your team how to do this. Get in touch.

    3 questions that will make you a phenomenal rubber duck

    As a Postgres reliability consultant and SRE, I’ve spent many hours being a rubber duck. Now I outperform even the incisive bath toy.

    “Rubber duck debugging” is a widespread, tongue-in-cheek term for the practice of explaining, out-loud, a difficult problem that you’re stumped on. Often, just by putting our troubles into words, we suddenly discover insights that unlock progress. The person we’re speaking to could just as well be an inanimate object, like a rubber duck. Hence the term.

    Rubber ducks are great, but a human can add even more value. In this article, I’ll share my 3 favorite questions to ask when someone comes to me feeling stumped in a troubleshooting endeavor. These questions work even when you have no particular expertise in the problem domain. Master them, and you’ll quickly start gaining a reputation as the person to talk to when you’re stuck. This is a great reputation to have!

    Question 1: How did you first start investigating this?

    As we investigate a problem, our focus shifts from one thing to another to another. We go down one path and forget about others. We zoom in on details and neglect to zoom back out. It’s easy to lose perspective.

    “How did you first start investigating this?” works well because, through the act of recounting their journey from initial observation to where they are now, your colleague will often regain perspective they’ve lost along the way. And by asking this particular question, you avoid having to suggest that they may have lost perspective – which could make them defensive.

    Even if your colleague hasn’t lost perspective, hearing the story of the investigation so far will help you ask better questions and help them organize their thoughts.

    Question 2: What observations have you made?

    In troubleshooting a complex problem, it’s easy to forget what you already know. As you go along, you make lots of observations, small and large, interesting and boring, relevant and irrelevant. You can’t hold them all in your head.

    When someone’s stuck, it often helps to review their observations. Not theories, not difficulties, not actions: directly observed facts.

    Reviewing observations can help in a few different ways:

    • They may be entertaining a hypothesis that clashes with some previously learned (but since forgotten) fact. If so, they can now go ahead and discard that hypothesis.
    • Juxtaposing two observations may suggest a hypothesis that never occurred to them before, because they never held those two observations in their head simultaneously.
    • Listing out their observations may bring to mind something they haven’t looked at yet.

    As your colleague recounts their observations, write them down in a numbered list. And, if you can, ask clarifying questions. Questions like “Does X always happen concurrently with Y, or only sometimes?” and “How does this differ from the normal behavior?”

    Never underestimate the power of precisely stating the facts.

    Question 3: If your hypothesis were wrong, how could we disprove it?

    This question is my favorite.

    One of the most common ways people get stuck in troubleshooting is tunnel vision. They get a single idea in their head about the cause of the problem, and that becomes all they can think about.

    This question, “If your hypothesis were wrong, how could we disprove it?” flips the script. Instead of racking their brain trying to prove their theory, it gets them thinking about other possibilities. Asking this question can lead to lots of different outcomes, all of which represent progress:

    1. You come up with a way to disprove the hypothesis, and successfully disprove it. This may make your colleague sad for a few hours, but when they come back to the problem, they’ll make huge strides.
    2. You come up with a way to disprove the hypothesis, but fail to disprove it. The hypothesis is thus bolstered, and the next step becomes clear: elaborate a few different versions of it and try to disprove those.
    3. You can’t think of any way to disprove it. This means it’s probably not a hypothesis at all, since it’s not falsifiable. Therefore you must replace it with a new hypothesis. This may feel like a setback, but it’s really the only way forward.

    How it fits together

    Under the hood, these 3 questions are just different ways of invoking hypothetico-deductive reasoning, which I’ve written about previously (see Troubleshooting On A Distributed Team Without Losing Common Ground and You Know Who’s Smart? Friggin’ Doctors, Man.). I don’t know of any better way to achieve consistent problem-solving results in the face of complexity.

    If you’re interested in learning how to apply these techniques in your career or in your organization, I can help. Shoot me an email!

    It’s fine to use names in post-mortems

    Sometimes, during an incident post-mortem, we need to talk about the actions of a specific person. It’s awkward. We want to be blameless, right? If we say, “Steve rebooted the wrong server,” aren’t we undermining the principle of blamelessness? Shouldn’t we say, “An engineer rebooted the wrong server” instead?

    This tension is rooted in a misunderstanding of blamelessness. The purpose of the blameless post-mortem is not to make everyone feel comfortable. Discomfort can be healthy and useful. The purpose of the blameless post-mortem is to let us find explanations deeper than human error.

    If we say, “This outage happened because Steve rebooted the wrong server,” then we make Steve the sole agent in the narrative. By this logic, the only way to avoid more future outages is for Steve to do better (or get fired). This is the kind of shallow analysis that blameless post-mortems help us avoid.

    But it’s okay to say, “Steve rebooted the wrong server,” as long as you don’t blame Steve for the failure. You’re just stating a fact. Everybody knows it was Steve who rebooted the server. Steve sure knows it: he’s been beating himself up about it all weekend.

    But why did Steve reboot the wrong server? Could it have had something to do with the fact that the server’s command prompt contains the opaque EC2 instance ID instead of a meaningful hostname? Was Steve coming off a 70-hour week? Come to think of it, why is it that rebooting that particular server can cause a total failure? If you’re asking questions like these instead of stopping at “Steve fucked up,” then you’re doing blamelessness.

    So just say the name. It’s fine.

    P.S. – obviously this doesn’t apply if you’re in the kind of organization that will retaliate against Steve for making an honest mistake. Don’t sell out your coworkers to bloodthirsty management.

    Incident metrics tell you nothing about reliability

    When an incident response process is created, there arise many voices calling for measurement. “As long as we’re creating standards for incidents, let’s track Mean-Time-To-Recovery (MTTR) and Mean-Time-To-Detection (MTTD) and Mean-time-Between-Failures (MTBF)!” they say things like.

    Usually there’s no further discussion on this point. Everyone agrees that more data is better (“but let’s make sure we interpret the data correctly, of course!”). As practitioners, it’s exciting to think that these metrics will finally let us demonstrate concrete results for all the effort we put into reliability. Middle management, for its part, looks forward to having a “north star” for reliability work. No more wishy-washy “we believe this project improved system reliability substantially” hand-waving: the initiative either moved the needle on MTBF or it didn’t.

    The assumption underlying all this is that incident data can shed light on a system’s reliability. But this assumption is never warranted. Analysis of your incident data may shed light on your incidents. But incident data really can’t tell you anything about your reliability.

    Picture this. Your team owns a web service. You’ve set up an alert on latency. If the 90th percentile latency exceeds 100 milliseconds, your team gets paged and you start an incident.

    One day, you find out that, really, if that 90th percentile latency even gets above 50 milliseconds, it’s already a problem. Downstream services start getting sluggish, and user experience suffers. “No problem,” you say, lowering the alert threshold to 50 milliseconds.

    Now you’re able to observe more failures. Or, if you prefer: you’ve widened the definition of “failure” to include more events. But the system’s reliability has in no way been altered. And yet, you will have more incidents. Possibly quite a few more! Mean-time-between-failures will drop.

    Picture this. Every once in a while, the database gets into a “wedged” state, where a bunch of queries are piled up and work grinds to a halt. It’s simple to un-wedge it: just manually kill the query that caused the pileup, and everything else gets flushed through the system. Each time the database gets wedged, the database team gets paged, they start an incident, they find and kill the query, and the incident is over. These incidents happen semi-regularly, but at least they’re over quickly.

    After a while, someone finally implements a fix. Those problematic queries? Gone! No more pointless incidents spun up just to manually clear out a pileup. And, more importantly: no more service degradations resulting from this problem. The reliability of the system has improved!

    So what happens to the incident metrics? MTBF goes up, of course. That checks out: there are fewer incidents now. But what about MTTR? Before the fix, MTTR was low. These query-pileup incidents were frequent and short, so they kept mean-time-to-recovery anchored down. After the fix – which, remember, corresponded to a direct and pure improvement in reliability – mean-time-to-recovery is higher. Whoopsy daisy!

    Picture this. You have a particular customer whose use-case tests the limits of your product’s capabilities. And furthermore, this customer is, let’s say, somewhat of a squeaky wheel. Every 2 or 3 weeks, they file a support ticket about something-or-other, and the problem is usually tricky enough that an incident gets spun up.

    Eventually this customer has had enough. They leave. This represents both a sizable blow to revenue and a scathing indictment of your product’s reliability at scale. But, on the bright side, both MTTR and MTBF benefit enormously! That’ll look great on the quarterly slide deck.

    When someone says they want to measure incident metrics (MTTR or MTBF or any other one in the family), I always want to ask them, “What’s your model?”

    Many people rush to collect data without first considering whether they can come up with a model powerful enough to interpret the data. If our model is bad, then we’ll make bad decisions even with good data. And if, as is often the case, we have no explicit model at all, then our implicit model is just, “Reliability up, MTBF up, MTTR down.” Which, as we’ve seen, doesn’t match our observations even in straightforward, common scenarios.

    Don’t waste time and energy collecting metrics you’ll never be able to use. Learn as much as you can from the failures themselves, and then move on.