No Observability Without Theory: The Talk

Last month, I had the unadulterated pleasure of presenting “No Observability Without Theory” at Monitorama 2024. If you’ve never been to Monitorama, I can’t recommend it enough. I think it’s the best tech conference, period.

This talk was adapted from an old blog post of mine, but it was a blast turning it into a talk. I got to make up a bunch of nonsense medical jargon, which is one of my favorite things to do. Here are my slides, and the video is below. Enjoy!

Ask questions first, shoot later

One piece of common-sense advice that you often hear about incident response is,

Fix it first. Ask “why” later.

This chestnut is often deployed to combat what is perceived as excessive investigation. And like most common-sense advice, it’s approximately right in lots of situations. But it misses a crucial point, and at its worst, this attitude perpetuates failure.

Diagnosing and fixing

Incident response comprises two intertwined, but distinct, activities: diagnosing and fixing. This point is illustrated in David Woods’ 1995 paper, Cognitive demands and activities in dynamic fault management: abductive reasoning and disturbance management (which uses the term “response planning” for what I’m calling “fixing”):

Diagnosing and fixing can involve overlapping activities, such that they blend together during incident response. For example, if you have diagnosed a partial outage as resulting from a web server that’s used up all its allotted file handles, you might restart that web server. This would be a “diagnostic intervention,” in that it serves to advance both the fix (if your diagnosis holds water, then restarting the web server will fix the problem) and the diagnosis (if restarting the web server fixes the problem, then you have additional evidence for your diagnosis; if it doesn’t, then you know you need a new diagnosis).

The fact that fixing and diagnosing often converge to the same actions doesn’t change the fact that these two concurrent activities have different goals. The goal of fixing is to bring the system into line with your mental model of how it’s supposed to function. The goal of diagnosing is to bring your mental model into line with the way the system is actually behaving.

Usually these two goals are aligned with one another, but sometimes they demand different actions. And that’s what opens the door for someone to say, “Let’s fix the problem first and ask questions later.” However, this implies a naïve conception of the incident response process, which I’ll represent schematically here:

In this view, you fix first, then you diagnose – perhaps in a post-mortem or a root-cause analysis. But in a field like ours, in which complex systems are at play, this is simply not how things work. A complex system has infinitely many possible failure modes. Therefore there are infinitely many possible actions that might be necessary to recover from a failure. In order to even attempt a fix, you must always start with some kind of diagnosis.

Sure, sometimes the diagnostic effort might be very brief and straightforward. Suppose you get an alert about some new error happening in production. You immediately recognize the error as resulting from a code change you just deployed, and you revert the change.

Because the diagnosis was so quick, it may feel like you simply fixed the problem as soon as you saw it. But you still undertook a diagnostic process. You saw the alert and developed a hypothesis (“My code change caused these errors”), and that hypothesis turned out to be right. Had you truly done no diagnosis, then you wouldn’t have known what to do. The incident actually looked like this:

Contrast this with another scenario. You get alerted about slow page-loads. Together with a team, you begin to investigate the slowness. But no explanation is forthcoming. It takes an hour of searching logs, reading documentation, and consulting with other teams before you have a satisfactory explanation: an mission-critical cache object has gotten too large to store in the cache, so it has to be fetched from origin on every request. Upon reaching this diagnosis, you immediately know what to do to fix the problem:

During this long diagnostic phase, nobody would have said, “Fix the problem first. Worry about ‘why’ later.” The diagnostic effort was clearly pursued in service of fixing the issue. Whether it takes a split-second or a week, a diagnosis (at least a differential diagnosis) always has to be reached before the problem can be fixed.

These are simple examples. In a more general case, you do some diagnosis, which produces a potential fix. That fix doesn’t work (or only partly works), so diagnosis continues until another potential fix presents itself. And since multiple responders are present on the call, diagnosis doesn’t generally have to totally halt in order for fixes to be pursued:

The “Fix first” shibboleth comes out when someone perceives that there is already a potential fix, but nobody is applying that fix yet. So when you hear it, or you’re tempted to speak it yourself, first ask yourself:

  1. Is there a potential fix on the table?
  2. Is that potential fix worth pursuing immediately?

If the answer to both of these questions is “yes,” then by all means, get on the fix. But don’t halt diagnosis to do so, unless you’re so labor-constrained that you must.

If either question elicits a “no,” then you should talk through your reasoning with the group and make the case for pursuing further diagnosis before taking action.

Fix first, ask questions later.

Ask questions until you can fix.

––

I teach Scientific Incident Response in 4 Days, an in-depth incident response training course for engineering teams.

Garden-path incidents

Barb’s story

It’s 12 noon on a Minneapolis Wednesday, which means Barb can be found at Quang. As the waiter sets down Barb’s usual order (#307, the Bun Chay, extra spicy), Barb’s nostrils catch the heavenly aroma of peanuts and scallions and red chiles. A wave of calm moves through her. Barb pulls her chair forward, cracks apart her wooden chopsticks, and…her pager goes off.

After cursing under her breath, she dutifully reads the message:

Error rate for `environment:production' exceeds 100 msg/s

Welp.

Barb grabs one quick bite of spring roll as she flags down the waiter for a to-go box. Opening Slack on her phone, she declares an incident, joins the Zoom call, and hurries back up Nicollet Ave. and around the corner, toward her apartment.

Five minutes later, finally sitting at her laptop, Barb is pretty sure she knows what the problem is. The error flooding the logs is:

object 'nil' has no method 'reimport!'

That looks like a straightforward app bug to her, and wouldn’t you know it? Right before these errors started cropping up, there was a deploy to the Rails app by a newish dev named Alice, who according to her Slack profile, is based in Seattle. Barb asks this ‘Alice’ to join the incident Zoom.

– Hi, this is Alice. What’d I do?
– Thanks for joining, Alice. I’m incident commander for this spike of production errors. It looks like you deployed a change at 17:46 UTC and a bunch of errors started happening. Can you revert that change please?
– Sure, no problem. I’ll put together the revert PR now.

5 minutes later, Alice’s PR is approved. Alice click’s “Merge.” The pair begin the anxious but familiar 15-minute wait for CI to pass, all the while greeting and informing the bewildered latecomers who straggle into the call. 

Alice’s story

Alice stares blankly at the white rectangle on her monitor. She spent her first hour getting yesterday’s frontend bugfixes pushed out, and now it’s time to start her PowerPoint. She’ll be working on this PowerPoint for the rest of the morning, probably through lunch, and all afternoon.

Alice shuts her eyes and heaves a dismal sigh. Alice fucking hates PowerPoint. But she can’t put it off anymore. So she dons her headphones, cracks her knuckles,, and… gets an urgent Slack message:

Morning, Alice – we’ve got a production incident involving a spike of errors, and it looks like it coincides with a deploy of yours. Can you jump on https://zoom.globocorp.co/z/123456789… when you have a moment please?

As she waits for Zoom to load, Alice feels something almost like relief. At least she doesn’t have to work on that goddamn PowerPoint yet.

– Hi, this is Alice. What’d I do?
– Thanks for joining, Alice. I’m incident commander for this spike of production errors. It looks like you deployed a change at 16:46 UTC and a bunch of errors started happening. Can you revert that change please?
– Sure, no problem. I’ll put together the revert PR now.

Alice quickly whips up that PR and gets it approved. She spends the next 15 minutes waiting for CI to pass, while absent-mindedly writing the first slide of her PowerPoint. By the time the tests are green, she has typed out and deleted 4 different titles.

The real story

This incident seems to have gone about as well as it could, considering. Alice was on the call within 7 minutes of the alert, and a PR was ready 5 minutes later. It would be great if CI were faster, or even better if CI could be skipped for a revert. They’ll talk about that at the post-mortem.

However, nobody in the call yet knows what really happened. What really happened is this:

    • Alice’s 16:46 UTC deploy was the first to pick up the latest Docker image.
    • The new Docker image includes an update to a software dependency.
    • The updated dependency has a bug that only shows up in production.

    But instead of knowing any of that, Alice and Barb are sitting here for 15 minutes waiting for CI to run, so they can deploy a fix that won’t even work.

    This is a garden-path incident. Barb has what she feels is a strong signal from the telemetry, which points toward a bug in Alice’s code. Alice has what she feels is a strong signal, which is that Barb seems very confident in her conclusion. But they’ve been led up the garden path, and as a consequence, this incident will run longer than it needs to.

    How this could all have been avoided

    Imagine instead, that Barb and Alice are both in the habit of saying their hypotheses out loud.

    When Alice joins the call, Barb instead says:

    – Thanks for joining, Alice. I’m incident commander for this spike of production errors. It looks like you deployed a change at 16:46 UTC and a bunch of errors started happening. My hypothesis is that your change triggered this spike of errors. Can you revert the change please?

    Instead of letting Alice infer that the cause of the error spike is already known to be her deploy, Barb acknowledges the limits of her certainty. She has a hypothesis, not a definitive diagnosis. This gives Alice the opportunity to respond with something like:

    – Well, are the errors from the backend or the frontend? Because my change was frontend-only.

    And just like that, Alice and Barb have stepped back from the garden path. Instead of waiting around for a useless CI cycle, they can continue straight away with diagnosis.

    Note that, even if Barb doesn’t state her hypothesis, things will still be okay as long as Alice does:

    – Hi, this is Alice. What’d I do?
    – Thanks for joining, Alice. I’m incident commander for this spike of production errors. It looks like you deployed a change at 16:46 UTC and a bunch of errors started happening. Can you revert that change please?
    – Sure, no problem. I’ll put together the revert PR now. Just to be clear, the hypothesis is that my frontend-only changeset is somehow causing these nil-has-no-method errors in the backend?
    Uh, did you say frontend-only?

    Again, Barb and Alice have gotten themselves off the garden path. Which means this technique – stating your hypothesis and asking for rule-outs – is something you can do unilaterally starting today to make your team better at troubleshooting.

    Another thing you can do to make your team better at troubleshooting is employ Clinical Troubleshooting against your next head-scratcher of a bug.

    Clinical troubleshooting: diagnose any production issue, fast.

    Over my career as an SRE, I’ve diagnosed thousands of production issues. When I’m lucky, I have enough information at hand, and enough expertise in the systems involved, to get to the bottom of the problem on my own. But very often I need to bring together a team.

    Troubleshooting with a team unleashes enormous power. Problems that would have taken me days to solve as an individual might take only hours or minutes, thanks to the benefit of pooled knowledge.

    However, collaborative troubleshooting also comes with dangers. Time and again, I’ve seen groups struggle to make progress on an investigation due to miscommunication, misalignment, and confusion. Among other difficulties, the fundamental common ground breakdown can have especially heinous consequences in this context.

    Over the years, I’ve developed a reliable method for harnessing the diagnostic power of groups. My approach is derived from a different field in which groups of experts with various levels of context need to reason together about problems in a complex, dynamic system: medicine.

    I call this method clinical troubleshooting.

    The clinical troubleshooting process

    Although clinical troubleshooting can be useful in solo troubleshooting, it really shines as a group activity. It’s a lightweight structure that always adds value. I recommend reaching for clinical troubleshooting as soon as you need to involve another person in any “why” question about an unexpected behavior of your system.

    Step 0: Get a group together

    Before you start, gather the people you’ll be troubleshooting with. Any synchronous communication channel can work for this: Slack, Zoom, a meeting room; you name it.

    You don’t need a big group. In fact, a small group is best. What matters most is that you bring together diverse perspectives. If you’re a backend engineer, try to pull in a network engineer and a frontend engineer, or a support agent and a sysadmin. Cast a wide net.

    Once you have an initial group, share a blank Google doc with everyone.

    Step 1: Identify symptoms

    Add a Symptoms header to the doc.

    You’re the one bringing the issue, so you must have some observations already. Write those down in a numbered list.

    It’s important that it be a numbered list rather than a bulleted list. As the troubleshooting process goes on, you’re going to want to refer to individual symptoms (and, later, hypotheses and actions). If each symptom has a number and the number never changes, this is a lot easier.

    Ask your collaborators to list symptoms, too. They may have observed some other facet of the problem, or they may think to look at a graph that you don’t know about.

    Here’s what an initial symptom list might look like:

    Symptoms

    1. About 5 times a day, the Storage API sends a spike of 503 responses. Each spike lasts about 500 milliseconds and includes between 200 and 1000 responses (about 0.1 to 0.5% of all responses sent during the interval)
    2. Outside of these spikes, the Storage API has not sent any 503 responses at all in the last 14 days.
    3. The failing requests have short durations, around the same as those of normal requests (mostly under 100 milliseconds).

    (In this and subsequent examples, don’t worry about understanding the exact technical situation. Clinical troubleshooting can be used on problems  in any part of any tech stack.)

    All the symptoms on the list should be precise and objective. In other words, if a statement is quantifiable, quantify it. Don’t make suppositions yet about why these phenomena have been observed. That comes next.

    Once you’re all on the same page about what problem you’re investigating, the initial symptom list is done.

    Step 2: Brainstorm hypotheses

    Add a Hypotheses header to the doc. Invite your colleagues to join you in suggesting hypotheses that might explain the symptoms.

    Let the ideas flow, and write them all down. This is where having a diverse set of perspectives in the room really pays off. Your co-investigators will think of hypotheses that would never have occurred to you, and vice versa. The more of these you come up with, the more likely the actual explanation will be on the list.

    A hypothesis can’t be just anything, though. A hypothesis must

    • explain (at least some of) the symptoms,
    • accord with all known facts, and
    • be falsifiable (that is: if it were false, we’d be able somehow to prove it false).

    For example, given the symptoms above, “requests to the storage API are getting queued up behind a long-running query” would not be a sound hypothesis, since it’s inconsistent with Symptom 3. If requests were queued up, we’d expect them to take longer before failing.

    After some discussion, your hypothesis list might look like this:

    Hypotheses

    1. A particular request causes an out-of-memory (OOM) event on a storage server, and all in-flight requests to that server get killed.
    2. A network hiccup causes connections between the load balancer and a storage server to be severed.
    3. Requests to the storage API are getting queued up behind a long-running query
      1. Discarded because inconsistent with Symptom 3
    4. A network hiccup causes connections between storage API servers and a persistence layer node to be severed.

    Again, use a numbered list. If a hypothesis is ruled out or deemed irrelevant, don’t delete it: you don’t want the list numbering to change. Instead, mark it in some with formatting. I use strikethrough. Gray works too.

    Step 3: Select actions

    Add an Actions header.

    In a new numbered list, choose one or two actions that will advance the troubleshooting effort. Usually, you should pick actions that will rule out, or “falsify,” one or more of the hypotheses on the table.

    To rule out Hypothesis 2 above, for instance, you could review the logs for one of the error spikes and check whether all the affected requests were associated with the same load balancer or the same storage server. If the requests are distributed across your infrastructure, then Hypothesis 2 is ruled out (and Hypothesis 1 as well, for that matter!).

    When you agree upon actions, it’s best to assign them to individuals. Sometimes an action can be taken right away, and other times it’ll take a while and the group will have to reconvene later. But ownership should never be unclear.

    Ruling out hypotheses the only purpose of actions in clinical troubleshooting. You can also assign actions that expand the group’s understanding of the problem, in order to generate new symptoms and new hypotheses. These actions can be things like, “Read the documentation on system X‘s network configuration,” or “Search for blog posts about error message E.” As long as there’s at least one hypothesis in the running, though, there ought to be at least one action in flight that could falsify it. That’s one of the ways clinical troubleshooting ensures constant progress.

    Steps 4 through N: Cycle back through

    When actions are completed, you get more information for the symptom list. More symptoms suggest new hypotheses. New hypotheses imply further actions. Just keep going through the cycle until you’re satisfied.

    Sometimes you won’t be satisfied until you have a definitive diagnosis: a solid explanation for all the symptoms that’s been proven right. Other times, you’ll be satisfied as soon as you take an action that makes the problem go away, even if there’s still uncertainty about what exactly was going on.

    In any case, clinical troubleshooting will reliably get you to the solution.

    Keep things moving forward

    In the absence of structure, collaborative diagnosis can stall out. Or worse, go backward.

    With clinical troubleshooting, there’s always a next step forward. Teams that practice this method will consistently get to the bottom of technical mysteries, even when strapped for data or under intense pressure. And over time, as this scientific way of thinking becomes a habit, and then a culture, we come to understand the behavior of our system that much better.

    I can teach your team how to do this. Get in touch.

    Interviewing engineers for diagnostic skills

    In SaaS, when we’re hiring engineers, we usually imagine that their time will mostly be spent building things. So we never forget to interview for skills at building stuff. Sometimes we ask candidates to write code on the fly. Other times we ask them to whiteboard out a sensible architecture for a new software product. We ask them to tell us a story about a piece of software or infrastructure that they built. All this focus on building things ensures that we’ll end up hiring a candidate who can build things.

    And yet, when you work in SRE or operations, or really any backend engineering role with an on-call rotation, troubleshooting is one of your core responsibilities. Some months, you’ll spend far more time troubleshooting than building. So shouldn’t your troubleshooting skills be evaluated during the interview process?

    When I talk about interviewing candidates for troubleshooting skills, I mean answering questions like:

    • Can they distinguish relevant from irrelevant facts?
    • Do they seek to answer specific questions?
    • Will they keep an open mind about the cause of a problem, rather than jumping to a conclusion?
    • Are they able to express their thought process to a colleague?
    • When they hit a dead end, will they get discouraged? Or will they calmly seek out a different approach?
    • Do they have a strategy?

    Several times throughout my SRE career, I’ve put together interview processes to screen for these skills. The most effective method I’ve found is to create a sort of interactive role-playing game in which the candidate tries to solve a tricky bug in a complex system.

    In this post, I’ll show how I went about creating one of these interview questions and how I evaluated the performance of candidates. I hope you’ll be able to copy my approach and produce your own such interview scenario (please reach out if you try this!). If you do, you’ll raise the bar for diagnostic skills on your team, and your operations will run that much more smoothly.

    The incident

    Around a year into my stint at $lastJob, I was tapped to organize an ongoing investigation into weird, customer-facing, nondeterministic-seeming API timeouts.

    The API in question was essentially a private software package registry. But the authentication logic that made the registry private was not implemented in the registry itself. Instead, calls to this registry were proxied by a broader, public-facing API, which handled auth/auth considerations. Here’s a diagram of the whole situation:

    Timeouts would occur seemingly at random, at a rate of about 1 timeout per 10,000 requests. The content of the request didn’t matter: any given request for the private registry might hang for 10 seconds and then fail. But if you immediately retried the same request, it would succeed.

    I led a team of 4 engineers in a diagnostic effort, and after almost 3 weeks of methodical investigation, we finally got to the bottom of these timeouts. Along the way, we stumbled into several dead-ends and chased a fair number of red herrings.

    I’ll spare you most of the details, but it’s important to note the existence of all these dead-ends. They later served as a signal that this problem was sufficiently complex and counterintuitive to give interview candidates a real workout. When you go searching for an incident to turn into a troubleshooting skills interview, look for investigations with lots of dead-ends.

    Anyway, the root cause (come at me) ended up being as follows. When requests arrive at the private registry, they’re served by one of N identical “backend” processes (in the diagrams below, N=3). A server distributes requests to these backends, always picking an idle backend if one exists:

    If no backend is idle, then the server must queue up the incoming request behind an in-flight request:

    What we found is that, occasionally, a particular script would execute an extremely long-running request against the registry API. This would normally be okay, but once in a blue moon, a few of these long-running requests would happen to hit the same server. This tied up multiple backends, resulting in one or more requests (☆) getting queued behind these pathologically long-running requests (△):

    Altogether, this made for a very confusing set of observations:

    • Requests passing through the fronting API would occasionally take extra long, regardless of their content.
    • If one of these requests took longer than 10 seconds, a timeout would be reached in the CLI, resulting in a customer-facing error message. In this situation,
      • The request would continue running in the fronting API, and would be logged by that system as a high-latency 200.
      • This same request would be logged as a low-latency 200 at the registry level, since that subsystem would not start its stopwatch until the backend started processing the request.
      • At the load balancer, this request would be logged as a 499, which means “the client closed its connection before we could send a response.”
    • If one of these requests took longer than 60 seconds, though, a load balancer timeout would be reached.
      • The fronting API would log a 499
      • The load balancer would log a 502 Bad Gateway
      • The registry would never log this request at all.

    The confusing nature of these symptoms is what made this issue such a perfect model for our troubleshooting interview.

    How I turned this into an interview question

    A while after this confusing incident was resolved, I was faced with the prospect of hiring SREs for my team. Since I’m very much a “squeeze the problems out of the system you have” person, rather than a “build new stuff with less problems” person, I wanted to make sure our interview process screened for good diagnostic skills.

    So I reviewed the registry timeouts incident. How could I turn that messy, roundabout, real-world investigation into an approachable interview question that would help me gauge a candidate’s troubleshooting skills?

    I changed some names, removed a couple complicating factors that would just muddy the waters, and wrote up a training document for interviewers. The training document included:

    • A description of the hypothetical architecture in which the exercise would take place;
    • A detailed explanation of the issue to be investigated, including discussion of the implications this issue would carry for observable artifacts (e.g. metrics, logs, traces, error messages);
    • A guide to conducting the interview (what to say, what not to say, how to evaluate responses).

    The document’s description of our hypothetical architecture included a bird’s-eye-view diagram of our made-up architecture:

    A service called “Vasa” took the place of the fronting API described earlier. Postgres was included in the diagram despite its irrelevance, introducing a bit of misdirection.

    To stand in for the private registry where all the queueing nonsense happened, I made up a microservice called “Argos.” It’s not shown in the bird’s-eye-view diagram. Instead, once the candidate localized the problem to the Argos service, I’d show them another, more zoomed-in diagram that included details of the Argos architecture.

    One key modification was that I made Argos (the private registry stand-in) into a generic “workspace metadata storage service.” I didn’t want candidates to fixate on the software-registry aspect of the problem, since that wasn’t especially relevant. People understand metadata storage and retrieval, and it’s boring, so they don’t ask many questions about it.

    Conducting the interview

    The actual interview takes the form of a role-playing game, almost D&D-esque. I start with a brief spiel:

    This is going to be a sort of troubleshooting roleplay adventure, in which you play as an SRE investigating a system problem. I'll lay out the system and the problem under investigation, and you'll get to ask questions and take actions as you see fit. As you learn new things, we can take notes in the Google Doc which I'm sharing now. Sound good? Shall we get started?

    After a bit of experience, I added another section to the intro:

    The exercise we're going to do is based on a real-world bug that took a team of 4 people several weeks to figure out. I mention this because I don't want you to get to the end of the interview and think, "Oh no! I didn't solve the problem! I must have failed the interview." It's very common for candidates to run out of time without finding the answer but still do very well. I'm much more interested in how you go about investigating the problem than in how far you get.

    With that, I show them the architecture diagram (above) and give a bird’s-eye-view description of the system they now “own.” Then we get into the problem:

    One day, you get assigned a ticket. According to the ticket, the support team has been getting an increasing number of reports of an issue in which a customer runs a "Get Workspace" command in the CLI, and the CLI hangs for 10 seconds before printing this message and crashing:
    ERROR: timeout waiting for response from server
    Several users have additionally reported that, if they immediately re-run their "Get Workspace" command, it succeeds as normal in under 200 milliseconds.

    “Alright,” I say to the candidate, “so you’re assigned this ticket. What do you do first?”

    From here, different candidates go in all sorts of different directions. Some want to ask more questions of the support team. Some want further clarity on the architecture. Some immediately dive into logs and metrics. I don’t try to guide them. I just make up plausible answers to their questions, given my knowledge of the underlying problem. For example:

    CANDIDATE: Can I ask the support team if there was anything else in common between the customer reports that they received?
    ME: Sure. Support doesn't understand quite what you're asking, but they give you links to all the relevant tickets. What do you look for?
    CANDIDATE: Are all the tickets from the same part of the world, or all from the same time of day or something?
    ME: You don't see any common geographic characteristics. Some of the customers are from North America, some are from Asia, some are from Australia. It does seem, from your limited sample, that most of the reports are from North American business hours, roughly mid-day, on weekdays.

    Even this early in the interview, I can sometimes detect warning signs. If a candidate immediately jumps to a specific hypothesis and gets stuck on it, I’m pretty sure this interview isn’t going to go well. But I’ll let them keep going:

    CANDIDATE: This sounds like a caching issue, because the requests are failing once and then succeeding on the second try. Are there any caches in the system that might be timing out?
    ME: There are many caching layers. There are filesystem caches, of course, and there's the buffer cache in the database. Some data items are cached in Memcache, as you can see in the diagram, and then some of the microservices have caching layers of their own.
    CANDIDATE: Okay, then I guess I'd look at the health of those caching layers.
    ME: Alright. How would you look at their health? What kinds of things would you look for?
    CANDIDATE: Like error messages, high CPU usage, that sort of thing.

    This candidate is not likely to do well. Instead of asking targeted questions to hone in on the location of the problem, they’ve leapt to a specific kind of problem (probably one they’ve seen recently in their job) based on very little evidence. Moreover, they’re not looking for the right kind of evidence. They’re looking for evidence that would confirm their suspicion, but when they don’t find any, their suspicion still won’t be ruled out. They could spin their wheels like this for the whole hour. But if that’s what they want to do, I’ll let them.

    Since the problem under investigation is sporadic (about 1 in every 10,000 requests), any approach that involves looking at aggregate system-health stats will reach a dead-end pretty quickly. Even good candidates may initially set out in the system-health direction, but they’ll quickly recognize the dead-end and incorporate this new information into their reasoning. They’ll adapt and find a new strategy.

    One such strategy is to try to find evidence of the failed requests in the system logs. But I make them work for it:

    CANDIDATE: I look in the logs to try to find the requests that are timing out.
    ME: Okay. Which logs do you look in? We've got logs from the load balancer, logs from the Vasa service, logs from the DB…
    CANDIDATE: The ones for the Vasa service.
    ME: The Vasa service has both access logs and error logs. Which one do you want to search?
    CANDIDATE: Let's look at the access logs for now.
    ME: Cool. How do you find the entries you're interested in?
    CANDIDATE: Oh. Hmm. Do we have the client IP addresses for any of the timeouts?
    ME: You search through the support tickets, but you don't find any client IP addresses.
    CANDIDATE: I guess then I would try to find requests for the Get Workspace endpoint that lasted longer than 9.99 seconds.

    When a candidate gives me enough information for a log search, I’ll make up some results, being sure to also document the parameters that were included in the search. For example,

    Searching in the Vasa access logs for Get Workspace requests that lasted longer than 9.99 seconds in the last 24 hours, you find:
    – 1000 total
    – 900: response code 200
    – 100: response code 503

    This is progress, but candidates often get stuck here for a couple reasons. Some get hung up on the request latencies. They’ll look for requests with latencies of exactly 10 seconds (of which there are none), or they’ll look for requests with latencies around 10 seconds (of which there are a few, most of which are unrelated to the problem at hand.) In the real-world investigation, we found that the borked requests as logged by Vasa had latencies ranging all the way up to 60 seconds, but because the request timeout built into the CLI was 10 seconds, the client would only hang for 10 seconds.

    Other candidates would get hung up on the response codes. Since customers are experiencing errors, the investigator would assume that the 503 responses in the log search results above are the ones causing the problem. But in reality, it’s the 200s. Due to mismatched timeouts, the stuck requests can linger in Vasa and Argos until well after the client has gone away, ultimately succeeding but without any client to return their 200 result to.

    Once a candidate finds a specific request that they suspect represents an instance of the problem, they usually want to look at a trace. In this situation, I simply draw one by hand using the Insert => Drawing command in the Google Doc. Here’s such a drawing:

    And so the process goes back and forth like this until time runs out. In all the dozens of times I administered this interview, no one ever made it to the end. But, just like I told candidates at the beginning, that’s not what I cared about anyway.

    At the end, I ask the candidate to do one last thing:

    Alright, you've gotten pretty far, but now let's suppose it's time for you to go home. Maybe you're even going on vacation. How will you update the ticket?

    With this prompt, I’m hoping the candidate will describe:

    • the relevant new facts they’ve observed,
    • the explanations they’ve ruled out, and
    • what they recommend doing next.

    A candidate who simply lists all the actions they’ve taken and the results of those actions is missing the point. I want to see that they can synthesize their observations into a coherent and succinct form that a colleague with comparable skills will be able to easily pick up where they left off.

    Evaluating results

    The evaluation of candidates’ performance on this interview goes directly back to the troubleshooting skills that I enumerated at the beginning:

    Can they distinguish relevant from irrelevant facts?

    If they spent long periods going down rabbit holes that had nothing to do with the timeouts under investigation, then the answer is “no.”

    Do they seek to answer specific questions?

    When candidates are strong in this area, they’ll reason out loud about their hypotheses. I’ll know why they’re asking for a specific piece of information because they’ll tell me why.

    It’s also pretty easy to tell when a candidate is just throwing stuff at the wall to see if anything sticks, and that’s big points off. This is not how you troubleshoot effectively.

    Will they keep an open mind about the cause of a problem, rather than jumping to a conclusion?

    Often, candidates will fixate on a particular area of the stack. “This really feels like a database problem…” or the like is a common thing to hear. Of course it’s okay – and often good – to have a hunch. But the difference between ineffective and effective troubleshooters is that the ineffective ones will keep trying to prove their hunch right, whereas the effective ones will try to prove it wrong.

    Are they able to express their thought process to a colleague?

    The “update the ticket” question at the end is very helpful here, but I can usually get solid evidence on this well before then. Throughout the exercise, a good troubleshooter will recognize shifts in their thinking and note them aloud. A not-so-good troubleshooter will leave me in the dark.

    When they hit a dead end, will they get discouraged? Or will they calmly seek out a different approach?

    On many occasions, I’ve seen candidates get discouraged at a perceived lack of progress. Usually I try to help them find a different direction to go in, and then they’re able to pick their momentum back up. But some people just shut down and stop playing the game. Others go all the way back to the beginning and start casting about for hints in seemingly random parts of the stack. These are signs that the candidate lacks the doggedness required to consistently troubleshoot difficult problems.

    Do they have a strategy?

    This is what it’s all about. I’m looking to hire someone who makes efficient use of all the resources and information available. Someone who doesn’t just seek out information, but seeks out the particular information that will narrow the search space. It’s only these strategic thinkers who pass this interview, and having subsequently worked with several of them, I can confidently say that this interview technique gets great results.

    3 questions that will make you a phenomenal rubber duck

    As a Postgres reliability consultant and SRE, I’ve spent many hours being a rubber duck. Now I outperform even the incisive bath toy.

    “Rubber duck debugging” is a widespread, tongue-in-cheek term for the practice of explaining, out-loud, a difficult problem that you’re stumped on. Often, just by putting our troubles into words, we suddenly discover insights that unlock progress. The person we’re speaking to could just as well be an inanimate object, like a rubber duck. Hence the term.

    Rubber ducks are great, but a human can add even more value. In this article, I’ll share my 3 favorite questions to ask when someone comes to me feeling stumped in a troubleshooting endeavor. These questions work even when you have no particular expertise in the problem domain. Master them, and you’ll quickly start gaining a reputation as the person to talk to when you’re stuck. This is a great reputation to have!

    Question 1: How did you first start investigating this?

    As we investigate a problem, our focus shifts from one thing to another to another. We go down one path and forget about others. We zoom in on details and neglect to zoom back out. It’s easy to lose perspective.

    “How did you first start investigating this?” works well because, through the act of recounting their journey from initial observation to where they are now, your colleague will often regain perspective they’ve lost along the way. And by asking this particular question, you avoid having to suggest that they may have lost perspective – which could make them defensive.

    Even if your colleague hasn’t lost perspective, hearing the story of the investigation so far will help you ask better questions and help them organize their thoughts.

    Question 2: What observations have you made?

    In troubleshooting a complex problem, it’s easy to forget what you already know. As you go along, you make lots of observations, small and large, interesting and boring, relevant and irrelevant. You can’t hold them all in your head.

    When someone’s stuck, it often helps to review their observations. Not theories, not difficulties, not actions: directly observed facts.

    Reviewing observations can help in a few different ways:

    • They may be entertaining a hypothesis that clashes with some previously learned (but since forgotten) fact. If so, they can now go ahead and discard that hypothesis.
    • Juxtaposing two observations may suggest a hypothesis that never occurred to them before, because they never held those two observations in their head simultaneously.
    • Listing out their observations may bring to mind something they haven’t looked at yet.

    As your colleague recounts their observations, write them down in a numbered list. And, if you can, ask clarifying questions. Questions like “Does X always happen concurrently with Y, or only sometimes?” and “How does this differ from the normal behavior?”

    Never underestimate the power of precisely stating the facts.

    Question 3: If your hypothesis were wrong, how could we disprove it?

    This question is my favorite.

    One of the most common ways people get stuck in troubleshooting is tunnel vision. They get a single idea in their head about the cause of the problem, and that becomes all they can think about.

    This question, “If your hypothesis were wrong, how could we disprove it?” flips the script. Instead of racking their brain trying to prove their theory, it gets them thinking about other possibilities. Asking this question can lead to lots of different outcomes, all of which represent progress:

    1. You come up with a way to disprove the hypothesis, and successfully disprove it. This may make your colleague sad for a few hours, but when they come back to the problem, they’ll make huge strides.
    2. You come up with a way to disprove the hypothesis, but fail to disprove it. The hypothesis is thus bolstered, and the next step becomes clear: elaborate a few different versions of it and try to disprove those.
    3. You can’t think of any way to disprove it. This means it’s probably not a hypothesis at all, since it’s not falsifiable. Therefore you must replace it with a new hypothesis. This may feel like a setback, but it’s really the only way forward.

    How it fits together

    Under the hood, these 3 questions are just different ways of invoking hypothetico-deductive reasoning, which I’ve written about previously (see Troubleshooting On A Distributed Team Without Losing Common Ground and You Know Who’s Smart? Friggin’ Doctors, Man.). I don’t know of any better way to achieve consistent problem-solving results in the face of complexity.

    If you’re interested in learning how to apply these techniques in your career or in your organization, I can help. Shoot me an email!

    Outliers carry information. Don’t leave them on the table

    Over a decade ago, I saw this talk by John Rauser. Only recently, though, did I come to realize how incredibly influential this talk has been on my career. Gosh what a great talk! You should watch it.

    If you operate a complex system, like a SaaS app, you probably have a dashboard showing a few high-level metrics that summarize the system’s overall state. These metrics (“summary statistics”) are essential. They can reveal many kinds of gross changes (both gross “large scale” and gross “ick”) in the system’s state, over many different time scales. Very useful!

    But don’t be misled. Summary statistics reveal certain patterns in the system’s behavior, but they are not identical to the system’s behavior. All summary statistics – yes, even distributions – hide information. They’re lossy. It’s easy to get lulled into the sense that, if an anomaly doesn’t show up in the summary statistics, it doesn’t matter. But a complex system’s behavior is not just curves on a plot. It’s a frothing, many-dimensional vector sum of instant-to-instant interactions.

    When you investigate an anomaly in summary statistics, you’re faced with a small number of big facts. Average latency jumped by 20% at such-and-such time. Write IOPS doubled. API server queue depth started rising at some later time. Usually, you “zoom in” from there to find patterns that might explain these changes.

    When you instead investigate a specific instance of anomalous behavior, you start with a large number of small facts. A request to such-and-such an endpoint with this-and-that parameter took however many seconds and crashed on line 99 of thing_doer.rb. None of these small facts tell you anything about the system’s overall behavior: this is just a single event among millions or billions or more. But, nevertheless: these small facts can be quite illuminating if you zoom out.

    First of all, this probably isn’t the only time a crash like this has ever occurred. Maybe it’s happening multiple times a day. Maybe it happened twice as often this week as it did last week. Maybe it’s happening every time a specific customer makes a specific API request. Maybe that customer is fuming.

    And second of all, the reason this event caught our eye in the first place was because it was anomalous. It had some extreme characteristic. Take, for example, a request that was served with very high latency. Perhaps, in the specific anomalous case before us, that extreme latency didn’t cause a problem. But how extreme could it get before it did cause a problem? If it took 20 seconds today, could it take 30 seconds next time? When it hits 30, it’ll time out and throw an error. Or, if multiple requests like this all arrived at the same time, could they exhaust some resource and interfere with other requests?

    If the only anomalies you investigate are those that show up in summary statistics, then you’ll only find problems that have already gotten bad enough to move those needles. But if you dig into specific instances of anomalous behavior – “outliers” – then you can often find problems earlier, before they become crises.

    Descriptive engineering: not just for post-mortems

    In an organization that delivers a software service, almost all R&D time goes toward building stuff. We figure out what the customer needs, we decide how to represent their need as software, and we proceed to build that software. After we repeat this cycle enough times, we find that we’ve accidentally ended up with a complex system.

    Inevitably, by virtue of its complexity, the system exhibits behaviors that we didn’t design. These behaviors are surprises, or – often – problems. Slowdowns, race conditions, crashes, and so on. Things that we, as the designers, didn’t anticipate, either because we failed to consider the full range of potential interactions between system components, or because the system was exposed to novel and unpredictable inputs (i.e. traffic patterns). Surprises emerge continuously, and most couldn’t have been predicted a priori from knowledge of the system’s design.

    R&D teams, therefore, must practice 2 distinct flavors of engineering. Prescriptive engineering is when you say, “What are we going to build, and how?”, and then you execute your plan. Teams with strong prescriptive engineering capabilities can deliver high-quality features fast. And that is, of course, indispensable.

    But prescriptive engineering is not enough. As surprises emerge, we need to spot them, understand them, and explain them. We need to practice descriptive engineering.

    Descriptive engineering is usually an afterthought

    Most engineers rarely engage with production surprises.

    We’re called upon to exercise descriptive engineering only in the wake of a catastrophe or a near-catastrophe. Catastrophic events bring attention to the ways in which our expectations about the system’s behavior have fallen short. We’re asked to figure out what went wrong and make sure it doesn’t happen again. And, when that’s done, to put the issue behind us so we can get back to the real work.

    In fact, descriptive engineering outside the context of a catastrophe is unheard of most places. Management tends to see all descriptive engineering as rework: a waste of time that could have been avoided had we just designed our system with more forethought in the first place.

    On the contrary. To quote the late, lamented Dr. Richard Cook:

    The complexity of these systems makes it impossible for them to run without multiple flaws being present. Because these [flaws] are individually insufficient to cause failure they are regarded as minor factors during operations. … The failures change constantly because of changing technology, work organization, and efforts to eradicate failures.

    How Complex Systems Fail, #4

    A complex system’s problems are constantly shifting, recombining, and popping into and out of existence. Therefore, descriptive engineering – far from rework – is a fundamental necessity. Over time, the behavior of the system diverges more and more from our expectations. Descriptive engineering is how we bring our expectations back in line with reality.

    In other words: our understanding of a complex system is subject to constant entropic decay, and descriptive engineering closes an anti-entropy feedback loop.

    Where descriptive engineering lives

    Descriptive engineering is the anti-entropy that keeps our shared mental model of the system from diverging too far from reality. As such, no organization would get very far without exercising some form of it.

    But, since descriptive engineering effort is so often perceived as waste, it rarely develops a nucleus. Instead, it arises in a panic, proceeds in a hurry, and gets abandoned half-done. It comes in many forms, including:

    • handling support tickets
    • incident response
    • debugging a broken deploy
    • performance analysis

    In sum: the contexts in which we do descriptive engineering tend to be those in which something is broken and needs to be fixed. The understanding is subservient to the fix, and once the fix is deployed, there’s no longer a need for descriptive engineering.

    Moreover, since descriptive engineering usually calls for knowledge of the moment-to-moment interactions between subsystems in production, and between the overall system and the outside world, this work has a habit of being siphoned away from developers toward operators. This siphoning effect is self-reinforcing: the team that most often practices descriptive engineering will become the team with the most skill at it, so they’ll get assigned more of it.

    This is a shame. By adopting the attitude that descriptive engineering need only occur in response to catastrophe, we deny ourselves opportunities to address surprises before they blow up. We’re stuck waiting for random, high-profile failures to shock us into action.

    What else can we do?

    Instead of doing descriptive engineering only in response to failures, we must make it an everyday practice. To quote Dr. Cook again,

    Overt catastrophic failure occurs when small, apparently innocuous failures join to create opportunity for a systemic accident. Each of these small failures is necessary to cause catastrophe but only the combination is sufficient to permit failure. Put another way, there are many more failure opportunities than overt system accidents.

    How Complex Systems Fail, #3

    We won’t ever know in advance which of the many small failures latent in the system will align to create an accident. But if we cultivate an active and constant descriptive engineering practice, we can try to make smart bets and fix small problems before they align to cause big problems.

    What would a proactive descriptive engineering practice look like, concretely? One can imagine it in many forms:

    • A dedicated team of SREs.
    • A permanent cross-functional team composed of engineers familiar with many different parts of the stack.
    • A cultural expectation that all engineers spend some amount of their time on descriptive engineering and share their results.
    • A permanent core team of SREs, joined by a rotating crew of other engineers. Incidentally, this describes the experimental team I’m currently leading IRL, which is called Production Engineering.

    I have a strong preference for models that distribute descriptive engineering responsibility across many teams. If the raison d’être of descriptive engineering is to maintain parity between our expectations of system behavior and reality, then it makes sense to spread that activity as broadly as possible among the people whose expectations get encoded into the product.

    In any case, however we organize the effort, the main activities of descriptive engineering will look much the same. We delve into the data to find surprises. We pick some of these surprises to investigate. We feed the result of our investigations back into development pipeline. And we do this over and over.

    It may not always be glamorous, but it sure beats the never-ending breakdown.

    Big Problems and Small Problems under load

    In a complex system, like a SaaS product, you have countless interdependent things all happening at every moment. On a gross scale: means of production go in; value comes out. But under the hood, all sorts of interactions are going on, at various rates and latencies, with various exit conditions, stacked to various heights of indirection. And the rules for these interactions change with every deploy.

    As load gets added to the system, you start to experience 🏋️‍♀️🔥🪦 load problems. The system just can’t keep up with traffic, and it starts to break down under various circumstances, in all kinds of different more- or less spectacular ways.

    At the onset of load problems, it feels like everything is breaking at once. Like some threshold was breached and we’ve entered a new domain. We always had problems, but now we have a new Big Problem.

    Big Problems, though, are always composed of smaller problems. Big Problems conditioned by load are not atomic novel phenomena. There’s a knot of interdependent processes, and when this knot gets pulled tight by load, it binds up.

    Maybe there was a code deploy 6 weeks ago that inadvertently made this API endpoint’s median latency
    50% higher. And maybe that API endpoint holds open an AMQP connection until it finishes. And maybe some customer just spun up 500 instances of a container that hits the endpoint. Now all these tangled Small Problems constrict, resources are exhausted, and the system stops producing.

    It’s just these sorts of interdependency knots that cause an increase in the frequency of customer-facing problems as load rises.

    We rely heavily on Big Problems to tell us which Small Problems need fixing. “How did this failure happen? It happened because of this and this and this contributing factor. Which of these can we fix?” And of course the results of this sort of analysis are useful. The Small Problems we identify this way have a crucial unifying feature: we know, because we’ve seen it happen, that they can each be contributing factors to Big Problems.

    But this is not the only way to identify Small Problems. We can go digging in our data and find them, endeavoring to solve them before they get a chance to cause a Big Problem.

    Okay, maybe we can’t prevent every Big Problem this way. And without the gating effect of post-mortem analysis, we’ll likely spend some of our time fixing Small Problems that never would’ve amounted to anything.

    But even if we’re just poking around in the dark, we’ll probably end up preventing some Big Problems, right? And Big Problems are expensive. I think, on balance, it’s worth the effort.

    Somebody ought to be getting paid to be Maxwell’s Demon. Ideally, nearly everybody in the org ought to be spending some of their time investigating mysteries. It ought to be an explicit part of the job.