Post-mortems: content over structure

The value of post-mortems is apparent: failures present opportunities to learn about unexpected behaviors of the system, and learning lets us make improvements to the system’s reliability.

The value of post-mortem documents is much less apparent.

Many R&D orgs will insist that the final draft of a post-mortem document have a particular structure. Common components of this structure include:

  • Start and end time of customer impact
  • Time to detection
  • Time to diagnosis
  • Time to recovery
  • A list of action items, each with a link to a ticket
  • Mitigation action items broken down into categories (e.g. “Prevention,” “Detection,” “Impact”)
  • Specific sections (e.g. “Timeline,” “What went wrong?,” “What can we do better?”)
  • Signoffs/approvals

None of these structural requirements facilitate learning. The benefits of post-mortem analysis come not from the document, but rather from the sense-making processes of which the document is an artifact. In order to understand a given failure, we invent hypotheses, test them against our mental model and the observable facts. In order to choose effective strategies for improvement, we converse and debate. And in order to make any of this matter, we establish accountability for seeing those strategies through.

These social processes are the source of the value of post-mortem analysis. The document is just a souvenir.

But what if you want to do meta-analysis? What if you want to analyze trends in incident characteristics over time, or categorize incidents according to some scheme? Don’t you need structure then?

I suppose you do. But good luck getting any useful information. No matter how much structure you insist on, the data set will be hopelessly noisy. Just try to state a hypothesis that can realistically be tested by a meta-analysis of a single organization’s incident data. I don’t think you can.

But what if structure helps me learn?

If structuring the post-mortem process helps you learn, then by all means! categorize! prompt! But recognize structure as a tool rather than an end in itself. Your learning process may benefit from one kind of structure, while somebody else’s may benefit from a different kind of structure, or less or more structure altogether. But the structure of the learning is what matters; not the structure of the document.

Organizational legibility

If you ask me, it’s all about the bureaucratic drive for legibility. Centralized power craves legibility.

A requirement for post-mortem documents to have a specific, consistent structure doesn’t help us learn or improve. So why do we do it?

Idiosyncratic processes like sense-making and learning are illegible to a command-and-control power structure. They come in diverse and intricate forms, instead of the standardized, codified forms that bureaucracy can parse. In service of legibility, a company’s power structure will insist that the post-mortem process culminate in spreadsheet-ready data items like “customer impact duration,” “time to recovery,” “severity level,” and the like. Centralized power demands these simplifications even if they inhibit learning and improvement. To the bureaucracy, legibility itself is the goal.

As an employee, you probably can’t totally disregard these bureaucratic impositions. But, to the extent you can disregard them, you should. Focus on what really matters: learning and improvement.

5 production surprises worth investigating

As an SRE, I’m a vocal believer in following one’s nose: seeking out surprising phenomena and getting to the bottom of them. By adopting this habit, we can find and fix many classes of problems before they turn into incidents. Over time, this makes things run much smoother.

But what is surprising? How do we decide what to ignore and what to chase down?

Here are 5 kinds of surprise that are worth digging into in production data.

Keep in mind that this list is based on one SRE’s experience. Someone else, with different intuitions about system behavior, would make a different list. And that list would be more valuable to them than this list.

The most important thing is not that we always pick the right surprises to investigate. It’s that we make time, consistently, to seek out and explain surprises.

Anyway. Here’s my list.

1. Long running requests

Pick a time window. Sort all the requests during that window descending by duration. At the top of this list, you’ll find classes of requests that spent extraordinarily long in your system.

Were you expecting to see these specific kinds of requests? If not, then congratulations: you’re surprised. And this is quite likely a surprise worth investigating, because

  1. while these slow requests are running, they’re eating up resources (CPU, memory, sockets, and so on), which gives them an outsized potential to degrade the performance of other, unrelated requests.
  2. without intervention, the underlying problem will probably get worse over time. Why wouldn’t it? It has already gotten this bad. And if it gets bad enough, it will eventually cause customer pain.
  3. many of the phenomena contributing to this slowness probably affect a variety of requests. So, if you can make these requests perform better, you may be able to make other requests perform better too.

We don’t have to rigorously analyze and optimize every slow request that goes through the system. Just by coming to understand the issues that cause our slowest requests to be slow, we often find opportunities to make high-impact improvements and mitigate the risk of ending up on an incident call later.

2. Saturation

Saturation is when units of work are waiting in a queue to be processed – be they web requests, or IP packets, or threads, or what have you.

To the extent we’re trying to serve each request as quickly as possible, saturation anywhere is waste. We shouldn’t expect any saturation. Or, at least, we should strive to thoroughly understand every source of it. If we don’t understand it, it will probably get worse over time.

Find every metric that represents a saturation. Put them all on one dashboard. Look at that dashboard every day.

3. Crashes

Every time an exception bubbles up to the top of the stack, that’s unexpected. If it were expected, then the code would have caught and handled the exception.

A tool like Sentry or Datadog Error Tracking (I mention these specific products only because I know them, not because they’re necessarily the best) lets you classify crashes, and see which ones are new or getting more frequent. Granted: it’s often the case that there are far too many classes of crash to investigate all or even most of them. But this shouldn’t stop us from chasing down some of them, according to our intuition about their relative importance.

We have imperfect information until we get our hands dirty, so we won’t always end up picking the highest-impact investigations to undertake. But fixing crashes is always better than ignoring them, and by dividing up the extant crashes into groups and prioritizing one group at a time, we can create some very useful guarantees.

For example, if we find all the crashes that correspond to deadlocks, and fix those, then we guarantee that our system has no deadlocks. Thereafter, problems caused by deadlocks – which tend to be subtle, elusive, and therefore time-consuming to solve – can no longer quietly sow chaos.

4. Utilization spikes

It’s expected for a utilization metric (for example, network bandwidth or memory utilization) to move around a lot. But it’s not expected for utilization to exhibit large spikes. When this happens, it means some recurring event is using more than its fair share of some resource.

When utilization hits 100%, things will degrade catastrophically and immediately. So if you have utilization spikes representing 20% of capacity, then steady-state utilization only has to be 80% before catastrophic degradations start occurring. Or, if the spikes gets worse, and now they’re 30% of capacity, then 70% capacity becomes the new safe limit. We’ll have to waste at least 30% of our capacity by leaving it idle in order to prevent outages.

So why not understand the spikes now, before they get worse?

5. Correlation between latency and throughput

When we see recurring swings in latency metrics, it’s easy to say to ourselves, “Yeah, well, when we’re under more load, things slow down.” But… why? Systems don’t always behave like that! They do it when there’s a bottleneck. If the system were behaving as expected and no resources were exhausted, then the latency graph would be more or less flat.

In other words, if traffic tends to be higher mid-day on weekdays, and latency also tends to rise mid-day on weekdays, then you’re likely saturating at least one resource. Would you rather find that bottleneck is now, or during an incident?

I hope this list is a good place to start if you’re looking to dig up problems. But, ultimately, you must be the judge of what’s surprising to you. Find something you can’t explain. Explain it. Repeat.

Platform teams don’t need to act like companies

Lately you see a lot of software company R&D teams organized around internal products. The Search Team provides a Search service and its “customers” are the teams whose code consumes that service. The Developer Productivity Team’s product is a suite of tools for managing local development environments and running tests, and its “customers” are the developers who use those tools. And so on.

The idea behind this organizational trick seems to be that product-oriented teams will act like a company. They’ll make better strategic decisions because their goals will be aligned with those of their internal customers.

This is pretty silly.

If a company satisfies the needs of its customers and potential customers, it can grow and thrive and put resources into attracting more revenue. If a company fails to satisfy customer needs, it loses out to competitors. As a company changes its strategy, the market provides feedback. These dynamics don’t exist for an internal team – especially a platform (or “infrastructure,” or “ops”) team.

When asked to think like a company in this way, a platform team will define their product as something like, “a reliable, performant, and secure platform for running software products.” Fine – but think of all the things that a company can do that you can’t do:

  • You can’t attract new customers
  • You can’t pivot to a different product
  • You can’t focus on a different market segment
  • You can’t sacrifice some set of customers to attract another

If you try to “think like a company” as a platform team, you will try to do what successful tech companies do: launch attractive features. But you can waste a lot of time that way, because the most important feature of a software platform is boringness.

If the platform team must act like a company, then it should act like a struggling company that’s desperate to keep its last few enterprise customers. This implies a very different strategy than that of a startup. Change as little as possible and put the bulk of your energy into efforts that increase boringness: documenting, simplifying, and fixing. Don’t build anything exciting if you can avoid it, and if you must build something, build it in service of boringness.

Fix tomorrow’s problems by fixing today’s problems

A bug in our deployment system causes O(N²) latency with respect to the number of deploys that have been performed. At first, it’s too minuscule to notice. But the average deploy latency grows over time. Eventually, deploys start randomly timing out. The deploy pipeline grinds to a halt, and it becomes an emergency.

Or maybe, if we think critically about the deploy latency time series soon enough, it might be obvious well in advance that something’s getting worse. We can fix this problem before it becomes a crisis. But in order to see it, we have to look. And we have to give ourselves time to go down the rabbit hole.

An API server has an edge case that leads to unconstrained memory usage. At first, this edge case only gets hit occasionally, and the API server’s memory usage stays well below capacity. But, as usage patterns evolve, we start to hit this bug more frequently, with larger and larger allocations of memory. For a while, we’re still below the OOMkill threshold. Once we start hitting that threshold, things get ugly. If we still continue to ignore it, then eventually, things will get so ugly that we’ll have to drop what we’re doing and fix this bug.

We had opportunities to see this coming. Depending on our willingness to dig in and investigate surprising phenomena, we could have discovered this problem when the OOMkills started, or even before they started – when these blips were just spikes on a memory graph.

A cron job runs every 30 minutes, and while it runs, it holds a database lock. When it’s first deployed, the cron job doesn’t have much to do, so it runs fast, and no one suffers. Over months, though, the cron job grows sluggish. It just has more work to do. Query pileups start to occur every 30 minutes. We start seeing significant impact on the latency of our application. And, one day, there’s an outage.

We’ll wish we’d dug in back when it was just a slow cron job. Or even when it was just query spikes.

You can prevent many things from turning into fires, but you need space. Space to be curious, to investigate, to explain your findings to yourself and others.

Suppose you spent a week looking for trouble like this, and you only happened to find 1 issue out of these 3. That’s still great, right? Compared to the cost of letting it become a disruption later?

When a system fails, it’s silly to blame practitioners for not seeing the signs. But that doesn’t mean we shouldn’t put in a serious effort to see the signs. If you give yourself space to follow the Huh!, you get opportunities to fix problems before they get worse.

Huh! as a signal

Every time our system fails, and we go to analyze the failure, we find ourselves saying things like “We didn’t know X was happening,” “we didn’t know Y could happen,” and so on. And it’s true: we didn’t know those things.

We can never predict with certainty what the next system failure will be. But we can predict, because painful experience has taught us, that some or all of the causes of that failure will be surprising.

We can use that!

When we go looking at data (and by “data” I mostly mean logs, traces, metrics, and so on, but data can be many things), sometimes we see something weird, and we go like, Huh!. That Huh! is a signal. If we follow that Huh! – get to the bottom of it, figure it out, make it not surprising anymore – two things happen. First, we get a chance to correct a latent problem which might some day contribute to a failure. And second, we make our mental model that much better.

Of course, any individual Huh! could turn out to be nothing. Perhaps there’s a bug. Perhaps circumstances have shifted, and our expectations no longer line up with reality. Or perhaps it’s just a monitoring blip. We won’t know until we run it down.

But, whatever the shortcomings of any particular investigation, a habit of investigating surprises has many attractive qualities. The main one is that we get to fix problems before those problems get worse, start bouncing off other problems, and cause fires. In other words: our system runs smoother. Consider what that’s worth.

Upstream-local queueing

One day not long ago, as I was looking for trouble in a production system’s telemetry, I came across a puzzling phenomenon. I was examining the load balancer access logs for a particular API endpoint – an endpoint that does essentially nothing and should always return a 200 response within a handful of milliseconds. I saw this:

MetricValueMy reaction
10th percentile latency10msOkay,
Median latency11mssure,
75th pecentile latency14msfair enough,
90th percentile latency160msmm-h– wait,
99th percentile latency700msWhat??

“What gives?” I wondered aloud, cocking my head. “What could this request possibly be doing for 700 milliseconds? It has nothing to do.” That’s how I learned about an issue I’m calling upstream-local queueing. It’s a mostly stack-agnostic performance problem, and boy am I glad I found it early, because it has some dire scaling implications.

The problem

I’ll spare you a recapitulation of the head-scratching and data fumbling that ensued. Upstream-local queueing is when an upstream – an individual server tasked with responding to requests – is asked to handle more requests than it can handle concurrently. For example, suppose you have a cluster of load balancers, all of which independently distribute requests to upstreams. Each upstream has a maximum concurrency of 3.

Upstream-local queueing happens when, due to random chance, a particular upstream has 3 requests in flight, and happens to receive a 4th. The upstream can’t immediately start processing that 4th request, so it waits in a queue until some capacity frees up. And in the meantime, maybe a 5th request arrives. And so on.

So, even though the rest of the upstream cluster might have plenty of idle capacity available, these requests end up sitting around twiddling their thumbs, accumulating latency.

You’ll notice that I haven’t mentioned any particular technologies or load balancing algorithms yet. That’s because upstream-local queueing is a remarkably general phenomenon. The following system characteristics are sufficient for this problem to emerge:

  1. The upstreams have finite capacity to handle concurrent requests.
  2. The load balancer’s decisions about where to send each request are independent from each other.

Many systems satisfy these criteria.

It’s hard to observe

Upstream-local queueing can be tricky to observe directly. A queue can appear and disappear on any given upstream within a matter of milliseconds: far shorter than the time scales on which metrics tend to be collected. ULQ’s contribution to request latency therefore appears to be randomly distributed, and mostly 0.

Furthermore, at least in my case, the stack is not instrumented well for this. Upstream-local queueing occurs in somewhat of a black box. It’s an open-source black box, but due to the design of the component that’s handling these requests within the upstream, it’s non-trivial to observe the impact of ULQ.

Further furthermore, the severity of ULQ-caused latency is coupled to utilization, which in most real-world systems is constantly changing. And, furthestmore, unless you’re already in deep trouble, ULQ’s impact tends to be below the noise floor for all but the most painstaking measurement techniques.

The most painstaking measurement techniques

When I first set out to examine the upstream-local queueing phenomenon, I took a highly manual approach:

  1. Pick an upstream
  2. Search for load balancer access logs corresponding to requests that were sent to that upstream
  3. Dump a CSV of those log entries
  4. Run a script against the CSV that uses the timestamps and durations to reconstruct the number of requests in-flight to that upstream from instant to instant.

This was a pain in the ass. But I only had to do it 2 or 3 times before I determined that, yes: this was indeed happening, and it was causing nontrivial latency.

At this point, I was sure that I had found a big problem. But I couldn’t just go fix it. I needed to convince my colleagues. In spite of this problem’s recalcitrance to direct observation, I needed a clear and compelling demonstration.

A computational model

Lucky for me, queueing systems are easy to model!

I spent a day or two building a computational model of the behavior of an upstream under load. The model is on my GitHub. I won’t bore you with the details, but essentially, requests arrive at the upstream at a set interval, and each request takes a random amount of time to execute. If there are more than 12 requests in flight, further requests are queued until slots free up. We add up the number of microseconds spent queued versus in flight, and voilà: a working model that largely agrees with our real-world observations.

Here’s what the model told me:

In the graph above, each point represents a run of the simulation with a different average request rate. As you can see, the average number of milliseconds spent by requests in the upstream-local queue is tightly correlated to utilization, and it grows more or less exponentially.

This is a huge problem! As more capacity is used, requests experience, on average, exponentially more latency:

% capacity usedAverage latency due to ULQ
40%0.29ms
50%2.22ms
60%11.4ms
70%38.4ms
80%127ms
90%461ms
95%1372ms

And remember: this is just on average. 90th- and 99th-percentile latencies can climb to unacceptable levels far sooner.

What’s worse, ULQ affects all requests equally. If the average added latency is, say, 10ms, then a request that would normally take 1000ms will instead take 1010ms, for a slowdown of 1%. But a request that would normally take 5ms will take on average 15ms: a 300% performance hit. This means more requests sitting around in your stack eating up resources for no good reason. It also means, if clients of your service tend to do many individual requests in sequence (like a web browser, for example), that overall user experience can suffer drastically before this problem even appears that bad.

What to do about it

As I said before, this is a quite general problem. Switching web servers won’t fix it, nor will scaling up. Switching from random-number to round-robin load balancing, or vice versa, will not fix it. There are 3 classes of solution.

The first class of solution is the dumbest. But hey, maybe you need your upstream-local queueing problem fixed now and you don’t have time to be smart. In this case, here’s what you do: set a threshold, and meet it by keeping enough of your capacity idle. Referring to the table above, if we decided on a threshold of 11ms average ULQ latency, then we’d need to keep at least 40% of our capacity idle at all times.

I told you it was dumb. But it is easy. The other two solutions are less easy.

The second solution is to reduce your application’s latency variance. If some requests take 10 milliseconds and others take 30000, then upstream-local queueing rears its ugly head. If, instead, all your requests take between 30 and 35 milliseconds (or between 3 and 3.5 seconds, for that matter), its effect is much less pronounced. By hacking away at the long tail of your latency distribution, you may be able to push the worst effects of ULQ further to the right-hand-side of the graph. But, at the end of the day, exponential growth is exponential growth. It’s not really a fix.

The best thing you can do, of course, is use a more sophisticated load balancing algorithm. This necessitates that your load balancing software supports one. If, for example, you use a least outstanding requests algorithm, then upstream-local queueing simply won’t occur until you’ve exhausted all of your upstream capacity. It ceases to be a scaling problem.

How to tell how bad ULQ is in your stack

For a quick and dirty answer to the question “How much latency is ULQ contributing in my system?” you can make a simple graph dashboard. Take the 90th percentile latency as measured by the load balancer, and subtract the 90th percentile latency as measured by the upstream.

If these curves grow and shrink along with your throughput, you probably have an upstream-local queueing problem. And if the peaks are getting higher, that problem is getting worse.

The numbers resulting from this calculation are not a rigorous measurement of anything in particular. You can’t really add or subtract percentiles. But it’s often a very easy calculation to do, and as long as you don’t make inferences based on the values of the numbers – just the shapes of the curves – you can get some quick confidence this way before you proceed with a deeper investigation. And then you can fix it.

Descriptive engineering: not just for post-mortems

In an organization that delivers a software service, almost all R&D time goes toward building stuff. We figure out what the customer needs, we decide how to represent their need as software, and we proceed to build that software. After we repeat this cycle enough times, we find that we’ve accidentally ended up with a complex system.

Inevitably, by virtue of its complexity, the system exhibits behaviors that we didn’t design. These behaviors are surprises, or – often – problems. Slowdowns, race conditions, crashes, and so on. Things that we, as the designers, didn’t anticipate, either because we failed to consider the full range of potential interactions between system components, or because the system was exposed to novel and unpredictable inputs (i.e. traffic patterns). Surprises emerge continuously, and most couldn’t have been predicted a priori from knowledge of the system’s design.

R&D teams, therefore, must practice 2 distinct flavors of engineering. Prescriptive engineering is when you say, “What are we going to build, and how?”, and then you execute your plan. Teams with strong prescriptive engineering capabilities can deliver high-quality features fast. And that is, of course, indispensable.

But prescriptive engineering is not enough. As surprises emerge, we need to spot them, understand them, and explain them. We need to practice descriptive engineering.

Descriptive engineering is usually an afterthought

Most engineers rarely engage with production surprises.

We’re called upon to exercise descriptive engineering only in the wake of a catastrophe or a near-catastrophe. Catastrophic events bring attention to the ways in which our expectations about the system’s behavior have fallen short. We’re asked to figure out what went wrong and make sure it doesn’t happen again. And, when that’s done, to put the issue behind us so we can get back to the real work.

In fact, descriptive engineering outside the context of a catastrophe is unheard of most places. Management tends to see all descriptive engineering as rework: a waste of time that could have been avoided had we just designed our system with more forethought in the first place.

On the contrary. To quote the late, lamented Dr. Richard Cook:

The complexity of these systems makes it impossible for them to run without multiple flaws being present. Because these [flaws] are individually insufficient to cause failure they are regarded as minor factors during operations. … The failures change constantly because of changing technology, work organization, and efforts to eradicate failures.

How Complex Systems Fail, #4

A complex system’s problems are constantly shifting, recombining, and popping into and out of existence. Therefore, descriptive engineering – far from rework – is a fundamental necessity. Over time, the behavior of the system diverges more and more from our expectations. Descriptive engineering is how we bring our expectations back in line with reality.

In other words: our understanding of a complex system is subject to constant entropic decay, and descriptive engineering closes an anti-entropy feedback loop.

Where descriptive engineering lives

Descriptive engineering is the anti-entropy that keeps our shared mental model of the system from diverging too far from reality. As such, no organization would get very far without exercising some form of it.

But, since descriptive engineering effort is so often perceived as waste, it rarely develops a nucleus. Instead, it arises in a panic, proceeds in a hurry, and gets abandoned half-done. It comes in many forms, including:

  • handling support tickets
  • incident response
  • debugging a broken deploy
  • performance analysis

In sum: the contexts in which we do descriptive engineering tend to be those in which something is broken and needs to be fixed. The understanding is subservient to the fix, and once the fix is deployed, there’s no longer a need for descriptive engineering.

Moreover, since descriptive engineering usually calls for knowledge of the moment-to-moment interactions between subsystems in production, and between the overall system and the outside world, this work has a habit of being siphoned away from developers toward operators. This siphoning effect is self-reinforcing: the team that most often practices descriptive engineering will become the team with the most skill at it, so they’ll get assigned more of it.

This is a shame. By adopting the attitude that descriptive engineering need only occur in response to catastrophe, we deny ourselves opportunities to address surprises before they blow up. We’re stuck waiting for random, high-profile failures to shock us into action.

What else can we do?

Instead of doing descriptive engineering only in response to failures, we must make it an everyday practice. To quote Dr. Cook again,

Overt catastrophic failure occurs when small, apparently innocuous failures join to create opportunity for a systemic accident. Each of these small failures is necessary to cause catastrophe but only the combination is sufficient to permit failure. Put another way, there are many more failure opportunities than overt system accidents.

How Complex Systems Fail, #3

We won’t ever know in advance which of the many small failures latent in the system will align to create an accident. But if we cultivate an active and constant descriptive engineering practice, we can try to make smart bets and fix small problems before they align to cause big problems.

What would a proactive descriptive engineering practice look like, concretely? One can imagine it in many forms:

  • A dedicated team of SREs.
  • A permanent cross-functional team composed of engineers familiar with many different parts of the stack.
  • A cultural expectation that all engineers spend some amount of their time on descriptive engineering and share their results.
  • A permanent core team of SREs, joined by a rotating crew of other engineers. Incidentally, this describes the experimental team I’m currently leading IRL, which is called Production Engineering.

I have a strong preference for models that distribute descriptive engineering responsibility across many teams. If the raison d’être of descriptive engineering is to maintain parity between our expectations of system behavior and reality, then it makes sense to spread that activity as broadly as possible among the people whose expectations get encoded into the product.

In any case, however we organize the effort, the main activities of descriptive engineering will look much the same. We delve into the data to find surprises. We pick some of these surprises to investigate. We feed the result of our investigations back into development pipeline. And we do this over and over.

It may not always be glamorous, but it sure beats the never-ending breakdown.

Troubleshooting On A Distributed Team Without Losing Common Ground

I work on a team that fixes complex systems under time pressure. My teammates have different skill sets, different priorities, and different levels of expertise. But we all have to troubleshoot and solve problems together.

This is really hard to do effectively. Fortunately for us in the relatively new domain of DevOps, situations like ours have been studied extensively in the last couple decades. We can use the results of this research to inform our own processes and automation for troubleshooting.

One of the most important concepts to emerge from recent teamwork research, common ground, helps us understand why collaborative troubleshooting breaks down over time. This breakdown leads to wasted effort and mistakes, even if the team maintains constant communication in a chat room. But if we extend ChatOps by drawing on some ideas from medical diagnosis, we can make troubleshooting way easier without losing the benefits of fluid team conversation.

Common Ground

Ergonomics researchers D.D. Woods and Gary Klein (the latter of whom I wrote about in What makes an expert an expert?) published a phenomenally insightful paper in 2004 called Common Ground and Coordination in Joint Activity. In it, they describe a particular kind of failure that occurs when people engage in joint cognition: the Fundamental Common Ground Breakdown. Once you learn about the Fundamental Common Ground Breakdown, you see it everywhere. Here’s how the Woods/Klein paper describes the FCGB:

  • Party A believes that Party B possesses some knowledge
  • Party B doesn’t have this knowledge, and doesn’t know he is supposed to have it.
  • Therefore, he or she doesn’t request it.
  • This lack of a request confirms to Party A that Party B has the knowledge.

When this happens, Party A and Party B lose common ground, which Woods & Klein define as “pertinent knowledge, beliefs and assumptions that are shared among the involved parties.” The two parties start making incorrect assumptions about each other’s knowledge and beliefs, which causes their common ground to break down further and further. Eventually they reach a coordination surprise, which forces them to re-synchronize their understanding of the coordinated activity:

csel_eng_ohio-state_edu_woods_distributed_CG_final_pdf

Seriously, the FCGB is everywhere. Check out the paper.

I’m especially interested in one particular area where an understanding of common ground can help us do better teamwork: joint troubleshooting.

Common Ground Breakdown in Chatroom Troubleshooting

Everybody’s into ChatOps these days, and I totally get it. When a critical system is broken, it’s super useful to get everybody in the same room and hash it out. ChatOps allows everybody to track progress, coordinate activities, and share results. And it also helps to have lots of different roles represented in the room:

  • Operations folks, to provide insight into the differences between the system’s normal behavior and its current state
  • Software engineers, who bring detailed knowledge of the ways subsystems are supposed to work
  • Account managers and product managers and support reps: not just for their ability to translate technical jargon into the customer’s language for status reporting, but also because their understanding of customer needs can help establish the right priorities
  • Q.A. engineers, who can rule out certain paths of investigation early with their intuition for the ways in which subsystems tend to fail

The process of communicating across role boundaries isn’t just overhead: it helps us refine our own understanding, look for extra evidence, and empathize with each other’s perspectives.

But ChatOps still offers a lot of opportunities for common ground breakdown. The FCGB can occur whenever different people interpret the same facts in different ways. Interpretations can differ for many different reasons:

  • Some people have less technical fluency in the system than others. A statement like “OOM killer just killed Cassandra on db014” might change an ops engineer’s whole understanding of the problem, but such a shift could fly under the radar of, say, a support engineer.
  • Some people are multitasking. They may have a stake in the troubleshooting effort but be unable to internalize every detail from the chat room in real time.
  • Some people are co-located. They find it easier to discuss the problem using mouth words or by physically showing each other graphs, thereby adjusting their own shared understanding without transmitting these adjustments to the rest of the team.
  • Some people enter the conversation late, or leave for a while and come back. These people will miss common ground changes that happen during their absence.

These FCGB opportunities all become more pronounced as the troubleshooting drags on and folks become tired, bored, and confused. And when somebody says they’ve lost track of common ground, what do we do? Two main things: we provide a summary of recent events and let the person ask questions until they feel comfortable; or we tell them to read the backlog.

The Q&A approach has serious drawbacks. First of all, it requires somebody knowledgeable to stop what they’re doing and summarize the situation. If people are frequently leaving and entering the chat room, you end up with a big distraction. Second of all, it leaves lots of room for important information to get missed. The Fundamental Common Ground Breakdown happens when somebody doesn’t know what to ask, so fixing it with a Q&A session is kind of silly.

The other way people catch up with the troubleshooting effort is by reading the backlog. This is even more inefficient than Q&A. Here’s the kind of stuff you have to dig through when you’re reading a chat backlog:

tng-hipchat

There’s a lot to unpack there – and that’s just 18 messages! Imagine piecing together a troubleshooting effort that’s gone on for hours, or days. It would take forever, and you’d still make a lot of mistakes. It’s just not a good way to preserve common ground.

So what do we need?

Differential Diagnosis as an Engine of Common Ground

I’ve blogged before about how much I love differential diagnosis. It’s a formalism that doctors use to keep the diagnostic process moving in the right direction. I’ve used it many times in ops since I learned about it. It’s incredibly useful.

In differential diagnosis, you get together with your team in front of a whiteboard – making sure to bring together people from a wide variety of roles – and you go through a cycle of 3 steps:

  1. Identify symptoms. Write down all the anomalies you’ve seen. Don’t try to connect the dots just yet; just write down your observations.
  2. Generate hypotheses. Brainstorm explanations for the symptoms you’ve observed. This is where it really helps to have a good cross-section of roles represented. The more diverse the ideas you write down, the better.
  3. Test hypotheses. Now that you have a list of things that might be causing the problem, you start narrowing down that list by coming up with a test that will prove or disprove a certain hypothesis.

Once you’re done with step #3, you can cross out a hypothesis or two. Then you head back to step #1 and repeat the cycle until the problem is identified.

A big part of the power of differential diagnosis is that it’s written down. Anybody can walk into the room, read the whiteboard, and understand the state of the collaborative effort. It cuts down on redundant Q&A, because the most salient information is summarized on the board. It eliminates inefficient chat log reading – the chat log is still there, but you use it to search for specific pieces of information instead of reading it like a novel. But, most importantly, differential diagnosis cuts down on fundamental common ground breakdowns, because everybody has agreed to accept what’s on the whiteboard as the canonical state of troubleshooting.

Integrating Differential Diagnosis with ChatOps

We don’t want to lose the off-the-cuff, conversational nature of ChatOps. But we need a structured source of truth to provide a point-in-time understanding of the effort. And we (read: I) don’t want to write a whole damn software project to make that happen.

My proposal is this: use Trello for differential diagnosis, and integrate it with the chat through a Hubot plugin. I haven’t written this plugin yet, but it shouldn’t take long (I’ll probably fork hubot-trello and start from there). That way people could update the list of symptoms, hypotheses, and tests on the fly, and they’d always have a central source of common ground to refer to.

In the system I envision, the chat room conversation would be peppered with statements like:

Geordi: hubot symptom warp engine going full speed, but ship not moving

Hubot: Created (symp0): warp engine going full speed, but ship not moving

Beverly: hubot falsify hypo1

Hubot: Falsified (hypo1): feedback loop between graviton emitter and graviton roaster

Geordi: hubot finish test1

Hubot: Marked (test1) finished: reboot the quantum phase allometer

And the resulting differential diagnosis board, containing the agreed-upon state of the troubleshooting effort, might look like this example, with cards labeled to indicate that they’re no longer in play.

What do you think?

Let me know if your organization already has something like this, or has tried a formal differential diagnosis approach before. I’d love to read some observations about your team’s process in the comments. Also, VictorOps has a pretty neat suite of tools that approaches what I have in mind, but I still think a more conceptually structured (not to mention free) solution could be very useful.

Automation is most effective when it’s a team player. By using automation to preserve common ground, we can solve problems faster and more thoroughly, with less frustration and less waste. And that all sounds pretty good to me.

What makes an expert an expert?

Maybe you’ve been here:

You get a phone call in the middle of the night. The new sysadmin (whom you hired straight out of college) is flipping all of her shits because web app performance has degraded beyond the alert threshold. She’s been clicking through page after page of graphs, checking application logs all the way up and down the stack, and just generally cussing up a storm because she can’t find the source of the issue. You open your laptop, navigate straight to overall performance graphs, drill down to database graphs, see a pattern that looks like mutex contention, log in to the database, find the offending queries, and report them to the on-call dev. You do all this in a matter of minutes.

Or here:

You’re trying to teach your dad to play Mario Kart. It’s like “Okay, go forward… no, forward… you have to press the gas – no, that’s fire – press the gas button… it’s the A button… the blue one… Yeah, there you go, okay, you’re going forward now… so… so go around the corner… why’d you stop? Dad… it’s like driving a car, you can’t turn if you’re stopped… so remember, gas is A… which is the blue one…”

Why is it so hard for experts to understand the novice experience? Well, in his book Sources of Power, decision-making researcher Gary Klein presents some really interesting theories about what makes experts experts. His theories give us insight into the communication barriers between novices and experts, which can make us better teachers and better learners.

Mental Simulation

Klein arrived at his decision-making model, the recognition-primed decision model, by interviewing hundreds of experts over several years. According to his research, experts in a huge variety of fields rely on mental simulation. In Sources of Power, he defines mental simulation as:

the ability to imagine people and objects consciously and to transform those people and objects through several transitions, finally picturing them in a different way than at the start.

Klein has never studied sysadmins, but when I read about his model I recognized it immediately. This is what we do when we’re trying to reason out how a problem got started, and it’s also how we figure out how to fix it. In our head, we have a model of the system in which the problem lives. Our model consists of some set of moving parts that go through transitions from one state to another.

If you and your friend are trying to figure out how to get a couch around a corner in your stairwell, your moving parts are the couch, your body, and your friend’s body. If you’re trying to figure out how a database table got corrupted, your moving parts might be the web app, the database’s storage engine, and the file system buffer. You envision a series of transitions from one state to the next. If those transitions don’t get you from the initial state to the final state then you tweak your simulation and try again until you get a solution.

Here’s the thing, though: we’re people. Our brains have a severely limited amount of working memory. In his interviews with experts about their decision making processes, Klein found that there was a pretty hard upper limit on the complexity of our mental simulations:

  • 3 moving parts
  • 6 transitions

That’s about all we get, regardless of our experience or intelligence. So how do experts mentally simulate so much more effectively than novices?

Abstractions

As we gain experience in a domain, we start to see how the pieces fit together. As we notice more and more causal patterns, we build a mental bank of abstractions. An abstraction is a kind of abbreviation that stands in for a set of transitions or moving parts that usually functions as a whole. It’s like the keyboard of a piano: when the piano’s working correctly, we don’t have to think about the Rube Golberg-esque series of yanks and shoves going on inside it; we press a key, and the corresponding note comes out.

Experts have access to a huge mental bank of abstractions. Novices don’t yet. This makes experts more efficient at creating mental simulations.

When you’re first learning to drive a car, you have to do everything step by step. You don’t have the abstraction bank of an experienced driver. When the driving instructor tells you to back out of a parking space, your procedure looks something like this:

  • Make sure foot is on brake pedal
  • Shift into reverse
  • Release brake enough to get rolling
  • Turn steering wheel (which direction is it when I’m in reverse?)
  • Put foot back on brake pedal
  • Shift into drive

It’s a choppy, nerve-racking sequence of individual steps. But once you practice this a dozen times or so, you start to build some useful abstractions. Your procedure for backing out of a parking space becomes more like:

  • Go backward (you no longer think about how you need to break, shift, and release the brake)
  • Get facing the right direction
  • Go forward

Once you’ve done it a hundred times, it’s just one step: “Back out of the parking space.”

Now if you recall that problem solving involves mental simulations with at most 3 moving parts and 6 transitions, you’ll see why abstractions are so critical to the making of an expert. Whereas a novice requires several transitions to represent a process, an expert might only need one. The right choice of abstraction allows the expert to hold a much richer simulation in mind, which improves their effectiveness in predicting outcomes and diagnosing problems.

Counterfactuals

Klein highlights another important difference between experts and novices: experts can readily process counterfactuals: explanations and predictions that are inconsistent with the data. This is how experts are able to improvise in unexpected situations.

Imagine that you’re troubleshooting a spate of improper 403 responses from a web app that you admin. You expect that the permissions on some cache directory got borked in the last deploy, so you log in to one of the web servers and tail the access log to see which requests in particular are generating 403s. But you can’t find a single log entry with a 403 error code! You refresh the app a few times in your browser, and sure enough you get a 403 response. But the log file still shows 200 after 200. What’s going on?

If you were a novice, you might just say “That’s impossible” and throw up your hands. But an experienced sysadmin could imagine any number of plausible scenarios to accommodate this counterfactual:

  • You logged in to staging instead of production
  • The 403s are only coming from one of the web servers, and it’s not the one you logged in to
  • 403s are being generated by the load balancer before the requests ever make it to the web servers
  • What you’re looking at in your browser is actually a 200 response with a body that says “403 Forbidden”

Why are experts able to adjust so fluidly to counterfactuals while novices aren’t?

It comes back to abstractions. When experts see something that doesn’t match expectations, they can easily recognize which abstraction is leaking. They understand what’s going on inside the piano, so when they expect a tink but hear a plunk, they can seamlessly jump to a lower level of abstraction and generate a new mental simulation that explains the discrepancy.

Empathizing with novices

By understanding a little about the relationship between abstractions and expertise, we can teach ourselves to see problems from a novice’s perspective. Rather than getting frustrated and taking over, we can try some different strategies:

  1. Tell stories. When Gary Klein and his research team want to understand an expert’s thought process, they don’t use questionnaires or ask the expert to make a flow chart or anything artificial like that. The most effective way to get inside an expert’s thought process is to listen to their stories. So when you’re teaching a novice how to reason about a system, try thinking of an interesting and surprising troubleshooting experience you’ve had with that system before, and tell that story.
  2. Use the Socratic method. Novices need practice at juggling abstractions and digesting counterfactuals. When a novice is describing their mental model of a problem or a potential path forward, ask a hypothetical question or two and watch the gears turn. Questions like “You saw Q happen because of P, but what are some ways we could’ve gotten to Q without P?” or “You expect that changing A will have an effect on B, but what would it mean if you changed A and there was no effect on B?” will challenge the novice to bounce between different layers of abstraction like an expert does.
  3. Remember: your boss may be a novice. Take a moment to look around your org chart and find the nearest novice; it may be above you. Even if your boss used to do your job, they’re a manager now. They may be rusty at dealing with the abstractions you use every day. When your boss is asking for a situation report or an explanation for some decision you made, keep in mind the power of narratives and counterfactuals.

You Know Who’s Smart? Friggin’ Doctors, Man.

Inspired by Steve Bennett‘s talk at Velocity 2012 (slides here. I swear it’s a great talk; I didn’t just think he was smart because he’s British), I’ve been trying lately to apply medicine’s differential diagnosis approach to my ops problem solving.

If you’ve ever seen an episode of “House M.D,” you’ll recognize the approach right away.

Problem-Based Learning

Since my girlfriend (partner/common-law fiancée/non-Platonic ladyperson/whatever) is a veterinary student, I end up hearing a lot about medical reasoning. One of her classes in first year was “Problem-Based Learning,” or as I called it, “House D.V.M.”. The format of this class should sound familiar to anyone who’s worked in ops, or dev, or the middle bit of any Venn diagram thereof.

You walk in on Monday and grab a worksheet. This worksheet describes the symptoms of some cat or pug or gila monster or headcrab that was recently treated in the hospital. Your homework: figure out what might be wrong with the animal, and recommend a course of treatment and testing.

On Tuesday, you’re given worksheet number 2. It says what a real vet did, given Monday’s info, and then it lists the results of the tests that the vet ordered. So the process starts over: your homework is to infer from the test results what could be wrong with the animal, and then figure out what tests or treatments to administer next.

This process repeats until  Friday, by which point you’ve hopefully figured out what the hell.

When I heard this, I thought it was all very cool. But I didn’t pick up on the parallels with my own work, which are staggering. And what really should have caught my attention, in retrospect, is that this was a course they were taking. They’re teaching a deductive process!

Can We Formalize It? Yes We Can!

In tech, our egos often impede learning. We’re smart and we’ve built a unique, intricate system that nobody else understands as well as we do. “Procedures” and “methodologies” disgust us: it’s just so enterprisey to imagine that any one framework could be applied to the novel, cutting-edge complexities we’re grokking with our enormous hacker brains.

Give it a rest. Humans have been teaching each other how to troubleshoot esoteric problems in complex systems for friggin millennia. That’s what medicine is.

When faced with a challenging issue to troubleshoot, doctors will turn to a deductive process called “differential diagnosis.” I’m not going to describe it in that much detail; if you want more, then tell Steve Bennett to write a book. Or watch a few episodes of House. But basically the process goes like this:

  • Write down what you know: the symptoms.
  • Brainstorm possible causes (“differentials”) for these symptoms.
  • Figure out a test that will rule out (“falsify”) some of the differentials, and perform the test.
  • If you end up falsifying all your differentials, then clearly you didn’t brainstorm hard enough. Revisit your assumptions and come up with more ideas.

This simple process keeps you moving forward without getting lost in your own creativity.

Mnemonics As Brainstorming Aids

The brainstorming step of this deductive process (“writing down your differentials”) is critical. Write down whatever leaps to mind.

Doctors have mnemonic devices to help cover all the bases here. One of the most popular is VINDICATE (Vascular/Inflammatory/Neoplastic/Degenerative/Idiopathic/Congenital/Autoimmune/ Traumatic/Endocrine). They go through this list and ask “Could it be something in this category?” The list covers all the systems in the body, so if the doctor seriously considers each of the letters, they’ll usually come up with the right differential (although they may not know it yet).

Vets have a slightly different go-to mnemonic when listing differentials: DAMNIT. There are several different meanings for each letter, but the gist of it is Degenerative, Anomalous, Metabolic, Nutritional, Inflammatory, Traumatic. Besides being a mild oath (my second-favorite kind of oath), this device has the advantage of putting more focus on the trouble’s mode of operation, rather than its location.

These mnemonics are super useful to doctors, and it’s not that hard to come up with your own version. Bennett suggests CASHWOUND (see his slides to find out why).

No Seriously, Try It. It’s Great.

The other day, we were looking at our contribution dashboard and we noticed this (artist’s rendering):

Brief dip in donations
Brief dip in donations

 

That dip in donations lasted about 10 minutes, and we found it extremely disturbing. So we piled into a conference room with a clean whiteboard, and we started writing down differentials.

A. Firewall glitch between card processors and Internet

B. Database failure causing donation pages not to load

C. Failures from the third-party payment gateway

D. Long response times from the payment gateway

E. Errors in our payment-processing application

F. DNS lookup failures for the payment gateway

Admittedly this is not a very long list, and we could’ve brainstormed better. But anyway, we started trying to pick apart the hypotheses.

We began with a prognostic approach. That means we judged hypothesis (B) to be the most terrifying, so we investigated it first. We checked out the web access logs and found that donation pages had been loading just fine for our users. Phew.

The next hypotheses to test were (C) and (D). Here we had switched to a probabilistic approach — we’d seen this payment gateway fail before, so why shouldn’t it happen again? To test this hypothesis, we checked two sources: our own application’s logs (which would report gateway failures), and Twitter search. Neither turned up anything promising. So now we had these differentials (including a new one devised by my boss, who had wandered in):

A. Firewall glitch between card processors and Internet

B. Database failure causing donation pages not to load

C. Failures from the third-party payment gateway

D. Long response times from the payment gateway

E. Errors in our payment-processing application

F. DNS lookup failures for the payment gateway

G. Users were redirected to a different site

(E) is pretty severe (if not particularly likely, since we hadn’t deployed the payment-processing code recently), so we investigated that next. No joy — the application’s logs were clean. Next up was (A), but it proved false as well, since we found no errors or abnormal behavior in the firewall logs.

So all we had left was (F) and (G). Finally we were able to determine that a client was A/B testing the donation page by randomly redirecting half of the traffic with Javascript. So everything was fine.

Throughout this process, I found that the differential diagnosis technique helped focus the team. Nobody stepped on each other’s toes, we were constantly making progress, and nobody had the feeling of groping in the dark that one can get when one troubleshoots without a method.

Try it out some time!