Post-mortems: content over structure

The value of post-mortems is apparent: failures present opportunities to learn about unexpected behaviors of the system, and learning lets us make improvements to the system’s reliability.

The value of post-mortem documents is much less apparent.

Many R&D orgs will insist that the final draft of a post-mortem document have a particular structure. Common components of this structure include:

  • Start and end time of customer impact
  • Time to detection
  • Time to diagnosis
  • Time to recovery
  • A list of action items, each with a link to a ticket
  • Mitigation action items broken down into categories (e.g. “Prevention,” “Detection,” “Impact”)
  • Specific sections (e.g. “Timeline,” “What went wrong?,” “What can we do better?”)
  • Signoffs/approvals

None of these structural requirements facilitate learning. The benefits of post-mortem analysis come not from the document, but rather from the sense-making processes of which the document is an artifact. In order to understand a given failure, we invent hypotheses, test them against our mental model and the observable facts. In order to choose effective strategies for improvement, we converse and debate. And in order to make any of this matter, we establish accountability for seeing those strategies through.

These social processes are the source of the value of post-mortem analysis. The document is just a souvenir.

But what if you want to do meta-analysis? What if you want to analyze trends in incident characteristics over time, or categorize incidents according to some scheme? Don’t you need structure then?

I suppose you do. But good luck getting any useful information. No matter how much structure you insist on, the data set will be hopelessly noisy. Just try to state a hypothesis that can realistically be tested by a meta-analysis of a single organization’s incident data. I don’t think you can.

But what if structure helps me learn?

If structuring the post-mortem process helps you learn, then by all means! categorize! prompt! But recognize structure as a tool rather than an end in itself. Your learning process may benefit from one kind of structure, while somebody else’s may benefit from a different kind of structure, or less or more structure altogether. But the structure of the learning is what matters; not the structure of the document.

Organizational legibility

If you ask me, it’s all about the bureaucratic drive for legibility. Centralized power craves legibility.

A requirement for post-mortem documents to have a specific, consistent structure doesn’t help us learn or improve. So why do we do it?

Idiosyncratic processes like sense-making and learning are illegible to a command-and-control power structure. They come in diverse and intricate forms, instead of the standardized, codified forms that bureaucracy can parse. In service of legibility, a company’s power structure will insist that the post-mortem process culminate in spreadsheet-ready data items like “customer impact duration,” “time to recovery,” “severity level,” and the like. Centralized power demands these simplifications even if they inhibit learning and improvement. To the bureaucracy, legibility itself is the goal.

As an employee, you probably can’t totally disregard these bureaucratic impositions. But, to the extent you can disregard them, you should. Focus on what really matters: learning and improvement.

Outliers carry information. Don’t leave them on the table

Over a decade ago, I saw this talk by John Rauser. Only recently, though, did I come to realize how incredibly influential this talk has been on my career. Gosh what a great talk! You should watch it.

If you operate a complex system, like a SaaS app, you probably have a dashboard showing a few high-level metrics that summarize the system’s overall state. These metrics (“summary statistics”) are essential. They can reveal many kinds of gross changes (both gross “large scale” and gross “ick”) in the system’s state, over many different time scales. Very useful!

But don’t be misled. Summary statistics reveal certain patterns in the system’s behavior, but they are not identical to the system’s behavior. All summary statistics – yes, even distributions – hide information. They’re lossy. It’s easy to get lulled into the sense that, if an anomaly doesn’t show up in the summary statistics, it doesn’t matter. But a complex system’s behavior is not just curves on a plot. It’s a frothing, many-dimensional vector sum of instant-to-instant interactions.

When you investigate an anomaly in summary statistics, you’re faced with a small number of big facts. Average latency jumped by 20% at such-and-such time. Write IOPS doubled. API server queue depth started rising at some later time. Usually, you “zoom in” from there to find patterns that might explain these changes.

When you instead investigate a specific instance of anomalous behavior, you start with a large number of small facts. A request to such-and-such an endpoint with this-and-that parameter took however many seconds and crashed on line 99 of thing_doer.rb. None of these small facts tell you anything about the system’s overall behavior: this is just a single event among millions or billions or more. But, nevertheless: these small facts can be quite illuminating if you zoom out.

First of all, this probably isn’t the only time a crash like this has ever occurred. Maybe it’s happening multiple times a day. Maybe it happened twice as often this week as it did last week. Maybe it’s happening every time a specific customer makes a specific API request. Maybe that customer is fuming.

And second of all, the reason this event caught our eye in the first place was because it was anomalous. It had some extreme characteristic. Take, for example, a request that was served with very high latency. Perhaps, in the specific anomalous case before us, that extreme latency didn’t cause a problem. But how extreme could it get before it did cause a problem? If it took 20 seconds today, could it take 30 seconds next time? When it hits 30, it’ll time out and throw an error. Or, if multiple requests like this all arrived at the same time, could they exhaust some resource and interfere with other requests?

If the only anomalies you investigate are those that show up in summary statistics, then you’ll only find problems that have already gotten bad enough to move those needles. But if you dig into specific instances of anomalous behavior – “outliers” – then you can often find problems earlier, before they become crises.

5 production surprises worth investigating

As an SRE, I’m a vocal believer in following one’s nose: seeking out surprising phenomena and getting to the bottom of them. By adopting this habit, we can find and fix many classes of problems before they turn into incidents. Over time, this makes things run much smoother.

But what is surprising? How do we decide what to ignore and what to chase down?

Here are 5 kinds of surprise that are worth digging into in production data.

Keep in mind that this list is based on one SRE’s experience. Someone else, with different intuitions about system behavior, would make a different list. And that list would be more valuable to them than this list.

The most important thing is not that we always pick the right surprises to investigate. It’s that we make time, consistently, to seek out and explain surprises.

Anyway. Here’s my list.

1. Long running requests

Pick a time window. Sort all the requests during that window descending by duration. At the top of this list, you’ll find classes of requests that spent extraordinarily long in your system.

Were you expecting to see these specific kinds of requests? If not, then congratulations: you’re surprised. And this is quite likely a surprise worth investigating, because

  1. while these slow requests are running, they’re eating up resources (CPU, memory, sockets, and so on), which gives them an outsized potential to degrade the performance of other, unrelated requests.
  2. without intervention, the underlying problem will probably get worse over time. Why wouldn’t it? It has already gotten this bad. And if it gets bad enough, it will eventually cause customer pain.
  3. many of the phenomena contributing to this slowness probably affect a variety of requests. So, if you can make these requests perform better, you may be able to make other requests perform better too.

We don’t have to rigorously analyze and optimize every slow request that goes through the system. Just by coming to understand the issues that cause our slowest requests to be slow, we often find opportunities to make high-impact improvements and mitigate the risk of ending up on an incident call later.

2. Saturation

Saturation is when units of work are waiting in a queue to be processed – be they web requests, or IP packets, or threads, or what have you.

To the extent we’re trying to serve each request as quickly as possible, saturation anywhere is waste. We shouldn’t expect any saturation. Or, at least, we should strive to thoroughly understand every source of it. If we don’t understand it, it will probably get worse over time.

Find every metric that represents a saturation. Put them all on one dashboard. Look at that dashboard every day.

3. Crashes

Every time an exception bubbles up to the top of the stack, that’s unexpected. If it were expected, then the code would have caught and handled the exception.

A tool like Sentry or Datadog Error Tracking (I mention these specific products only because I know them, not because they’re necessarily the best) lets you classify crashes, and see which ones are new or getting more frequent. Granted: it’s often the case that there are far too many classes of crash to investigate all or even most of them. But this shouldn’t stop us from chasing down some of them, according to our intuition about their relative importance.

We have imperfect information until we get our hands dirty, so we won’t always end up picking the highest-impact investigations to undertake. But fixing crashes is always better than ignoring them, and by dividing up the extant crashes into groups and prioritizing one group at a time, we can create some very useful guarantees.

For example, if we find all the crashes that correspond to deadlocks, and fix those, then we guarantee that our system has no deadlocks. Thereafter, problems caused by deadlocks – which tend to be subtle, elusive, and therefore time-consuming to solve – can no longer quietly sow chaos.

4. Utilization spikes

It’s expected for a utilization metric (for example, network bandwidth or memory utilization) to move around a lot. But it’s not expected for utilization to exhibit large spikes. When this happens, it means some recurring event is using more than its fair share of some resource.

When utilization hits 100%, things will degrade catastrophically and immediately. So if you have utilization spikes representing 20% of capacity, then steady-state utilization only has to be 80% before catastrophic degradations start occurring. Or, if the spikes gets worse, and now they’re 30% of capacity, then 70% capacity becomes the new safe limit. We’ll have to waste at least 30% of our capacity by leaving it idle in order to prevent outages.

So why not understand the spikes now, before they get worse?

5. Correlation between latency and throughput

When we see recurring swings in latency metrics, it’s easy to say to ourselves, “Yeah, well, when we’re under more load, things slow down.” But… why? Systems don’t always behave like that! They do it when there’s a bottleneck. If the system were behaving as expected and no resources were exhausted, then the latency graph would be more or less flat.

In other words, if traffic tends to be higher mid-day on weekdays, and latency also tends to rise mid-day on weekdays, then you’re likely saturating at least one resource. Would you rather find that bottleneck is now, or during an incident?

I hope this list is a good place to start if you’re looking to dig up problems. But, ultimately, you must be the judge of what’s surprising to you. Find something you can’t explain. Explain it. Repeat.

Platform teams don’t need to act like companies

Lately you see a lot of software company R&D teams organized around internal products. The Search Team provides a Search service and its “customers” are the teams whose code consumes that service. The Developer Productivity Team’s product is a suite of tools for managing local development environments and running tests, and its “customers” are the developers who use those tools. And so on.

The idea behind this organizational trick seems to be that product-oriented teams will act like a company. They’ll make better strategic decisions because their goals will be aligned with those of their internal customers.

This is pretty silly.

If a company satisfies the needs of its customers and potential customers, it can grow and thrive and put resources into attracting more revenue. If a company fails to satisfy customer needs, it loses out to competitors. As a company changes its strategy, the market provides feedback. These dynamics don’t exist for an internal team – especially a platform (or “infrastructure,” or “ops”) team.

When asked to think like a company in this way, a platform team will define their product as something like, “a reliable, performant, and secure platform for running software products.” Fine – but think of all the things that a company can do that you can’t do:

  • You can’t attract new customers
  • You can’t pivot to a different product
  • You can’t focus on a different market segment
  • You can’t sacrifice some set of customers to attract another

If you try to “think like a company” as a platform team, you will try to do what successful tech companies do: launch attractive features. But you can waste a lot of time that way, because the most important feature of a software platform is boringness.

If the platform team must act like a company, then it should act like a struggling company that’s desperate to keep its last few enterprise customers. This implies a very different strategy than that of a startup. Change as little as possible and put the bulk of your energy into efforts that increase boringness: documenting, simplifying, and fixing. Don’t build anything exciting if you can avoid it, and if you must build something, build it in service of boringness.

Fix tomorrow’s problems by fixing today’s problems

A bug in our deployment system causes O(N²) latency with respect to the number of deploys that have been performed. At first, it’s too minuscule to notice. But the average deploy latency grows over time. Eventually, deploys start randomly timing out. The deploy pipeline grinds to a halt, and it becomes an emergency.

Or maybe, if we think critically about the deploy latency time series soon enough, it might be obvious well in advance that something’s getting worse. We can fix this problem before it becomes a crisis. But in order to see it, we have to look. And we have to give ourselves time to go down the rabbit hole.

An API server has an edge case that leads to unconstrained memory usage. At first, this edge case only gets hit occasionally, and the API server’s memory usage stays well below capacity. But, as usage patterns evolve, we start to hit this bug more frequently, with larger and larger allocations of memory. For a while, we’re still below the OOMkill threshold. Once we start hitting that threshold, things get ugly. If we still continue to ignore it, then eventually, things will get so ugly that we’ll have to drop what we’re doing and fix this bug.

We had opportunities to see this coming. Depending on our willingness to dig in and investigate surprising phenomena, we could have discovered this problem when the OOMkills started, or even before they started – when these blips were just spikes on a memory graph.

A cron job runs every 30 minutes, and while it runs, it holds a database lock. When it’s first deployed, the cron job doesn’t have much to do, so it runs fast, and no one suffers. Over months, though, the cron job grows sluggish. It just has more work to do. Query pileups start to occur every 30 minutes. We start seeing significant impact on the latency of our application. And, one day, there’s an outage.

We’ll wish we’d dug in back when it was just a slow cron job. Or even when it was just query spikes.

You can prevent many things from turning into fires, but you need space. Space to be curious, to investigate, to explain your findings to yourself and others.

Suppose you spent a week looking for trouble like this, and you only happened to find 1 issue out of these 3. That’s still great, right? Compared to the cost of letting it become a disruption later?

When a system fails, it’s silly to blame practitioners for not seeing the signs. But that doesn’t mean we shouldn’t put in a serious effort to see the signs. If you give yourself space to follow the Huh!, you get opportunities to fix problems before they get worse.

Huh! as a signal

Every time our system fails, and we go to analyze the failure, we find ourselves saying things like “We didn’t know X was happening,” “we didn’t know Y could happen,” and so on. And it’s true: we didn’t know those things.

We can never predict with certainty what the next system failure will be. But we can predict, because painful experience has taught us, that some or all of the causes of that failure will be surprising.

We can use that!

When we go looking at data (and by “data” I mostly mean logs, traces, metrics, and so on, but data can be many things), sometimes we see something weird, and we go like, Huh!. That Huh! is a signal. If we follow that Huh! – get to the bottom of it, figure it out, make it not surprising anymore – two things happen. First, we get a chance to correct a latent problem which might some day contribute to a failure. And second, we make our mental model that much better.

Of course, any individual Huh! could turn out to be nothing. Perhaps there’s a bug. Perhaps circumstances have shifted, and our expectations no longer line up with reality. Or perhaps it’s just a monitoring blip. We won’t know until we run it down.

But, whatever the shortcomings of any particular investigation, a habit of investigating surprises has many attractive qualities. The main one is that we get to fix problems before those problems get worse, start bouncing off other problems, and cause fires. In other words: our system runs smoother. Consider what that’s worth.

Upstream-local queueing

One day not long ago, as I was looking for trouble in a production system’s telemetry, I came across a puzzling phenomenon. I was examining the load balancer access logs for a particular API endpoint – an endpoint that does essentially nothing and should always return a 200 response within a handful of milliseconds. I saw this:

MetricValueMy reaction
10th percentile latency10msOkay,
Median latency11mssure,
75th pecentile latency14msfair enough,
90th percentile latency160msmm-h– wait,
99th percentile latency700msWhat??

“What gives?” I wondered aloud, cocking my head. “What could this request possibly be doing for 700 milliseconds? It has nothing to do.” That’s how I learned about an issue I’m calling upstream-local queueing. It’s a mostly stack-agnostic performance problem, and boy am I glad I found it early, because it has some dire scaling implications.

The problem

I’ll spare you a recapitulation of the head-scratching and data fumbling that ensued. Upstream-local queueing is when an upstream – an individual server tasked with responding to requests – is asked to handle more requests than it can handle concurrently. For example, suppose you have a cluster of load balancers, all of which independently distribute requests to upstreams. Each upstream has a maximum concurrency of 3.

Upstream-local queueing happens when, due to random chance, a particular upstream has 3 requests in flight, and happens to receive a 4th. The upstream can’t immediately start processing that 4th request, so it waits in a queue until some capacity frees up. And in the meantime, maybe a 5th request arrives. And so on.

So, even though the rest of the upstream cluster might have plenty of idle capacity available, these requests end up sitting around twiddling their thumbs, accumulating latency.

You’ll notice that I haven’t mentioned any particular technologies or load balancing algorithms yet. That’s because upstream-local queueing is a remarkably general phenomenon. The following system characteristics are sufficient for this problem to emerge:

  1. The upstreams have finite capacity to handle concurrent requests.
  2. The load balancer’s decisions about where to send each request are independent from each other.

Many systems satisfy these criteria.

It’s hard to observe

Upstream-local queueing can be tricky to observe directly. A queue can appear and disappear on any given upstream within a matter of milliseconds: far shorter than the time scales on which metrics tend to be collected. ULQ’s contribution to request latency therefore appears to be randomly distributed, and mostly 0.

Furthermore, at least in my case, the stack is not instrumented well for this. Upstream-local queueing occurs in somewhat of a black box. It’s an open-source black box, but due to the design of the component that’s handling these requests within the upstream, it’s non-trivial to observe the impact of ULQ.

Further furthermore, the severity of ULQ-caused latency is coupled to utilization, which in most real-world systems is constantly changing. And, furthestmore, unless you’re already in deep trouble, ULQ’s impact tends to be below the noise floor for all but the most painstaking measurement techniques.

The most painstaking measurement techniques

When I first set out to examine the upstream-local queueing phenomenon, I took a highly manual approach:

  1. Pick an upstream
  2. Search for load balancer access logs corresponding to requests that were sent to that upstream
  3. Dump a CSV of those log entries
  4. Run a script against the CSV that uses the timestamps and durations to reconstruct the number of requests in-flight to that upstream from instant to instant.

This was a pain in the ass. But I only had to do it 2 or 3 times before I determined that, yes: this was indeed happening, and it was causing nontrivial latency.

At this point, I was sure that I had found a big problem. But I couldn’t just go fix it. I needed to convince my colleagues. In spite of this problem’s recalcitrance to direct observation, I needed a clear and compelling demonstration.

A computational model

Lucky for me, queueing systems are easy to model!

I spent a day or two building a computational model of the behavior of an upstream under load. The model is on my GitHub. I won’t bore you with the details, but essentially, requests arrive at the upstream at a set interval, and each request takes a random amount of time to execute. If there are more than 12 requests in flight, further requests are queued until slots free up. We add up the number of microseconds spent queued versus in flight, and voilà: a working model that largely agrees with our real-world observations.

Here’s what the model told me:

In the graph above, each point represents a run of the simulation with a different average request rate. As you can see, the average number of milliseconds spent by requests in the upstream-local queue is tightly correlated to utilization, and it grows more or less exponentially.

This is a huge problem! As more capacity is used, requests experience, on average, exponentially more latency:

% capacity usedAverage latency due to ULQ
40%0.29ms
50%2.22ms
60%11.4ms
70%38.4ms
80%127ms
90%461ms
95%1372ms

And remember: this is just on average. 90th- and 99th-percentile latencies can climb to unacceptable levels far sooner.

What’s worse, ULQ affects all requests equally. If the average added latency is, say, 10ms, then a request that would normally take 1000ms will instead take 1010ms, for a slowdown of 1%. But a request that would normally take 5ms will take on average 15ms: a 300% performance hit. This means more requests sitting around in your stack eating up resources for no good reason. It also means, if clients of your service tend to do many individual requests in sequence (like a web browser, for example), that overall user experience can suffer drastically before this problem even appears that bad.

What to do about it

As I said before, this is a quite general problem. Switching web servers won’t fix it, nor will scaling up. Switching from random-number to round-robin load balancing, or vice versa, will not fix it. There are 3 classes of solution.

The first class of solution is the dumbest. But hey, maybe you need your upstream-local queueing problem fixed now and you don’t have time to be smart. In this case, here’s what you do: set a threshold, and meet it by keeping enough of your capacity idle. Referring to the table above, if we decided on a threshold of 11ms average ULQ latency, then we’d need to keep at least 40% of our capacity idle at all times.

I told you it was dumb. But it is easy. The other two solutions are less easy.

The second solution is to reduce your application’s latency variance. If some requests take 10 milliseconds and others take 30000, then upstream-local queueing rears its ugly head. If, instead, all your requests take between 30 and 35 milliseconds (or between 3 and 3.5 seconds, for that matter), its effect is much less pronounced. By hacking away at the long tail of your latency distribution, you may be able to push the worst effects of ULQ further to the right-hand-side of the graph. But, at the end of the day, exponential growth is exponential growth. It’s not really a fix.

The best thing you can do, of course, is use a more sophisticated load balancing algorithm. This necessitates that your load balancing software supports one. If, for example, you use a least outstanding requests algorithm, then upstream-local queueing simply won’t occur until you’ve exhausted all of your upstream capacity. It ceases to be a scaling problem.

How to tell how bad ULQ is in your stack

For a quick and dirty answer to the question “How much latency is ULQ contributing in my system?” you can make a simple graph dashboard. Take the 90th percentile latency as measured by the load balancer, and subtract the 90th percentile latency as measured by the upstream.

If these curves grow and shrink along with your throughput, you probably have an upstream-local queueing problem. And if the peaks are getting higher, that problem is getting worse.

The numbers resulting from this calculation are not a rigorous measurement of anything in particular. You can’t really add or subtract percentiles. But it’s often a very easy calculation to do, and as long as you don’t make inferences based on the values of the numbers – just the shapes of the curves – you can get some quick confidence this way before you proceed with a deeper investigation. And then you can fix it.

Don’t build Cities of Nowhere

I just finished reading Seeing Like A State: How Certain Schemes to Improve the Human Condition Have Failed by James C. Scott (full text online). I highly recommend it. Through examples ranging from Soviet collectivization to the construction of Brasilia, the book argues that grand, centralized planning efforts in the high modernist tradition are all doomed to failure. One simply can’t substitute pure reason – no matter how beautiful and internally consistent – for local human decision-making informed by direct experience.

To take one striking anecdote, Le Corbusier spent some time lobbying Soviet intelligentsia to implement his redesign of Moscow. However:

Stalin’s commissars found his plans for Moscow as well as his project for the Palace of Soviets too radical. The Soviet modernist El Lissitzky attacked Le Corbusier’s Moscow as a “city of nowhere, … [a city] that is neither capitalist, nor proletarian, nor socialist, … a city on paper, extraneous to living nature, located in a desert through which not even a river must be allowed to pass (since a curve would contradict the style).” As if to confirm El Lissitzky’s charge that he had designed a “city of nowhere,” Le Corbusier recycled his design virtually intact—aside from removing all references to Moscow—and presented it as La ville radieuse, suitable for central Paris.

Seeing Like A State – James C. Scott
Le Corbusier's vision for the ideal Moscow. Symmetric, vertical, orderly.
A map of actual Moscow. Asymmetric, spread out across either side of the Moskva, disorderly.

In Scott’s book, this pattern plays out over and over. Planners, relying heavily on what they imagine are universal principles, produce designs for human life that are nevertheless completely at odds with how humans actually live and work. These designed spaces possess a symmetric, holistic beauty which blinds their creators to the needs of the infinitely complex human communities that are meant to populate them. The planned city ultimately fails to improve the human condition, since improving the human condition is one of the many considerations which must bow to the planner’s aesthetic.

Toward the end of the book – although this is only a short passage and certainly not the thrust – Scott gives 4 rules of thumb for development planning. Building up a SaaS product is clearly different in many ways (not least of which is the stakes) from planning human development. But the parallels got me thinking in engineering terms, and I find that these rules also work quite well as rules of thumb for making changes to a complex software system. By following them, we can mostly avoid wasting effort on huge endeavors that end up being Cities of Nowhere.

1. Take small steps

In an experimental approach to social change, presume that we cannot know the consequences of our interventions in advance. Given this postulate of ignorance, prefer wherever possible to take a small step, stand back, observe, and then plan the next small move.

In software, taking small steps is a challenge of discipline. We work with pure thought-stuff. In principle, we can build whatever we can imagine, so it’s always tempting to solve more of the problem.

But taking small steps has by now become the common wisdom in our industry. Single-feature pull requests are encouraged over massive, multifaceted ones. We roll out features to small groups of users before ramping up. Prototypes and MVPs abound.

Where we still have much to learn from Scott is the “stand back, observe” part. Often, we’re tempted to simply let the machine do the observing for us: if there’s anything wrong with our change, the integration tests will fail, or the deploy will fail, or we’ll get an alert. While such automated signals are indispensable, they’re not sufficient. To understand the real-world effects of our small changes, we have to exercise the further discipline of curiosity. With our particular change in mind, we have to search diligently for evidence of its effects, both intended and unintended, direct and indirect. Observability is not enough – we must actively observe.

2. Favor reversibility

Prefer interventions that can easily be undone if they turn out to be mistakes. Irreversible interventions have irreversible consequences. Interventions into ecosystems require particular care in this respect, given our great ignorance about how they interact. Aldo Leopold captured the spirit of caution required: “The first rule of intelligent tinkering is to keep all the parts.”

It’s pretty clear how this reversibility consideration applies to deploying software and infrastructure. Most changes should be trivially reversible by “rolling back” the deploy. Where this is impossible (such as in certain classes of database migrations and infrastructure changes), we come up with more case-specific back-out plans, or we end up inventing reversible patterns despite ourselves. This amounts to an implicit recognition that our changes can always have unexpected consequences. Which is good!

But, in a socio-technical system, the technology isn’t the only thing that gets altered over time. We must also favor reversibility with respect to the social elements – with respect to procedures, policies, and organizational structures.

One pattern I like for this is an experiment ledger. As a team, you keep a running register (e.g. in a spreadsheet) of the different experiments you’re trying. These can be anything from a new recurring meeting to a new on-call rotation to a rearrangement of your kanban columns. Each experiment in the ledger has one or more check-in dates, when the team will discuss the results of the experiment and decide whether to keep going or abandon the course.

Of course, for many reasons, not every change can be reversible. Not least because even after you reverse something, the taste stays in people’s mouths. But taken together with the rest of Scott’s advice, reversibility is a sensible attribute to strive for.

3. Plan on surprises

Choose plans that allow the largest accommodation to the unforeseen. In agricultural schemes this may mean choosing and preparing land so that it can grow any of several crops. In planning housing, it would mean “designing in” flexibility for accommodating changes in family structures or living styles. In a factory it may mean selecting a location, layout, or piece of machinery that allows for new processes, materials, or product lines down the road.

No matter how much time and sweat you put into the design of a system – no matter how much of the problem you try to solve a priorithere will always be surprises. It’s just the nature of a complex system, and even more so for a system with inputs you can’t control (e.g. customer traffic patterns).

Therefore, watch carefully for both expected and unexpected results. That’s what “plan on surprises” means to me: make small, reversible changes, and in the meantime look closely for new unexpected behaviors that you can investigate and understand. This will give you much more insight into your system’s abilities and constraints than any application of pure thought.

4. Plan on human inventiveness

Always plan under the assumption that those who become involved in the project later will have or will develop the experience and insight to improve on the design.

Write with clarity and humility on the motivations for your designs. Explain what you did and what you chose not to do, and why. The reasons for a particular design are never self-evident, no matter what cosmic beauty they may have in your head.

Taken together, Scott’s rules sketch out a pragmatic philosophy for managing the evolution of complex systems. In favor of grand redesigns that attempt to solve all problems at once, one should prefer targeted, reversible changes. We should change significant things about the system only we can fully explain why it’s necessary, and afterward we should exercise diligence and curiosity in making sure we understand what we changed.

Descriptive engineering: not just for post-mortems

In an organization that delivers a software service, almost all R&D time goes toward building stuff. We figure out what the customer needs, we decide how to represent their need as software, and we proceed to build that software. After we repeat this cycle enough times, we find that we’ve accidentally ended up with a complex system.

Inevitably, by virtue of its complexity, the system exhibits behaviors that we didn’t design. These behaviors are surprises, or – often – problems. Slowdowns, race conditions, crashes, and so on. Things that we, as the designers, didn’t anticipate, either because we failed to consider the full range of potential interactions between system components, or because the system was exposed to novel and unpredictable inputs (i.e. traffic patterns). Surprises emerge continuously, and most couldn’t have been predicted a priori from knowledge of the system’s design.

R&D teams, therefore, must practice 2 distinct flavors of engineering. Prescriptive engineering is when you say, “What are we going to build, and how?”, and then you execute your plan. Teams with strong prescriptive engineering capabilities can deliver high-quality features fast. And that is, of course, indispensable.

But prescriptive engineering is not enough. As surprises emerge, we need to spot them, understand them, and explain them. We need to practice descriptive engineering.

Descriptive engineering is usually an afterthought

Most engineers rarely engage with production surprises.

We’re called upon to exercise descriptive engineering only in the wake of a catastrophe or a near-catastrophe. Catastrophic events bring attention to the ways in which our expectations about the system’s behavior have fallen short. We’re asked to figure out what went wrong and make sure it doesn’t happen again. And, when that’s done, to put the issue behind us so we can get back to the real work.

In fact, descriptive engineering outside the context of a catastrophe is unheard of most places. Management tends to see all descriptive engineering as rework: a waste of time that could have been avoided had we just designed our system with more forethought in the first place.

On the contrary. To quote the late, lamented Dr. Richard Cook:

The complexity of these systems makes it impossible for them to run without multiple flaws being present. Because these [flaws] are individually insufficient to cause failure they are regarded as minor factors during operations. … The failures change constantly because of changing technology, work organization, and efforts to eradicate failures.

How Complex Systems Fail, #4

A complex system’s problems are constantly shifting, recombining, and popping into and out of existence. Therefore, descriptive engineering – far from rework – is a fundamental necessity. Over time, the behavior of the system diverges more and more from our expectations. Descriptive engineering is how we bring our expectations back in line with reality.

In other words: our understanding of a complex system is subject to constant entropic decay, and descriptive engineering closes an anti-entropy feedback loop.

Where descriptive engineering lives

Descriptive engineering is the anti-entropy that keeps our shared mental model of the system from diverging too far from reality. As such, no organization would get very far without exercising some form of it.

But, since descriptive engineering effort is so often perceived as waste, it rarely develops a nucleus. Instead, it arises in a panic, proceeds in a hurry, and gets abandoned half-done. It comes in many forms, including:

  • handling support tickets
  • incident response
  • debugging a broken deploy
  • performance analysis

In sum: the contexts in which we do descriptive engineering tend to be those in which something is broken and needs to be fixed. The understanding is subservient to the fix, and once the fix is deployed, there’s no longer a need for descriptive engineering.

Moreover, since descriptive engineering usually calls for knowledge of the moment-to-moment interactions between subsystems in production, and between the overall system and the outside world, this work has a habit of being siphoned away from developers toward operators. This siphoning effect is self-reinforcing: the team that most often practices descriptive engineering will become the team with the most skill at it, so they’ll get assigned more of it.

This is a shame. By adopting the attitude that descriptive engineering need only occur in response to catastrophe, we deny ourselves opportunities to address surprises before they blow up. We’re stuck waiting for random, high-profile failures to shock us into action.

What else can we do?

Instead of doing descriptive engineering only in response to failures, we must make it an everyday practice. To quote Dr. Cook again,

Overt catastrophic failure occurs when small, apparently innocuous failures join to create opportunity for a systemic accident. Each of these small failures is necessary to cause catastrophe but only the combination is sufficient to permit failure. Put another way, there are many more failure opportunities than overt system accidents.

How Complex Systems Fail, #3

We won’t ever know in advance which of the many small failures latent in the system will align to create an accident. But if we cultivate an active and constant descriptive engineering practice, we can try to make smart bets and fix small problems before they align to cause big problems.

What would a proactive descriptive engineering practice look like, concretely? One can imagine it in many forms:

  • A dedicated team of SREs.
  • A permanent cross-functional team composed of engineers familiar with many different parts of the stack.
  • A cultural expectation that all engineers spend some amount of their time on descriptive engineering and share their results.
  • A permanent core team of SREs, joined by a rotating crew of other engineers. Incidentally, this describes the experimental team I’m currently leading IRL, which is called Production Engineering.

I have a strong preference for models that distribute descriptive engineering responsibility across many teams. If the raison d’être of descriptive engineering is to maintain parity between our expectations of system behavior and reality, then it makes sense to spread that activity as broadly as possible among the people whose expectations get encoded into the product.

In any case, however we organize the effort, the main activities of descriptive engineering will look much the same. We delve into the data to find surprises. We pick some of these surprises to investigate. We feed the result of our investigations back into development pipeline. And we do this over and over.

It may not always be glamorous, but it sure beats the never-ending breakdown.

I used a spreadsheet to save myself 12 hours a year grocery shopping

I do most of the grocery shopping for my family. Which is, of course, a privilege. But it’s a time-consuming privilege. From the moment I enter the supermarket to the moment I enter the checkout line, it’s about 30 minutes on average. Multiply that by 49 grocery trips a year, and you get 24.5 hours spent in the supermarket annually.

That’s 24 hours hours a year when I’d rather be doing basically anything else. The whole experience of grocery shopping grates on my nerves: jockeying for position with all the other shopping carts, getting the goddamn cleanup robot in my way, listening to the inane patter of the pre-recorded PA messages. It’s miserable.

All considered, I don’t mind doing it. My family needs to eat, and this is part of the cost of eating.

Buuut…

Why does it take me so long?

When I finally reach the end of the slog and look in my cart, I can’t help but think, there’s no way this should’ve taken me 30 minutes. There’s just not that much stuff.

So I started paying close attention to how I spend my time. Ideally, the process would look like this:

  • Walk to where the milk is
  • Put milk in the cart
  • Walk to where the eggs are
  • Put eggs in the cart
  • Walk to where the bread is
  • … and so on.

In reality, though, the process is much more like this:

  • Look at the list and decide what to get first
  • Walk in the general direction of that item
  • Hopefully find the item there, and put it in the cart
  • Take out my phone and check it off the list
  • Look at the list again and guess which of the remaining items is closest
  • Walk in the general direction of that, hoping I’ll see the right overhead sign and not miss the aisle

This process involves tremendously more context switching than the ideal. Because of deficiencies in my brain, I can’t remember more than 1 or 2 items at a time, and every context switch entails a risk of forgetting what I was in the middle of doing. Compounding with this problem is that my grocery list isn’t sorted in any particular order. I’m at the mercy of my incomplete knowledge of supermarket geography: if I miss an item because I walked past the aisle it’s in, I have to waste time walking back to that aisle.

This close examination of my time expenditure got me thinking: how much of that 30 minutes could be optimized away?

Better grocery shopping through data

It became clear that, if I wanted to spend less time in the supermarket, my best bet was to calculate the most efficient travel path through the store. Backtracking and searching were huge sources of waste. Instead, I wanted to take a single pass through the store, grabbing everything I needed as I went.

The first solution that popped into my imagination was to write a script that could take my shopping list, compare it to data about the supermarket’s layout, and produce step-by-step instructions. The instructions would look something like this:

  1. Starting from the store entrance,
  2. Go up Aisle 24. Find bread on the right side.
  3. Toward the back, find milk on the left side.
  4. Head to Aisle 14. Find yogurt against the back wall.
  5. Go down Aisle 12. Find coffee on the right side.

I implemented this as a simple Go program called grocery-run. This script contains a hard-coded shopping list, store layout, and arrangement of items. The simple routing algorithm traverses the aisles in order, walking down any aisle that contains an item on the shopping list, and checking that item off. It keeps track of which direction I’m walking so as to order the items appropriately and predict which side of the aisle the item will be found on. The output of grocery-run looks like this:

This served decently for a few weeks. Each weekend before going to the supermarket, I would transfer my shopping list from the place where my family keeps it (Trello) into the code. Then I’d run the script and transfer the output to written form in a notebook. I was of course losing time with all these transfers, but my main goal at the time was just to validate the hypothesis that a path-finding algorithm like this could save me time in the long run. If that hypothesis turned out to be correct, then from there I could set my mind to optimizing away much of the overhead of using the script.

At the supermarket, I’d follow the instructions on the paper, noting the locations of any items that weren’t yet in my data set for later entry. This data recording too was a drain on my time, but – I reasoned – one that would disappear as my data set grew over multiple trips.

The initial results were encouraging! Even with the extra time spent noting item locations, I was spending the same amount of shopping time per trip. And I was also learning some important things:

  • The left/right information wasn’t especially useful. Since I was spending no time mucking with my phone (as I was before, when I would refer to the list on Trello many times over the course of a trip), my eyes were free to scan the shelves on both sides of an aisle.
  • The front-of-store/back-of-store information wasn’t that useful either. Because I always knew which item was next, I would see the item as I reached it. Then I could use my own judgement about whether to continue walking to the end of the aisle or turn back.
  • Time spent noting item locations was indeed decreasing: from week to week, the store wasn’t getting rearranged much.
  • A paper shopping list was far more efficient than using my phone. First of all, I didn’t have to go through the distracting and time-consuming exercise of taking my phone out of my pocket, unlocking it, scanning the list, and putting it back. And moreover: since the order of the paper list was aligned with my path through the supermarket, I didn’t even need to spend time or attention checking things off. Everything before the current item had already been grabbed, and everything after it remained.

The next iteration

This grocery-run script served decently, but it ended up being a stepping stone to an even simpler solution. Since I didn’t need the left/right or front-of-store/back-of-store data, it turned out I could replace my whole script with a spreadsheet. When you can replace a script with a spreadsheet, it’s almost always a win.

Here’s the spreadsheet. It does basically the same thing as the grocery-run script. There’s one sheet for the current shopping list, one for the store layout, and one for the arrangement of items within the store.

This spreadsheet makes use of a feature of Google Sheets that I hadn’t leveraged before: VLOOKUP. This function lets you construct a lookup table in one part of your spreadsheet, which can be referenced from another part. I think of this by analogy to JOINs in SQL:

    SELECT shopping_list.item

    FROM shopping_list
    JOIN store_arrangement ON item
    JOIN store_layout ON aisle

    ORDER BY store_layout.visit_order;

My workflow with this spreadsheet is more or less the same as before:

  1. Transfer shopping list from Trello to spreadsheet
  2. Sort shopping list sheet by the “visit order” column (this step corresponds roughly to running the grocery-run script)
  3. Transfer the resulting list to paper

The final shopping list looks like this:

The left column represents the aisle, with empty spaces indicating repetition.

Before and after

Before this intervention, my grocery shopping process was circuitous and wasteful.

Now it’s much more streamlined. Much less backtracking, no more futzing with my phone, no more checking items off the list and getting distracted in the meantime.

All told, this saves me about 15 minutes a week of shopping time. Over the course of the year, that’s

(15m) * (49 weeks a year, adjusting for vacation) = 12.25h

I like to think of this as about 1 book: due to this optimization, I get to read about 1 extra book a year, or do something else of equivalent value. Plus, I get to spend 12 hours less each year listening to that insipid Stop & Shop PA loop. I’ll take it!