3 questions that will make you a phenomenal rubber duck

As a Postgres reliability consultant and SRE, I’ve spent many hours being a rubber duck. Now I outperform even the incisive bath toy.

“Rubber duck debugging” is a widespread, tongue-in-cheek term for the practice of explaining, out-loud, a difficult problem that you’re stumped on. Often, just by putting our troubles into words, we suddenly discover insights that unlock progress. The person we’re speaking to could just as well be an inanimate object, like a rubber duck. Hence the term.

Rubber ducks are great, but a human can add even more value. In this article, I’ll share my 3 favorite questions to ask when someone comes to me feeling stumped in a troubleshooting endeavor. These questions work even when you have no particular expertise in the problem domain. Master them, and you’ll quickly start gaining a reputation as the person to talk to when you’re stuck. This is a great reputation to have!

Question 1: How did you first start investigating this?

As we investigate a problem, our focus shifts from one thing to another to another. We go down one path and forget about others. We zoom in on details and neglect to zoom back out. It’s easy to lose perspective.

“How did you first start investigating this?” works well because, through the act of recounting their journey from initial observation to where they are now, your colleague will often regain perspective they’ve lost along the way. And by asking this particular question, you avoid having to suggest that they may have lost perspective – which could make them defensive.

Even if your colleague hasn’t lost perspective, hearing the story of the investigation so far will help you ask better questions and help them organize their thoughts.

Question 2: What observations have you made?

In troubleshooting a complex problem, it’s easy to forget what you already know. As you go along, you make lots of observations, small and large, interesting and boring, relevant and irrelevant. You can’t hold them all in your head.

When someone’s stuck, it often helps to review their observations. Not theories, not difficulties, not actions: directly observed facts.

Reviewing observations can help in a few different ways:

  • They may be entertaining a hypothesis that clashes with some previously learned (but since forgotten) fact. If so, they can now go ahead and discard that hypothesis.
  • Juxtaposing two observations may suggest a hypothesis that never occurred to them before, because they never held those two observations in their head simultaneously.
  • Listing out their observations may bring to mind something they haven’t looked at yet.

As your colleague recounts their observations, write them down in a numbered list. And, if you can, ask clarifying questions. Questions like “Does X always happen concurrently with Y, or only sometimes?” and “How does this differ from the normal behavior?”

Never underestimate the power of precisely stating the facts.

Question 3: If your hypothesis were wrong, how could we disprove it?

This question is my favorite.

One of the most common ways people get stuck in troubleshooting is tunnel vision. They get a single idea in their head about the cause of the problem, and that becomes all they can think about.

This question, “If your hypothesis were wrong, how could we disprove it?” flips the script. Instead of racking their brain trying to prove their theory, it gets them thinking about other possibilities. Asking this question can lead to lots of different outcomes, all of which represent progress:

  1. You come up with a way to disprove the hypothesis, and successfully disprove it. This may make your colleague sad for a few hours, but when they come back to the problem, they’ll make huge strides.
  2. You come up with a way to disprove the hypothesis, but fail to disprove it. The hypothesis is thus bolstered, and the next step becomes clear: elaborate a few different versions of it and try to disprove those.
  3. You can’t think of any way to disprove it. This means it’s probably not a hypothesis at all, since it’s not falsifiable. Therefore you must replace it with a new hypothesis. This may feel like a setback, but it’s really the only way forward.

How it fits together

Under the hood, these 3 questions are just different ways of invoking hypothetico-deductive reasoning, which I’ve written about previously (see Troubleshooting On A Distributed Team Without Losing Common Ground and You Know Who’s Smart? Friggin’ Doctors, Man.). I don’t know of any better way to achieve consistent problem-solving results in the face of complexity.

If you’re interested in learning how to apply these techniques in your career or in your organization, I can help. Shoot me an email!

Why transaction order matters, even if you’re only reading

There are 4 isolation levels defined by the SQL standard, and Postgres supports them through the SET TRANSACTION statement. They are:

  • Read uncommitted: It’s possible to read data written by another concurrent uncommitted transaction (this is called a “dirty read”). (Note: this isolation level is not really implemented by Postgres)
  • Read committed: Dirty reads are not possible.
  • Repeatable read: “Nonrepeatable reads” are not possible. A nonrepeatable read is when you do a read once, then you do the same read again and see that data has been modified by another concurrent transaction.
  • Serializable: The strongest isolation level, this guarantees that repeating a read within a transaction will not only always return the same data for each row, but also the same set of rows. And it further guarantees that the end result of any set of concurrent transactions will be a possible outcome of having run those same transactions in serial, in some order.

This last guarantee is one against serialization anomalies. A serialization anomaly is any sequence of events that produces a result that would be impossible if those transactions ran one-at-a-time (“in serial”). This guarantee is tricky to reason about, but the Postgres docs have a great example to demonstrate it.

A simple serialization anomaly

The example from the docs goes like this. You have the following table, mytab:

class | value
-------+-------
1 | 10
1 | 20
2 | 100
2 | 200

And the following two transactions run concurrently:

Transaction ATransaction BNotes
BEGIN;BEGIN;
SELECT SUM(value)
FROM mytab
WHERE class = 1;
The result is 30.
SELECT SUM(value)
FROM mytab
WHERE class = 2;
The result is 300.
INSERT INTO mytab
(class, value)
VALUES (2, 30);
INSERT INTO mytab
(class, value)
VALUES (1, 300);
COMMIT;COMMIT;

This is a serialization anomaly, since no serial ordering of A and B could produce the same end result:

  • A => B: Transaction B writes 330.
  • B => A: Transaction A writes 600.

The explanation is pretty straightforward once you walk through it. Each transaction writes data that the other reads in order to determine what data to write. So you end up with a situation that could never occur if the transactions had been run in serial.

But I also noticed this passage in the docs:

A READ ONLY transaction may be able to release its SIRead locks before completion, if it detects that no conflicts can still occur which could lead to a serialization anomaly. In fact, READ ONLY transactions will often be able to establish that fact at startup and avoid taking any predicate locks. If you explicitly request a SERIALIZABLE READ ONLY DEFERRABLE transaction, it will block until it can establish this fact.

This surprised me because, intuitively, I wouldn’t think that a READ ONLY transaction could participate in a serialization anomaly. How can it be that just reading data concurrently with other transactions could cause a result inconsistent with serial execution? Such a transaction can’t even lock rows. I racked my brains trying to think of such a situation, but came up empty.

A serialization anomaly from a READ ONLY transaction

Lucky for me, back in about 2000, somebody else racked their brains about it too, and produced this 3-page paper describing just such a situation. Right near the beginning, I read:

It has been widely assumed that, under SI [a.k.a. SERIALIZABLE transaction isolation mode], read-only transactions always execute serializably provided the concurrent update transactions are serializable. The reason for this is that all SI reads return values from a single instant of time when all committed transactions have completed their writes and no writes of non-committed transactions are visible. This seems to imply that read-only transactions will not read anomalous results so long as the update transactions with which they execute do not write such results.

“Yeah, exactly!” I thought. But it turns out to be possible, and here’s how it works.

Suppose we have the following table, representing bank account balances.

              Table "public.accounts"
Column | Type | Collation | Nullable | Default
--------+---------+-----------+----------+---------
id | integer | | not null |
value | integer | | not null |
Indexes:
"accounts_pkey" PRIMARY KEY, btree (id)

And suppose further that the following two accounts exist, with zero balance:

# SELECT * FROM accounts;
id | value
----+-------
1 | 0
2 | 0

These two accounts are owned by the same customer, and the sum of their balances must always be positive. If the sum goes negative, then a $1 overdraft fee is assessed. Now consider the following 3 DB transactions:

  • T1: Deposits $20 into account 2.
  • T2: Withdraws $10 from account 1.
  • T3: Retrieves the balances of both accounts and prints them out.

Let’s walk through these transactions and see how a serialization anomaly can emerge:

T1T2T3Notes
BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE;BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE;BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE;
SELECT id, value
FROM accounts
WHERE id IN (1,2);
T2 needs the balances of both accounts in order to know whether their sum is negative.
Both are 0.
SELECT value
FROM accounts
WHERE id = 2;
Still 0.
UPDATE accounts
SET value = 20
WHERE id = 2;
Deposits $20 into account 2.
COMMIT;T2 is now working with an outdated value.
SELECT id, value
FROM accounts
WHERE id IN (1,2);
T3 gets values (0, 20).
COMMIT;T3 sends this result to the printer.
UPDATE accounts
SET value = -11
WHERE id = 1;
Withdraws $10 from account 1, charging a $1 overdraft fee.
COMMIT;

When T2 tries to commit, it fails with the following error:

ERROR:  could not serialize access due to read/write dependencies among transactions
DETAIL: Reason code: Canceled on identification as a pivot, during write.
HINT: The transaction might succeed if retried.

Postgres has identified a potential serialization anomaly and prevented it by aborting T2. It’s a serialization anomaly because, if T2 were allowed to commit, then the customer would be charged an overdraft fee for withdrawing $10 despite having a print-out showing a total balance of $20. This could never happen if these transactions were run one-at-a-time in any order:

  • T1 => T2 => T3: print-out has a total balance of $10, and no overdraft fee is charged.
  • T1 => T3 => T2: print-out shows $20, and no overdraft fee is charged.
  • T2 => T1 => T3: print-out shows $9 ($-11 + $20).
  • T2 => T3 => T1: print-out shows $-11.
  • T3 => T1 => T2: print-out shows $0.
  • T3 => T2 => T1: print-out shows $0.

And, as we can also see, this serialization anomaly couldn’t occur if T3 were absent. Either of the possible orderings of T1 and T2 produces an internally consistent result.

Avoiding this problem

If you’re not using the Serializable isolation level, then instead of an error at COMMIT time, you just get a bug. And a scrap of paper that the SEC is going to have some embarrassing questions about.

So you can either use Serializable, or stick with a lower isolation level such as Read Committed and introduce a lock. For example, if T2 were to exclusively lock the row for account 1 when reading it, by using SELECT FOR UPDATE, then T1 would block until T2 committed. This would eliminate the bug.

Another fix, using the Repeatable Read isolation level, would be to add an additional row containing the sum of balances for each group of accounts, and require that any transaction updating one of those balances also update the sum row. Whichever transaction tried to update that row last would get an error and be forced to retry the transaction, thus eliminating the bug.

If you can, I recommend starting with Serializable, and only switching to a lower isolation level on a transaction-by-transaction basis when the performance cost of serializability checks becomes too high.

Concurrent locks and MultiXacts in Postgres

Pretty recently, I was troubleshooting a performance issue in a production Rails app backed by Postgres. There was this one class of query that would get slower and slower over the course of about an hour. The exact pathology is a tale for another time, but the investigation led me down the rabbit hole of a particular data structure in Postgres internals called a MultiXact.

Let’s look at what a MultiXact is, why it exists, and under what circumstances one comes into existence.

What the hell is a MultiXact?

A two-panel meme based on The Fifth Element. First panel shows Leeloo holding up her Multipass with the text, "Leeloo Dallas MultiXact". Second panel shows Korben, exasperated, saying "Yeah she knows it's a MultiXact"

In Postgres, it’s possible for two threads (or indeed, even three!) to lock the same row at the same time, as long as their locks don’t conflict with one another. The rules for what kinds of locks conflict with what other kinds of locks are admirably documented.

When you lock a row in Postgres, what you do is, you put your transaction ID in the xmax field of the corresponding tuple in storage (more on xmax on the Cybertec blog). This way, anyone who comes along looking for this row will know that you already have it locked. The latecomer can then wait for the lock:

Transaction 768Transaction 769Notes
SELECT *
FROM foo
WHERE x = 9
FOR UPDATE;
Transaction 768 now has a row lock. The row’s xmax field contains the value 768.
SELECT *
FROM foo
WHERE x = 9
FOR UPDATE;
Transaction 769 retrieves the current row, sees that Transaction 768 already holds a lock that conflicts with the lock it wants, and waits for Transaction 768 to be over.

But what if two threads both want to lock the same row simultaneously? For example:

Transaction 772Transaction 773Notes
SELECT *
FROM foo
WHERE x = 9
FOR SHARE;
Afterward, transaction 772 has a row lock. The row’s xmax field contains the value 772.
SELECT *
FROM foo
WHERE x = 9
FOR SHARE;
What happens now? 

Transaction 773 can’t just write its transaction ID into the xmax field. That would amount to preempting Transaction 772‘s lock… which would defeat the whole point of locking. To solve this problem, Postgres creates a MultiXact. A MultiXact essentially bundles together some set of transactions so that those transactions can all lock the same row at the same time. Instead of a transaction ID, a new MultiXact ID is written to the row’s xmax.

Transaction 772Transaction 773Notes
SELECT *
FROM foo
WHERE x = 9
FOR SHARE;
SELECT *
FROM foo
WHERE x = 9
FOR SHARE;
Now both transactions have the row locked. The row’s xmax field is set to 14, which is a MultiXact ID. MultiXact 14 refers to transactions 772 and 773 by their transaction IDs.
COMMIT;Transaction 773 is now over, but the row’s xmax value is still 14. Since MultiXacts are immutable, MultiXact 14 still refers to the now-defunct Transaction 773 as well as the ongoing Transaction 772.
COMMIT;With both transactions over, there are no remaining locks active on the row. Its xmax value is still 14, and will remain 14 until another thread locks the row or the table is vacuumed.

It bears repeating that MultiXacts are immutable. If transactions 104 and 108 both have row R locked as part of MultiXact 19, and transaction 117 locks row R too, transaction 117 can’t just join MultiXact 19. Instead, a new MultiXact with ID 20 is created, which contains 104, 108, and 117.

This means that, every time an additional transaction wants to lock a row, Postgres has to write the whole new MultiXact to a buffer. For large MultiXacts, the time-cost of all this reading and writing can become quite significant. Especially since access to the underlying data region is subject to a set of global mutexes.

Counterintuitively, a MultiXact doesn’t always involve multiple transactions.

When I first encountered MultiXacts in production, it was extra confusing, because I was seeing MultiXacts get created within a single transaction. “What gives?” I thought! “They’re supposed to be multi.”

Turns out, we were doing something like this:

BEGIN;
SELECT * FROM queue_jobs
WHERE id = 4
FOR SHARE;

SAVEPOINT foo;

SELECT *
FROM queue_jobs
WHERE id = 4
FOR UPDATE;

A savepoint doesn’t technically create a new transaction (I don’t think), but nevertheless, Postgres needs to keep track of the fact that the FOR UPDATE lock was taken after the savepoint, so that that lock can be relinquished in case of a subsequent ROLLBACK TO foo command. As a result, a new MultiXact gets created and its ID gets placed in the row’s xmax field.

I hope I’ve contributed to your understanding of MultiXacts in Postgres! It’s the sort of thing where you don’t ever need to think or know about it, until suddenly you do.

Squeeze the hell out of the system you have

About a year ago, I raised a red flag with colleagues and managers about Postgres performance. Our database was struggling to keep up with the load generated by our monolithic SaaS application. CPU utilization was riding between 60 and 80%, and at least once it spiked to 100%, causing a brief outage.

Now, we had been kicking the can down the road with respect to Postgres capacity for a long time. When the database looked too busy, we’d replace it with a bigger instance and move on. This saved us a lot of time and allowed us to focus on other things, like building features, which was great.

But this time, it wasn’t possible to scale the DB server vertically: we were already on the biggest instance. And we were about to overload that instance.

Lots of schemes were floated. Foremost among them:

  • Shard writes. Spin up a cluster of independent databases, and write data to one or the other according to some partitioning strategy.
  • Do micro-services. Split up the monolith into multiple interconnected services, each with its own data store that could be scaled on its own terms.

Both of these options are cool! A strong case can be made for either one on its merits. With write sharding, we could potentially increase our capacity by 2 or even 3 orders of magnitude. With micro-services, we’d be free to use “the right tool for the job,” picking data stores optimized to the requirements of each service workload. Either branch of the skill tree would offer exciting options for fault tolerance and operational resilience.

Either way, everyone had to agree: we’d outgrown our old, naïve implementation. Onward and upward! We can do hard things!

In situations like this, presented with a dazzling array of next-generation architecture options that can be built to last us through the decade, it’s easy to forget what our goal was: to get database performance under control.

Complexity costs attention.

Sometimes, leaps in complexity must be made. It’s generally a good problem to have. If enough demand is being placed on your system to render obsolete your existing technology, then even more growth is probably on the horizon! If you can just put in the investment and build the more advanced architecture now, then you’ll be looking at a bright future of unconstrained year-over-year success.

But don’t just consider the implementation cost. The real cost of increased complexity – often the much larger cost – is attention.

If you decide to shard across databases, then not only must you pay the money-, time-, and opportunity cost of building out the new architecture: you must also take the new complexity into account in every subsequent technical decision. Want to shard writes? Fine, but this complicates every future decision about backups, monitoring, migrations, the ORM, and network topology (just to name a few). And don’t get me started on micro-services.

Just think about how massive these costs are. How much feature delivery will have to be delayed or foregone to support the additional architectural complexity?

Always squeeze first

We should always put off significant complexity increases as long as possible.

When complexity leaps are on the table, there’s usually also an opportunity to squeeze some extra juice out of the system you have. By tweaking the workload, tuning performance, or supplementing the system in some way, you may be able to add months or even years of runway. When viable, these options are always preferable to building out a next-gen system.

Let’s return to the example of the overloaded Postgres instance. In that case, what we ended up doing was twofold:

  1. Two engineers (me and my colleague Ted – but mostly Ted) spent about 3 months working primarily on database performance issues. There was no silver bullet. We used our telemetry to identify heavy queries, dug into the (Rails) codebase to understand where they were coming from, and optimized or eliminated them. We also tuned a lot of Postgres settings.
  2. Two more engineers cut a path through the codebase to run certain expensive read-only queries on a replica DB. This effort bore fruit around the same time as (1), when we offloaded our single most frequent query (a SELECT triggered by polling web clients).

These two efforts together reduced the maximum weekly CPU usage on the database from 90% to 30%.

Now we can sleep at night. We have a huge amount of room to grow, both in terms of CPU headroom and our ability to shed load from the primary. And furthermore, since our work touched many parts of the codebase and demanded collaboration with lots of different devs, we now have a strong distributed knowledge base about the existing system. We’re well positioned to squeeze it even more if need be.

This doesn’t mean complexity is bad

Of course, I’m not saying complexity is bad. It’s necessary. Some day we’ll reach a fundamental limit of our database architecture, and before that day arrives, we’ll need to make a jump in complexity.

But until then, because we squeezed first, we get to keep working with the most boring system possible. This is by far the cheaper and more practical option.

Don’t fix it just because it’s technical debt.

Many kinds of problems get classified as technical debt. Most generally it’s problems where:

  • In the past, we made a decision to borrow against our future productivity in order to achieve earlier product delivery;
  • Because of that past decision, we must now make recurring interest payments in the form of wasted effort.

Engineers hate waste. Waste makes us want to barf. So it’s obvious to us that technical debt must be faced head-on and paid down. We therefore propose to management, “Let’s take technical debt more seriously.” Management usually replies one of two ways:

  1. No. We have to keep setting aggressive feature deadlines. We don’t have the luxury of looking back.
  2. Sure! How much of your time do you want to spend on tech debt reduction? How about 20%?

At least #1 is internally consistent.

#2 is scatter-brained. Why should we only spend part of our time doing work that maximizes value, and the rest of our time doing other, less optimal work?

A realistic manager would say, “If you have a proposed improvement that’ll reduce wasted effort, then make a case for the soundness of that proposal as an investment.” If we can save 32 labor-hours over the next 3 years, but it’ll only cost 8 labor-hours to do it, then maybe let’s do it. If it’ll cost 40 labor-hours, then you should just keep on wasting that effort.

Reducing wasted effort may feel like an obvious good, but reducing wasted effort isn’t the goal of a company. Making money is.

It’s fine to use names in post-mortems

Sometimes, during an incident post-mortem, we need to talk about the actions of a specific person. It’s awkward. We want to be blameless, right? If we say, “Steve rebooted the wrong server,” aren’t we undermining the principle of blamelessness? Shouldn’t we say, “An engineer rebooted the wrong server” instead?

This tension is rooted in a misunderstanding of blamelessness. The purpose of the blameless post-mortem is not to make everyone feel comfortable. Discomfort can be healthy and useful. The purpose of the blameless post-mortem is to let us find explanations deeper than human error.

If we say, “This outage happened because Steve rebooted the wrong server,” then we make Steve the sole agent in the narrative. By this logic, the only way to avoid more future outages is for Steve to do better (or get fired). This is the kind of shallow analysis that blameless post-mortems help us avoid.

But it’s okay to say, “Steve rebooted the wrong server,” as long as you don’t blame Steve for the failure. You’re just stating a fact. Everybody knows it was Steve who rebooted the server. Steve sure knows it: he’s been beating himself up about it all weekend.

But why did Steve reboot the wrong server? Could it have had something to do with the fact that the server’s command prompt contains the opaque EC2 instance ID instead of a meaningful hostname? Was Steve coming off a 70-hour week? Come to think of it, why is it that rebooting that particular server can cause a total failure? If you’re asking questions like these instead of stopping at “Steve fucked up,” then you’re doing blamelessness.

So just say the name. It’s fine.

P.S. – obviously this doesn’t apply if you’re in the kind of organization that will retaliate against Steve for making an honest mistake. Don’t sell out your coworkers to bloodthirsty management.

Incident metrics tell you nothing about reliability

When an incident response process is created, there arise many voices calling for measurement. “As long as we’re creating standards for incidents, let’s track Mean-Time-To-Recovery (MTTR) and Mean-Time-To-Detection (MTTD) and Mean-time-Between-Failures (MTBF)!” they say things like.

Usually there’s no further discussion on this point. Everyone agrees that more data is better (“but let’s make sure we interpret the data correctly, of course!”). As practitioners, it’s exciting to think that these metrics will finally let us demonstrate concrete results for all the effort we put into reliability. Middle management, for its part, looks forward to having a “north star” for reliability work. No more wishy-washy “we believe this project improved system reliability substantially” hand-waving: the initiative either moved the needle on MTBF or it didn’t.

The assumption underlying all this is that incident data can shed light on a system’s reliability. But this assumption is never warranted. Analysis of your incident data may shed light on your incidents. But incident data really can’t tell you anything about your reliability.

Picture this. Your team owns a web service. You’ve set up an alert on latency. If the 90th percentile latency exceeds 100 milliseconds, your team gets paged and you start an incident.

One day, you find out that, really, if that 90th percentile latency even gets above 50 milliseconds, it’s already a problem. Downstream services start getting sluggish, and user experience suffers. “No problem,” you say, lowering the alert threshold to 50 milliseconds.

Now you’re able to observe more failures. Or, if you prefer: you’ve widened the definition of “failure” to include more events. But the system’s reliability has in no way been altered. And yet, you will have more incidents. Possibly quite a few more! Mean-time-between-failures will drop.

Picture this. Every once in a while, the database gets into a “wedged” state, where a bunch of queries are piled up and work grinds to a halt. It’s simple to un-wedge it: just manually kill the query that caused the pileup, and everything else gets flushed through the system. Each time the database gets wedged, the database team gets paged, they start an incident, they find and kill the query, and the incident is over. These incidents happen semi-regularly, but at least they’re over quickly.

After a while, someone finally implements a fix. Those problematic queries? Gone! No more pointless incidents spun up just to manually clear out a pileup. And, more importantly: no more service degradations resulting from this problem. The reliability of the system has improved!

So what happens to the incident metrics? MTBF goes up, of course. That checks out: there are fewer incidents now. But what about MTTR? Before the fix, MTTR was low. These query-pileup incidents were frequent and short, so they kept mean-time-to-recovery anchored down. After the fix – which, remember, corresponded to a direct and pure improvement in reliability – mean-time-to-recovery is higher. Whoopsy daisy!

Picture this. You have a particular customer whose use-case tests the limits of your product’s capabilities. And furthermore, this customer is, let’s say, somewhat of a squeaky wheel. Every 2 or 3 weeks, they file a support ticket about something-or-other, and the problem is usually tricky enough that an incident gets spun up.

Eventually this customer has had enough. They leave. This represents both a sizable blow to revenue and a scathing indictment of your product’s reliability at scale. But, on the bright side, both MTTR and MTBF benefit enormously! That’ll look great on the quarterly slide deck.

When someone says they want to measure incident metrics (MTTR or MTBF or any other one in the family), I always want to ask them, “What’s your model?”

Many people rush to collect data without first considering whether they can come up with a model powerful enough to interpret the data. If our model is bad, then we’ll make bad decisions even with good data. And if, as is often the case, we have no explicit model at all, then our implicit model is just, “Reliability up, MTBF up, MTTR down.” Which, as we’ve seen, doesn’t match our observations even in straightforward, common scenarios.

Don’t waste time and energy collecting metrics you’ll never be able to use. Learn as much as you can from the failures themselves, and then move on.

Post-mortems: content over structure

The value of post-mortems is apparent: failures present opportunities to learn about unexpected behaviors of the system, and learning lets us make improvements to the system’s reliability.

The value of post-mortem documents is much less apparent.

Many R&D orgs will insist that the final draft of a post-mortem document have a particular structure. Common components of this structure include:

  • Start and end time of customer impact
  • Time to detection
  • Time to diagnosis
  • Time to recovery
  • A list of action items, each with a link to a ticket
  • Mitigation action items broken down into categories (e.g. “Prevention,” “Detection,” “Impact”)
  • Specific sections (e.g. “Timeline,” “What went wrong?,” “What can we do better?”)
  • Signoffs/approvals

None of these structural requirements facilitate learning. The benefits of post-mortem analysis come not from the document, but rather from the sense-making processes of which the document is an artifact. In order to understand a given failure, we invent hypotheses and test them against our mental model and the observable facts. In order to choose effective strategies for improvement, we converse and debate. And in order to make any of this matter, we establish accountability for seeing those strategies through.

These social processes are the source of the value of post-mortem analysis. The document is just a souvenir.

But what if you want to do meta-analysis? What if you want to analyze trends in incident characteristics over time, or categorize incidents according to some scheme? Don’t you need structure then?

I suppose you do. But good luck getting any useful information. No matter how much structure you insist on, the data set will be hopelessly noisy. Just try to state a hypothesis that can realistically be tested by a meta-analysis of a single organization’s incident data. I don’t think you can.

But what if structure helps me learn?

If structuring the post-mortem process helps you learn, then by all means! categorize! prompt! But recognize structure as a tool rather than an end in itself. Your learning process may benefit from one kind of structure, while somebody else’s may benefit from a different kind of structure, or less or more structure altogether. But the structure of the learning is what matters; not the structure of the document.

Organizational legibility

A requirement for post-mortem documents to have a specific, consistent structure doesn’t help us learn or improve. So why do we do it?

If you ask me, it’s all about the bureaucratic drive for legibility. Centralized power craves legibility.

Idiosyncratic processes like sense-making and learning are illegible to a command-and-control power structure. They come in diverse and intricate forms, instead of the standardized, codified forms that bureaucracy can parse. In service of legibility, a company’s power structure will insist that the post-mortem process culminate in spreadsheet-ready data items like “customer impact duration,” “time to recovery,” “severity level,” and the like. Centralized power demands these simplifications even if they inhibit learning and improvement. To the bureaucracy, legibility itself is the goal.

As an employee, you probably can’t totally disregard these bureaucratic impositions. But, to the extent you can disregard them, you should. Focus on what really matters: learning and improvement.

Outliers carry information. Don’t leave them on the table

Over a decade ago, I saw this talk by John Rauser. Only recently, though, did I come to realize how incredibly influential this talk has been on my career. Gosh what a great talk! You should watch it.

If you operate a complex system, like a SaaS app, you probably have a dashboard showing a few high-level metrics that summarize the system’s overall state. These metrics (“summary statistics”) are essential. They can reveal many kinds of gross changes (both gross “large scale” and gross “ick”) in the system’s state, over many different time scales. Very useful!

But don’t be misled. Summary statistics reveal certain patterns in the system’s behavior, but they are not identical to the system’s behavior. All summary statistics – yes, even distributions – hide information. They’re lossy. It’s easy to get lulled into the sense that, if an anomaly doesn’t show up in the summary statistics, it doesn’t matter. But a complex system’s behavior is not just curves on a plot. It’s a frothing, many-dimensional vector sum of instant-to-instant interactions.

When you investigate an anomaly in summary statistics, you’re faced with a small number of big facts. Average latency jumped by 20% at such-and-such time. Write IOPS doubled. API server queue depth started rising at some later time. Usually, you “zoom in” from there to find patterns that might explain these changes.

When you instead investigate a specific instance of anomalous behavior, you start with a large number of small facts. A request to such-and-such an endpoint with this-and-that parameter took however many seconds and crashed on line 99 of thing_doer.rb. None of these small facts tell you anything about the system’s overall behavior: this is just a single event among millions or billions or more. But, nevertheless: these small facts can be quite illuminating if you zoom out.

First of all, this probably isn’t the only time a crash like this has ever occurred. Maybe it’s happening multiple times a day. Maybe it happened twice as often this week as it did last week. Maybe it’s happening every time a specific customer makes a specific API request. Maybe that customer is fuming.

And second of all, the reason this event caught our eye in the first place was because it was anomalous. It had some extreme characteristic. Take, for example, a request that was served with very high latency. Perhaps, in the specific anomalous case before us, that extreme latency didn’t cause a problem. But how extreme could it get before it did cause a problem? If it took 20 seconds today, could it take 30 seconds next time? When it hits 30, it’ll time out and throw an error. Or, if multiple requests like this all arrived at the same time, could they exhaust some resource and interfere with other requests?

If the only anomalies you investigate are those that show up in summary statistics, then you’ll only find problems that have already gotten bad enough to move those needles. But if you dig into specific instances of anomalous behavior – “outliers” – then you can often find problems earlier, before they become crises.

5 production surprises worth investigating

As an SRE, I’m a vocal believer in following one’s nose: seeking out surprising phenomena and getting to the bottom of them. By adopting this habit, we can find and fix many classes of problems before they turn into incidents. Over time, this makes things run much smoother.

But what is surprising? How do we decide what to ignore and what to chase down?

Here are 5 kinds of surprise that are worth digging into in production data.

Keep in mind that this list is based on one SRE’s experience. Someone else, with different intuitions about system behavior, would make a different list. And that list would be more valuable to them than this list.

The most important thing is not that we always pick the right surprises to investigate. It’s that we make time, consistently, to seek out and explain surprises.

Anyway. Here’s my list.

1. Long running requests

Pick a time window. Sort all the requests during that window descending by duration. At the top of this list, you’ll find classes of requests that spent extraordinarily long in your system.

Were you expecting to see these specific kinds of requests? If not, then congratulations: you’re surprised. And this is quite likely a surprise worth investigating, because

  1. while these slow requests are running, they’re eating up resources (CPU, memory, sockets, and so on), which gives them an outsized potential to degrade the performance of other, unrelated requests.
  2. without intervention, the underlying problem will probably get worse over time. Why wouldn’t it? It has already gotten this bad. And if it gets bad enough, it will eventually cause customer pain.
  3. many of the phenomena contributing to this slowness probably affect a variety of requests. So, if you can make these requests perform better, you may be able to make other requests perform better too.

We don’t have to rigorously analyze and optimize every slow request that goes through the system. Just by coming to understand the issues that cause our slowest requests to be slow, we often find opportunities to make high-impact improvements and mitigate the risk of ending up on an incident call later.

2. Saturation

Saturation is when units of work are waiting in a queue to be processed – be they web requests, or IP packets, or threads, or what have you.

To the extent we’re trying to serve each request as quickly as possible, saturation anywhere is waste. We shouldn’t expect any saturation. Or, at least, we should strive to thoroughly understand every source of it. If we don’t understand it, it will probably get worse over time.

Find every metric that represents a saturation. Put them all on one dashboard. Look at that dashboard every day.

3. Crashes

Every time an exception bubbles up to the top of the stack, that’s unexpected. If it were expected, then the code would have caught and handled the exception.

A tool like Sentry or Datadog Error Tracking (I mention these specific products only because I know them, not because they’re necessarily the best) lets you classify crashes, and see which ones are new or getting more frequent. Granted: it’s often the case that there are far too many classes of crash to investigate all or even most of them. But this shouldn’t stop us from chasing down some of them, according to our intuition about their relative importance.

We have imperfect information until we get our hands dirty, so we won’t always end up picking the highest-impact investigations to undertake. But fixing crashes is always better than ignoring them, and by dividing up the extant crashes into groups and prioritizing one group at a time, we can create some very useful guarantees.

For example, if we find all the crashes that correspond to deadlocks, and fix those, then we guarantee that our system has no deadlocks. Thereafter, problems caused by deadlocks – which tend to be subtle, elusive, and therefore time-consuming to solve – can no longer quietly sow chaos.

4. Utilization spikes

It’s expected for a utilization metric (for example, network bandwidth or memory utilization) to move around a lot. But it’s not expected for utilization to exhibit large spikes. When this happens, it means some recurring event is using more than its fair share of some resource.

When utilization hits 100%, things will degrade catastrophically and immediately. So if you have utilization spikes representing 20% of capacity, then steady-state utilization only has to be 80% before catastrophic degradations start occurring. Or, if the spikes gets worse, and now they’re 30% of capacity, then 70% capacity becomes the new safe limit. We’ll have to waste at least 30% of our capacity by leaving it idle in order to prevent outages.

So why not understand the spikes now, before they get worse?

5. Correlation between latency and throughput

When we see recurring swings in latency metrics, it’s easy to say to ourselves, “Yeah, well, when we’re under more load, things slow down.” But… why? Systems don’t always behave like that! They do it when there’s a bottleneck. If the system were behaving as expected and no resources were exhausted, then the latency graph would be more or less flat.

In other words, if traffic tends to be higher mid-day on weekdays, and latency also tends to rise mid-day on weekdays, then you’re likely saturating at least one resource. Would you rather find that bottleneck now, or during an incident?

I hope this list is a good place to start if you’re looking to dig up problems. But, ultimately, you must be the judge of what’s surprising to you. Find something you can’t explain. Explain it. Repeat.