No observability without theory

Imagine you’re an extremely bad doctor. Actually, chances are you don’t even have to imagine. Most people are extremely bad doctors.

But imagine you’re a bad doctor with a breathtakingly thorough knowledge of the human body. You can recite chapter and verse of your anatomy and physiology textbooks, and you’re always up to date on the most important research going on in your field. So what makes you a bad doctor? Well, you never order tests for your patients.

What good does your virtually limitless store of medical knowledge do you? None at all. Without data from real tests, you’ll almost never pick the right interventions for your patients. Every choice you make will be a guess.

There’s another way to be an extremely bad doctor, though. Imagine you don’t really know anything about how the human body works. But you do have access to lots of fancy testing equipment. When a patient comes in complaining of abdominal pain and nausea, you order as many tests as you can think of, hoping that one of them will tell you what’s up.

This rarely works. Most tests just give you a bunch of numbers. Some of those numbers may be outside of normal ranges, but without a coherent understanding of how people’s bodies behave, you have no way to put those numbers into context with each other. They’re just data – not information.

In medicine, data is useless without theory, and theory is useless without data. Why would we expect things to be any different in software?

Observability as signal and theory

The word “observability” gets thrown around a lot, especially in DevOps and SRE circles. Everybody wants to build observable systems, then make their systems more observable, and then get some observability into their observability so they can observe while they observe.

But when we look for concrete things we can do to increase observability, it almost always comes down to adding data. More metrics, more logs, more spans, more alerts. Always more. This makes us like the doctor with all the tests in the world but no bigger picture to fit their tests results into.

Observability is not just data. Observability comprises two interrelated and necessary properties: signal and theory. The relationship between these two properties is as follows:

• Signal emerges from data when we interpret it within our theory about the system’s behavior.
• Theory reacts to signal, changing and adapting as we use it to process new information.

In other words, you can’t have observability without both a rich vein of data and a theory within which that data can be refined into signal. Not enough data and your theory can’t do its job; not enough theory and your data is meaningless. Theory is the alchemy that turns data into knowledge.

What does this mean concretely?

It’s all well and good to have a definition of observability that looks nice on a cocktail napkin. But what can we do with it? How does this help us be better at our job?

The main takeaway from the understanding that observability consists of a relationship between data and theory, rather than simply a surfeit of the former, is this: a system’s observability may be constrained by deficiencies in either the data stream or our theory. This insight allows us to make better decisions when promoting observability.

Making better graph dashboards

However many graphs it contains, a metric dashboard only contributes to observability if its reader can interpret the curves they’re seeing within a theory of the system under study. We can facilitate this through many interventions, a few of which are to:

• Add a note panel to the top of every dashboard which give an overview of how that dashboard’s graphs are expected to relate to one another.
• Add links to dashboards for upstream and downstream services, so that data on the dashboard can be interpreted in a meaningful context.
• When building a dashboard, start with a set of questions you want to answer about a system’s behavior, and then choose where and how to add instrumentation; not the other way around.

Alerts are another form of data that we tend to care about. And like all data, they can only be transmogrified into signal by being interpreted within a theory. To guide this transmogrification, we can:

• Present alerts along with links to corresponding runbooks or graph dashboards.
• Document a set of alerts that, according to our theory, provides sufficient coverage of the health of the system.
• Delete any alerts whose relevance to our theory can’t be explained succinctly.

Engaging in more effective incident response

When there’s an urgent issue with a system, an intuitive understanding of the system’s behavior is indispensable to the problem solving process. That means we depend on the system’s observability. The incident response team’s common ground is their theory of the system’s behavior – in order to make troubleshooting observations meaningful, that theory needs to be kept up to date with the data.

To maintain common ground over the course of incident response, we can:

• Engage in a regular, structured sync conversation about the meaning of new data and the next steps.
• Seek out data only when you can explicitly state how the data will relate to our theory (e.g. “I’m going to compare these new log entries with the contents of such-and-such database table because I think the latest deploy might have caused an inconsistency”).
• Maintain an up-to-date, explicit record of the current state of problem solving, and treat it as the ultimate source of truth.

Delivering meaning

Data is just data until theory makes it signal.

The next time you need to build an observable system, or make a system more observable, take the time to consider not just what data the system produces, but how to surface a coherent theory of the system’s workings. Remember that observability is about delivering meaning, not just data.

Ever since the late 2000s, I’ve been implementing “alert review” processes on ops teams. As a team, we go through all the alerts we’ve received in the last week, and identify those that are bad.

I use a simple framework wherein a good alert has the following properties:

• Actionable: Indicates a problem for which the recipient is well placed to take immediate corrective action.
• Investigable: (yes, I made this word up) Indicates a problem whose solution is not yet known by the organization.

These properties can be present or absent independently, so this framework identifies four types of alerts:

Actionability

Actionability has been widely touted as a necessary condition for good alerts, and for good reason. Why send an alert about something you can’t fix? Just to make the recipient anxious?

Referring to my definition of an actionable alert:

Indicates a problem for which the recipient is well placed to take immediate corrective action.

we can see three main ways in which non-actionability shows up:

• Someone is well placed to take immediate corrective action, but not the recipient. For example, an alert that indicates a problem with the Maple Syrup service instead pages the Butter team, which can’t do anything about it. In cases like these, the fix is often simple: change the alert’s recipient. Sometimes, though, you’ll first have to refactor your config to let it distinguish between Maple Syrup problems and Butter problems.
• There is an action to take, but it can’t be taken immediately. For example, Apache needs to be restarted, but the recipient of the alert isn’t sure whether this will cause an outage. This type of non-actionable alert often calls for either improved documentation (e.g. a “runbook” indicating steps to perform). Another example might be a disk space alert that has been slowly climbing for a while and just crossed a threshold: action can’t be taken immediately, because the team needs to agree upon an action to take.
• There is no action to take. For example, “CPU utilization” or “Packet loss.” These are your classic FYI alerts. Instead of alerting, these things should appear on a dashboard for use in troubleshooting when a problem is already known to exist.

Investigability

An alert is non-investigable if its implications are obvious at first glance. Here are the two most common types of non-investigable alerts:

• “Chief O’Brien” alerts. If you look at your phone and instantly know the commands you have to run to fix it, that’s a “Chief O’Brien” alert. There’s no need to bother a human to fix the issue; the response should be automated.
• Redundant alerts. Sometimes you get an alert for increased error rates from one of your services, and by the time you get up and get to your laptop, you’ve gotten 8 more similar alerts. The first one might well have been a perfectly good alert, but the other 8 are likely non-investigable. Whatever you learn in investigating the first one will apply to the rest in exactly the same way. The correct response to alerts like these is to use dependencies or grouping.

What to do with this framework

Like I said, I like to have my team go through all the alerts they’ve received in the last week and discuss them. Imagine a spreadsheet where each alert is a row and there are columns labeled Actionable? and Investigable?

Actually, don’t bother. I imagined one for you:

This actionability/investigability framework helps the team identify bad alerts and agree on the precise nature of their badness. And as a bonus, the data in these spreadsheets can shine a light on trends in alert quality over time.

I’ve had a lot of success with this framework throughout the years, and I’d like to hear how it works for others. Let me know if you try it out, or if you have a different model for addressing the same sorts of questions!

MTTR: lower isn’t always better

It’s a great idea to track your MTTR (Mean Time To Recover) as an operational metric. MTTR is defined as the average interval between onset of a failure and recovery from that failure. We acknowledge that failures are part of the game, so we want our organization to be good at responding quickly to them. It’s intuitive that we’d want our MTTR to trend down.

This is one of those places where our intuition can be misleading.

MTTR is an average over incidents of incident duration. That means that the total amount of downtime gets denominatored out. Consider these two brothers who run different websites:

• Achenar’s site only had 1 outage in September, and it lasted 60 minutes.
• Sirius’s site had 120 outages in September, lasting 20 minutes each.

Sirius had 40 times as much downtime as Achenar in the month of September. Sirius’s MTTR, however, was 1/3 that of Achenar: 20 minutes rather than 60 minutes.

Lowering your MTTR is a good strategy in certain situations. But you need to make sure it’s the right strategy. If you don’t look at the whole picture, things like nuisance alarms and insufficient automation can be confounded with the meaning of your MTTR. If you fix a whole bunch of meaningless alerts that always recover quickly without intervention (you know the type), your MTTR goes up!

MTTR is useful to track, and it can be useful for decision-making. Just remember: our goal is to minimize downtime and noise, not MTTR. If the path of least resistance to lower downtime and a stronger signal is to respond to incidents quicker, then MTTR is your best friend. But that’s not always true.

Why I only page on downtime. ONLY.

I often meet with skepticism when I say that server monitoring systems should only page when a service stops doing its work. It’s one of the suggestions I made in my Smoke Alarms & Car Alarms talk at Monitorama this year. I don’t page on high CPU usage, or rapidly-growing RAM usage, or anything like that. Skeptics usually ask some variation on:

If you only alert on things that are already broken, won’t you miss opportunities to fix things before they break?

The answer is a clear and unapologetic yes! Sometimes that will happen.

It’s easy to be certain that a service is down: just check whether its work is still getting done. It’s even pretty easy to detect a performance degradation, as long as you have clearly defined what constitutes acceptable performance. But it’s orders of magnitude more difficult to reliably predict that a service will go down soon without human intervention.

We like to operate our systems at the very edge of their capacity. This is true not only in tech, but in all sectors: from medicine to energy to transportation. And it makes sense: we bought a certain amount of capacity: why would we waste any? But a side effect of this insatiable lust for capacity is that it makes the line between working and not working extremely subtle. As Mark Burgess points out in his thought-provoking In Search of Certainty, this is a consequence of nonlinear dynamics (or “chaos theory“), and our systems are vulnerable to it as long as we operate them so close to an unstable region.

But we really really want to predict failures! It’s tempting to try and develop increasingly complex models of our nonlinear systems, aiming for perfect failure prediction. Unfortunately, since these systems are almost always operating under an unpredictable workload, we end up having to couple these models tightly to our implementation: number of threads, number of servers, network link speed, JVM heap size, and so on.

This is just like overfitting a regression in statistics: it may work incredibly well for the conditions that you sampled to build your model, but it will fail as soon as new conditions are introduced. In short, predictive models for nonlinear systems are fragile. So fragile that they’re not worth the effort to build.

Instead of trying to buck the unbuckable (which is a bucking waste of time), we should seek to capture every failure and let our system learn from it. We should make systems that are aware of their own performance and the status of their own monitors. That way we can build feedback loops and self-healing into them: a strategy that won’t crumble when the implementation or the workload takes a sharp left.

Car alarms and smoke alarms: the tradeoff between sensitivity and specificity

Wouldn’t you like to live in a world where your monitoring systems only alerted when things were actually broken? And wouldn’t it be great if, in that world, your alerts would always fire if things were broken?

Well so would everybody else. But we don’t live in that world. When we choose a threshold for alerting, we usually have to make a tradeoff between the chance of getting a false positive (an alert that fires when nothing is wrong) and the chance of getting a false negative (an alert that doesn’t fire when something is wrong).

Take the load average on an app server for example: if it’s above 100, then your service is probably broken. But there’s still a chance that the waiting processes aren’t blocking your mission-critical code paths. If you page somebody on this threshold, there’s always a chance that you’ll be waking that person up in the middle of the night for no good reason. However, if you raise the threshold to 200 to get rid of such spurious alerts, you’re making it more likely that a pathologically high load average will go unnoticed.

When presented with this tradeoff, the path of least resistance is to say “Let’s just keep the threshold lower. We’d rather get woken up when there’s nothing broken than sleep through a real problem.” And I can sympathize with that attitude. Undetected outages are embarrassing and harmful to your reputation. Surely it’s preferable to deal with a few late-night fire drills.

It’s a trap.

In the long run, false positives can — and will often — hurt you more than false negatives. Let’s learn about the base rate fallacy.

The base rate fallacy

Suppose you have a service that works fine most of the time, but breaks occasionally. It’s not trivial to determine whether the service is working, but you can write a probe that’ll detect its state correctly 99% of the time:

• If the service is working, there’s a 1% chance that your probe will say it’s broken
• If the service is broken, there’s a 1% chance that your probe will say it’s working

Naïvely, you might expect this probe to be a decent check of the service’s health. If it goes off, you’ve got a pretty good chance that the service is broken, right?

No. Bad. Wrong. This is what logicians and statisticians call the “base rate fallacy.” Your expectation hinges on the assumption that the service is only working half the time. In reality, if the service is any good, it works almost all the time. Let’s say the service is functional 99.9% of the time. If we assume the service just fails randomly the other 0.1% of the time, we can calculate the true-positive rate:

$\begin{array}{rcl} \text{TPR} & = & \text{(prob. of service failure)}*\text{(prob. of detecting a failure)} \\ & = & (0.001) * (0.99) \\ & = & 0.00099 \\ & = & 0.099\% \end{array}$

That is to say, about 1 in 1000 of all tests will run during a failure and detect that failure correctly. We can also calculate the false-positive rate:

$\begin{array}{rcl} \text{FPR} & = & \text{(prob. of service non-failure)}*\text{(prob. of detecting failure anyway)} \\ & = & (1-0.001)*(1-0.99) \\ & = & 0.0099 \\ & = & 0.99\% \end{array}$

So almost 1 test in 100 will run when the service is not broken, but will report that it’s broken anyway.

You should already be feeling anxious.

With these numbers, we can calculate what the medical field calls the probe’s positive predictive value: the probability that, if a given test produces a positive result, it’s a true positive. For our purposes this is the probability that, if we just got paged, something’s actually broken.

$\begin{array}{rcl} \text{(Positive predictive value)} & = & \frac{\text{TPR}}{\text{TPR} + \text{FPR}} \\ & = & \frac{0.00099}{0.00099 + 0.0099} \\ & = & 0.091 \\ & = & 9.1\% \end{array}$

Bad news. There’s no hand-waving here. If you get alerted by this probe, there’s only a 9.1% chance that something’s actually wrong.

Car alarms and smoke alarms

When you hear a car alarm going off, do you run to the window and start looking for car thieves? Do you call 9-1-1? Do you even notice car alarms anymore?

Car alarms have a very low positive predictive value. They go off for so many spurious reasons: glitchy electronics, drunk people leaning on the hood, accidental pressing of the panic button. And as a result of this low PPV, car alarms are much less useful as theft deterrents than they could be.

Now think about smoke alarms. People trust smoke alarms. When a smoke alarm goes off in a school or an office building, everybody stops what they’re doing and walks outside in an orderly fashion. Why? Because when smoke alarms go off (and there’s no drill scheduled), it frequently means there’s actual smoke somewhere.

This is not to say that smoke alarms have a perfect PPV, of course, as anybody who’s lost half an hour of their time to a false positive will tell you. But their PPV is high enough that people still pay attention to them.

We should strive to make our alerts more like smoke alarms than car alarms.

Sensitivity and specificity

Let’s go back to our example: probing a service that works 99.9% of the time. There’s some jargon for the tradeoff we’re looking at. It’s the tradeoff between the sensitivity of our test (the probability of detecting a real problem if there is one) and its specificity (the probability that we won’t detect a problem if there isn’t one).

Every time we set a monitoring threshold, we have to balance sensitivity and specificity. And one of the first questions we should ask ourselves is: “How high does our specificity have to be in order to get a decent positive predictive value?” It just takes some simple algebra to figure this out. We start with the PPV formula we used before, enjargoned below:

$\begin{array}{rcl} \text{PPV} & = & \frac{\text{TPR}}{\text{TPR}+\text{FPR}} \\ & = & \frac{\text{(prob. of failure)}\cdot\text{(sensitivity)}}{\text{(prob. of failure)}\cdot\text{(sensitivity)} + (1 - \text{(prob. of failure)})\cdot(1 - \text{(specificity)})} \end{array}$

To make this math a little more readable, let’s let p = PPV, f = the probability of service failure, a = sensitivity, and b = specificity. And let’s solve for b.

$\begin{array}{rcl} p & = & \frac{fa}{fa + (1-f)*(1-b)} \\ fa + (1-f)(1-b) & = & \frac{fa}{p} \\ 1-b & = & \frac{\frac{fa}{p} - fa}{1-f} \\ b & = & 1 - \frac{\frac{fa}{p} - fa}{1-f} \end{array}$

So, sticking with the parameters of our initial example (0.1% probability of service failure, 99% sensitivity) and deciding that we want a positive predictive value of at least 90% (so that 9 out of 10 alerts will mean something’s actually broken), we end up with

$\begin{array}{rcl} \text{Specificity} & = & 1 - \frac{\frac{0.001*0.99}{0.9} - (0.001 * 0.99)}{(1 - 0.001)} \\ & = & 0.9999 \\ & = & 99.99\% \end{array}$

The necessary specificity is about 99.99% — that’s way higher than the sensitivity of 99%! In order to get a probe that detects failures in this service with sufficient reliability, you need to be 100 times less likely to falsely detect a failure than you are to miss a positive!

So listen.

You’ll often be tempted to favor high sensitivity at the cost of specificity, and sometimes that’s the right choice. Just be careful: avoid the base rate fallacy by remembering that your false-positive rate needs to be much smaller than your failure rate if you want your test to have a decent positive predictive value.