No observability without theory

Imagine you’re an extremely bad doctor. Actually, chances are you don’t even have to imagine. Most people are extremely bad doctors.

Beautiful dog dressed as a vet
I love dogs, but they make bad doctors.

But imagine you’re a bad doctor with a breathtakingly thorough knowledge of the human body. You can recite chapter and verse of your anatomy and physiology textbooks, and you’re always up to date on the most important research going on in your field. So what makes you a bad doctor? Well, you never order tests for your patients.

What good does your virtually limitless store of medical knowledge do you? None at all. Without data from real tests, you’ll almost never pick the right interventions for your patients. Every choice you make will be a guess.

There’s another way to be an extremely bad doctor, though. Imagine you don’t really know anything about how the human body works. But you do have access to lots of fancy testing equipment. When a patient comes in complaining of abdominal pain and nausea, you order as many tests as you can think of, hoping that one of them will tell you what’s up.

This rarely works. Most tests just give you a bunch of numbers. Some of those numbers may be outside of normal ranges, but without a coherent understanding of how people’s bodies behave, you have no way to put those numbers into context with each other. They’re just data – not information.

In medicine, data is useless without theory, and theory is useless without data. Why would we expect things to be any different in software?

Observability as signal and theory

The word “observability” gets thrown around a lot, especially in DevOps and SRE circles. Everybody wants to build observable systems, then make their systems more observable, and then get some observability into their observability so they can observe while they observe.

But when we look for concrete things we can do to increase observability, it almost always comes down to adding data. More metrics, more logs, more spans, more alerts. Always more. This makes us like the doctor with all the tests in the world but no bigger picture to fit their tests results into.

Observability is not just data. Observability comprises two interrelated and necessary properties: signal and theory. The relationship between these two properties is as follows:

    • Signal emerges from data when we interpret it within our theory about the system’s behavior.
  • Theory reacts to signal, changing and adapting as we use it to process new information.

In other words, you can’t have observability without both a rich vein of data and a theory within which that data is interpretable as signal. Not enough data and your theory can’t do its job; not enough theory and your data is meaningless. Theory is the alchemy that turns data into knowledge.

What does this mean concretely?

Screen Shot 2019-05-03 at 1.59.18 PM

It’s all well and good to have a definition of observability that looks nice on a cocktail napkin. But what can we do with it? How does this help us be better at our job?

The main takeaway from the understanding that observability consists of a relationship between data and theory, rather than simply a surfeit of the former, is this: a system’s observability may be constrained by deficiencies in either the data stream or our theory. This insight allows us to make better decisions when promoting observability.

Making better graph dashboards

However many graphs it contains, a metric dashboard only contributes to observability if its reader can interpret the curves they’re seeing within a theory of the system under study. We can facilitate this through many interventions, a few of which are to:

    • Add a note panel to the top of every dashboard which give an overview of how that dashboard’s graphs are expected to relate to one another.
    • Add links to dashboards for upstream and downstream services, so that data on the dashboard can be interpreted in a meaningful context.
  • When building a dashboard, start with a set of questions you want to answer about a system’s behavior, and then choose where and how to add instrumentation; not the other way around.

Making better alerts

Alerts are another form of data that we tend to care about. And like all data, they can only be transmogrified into signal by being interpreted within a theory. To guide this transmogrification, we can:

    • Present alerts along with links to corresponding runbooks or graph dashboards.
    • Document a set of alerts that, according to our theory, provides sufficient coverage of the health of the system.
  • Delete any alerts whose relevance to our theory can’t be explained succinctly.

Engaging in more effective incident response

When there’s an urgent issue with a system, an intuitive understanding of the system’s behavior is indispensable to the problem solving process. That means we depend on the system’s observability. The incident response team’s common ground is their theory of the system’s behavior – in order to make troubleshooting observations meaningful, that theory needs to be kept up to date with the data.

To maintain common ground over the course of incident response, we can:

    • Engage in a regular, structured sync conversation about the meaning of new data and the next steps.
    • Seek out data only when you can explicitly state how the data will relate to our theory (e.g. “I’m going to compare these new log entries with the contents of such-and-such database table because I think the latest deploy might have caused an inconsistency”).
  • Maintain an up-to-date, explicit record of the current state of problem solving, and treat it as the ultimate source of truth.

Delivering meaning

Data is just data until theory makes it signal.

The next time you need to build an observable system, or make a system more observable, take the time to consider not just what data the system produces, but how to surface a coherent theory of the system’s workings. Remember that observability is about delivering meaning, not just data.

One thought on “No observability without theory

  1. Rainer

    Dan, thanks a lot for the great blog post. I really like the way that you make the point that data is just data until theory makes it a signal.
    However, I would like to add two things to the definition of a good doctor. A good doctor knows not only the theory and does real tests, he is also a master in using his tools to cure the patient. You may understand everything on how the body works, you may make run the right tests to support your theory but if you are not a master in how to treat the problem it does not help. Think about a doctor doing a neurosurgery. If the doctor is not very skilled in doing the surgery the patient will probably die.
    A second part I would like to add is the ability of the doctor to explain the patient the problem he found and how to cure it. If the patient does not understand the explanation of the doctor he might looe his trust and go to some other doctor.
    To relate that back to observability in software. If you find the problem but can’t actually solve it then you will still have the problem. Like you identified a performance problem in a piece of software but you don’t know how to speed it up. And when you find the problem verified by data and know how to solve it but you can’t explain it to the ones needing to give you the go ahead to solve it in your proposed way then the problem will last, too.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s