Imagine you’re an extremely bad doctor. Actually, chances are you don’t even have to imagine. Most people are extremely bad doctors.
But imagine you’re a bad doctor with a breathtakingly thorough knowledge of the human body. You can recite chapter and verse of your anatomy and physiology textbooks, and you’re always up to date on the most important research going on in your field. So what makes you a bad doctor? Well, you never order tests for your patients.
What good does your virtually limitless store of medical knowledge do you? None at all. Without data from real tests, you’ll almost never pick the right interventions for your patients. Every choice you make will be a guess.
There’s another way to be an extremely bad doctor, though. Imagine you don’t really know anything about how the human body works. But you do have access to lots of fancy testing equipment. When a patient comes in complaining of abdominal pain and nausea, you order as many tests as you can think of, hoping that one of them will tell you what’s up.
This rarely works. Most tests just give you a bunch of numbers. Some of those numbers may be outside of normal ranges, but without a coherent understanding of how people’s bodies behave, you have no way to put those numbers into context with each other. They’re just data – not information.
In medicine, data is useless without theory, and theory is useless without data. Why would we expect things to be any different in software?
Observability as signal and theory
The word “observability” gets thrown around a lot, especially in DevOps and SRE circles. Everybody wants to build observable systems, then make their systems more observable, and then get some observability into their observability so they can observe while they observe.
But when we look for concrete things we can do to increase observability, it almost always comes down to adding data. More metrics, more logs, more spans, more alerts. Always more. This makes us like the doctor with all the tests in the world but no bigger picture to fit their tests results into.
Observability is not just data. Observability comprises two interrelated and necessary properties: signal and theory. The relationship between these two properties is as follows:
- Signal emerges from data when we interpret it within our theory about the system’s behavior.
- Theory reacts to signal, changing and adapting as we use it to process new information.
In other words, you can’t have observability without both a rich vein of data and a theory within which that data can be refined into signal. Not enough data and your theory can’t do its job; not enough theory and your data is meaningless. Theory is the alchemy that turns data into knowledge.
What does this mean concretely?
It’s all well and good to have a definition of observability that looks nice on a cocktail napkin. But what can we do with it? How does this help us be better at our job?
The main takeaway from the understanding that observability consists of a relationship between data and theory, rather than simply a surfeit of the former, is this: a system’s observability may be constrained by deficiencies in either the data stream or our theory. This insight allows us to make better decisions when promoting observability.
Making better graph dashboards
However many graphs it contains, a metric dashboard only contributes to observability if its reader can interpret the curves they’re seeing within a theory of the system under study. We can facilitate this through many interventions, a few of which are to:
- Add a note panel to the top of every dashboard which give an overview of how that dashboard’s graphs are expected to relate to one another.
- Add links to dashboards for upstream and downstream services, so that data on the dashboard can be interpreted in a meaningful context.
- When building a dashboard, start with a set of questions you want to answer about a system’s behavior, and then choose where and how to add instrumentation; not the other way around.
Making better alerts
Alerts are another form of data that we tend to care about. And like all data, they can only be transmogrified into signal by being interpreted within a theory. To guide this transmogrification, we can:
- Present alerts along with links to corresponding runbooks or graph dashboards.
- Document a set of alerts that, according to our theory, provides sufficient coverage of the health of the system.
- Delete any alerts whose relevance to our theory can’t be explained succinctly.
Engaging in more effective incident response
When there’s an urgent issue with a system, an intuitive understanding of the system’s behavior is indispensable to the problem solving process. That means we depend on the system’s observability. The incident response team’s common ground is their theory of the system’s behavior – in order to make troubleshooting observations meaningful, that theory needs to be kept up to date with the data.
To maintain common ground over the course of incident response, we can:
- Engage in a regular, structured sync conversation about the meaning of new data and the next steps.
- Seek out data only when you can explicitly state how the data will relate to our theory (e.g. “I’m going to compare these new log entries with the contents of such-and-such database table because I think the latest deploy might have caused an inconsistency”).
- Maintain an up-to-date, explicit record of the current state of problem solving, and treat it as the ultimate source of truth.
Data is just data until theory makes it signal.
The next time you need to build an observable system, or make a system more observable, take the time to consider not just what data the system produces, but how to surface a coherent theory of the system’s workings. Remember that observability is about delivering meaning, not just data.