Over a decade ago, I saw this talk by John Rauser. Only recently, though, did I come to realize how incredibly influential this talk has been on my career. Gosh what a great talk! You should watch it.
If you operate a complex system, like a SaaS app, you probably have a dashboard showing a few high-level metrics that summarize the system’s overall state. These metrics (“summary statistics”) are essential. They can reveal many kinds of gross changes (both gross “large scale” and gross “ick”) in the system’s state, over many different time scales. Very useful!
But don’t be misled. Summary statistics reveal certain patterns in the system’s behavior, but they are not identical to the system’s behavior. All summary statistics – yes, even distributions – hide information. They’re lossy. It’s easy to get lulled into the sense that, if an anomaly doesn’t show up in the summary statistics, it doesn’t matter. But a complex system’s behavior is not just curves on a plot. It’s a frothing, many-dimensional vector sum of instant-to-instant interactions.
When you investigate an anomaly in summary statistics, you’re faced with a small number of big facts. Average latency jumped by 20% at such-and-such time. Write IOPS doubled. API server queue depth started rising at some later time. Usually, you “zoom in” from there to find patterns that might explain these changes.
When you instead investigate a specific instance of anomalous behavior, you start with a large number of small facts. A request to such-and-such an endpoint with this-and-that parameter took however many seconds and crashed on line 99 of thing_doer.rb. None of these small facts tell you anything about the system’s overall behavior: this is just a single event among millions or billions or more. But, nevertheless: these small facts can be quite illuminating if you zoom out.
First of all, this probably isn’t the only time a crash like this has ever occurred. Maybe it’s happening multiple times a day. Maybe it happened twice as often this week as it did last week. Maybe it’s happening every time a specific customer makes a specific API request. Maybe that customer is fuming.
And second of all, the reason this event caught our eye in the first place was because it was anomalous. It had some extreme characteristic. Take, for example, a request that was served with very high latency. Perhaps, in the specific anomalous case before us, that extreme latency didn’t cause a problem. But how extreme could it get before it did cause a problem? If it took 20 seconds today, could it take 30 seconds next time? When it hits 30, it’ll time out and throw an error. Or, if multiple requests like this all arrived at the same time, could they exhaust some resource and interfere with other requests?
If the only anomalies you investigate are those that show up in summary statistics, then you’ll only find problems that have already gotten bad enough to move those needles. But if you dig into specific instances of anomalous behavior – “outliers” – then you can often find problems earlier, before they become crises.
In an organization that delivers a software service, almost all R&D time goes toward building stuff. We figure out what the customer needs, we decide how to represent their need as software, and we proceed to build that software. After we repeat this cycle enough times, we find that we’ve accidentally ended up with a complex system.
Inevitably, by virtue of its complexity, the system exhibits behaviors that we didn’t design. These behaviors are surprises, or – often – problems. Slowdowns, race conditions, crashes, and so on. Things that we, as the designers, didn’t anticipate, either because we failed to consider the full range of potential interactions between system components, or because the system was exposed to novel and unpredictable inputs (i.e. traffic patterns). Surprises emerge continuously, and most couldn’t have been predicted a priori from knowledge of the system’s design.
R&D teams, therefore, must practice 2 distinct flavors of engineering. Prescriptive engineering is when you say, “What are we going to build, and how?”, and then you execute your plan. Teams with strong prescriptive engineering capabilities can deliver high-quality features fast. And that is, of course, indispensable.
But prescriptive engineering is not enough. As surprises emerge, we need to spot them, understand them, and explain them. We need to practice descriptive engineering.
Descriptive engineering is usually an afterthought
Most engineers rarely engage with production surprises.
We’re called upon to exercise descriptive engineering only in the wake of a catastrophe or a near-catastrophe. Catastrophic events bring attention to the ways in which our expectations about the system’s behavior have fallen short. We’re asked to figure out what went wrong and make sure it doesn’t happen again. And, when that’s done, to put the issue behind us so we can get back to the real work.
In fact, descriptive engineering outside the context of a catastrophe is unheard of most places. Management tends to see all descriptive engineering as rework: a waste of time that could have been avoided had we just designed our system with more forethought in the first place.
The complexity of these systems makes it impossible for them to run without multiple flaws being present. Because these [flaws] are individually insufficient to cause failure they are regarded as minor factors during operations. … The failures change constantly because of changing technology, work organization, and efforts to eradicate failures.
A complex system’s problems are constantly shifting, recombining, and popping into and out of existence. Therefore, descriptive engineering – far from rework – is a fundamental necessity. Over time, the behavior of the system diverges more and more from our expectations. Descriptive engineering is how we bring our expectations back in line with reality.
In other words: our understanding of a complex system is subject to constant entropic decay, and descriptive engineering closes an anti-entropy feedback loop.
Where descriptive engineering lives
Descriptive engineering is the anti-entropy that keeps our shared mental model of the system from diverging too far from reality. As such, no organization would get very far without exercising some form of it.
But, since descriptive engineering effort is so often perceived as waste, it rarely develops a nucleus. Instead, it arises in a panic, proceeds in a hurry, and gets abandoned half-done. It comes in many forms, including:
handling support tickets
debugging a broken deploy
In sum: the contexts in which we do descriptive engineering tend to be those in which something is broken and needs to be fixed. The understanding is subservient to the fix, and once the fix is deployed, there’s no longer a need for descriptive engineering.
Moreover, since descriptive engineering usually calls for knowledge of the moment-to-moment interactions between subsystems in production, and between the overall system and the outside world, this work has a habit of being siphoned away from developers toward operators. This siphoning effect is self-reinforcing: the team that most often practices descriptive engineering will become the team with the most skill at it, so they’ll get assigned more of it.
This is a shame. By adopting the attitude that descriptive engineering need only occur in response to catastrophe, we deny ourselves opportunities to address surprises before they blow up. We’re stuck waiting for random, high-profile failures to shock us into action.
What else can we do?
Instead of doing descriptive engineering only in response to failures, we must make it an everyday practice. To quote Dr. Cook again,
Overt catastrophic failure occurs when small, apparently innocuous failures join to create opportunity for a systemic accident. Each of these small failures is necessary to cause catastrophe but only the combination is sufficient to permit failure. Put another way, there are many more failure opportunities than overt system accidents.
What would a proactive descriptive engineering practice look like, concretely? One can imagine it in many forms:
A dedicated team of SREs.
A permanent cross-functional team composed of engineers familiar with many different parts of the stack.
A cultural expectation that all engineers spend some amount of their time on descriptive engineering and share their results.
A permanent core team of SREs, joined by a rotating crew of other engineers. Incidentally, this describes the experimental team I’m currently leading IRL, which is called Production Engineering.
I have a strong preference for models that distribute descriptive engineering responsibility across many teams. If the raison d’être of descriptive engineering is to maintain parity between our expectations of system behavior and reality, then it makes sense to spread that activity as broadly as possible among the people whose expectations get encoded into the product.
In any case, however we organize the effort, the main activities of descriptive engineering will look much the same. We delve into the data to find surprises. We pick some of these surprises to investigate. We feed the result of our investigations back into development pipeline. And we do this over and over.
It may not always be glamorous, but it sure beats the never-ending breakdown.
In a complex system, like a SaaS product, you have countless interdependent things all happening at every moment. On a gross scale: means of production go in; value comes out. But under the hood, all sorts of interactions are going on, at various rates and latencies, with various exit conditions, stacked to various heights of indirection. And the rules for these interactions change with every deploy.
As load gets added to the system, you start to experience 🏋️♀️🔥🪦 load problems. The system just can’t keep up with traffic, and it starts to break down under various circumstances, in all kinds of different more- or less spectacular ways.
At the onset of load problems, it feels like everything is breaking at once. Like some threshold was breached and we’ve entered a new domain. We always had problems, but now we have a new Big Problem.
Big Problems, though, are always composed of smaller problems. Big Problems conditioned by load are not atomic novel phenomena. There’s a knot of interdependent processes, and when this knot gets pulled tight by load, it binds up.
Maybe there was a code deploy 6 weeks ago that inadvertently made this API endpoint’s median latency 50% higher. And maybe that API endpoint holds open an AMQP connection until it finishes. And maybe some customer just spun up 500 instances of a container that hits the endpoint. Now all these tangled Small Problems constrict, resources are exhausted, and the system stops producing.
It’s just these sorts of interdependency knots that cause an increase in the frequency of customer-facing problems as load rises.
We rely heavily on Big Problems to tell us which Small Problems need fixing. “How did this failure happen? It happened because of this and this and this contributing factor. Which of these can we fix?” And of course the results of this sort of analysis are useful. The Small Problems we identify this way have a crucial unifying feature: we know, because we’ve seen it happen, that they can each be contributing factors to Big Problems.
But this is not the only way to identify Small Problems. We can go digging in our data and find them, endeavoring to solve them before they get a chance to cause a Big Problem.
Okay, maybe we can’t prevent every Big Problem this way. And without the gating effect of post-mortem analysis, we’ll likely spend some of our time fixing Small Problems that never would’ve amounted to anything.
But even if we’re just poking around in the dark, we’ll probably end up preventing some Big Problems, right? And Big Problems are expensive. I think, on balance, it’s worth the effort.
Somebody ought to be getting paid to be Maxwell’s Demon. Ideally, nearly everybody in the org ought to be spending some of their time investigating mysteries. It ought to be an explicit part of the job.
Take a nontrivial software system and put it on the internet. Problems will emerge. Some problems will be serious; others less so. We won’t notice most of them.
A software system in production is a bucket filled with fluid. Each particle of the fluid is a discrete problem. The problems bounce around and collide with each other and do all kinds of stochastic stuff from moment to moment.
At the very bottom of the bucket are problems so minute that they can hardly be called problems at all. They have low energy. They don’t interact much with each other or with anything else.
Higher up, you find higher-and-higher-energy particles. Problems that cause small hiccups, or sporadic bouts of sluggishness.
Somewhere near the top, there’s a threshold. When a problem gets enough energy to cross this threshold, we passively notice it. Maybe it causes an outage, or maybe it just causes a false positive alert. Maybe a support ticket gets filed. Maybe it’s just a weird spike in a graph. However we perceive it, we’re forced to take it seriously.
What happens before a particle jumps this energy threshold?
Perhaps the problem is entirely novel – no part of it existed before now. A code deploy with a totally self-contained bug. A DOS attack. If it’s something like that: oh well.
But more often, a problem we just perceived has been acted upon by a more gradual process. Problems bounce around in the bucket, and occasionally they bounce into each other and you get a problem with higher energy than before. Or circumstances shift, and a problem that was once no big deal becomes a big deal. Over time, particles that started in the middle – or even at the bottom – can work their way up to the passive perception line.
If problems usually hang out below the perception threshold for a while before they cross it, then we can take advantage of that in two ways. One way is to lower the threshold for passive perception. Raise the sensitivity of our monitors without sacrificing specificity. This is hard, but worthwhile.
The other way to take advantage of the fluid-like behavior of problems is to spend energy finding and fixing problems before they boil. I call this the Maxwell’s demon approach. You go looking for trouble. You poke around in dashboards and traces and logs, find things that look weird, turn them around in your hands until you understand them, and ultimately fix them. Maybe you have a ticket backlog of possible problems you’ve found, and it’s somebody’s job to burn down that backlog. Ideally it’s the job of a team using a shared-context system like differential diagnosis.
If you make it somebody’s job to be Maxwell’s demon, you can find and fix all sorts of problems before they become bigger problems. If you don’t make it someone’s job, then no problem will get taken seriously until it’s an outage.
Imagine you’re an extremely bad doctor. Actually, chances are you don’t even have to imagine. Most people are extremely bad doctors.
But imagine you’re a bad doctor with a breathtakingly thorough knowledge of the human body. You can recite chapter and verse of your anatomy and physiology textbooks, and you’re always up to date on the most important research going on in your field. So what makes you a bad doctor? Well, you never order tests for your patients.
What good does your virtually limitless store of medical knowledge do you? None at all. Without data from real tests, you’ll almost never pick the right interventions for your patients. Every choice you make will be a guess.
There’s another way to be an extremely bad doctor, though. Imagine you don’t really know anything about how the human body works. But you do have access to lots of fancy testing equipment. When a patient comes in complaining of abdominal pain and nausea, you order as many tests as you can think of, hoping that one of them will tell you what’s up.
This rarely works. Most tests just give you a bunch of numbers. Some of those numbers may be outside of normal ranges, but without a coherent understanding of how people’s bodies behave, you have no way to put those numbers into context with each other. They’re just data – not information.
In medicine, data is useless without theory, and theory is useless without data. Why would we expect things to be any different in software?
Observability as signal and theory
The word “observability” gets thrown around a lot, especially in DevOps and SRE circles. Everybody wants to build observable systems, then make their systems more observable, and then get some observability into their observability so they can observe while they observe.
But when we look for concrete things we can do to increase observability, it almost always comes down to adding data. More metrics, more logs, more spans, more alerts. Always more. This makes us like the doctor with all the tests in the world but no bigger picture to fit their tests results into.
Observability is not just data. Observability comprises two interrelated and necessary properties: signal and theory. The relationship between these two properties is as follows:
Signal emerges from data when we interpret it within our theory about the system’s behavior.
Theory reacts to signal, changing and adapting as we use it to process new information.
In other words, you can’t have observability without both a rich vein of data and a theory within which that data can be refined into signal. Not enough data and your theory can’t do its job; not enough theory and your data is meaningless. Theory is the alchemy that turns data into knowledge.
What does this mean concretely?
It’s all well and good to have a definition of observability that looks nice on a cocktail napkin. But what can we do with it? How does this help us be better at our job?
The main takeaway from the understanding that observability consists of a relationship between data and theory, rather than simply a surfeit of the former, is this: a system’s observability may be constrained by deficiencies in either the data stream or our theory. This insight allows us to make better decisions when promoting observability.
Making better graph dashboards
However many graphs it contains, a metric dashboard only contributes to observability if its reader can interpret the curves they’re seeing within a theory of the system under study. We can facilitate this through many interventions, a few of which are to:
Add a note panel to the top of every dashboard which give an overview of how that dashboard’s graphs are expected to relate to one another.
Add links to dashboards for upstream and downstream services, so that data on the dashboard can be interpreted in a meaningful context.
When building a dashboard, start with a set of questions you want to answer about a system’s behavior, and then choose where and how to add instrumentation; not the other way around.
Making better alerts
Alerts are another form of data that we tend to care about. And like all data, they can only be transmogrified into signal by being interpreted within a theory. To guide this transmogrification, we can:
Present alerts along with links to corresponding runbooks or graph dashboards.
Document a set of alerts that, according to our theory, provides sufficient coverage of the health of the system.
Delete any alerts whose relevance to our theory can’t be explained succinctly.
Engaging in more effective incident response
When there’s an urgent issue with a system, an intuitive understanding of the system’s behavior is indispensable to the problem solving process. That means we depend on the system’s observability. The incident response team’s common ground is their theory of the system’s behavior – in order to make troubleshooting observations meaningful, that theory needs to be kept up to date with the data.
To maintain common ground over the course of incident response, we can:
Engage in a regular, structured sync conversation about the meaning of new data and the next steps.
Seek out data only when you can explicitly state how the data will relate to our theory (e.g. “I’m going to compare these new log entries with the contents of such-and-such database table because I think the latest deploy might have caused an inconsistency”).
Maintain an up-to-date, explicit record of the current state of problem solving, and treat it as the ultimate source of truth.
Data is just data until theory makes it signal.
The next time you need to build an observable system, or make a system more observable, take the time to consider not just what data the system produces, but how to surface a coherent theory of the system’s workings. Remember that observability is about delivering meaning, not just data.
I was pretty happy with my talk, especially the Star Trek: The Next Generation vignette in the middle. It was a lot of ideas to pack into a single talk, but I think a lot of people got the point. However, I did give a really unsatisfactory answer (30m46s) to the first question I received. The question was:
In the differential diagnosis steps, you listed performing tests to falsify assumptions. Are you borrowing that from medicine? In tech are we only trying to falsify assumptions, or are we sometimes trying to validate them?
I didn’t have a real answer at the time, so I spouted some bullshit and moved on. But it’s a good question, and I’ve thought more about it, and I’ve come up with two (related) answers: a common-sense answer and a pretentious philosophical answer.
The Common Sense Answer
My favorite thing about differential diagnosis is that it keeps the problem-solving effort moving. There’s always something to do. If you’re out of hypotheses, you come up with new ones. If you finish a test, you update the symptoms list. It may not always be easy to make progress, but you always have a direction to go, and everybody stays on the same page.
But when you seek to confirm your hypotheses, rather than to falsify others, it’s easy to fall victim to tunnel vision. That’s when you fixate on a single idea about what could be wrong with the system. That single idea is all you can see, as if you’re looking at it through a tunnel whose walls block everything else from view.
Tunnel vision takes that benefit of differential diagnosis – the constant presence of a path forward – and negates it. You keep running tests to try to confirm your hypothesis, but you may never prove it. You may just keep getting tests results that are consistent with what you believe, but that are also consistent with an infinite number of hypotheses you haven’t thought of.
A focus on falsification instead of verification can be seen as a guard against tunnel vision. You can’t get stuck on a single hypothesis if you’re constrained to falsify other ones. The more alternate hypotheses you manage to falsify, the more confident you get that you should be treating for the hypotheses that might still be right.
Now, of course, there are times when it’s possible to verify your hunch. If you have a highly specific test for a problem, then by all means try it. But in general it’s helpful to focus on knocking down hypotheses rather than propping them up.
Published in 1959 – but based on Popper’s earlier book Logik der Forschung from 1934 – The Logic Of Scientific Discovery makes a then-controversial [now widely accepted (but not universally accepted, because philosophers make cats look like sheep, herdability-wise)] claim. I’ll paraphrase the claim like so:
Science does not produce knowledge by generalizing from individual experiences to theories. Rather, science is founded on the establishment of theories that prohibit classes of events, such that the reproducible occurrence of such events may falsify the theory.
Popper was primarily arguing against a school of thought called logical positivism, whose subscribers assert that a statement is meaningful if and only if it is empirically testable. But what matters to our understanding of differential diagnosis isn’t so much Popper’s absolutely brutal takedown of logical positivism (and damn is it brutal), as it is his arguments in favor of falsifiability as the central criterion of science.
I find one particular argument enlightening on the topic of falsification in differential diagnosis. It hinges on the concept of self-contradictory statements.
There’s an important logical precept named – a little hyperbolically – the Principle of Explosion. It asserts that any statement that contradicts itself (for example, “my eyes are brown and my eyes are not brown”) implies all possible statements. In other words: if you assume that a statement and its negation are both true, then you can deduce any other statement you like. Here’s how:
Assume that the following two statements are true:
“All cats are assholes”
“There exists at least one cat that is not an asshole”
Therefore the statement “Either all cats are assholes, or 9/11 was an inside job” (we’ll call this Statement A) is true, since the part about the asshole cats is true.
However, if the statement “there exists at least one cat that is not an asshole” is true too (which we’ve assumed it is) and 9/11 were not an inside job, then Statement A would be false, since neither of its two parts would be true.
So the only way left for Statement A to be true is for “9/11 was an inside job” to be a true statement. Therefore, 9/11 was an inside job.
Wake up, sheeple.
The Principle of Explosion is the crux of one of Popper’s most convincing arguments against the Principle of Induction as the basis for scientific knowledge.
It was assumed by many philosophers of science before Popper that science relied on some undefined Principle of Induction which allowed one to generalize from a finite list of experiences to a general rule about the universe. For example, the Principle of Induction would allow one to deduce from enough statements like “I dropped a ball and it fell” and “My friend dropped a wrench and it fell” to “When things are dropped, they fall.” But Popper argued against the existence of the Principle of Induction. In particular, he pointed out that:
If there were some way to prove a general rule by demonstrating the truth of a finite number of examples of its consequences, then we would be able to deduce anything from such a set of true statements.
Right? By the Principle of Explosion, a self-contradictory statement implies the truth of all statements. If we accepted the Principle of Induction, then the same evidence that proves “When things are dropped, they fall” would also prove “All cats are assholes and there exists at least one cat that is not an asshole,” which would prove every statement we can imagine.
So what does this have to do with falsification in differential diagnosis? Well, imagine you’ve come up with these hypotheses to explain some API slowness you’re troubleshooting:
Hypothesis Alpha: contention on the table cache is too high, so extra latency is introduced for each new table opened
Hypothesis Bravo: we’re hitting our IOPS limit on the EBS volume attached to the database server
There are many test results that would be compatible with Hypothesis Alpha. But unless you craft your tests very carefully, those same results will also be compatible with Hypothesis Bravo. Without a highly specific test for table cache contention, you can’t prove Hypothesis Alpha through a series of observations that agree with it.
What you can do, however, is try to quickly falsify Hypothesis Bravo by checking some graphs against some AWS configuration data. And if you do that, then Hypothesis Alpha is the your best remaining guess. Now you can start treating for table cache contention on the one hand, and attempting the more time-consuming process (especially if it’s correct!) of falsifying Hypothesis Alpha.
Isn’t this kind of abstract?
Haha OMG yes. It’s the most abstract. But that doesn’t mean it’s not a useful idea.
If it’s your job to troubleshoot problems, you know that tunnel vision is very real. If you focus on generating alternate hypotheses and falsifying them, you can resist tunnel vision’s allure.
I work on a team that fixes complex systems under time pressure. My teammates have different skill sets, different priorities, and different levels of expertise. But we all have to troubleshoot and solve problems together.
This is really hard to do effectively. Fortunately for us in the relatively new domain of DevOps, situations like ours have been studied extensively in the last couple decades. We can use the results of this research to inform our own processes and automation for troubleshooting.
One of the most important concepts to emerge from recent teamwork research, common ground, helps us understand why collaborative troubleshooting breaks down over time. This breakdown leads to wasted effort and mistakes, even if the team maintains constant communication in a chat room. But if we extend ChatOps by drawing on some ideas from medical diagnosis, we can make troubleshooting way easier without losing the benefits of fluid team conversation.
Ergonomics researchers D.D. Woods and Gary Klein (the latter of whom I wrote about in What makes an expert an expert?) published a phenomenally insightful paper in 2004 called Common Ground and Coordination in Joint Activity. In it, they describe a particular kind of failure that occurs when people engage in joint cognition: the Fundamental Common Ground Breakdown. Once you learn about the Fundamental Common Ground Breakdown, you see it everywhere. Here’s how the Woods/Klein paper describes the FCGB:
Party A believes that Party B possesses some knowledge
Party B doesn’t have this knowledge, and doesn’t know he is supposed to have it.
Therefore, he or she doesn’t request it.
This lack of a request confirms to Party A that Party B has the knowledge.
When this happens, Party A and Party B lose common ground, which Woods & Klein define as “pertinent knowledge, beliefs and assumptions that are shared among the involved parties.” The two parties start making incorrect assumptions about each other’s knowledge and beliefs, which causes their common ground to break down further and further. Eventually they reach a coordination surprise, which forces them to re-synchronize their understanding of the coordinated activity:
Seriously, the FCGB is everywhere. Check out the paper.
I’m especially interested in one particular area where an understanding of common ground can help us do better teamwork: joint troubleshooting.
Common Ground Breakdown in Chatroom Troubleshooting
Everybody’s into ChatOps these days, and I totally get it. When a critical system is broken, it’s super useful to get everybody in the same room and hash it out. ChatOps allows everybody to track progress, coordinate activities, and share results. And it also helps to have lots of different roles represented in the room:
Operations folks, to provide insight into the differences between the system’s normal behavior and its current state
Software engineers, who bring detailed knowledge of the ways subsystems are supposed to work
Account managers and product managers and support reps: not just for their ability to translate technical jargon into the customer’s language for status reporting, but also because their understanding of customer needs can help establish the right priorities
Q.A. engineers, who can rule out certain paths of investigation early with their intuition for the ways in which subsystems tend to fail
The process of communicating across role boundaries isn’t just overhead: it helps us refine our own understanding, look for extra evidence, and empathize with each other’s perspectives.
But ChatOps still offers a lot of opportunities for common ground breakdown. The FCGB can occur whenever different people interpret the same facts in different ways. Interpretations can differ for many different reasons:
Some people have less technical fluency in the system than others. A statement like “OOM killer just killed Cassandra on db014” might change an ops engineer’s whole understanding of the problem, but such a shift could fly under the radar of, say, a support engineer.
Some people are multitasking. They may have a stake in the troubleshooting effort but be unable to internalize every detail from the chat room in real time.
Some people are co-located. They find it easier to discuss the problem using mouth words or by physically showing each other graphs, thereby adjusting their own shared understanding without transmitting these adjustments to the rest of the team.
Some people enter the conversation late, or leave for a while and come back. These people will miss common ground changes that happen during their absence.
These FCGB opportunities all become more pronounced as the troubleshooting drags on and folks become tired, bored, and confused. And when somebody says they’ve lost track of common ground, what do we do? Two main things: we provide a summary of recent events and let the person ask questions until they feel comfortable; or we tell them to read the backlog.
The Q&A approach has serious drawbacks. First of all, it requires somebody knowledgeable to stop what they’re doing and summarize the situation. If people are frequently leaving and entering the chat room, you end up with a big distraction. Second of all, it leaves lots of room for important information to get missed. The Fundamental Common Ground Breakdown happens when somebody doesn’t know what to ask, so fixing it with a Q&A session is kind of silly.
The other way people catch up with the troubleshooting effort is by reading the backlog. This is even more inefficient than Q&A. Here’s the kind of stuff you have to dig through when you’re reading a chat backlog:
There’s a lot to unpack there – and that’s just 18 messages! Imagine piecing together a troubleshooting effort that’s gone on for hours, or days. It would take forever, and you’d still make a lot of mistakes. It’s just not a good way to preserve common ground.
So what do we need?
Differential Diagnosis as an Engine of Common Ground
I’ve blogged before about how much I love differential diagnosis. It’s a formalism that doctors use to keep the diagnostic process moving in the right direction. I’ve used it many times in ops since I learned about it. It’s incredibly useful.
In differential diagnosis, you get together with your team in front of a whiteboard – making sure to bring together people from a wide variety of roles – and you go through a cycle of 3 steps:
Identify symptoms. Write down all the anomalies you’ve seen. Don’t try to connect the dots just yet; just write down your observations.
Generate hypotheses. Brainstorm explanations for the symptoms you’ve observed. This is where it really helps to have a good cross-section of roles represented. The more diverse the ideas you write down, the better.
Test hypotheses. Now that you have a list of things that might be causing the problem, you start narrowing down that list by coming up with a test that will prove or disprove a certain hypothesis.
Once you’re done with step #3, you can cross out a hypothesis or two. Then you head back to step #1 and repeat the cycle until the problem is identified.
A big part of the power of differential diagnosis is that it’s written down. Anybody can walk into the room, read the whiteboard, and understand the state of the collaborative effort. It cuts down on redundant Q&A, because the most salient information is summarized on the board. It eliminates inefficient chat log reading – the chat log is still there, but you use it to search for specific pieces of information instead of reading it like a novel. But, most importantly, differential diagnosis cuts down on fundamental common ground breakdowns, because everybody has agreed to accept what’s on the whiteboard as the canonical state of troubleshooting.
Integrating Differential Diagnosis with ChatOps
We don’t want to lose the off-the-cuff, conversational nature of ChatOps. But we need a structured source of truth to provide a point-in-time understanding of the effort. And we (read: I) don’t want to write a whole damn software project to make that happen.
My proposal is this: use Trello for differential diagnosis, and integrate it with the chat through a Hubot plugin. I haven’t written this plugin yet, but it shouldn’t take long (I’ll probably fork hubot-trello and start from there). That way people could update the list of symptoms, hypotheses, and tests on the fly, and they’d always have a central source of common ground to refer to.
In the system I envision, the chat room conversation would be peppered with statements like:
Geordi: hubot symptom warp engine going full speed, but ship not moving
Hubot: Created (symp0): warp engine going full speed, but ship not moving
Beverly: hubot falsify hypo1
Hubot: Falsified (hypo1): feedback loop between graviton emitter and graviton roaster
Geordi: hubot finish test1
Hubot: Marked (test1) finished: reboot the quantum phase allometer
And the resulting differential diagnosis board, containing the agreed-upon state of the troubleshooting effort, might look like this example, with cards labeled to indicate that they’re no longer in play.
What do you think?
Let me know if your organization already has something like this, or has tried a formal differential diagnosis approach before. I’d love to read some observations about your team’s process in the comments. Also, VictorOps has a pretty neat suite of tools that approaches what I have in mind, but I still think a more conceptually structured (not to mention free) solution could be very useful.
Automation is most effective when it’s a team player. By using automation to preserve common ground, we can solve problems faster and more thoroughly, with less frustration and less waste. And that all sounds pretty good to me.
You get a phone call in the middle of the night. The new sysadmin (whom you hired straight out of college) is flipping all of her shits because web app performance has degraded beyond the alert threshold. She’s been clicking through page after page of graphs, checking application logs all the way up and down the stack, and just generally cussing up a storm because she can’t find the source of the issue. You open your laptop, navigate straight to overall performance graphs, drill down to database graphs, see a pattern that looks like mutex contention, log in to the database, find the offending queries, and report them to the on-call dev. You do all this in a matter of minutes.
You’re trying to teach your dad to play Mario Kart. It’s like “Okay, go forward… no, forward… you have to press the gas – no, that’s fire – press the gas button… it’s the A button… the blue one… Yeah, there you go, okay, you’re going forward now… so… so go around the corner… why’d you stop? Dad… it’s like driving a car, you can’t turn if you’re stopped… so remember, gas is A… which is the blue one…”
Why is it so hard for experts to understand the novice experience? Well, in his book Sources of Power, decision-making researcher Gary Klein presents some really interesting theories about what makes experts experts. His theories give us insight into the communication barriers between novices and experts, which can make us better teachers and better learners.
Klein arrived at his decision-making model, the recognition-primed decision model, by interviewing hundreds of experts over several years. According to his research, experts in a huge variety of fields rely on mental simulation. In Sources of Power, he defines mental simulation as:
the ability to imagine people and objects consciously and to transform those people and objects through several transitions, finally picturing them in a different way than at the start.
Klein has never studied sysadmins, but when I read about his model I recognized it immediately. This is what we do when we’re trying to reason out how a problem got started, and it’s also how we figure out how to fix it. In our head, we have a model of the system in which the problem lives. Our model consists of some set of moving parts that go through transitions from one state to another.
If you and your friend are trying to figure out how to get a couch around a corner in your stairwell, your moving parts are the couch, your body, and your friend’s body. If you’re trying to figure out how a database table got corrupted, your moving parts might be the web app, the database’s storage engine, and the file system buffer. You envision a series of transitions from one state to the next. If those transitions don’t get you from the initial state to the final state then you tweak your simulation and try again until you get a solution.
Here’s the thing, though: we’re people. Our brains have a severely limited amount of working memory. In his interviews with experts about their decision making processes, Klein found that there was a pretty hard upper limit on the complexity of our mental simulations:
3 moving parts
That’s about all we get, regardless of our experience or intelligence. So how do experts mentally simulate so much more effectively than novices?
As we gain experience in a domain, we start to see how the pieces fit together. As we notice more and more causal patterns, we build a mental bank of abstractions. An abstraction is a kind of abbreviation that stands in for a set of transitions or moving parts that usually functions as a whole. It’s like the keyboard of a piano: when the piano’s working correctly, we don’t have to think about the Rube Golberg-esque series of yanks and shoves going on inside it; we press a key, and the corresponding note comes out.
Experts have access to a huge mental bank of abstractions. Novices don’t yet. This makes experts more efficient at creating mental simulations.
When you’re first learning to drive a car, you have to do everything step by step. You don’t have the abstraction bank of an experienced driver. When the driving instructor tells you to back out of a parking space, your procedure looks something like this:
Make sure foot is on brake pedal
Shift into reverse
Release brake enough to get rolling
Turn steering wheel (which direction is it when I’m in reverse?)
Put foot back on brake pedal
Shift into drive
It’s a choppy, nerve-racking sequence of individual steps. But once you practice this a dozen times or so, you start to build some useful abstractions. Your procedure for backing out of a parking space becomes more like:
Go backward (you no longer think about how you need to break, shift, and release the brake)
Get facing the right direction
Once you’ve done it a hundred times, it’s just one step: “Back out of the parking space.”
Now if you recall that problem solving involves mental simulations with at most 3 moving parts and 6 transitions, you’ll see why abstractions are so critical to the making of an expert. Whereas a novice requires several transitions to represent a process, an expert might only need one. The right choice of abstraction allows the expert to hold a much richer simulation in mind, which improves their effectiveness in predicting outcomes and diagnosing problems.
Klein highlights another important difference between experts and novices: experts can readily process counterfactuals: explanations and predictions that are inconsistent with the data. This is how experts are able to improvise in unexpected situations.
Imagine that you’re troubleshooting a spate of improper 403 responses from a web app that you admin. You expect that the permissions on some cache directory got borked in the last deploy, so you log in to one of the web servers and tail the access log to see which requests in particular are generating 403s. But you can’t find a single log entry with a 403 error code! You refresh the app a few times in your browser, and sure enough you get a 403 response. But the log file still shows 200 after 200. What’s going on?
If you were a novice, you might just say “That’s impossible” and throw up your hands. But an experienced sysadmin could imagine any number of plausible scenarios to accommodate this counterfactual:
You logged in to staging instead of production
The 403s are only coming from one of the web servers, and it’s not the one you logged in to
403s are being generated by the load balancer before the requests ever make it to the web servers
What you’re looking at in your browser is actually a 200 response with a body that says “403 Forbidden”
Why are experts able to adjust so fluidly to counterfactuals while novices aren’t?
It comes back to abstractions. When experts see something that doesn’t match expectations, they can easily recognize which abstraction is leaking. They understand what’s going on inside the piano, so when they expect a tink but hear a plunk, they can seamlessly jump to a lower level of abstraction and generate a new mental simulation that explains the discrepancy.
Empathizing with novices
By understanding a little about the relationship between abstractions and expertise, we can teach ourselves to see problems from a novice’s perspective. Rather than getting frustrated and taking over, we can try some different strategies:
Tell stories. When Gary Klein and his research team want to understand an expert’s thought process, they don’t use questionnaires or ask the expert to make a flow chart or anything artificial like that. The most effective way to get inside an expert’s thought process is to listen to their stories. So when you’re teaching a novice how to reason about a system, try thinking of an interesting and surprising troubleshooting experience you’ve had with that system before, and tell that story.
Use the Socratic method. Novices need practice at juggling abstractions and digesting counterfactuals. When a novice is describing their mental model of a problem or a potential path forward, ask a hypothetical question or two and watch the gears turn. Questions like “You saw Q happen because of P, but what are some ways we could’ve gotten to Q without P?” or “You expect that changing A will have an effect on B, but what would it mean if you changed A and there was no effect on B?” will challenge the novice to bounce between different layers of abstraction like an expert does.
Remember: your boss may be a novice. Take a moment to look around your org chart and find the nearest novice; it may be above you. Even if your boss used to do your job, they’re a manager now. They may be rusty at dealing with the abstractions you use every day. When your boss is asking for a situation report or an explanation for some decision you made, keep in mind the power of narratives and counterfactuals.
It’s a great idea to track your MTTR (Mean Time To Recover) as an operational metric. MTTR is defined as the average interval between onset of a failure and recovery from that failure. We acknowledge that failures are part of the game, so we want our organization to be good at responding quickly to them. It’s intuitive that we’d want our MTTR to trend down.
This is one of those places where our intuition can be misleading.
MTTR is an average over incidents of incident duration. That means that the total amount of downtime gets denominatored out. Consider these two brothers who run different websites:
Achenar’s site only had 1 outage in September, and it lasted 60 minutes.
Sirius’s site had 120 outages in September, lasting 20 minutes each.
Sirius had 40 times as much downtime as Achenar in the month of September. Sirius’s MTTR, however, was 1/3 that of Achenar: 20 minutes rather than 60 minutes.
Lowering your MTTR is a good strategy in certain situations. But you need to make sure it’s the right strategy. If you don’t look at the whole picture, things like nuisance alarmsand insufficient automation can be confounded with the meaning of your MTTR. If you fix a whole bunch of meaningless alerts that always recover quickly without intervention (you know the type), your MTTR goes up!
MTTR is useful to track, and it can be useful for decision-making. Just remember: our goal is to minimize downtime and noise, not MTTR. If the path of least resistance to lower downtime and a stronger signal is to respond to incidents quicker, then MTTR is your best friend. But that’s not always true.
Inspired by Steve Bennett‘s talk at Velocity 2012 (slides here. I swear it’s a great talk; I didn’t just think he was smart because he’s British), I’ve been trying lately to apply medicine’s differential diagnosis approach to my ops problem solving.
If you’ve ever seen an episode of “House M.D,” you’ll recognize the approach right away.
Since my girlfriend (partner/common-law fiancée/non-Platonic ladyperson/whatever) is a veterinary student, I end up hearing a lot about medical reasoning. One of her classes in first year was “Problem-Based Learning,” or as I called it, “House D.V.M.”. The format of this class should sound familiar to anyone who’s worked in ops, or dev, or the middle bit of any Venn diagram thereof.
You walk in on Monday and grab a worksheet. This worksheet describes the symptoms of some cat or pug or gila monster or headcrab that was recently treated in the hospital. Your homework: figure out what might be wrong with the animal, and recommend a course of treatment and testing.
On Tuesday, you’re given worksheet number 2. It says what a real vet did, given Monday’s info, and then it lists the results of the tests that the vet ordered. So the process starts over: your homework is to infer from the test results what could be wrong with the animal, and then figure out what tests or treatments to administer next.
This process repeats until Friday, by which point you’ve hopefully figured out what the hell.
When I heard this, I thought it was all very cool. But I didn’t pick up on the parallels with my own work, which are staggering. And what really should have caught my attention, in retrospect, is that this was a course they were taking. They’re teaching a deductive process!
Can We Formalize It? Yes We Can!
In tech, our egos often impede learning. We’re smart and we’ve built a unique, intricate system that nobody else understands as well as we do. “Procedures” and “methodologies” disgust us: it’s just so enterprisey to imagine that any one framework could be applied to the novel, cutting-edge complexities we’re grokking with our enormous hacker brains.
Give it a rest. Humans have been teaching each other how to troubleshoot esoteric problems in complex systems for friggin millennia. That’s what medicine is.
When faced with a challenging issue to troubleshoot, doctors will turn to a deductive process called “differential diagnosis.” I’m not going to describe it in that much detail; if you want more, then tell Steve Bennett to write a book. Or watch a few episodes of House. But basically the process goes like this:
Write down what you know: the symptoms.
Brainstorm possible causes (“differentials”) for these symptoms.
Figure out a test that will rule out (“falsify”) some of the differentials, and perform the test.
If you end up falsifying all your differentials, then clearly you didn’t brainstorm hard enough. Revisit your assumptions and come up with more ideas.
This simple process keeps you moving forward without getting lost in your own creativity.
Mnemonics As Brainstorming Aids
The brainstorming step of this deductive process (“writing down your differentials”) is critical. Write down whatever leaps to mind.
Doctors have mnemonic devices to help cover all the bases here. One of the most popular is VINDICATE (Vascular/Inflammatory/Neoplastic/Degenerative/Idiopathic/Congenital/Autoimmune/ Traumatic/Endocrine). They go through this list and ask “Could it be something in this category?” The list covers all the systems in the body, so if the doctor seriously considers each of the letters, they’ll usually come up with the right differential (although they may not know it yet).
Vets have a slightly different go-to mnemonic when listing differentials: DAMNIT. There are several different meanings for each letter, but the gist of it is Degenerative, Anomalous, Metabolic, Nutritional, Inflammatory, Traumatic. Besides being a mild oath (my second-favorite kind of oath), this device has the advantage of putting more focus on the trouble’s mode of operation, rather than its location.
These mnemonics are super useful to doctors, and it’s not that hard to come up with your own version. Bennett suggests CASHWOUND (see his slides to find out why).
No Seriously, Try It. It’s Great.
The other day, we were looking at our contribution dashboard and we noticed this (artist’s rendering):
That dip in donations lasted about 10 minutes, and we found it extremely disturbing. So we piled into a conference room with a clean whiteboard, and we started writing down differentials.
A. Firewall glitch between card processors and Internet
B. Database failure causing donation pages not to load
C. Failures from the third-party payment gateway
D. Long response times from the payment gateway
E. Errors in our payment-processing application
F. DNS lookup failures for the payment gateway
Admittedly this is not a very long list, and we could’ve brainstormed better. But anyway, we started trying to pick apart the hypotheses.
We began with a prognostic approach. That means we judged hypothesis (B) to be the most terrifying, so we investigated it first. We checked out the web access logs and found that donation pages had been loading just fine for our users. Phew.
The next hypotheses to test were (C) and (D). Here we had switched to a probabilistic approach — we’d seen this payment gateway fail before, so why shouldn’t it happen again? To test this hypothesis, we checked two sources: our own application’s logs (which would report gateway failures), and Twitter search. Neither turned up anything promising. So now we had these differentials (including a new one devised by my boss, who had wandered in):
A. Firewall glitch between card processors and Internet
B. Database failure causing donation pages not to load
C. Failures from the third-party payment gateway
D. Long response times from the payment gateway
E. Errors in our payment-processing application
F. DNS lookup failures for the payment gateway
G. Users were redirected to a different site
(E) is pretty severe (if not particularly likely, since we hadn’t deployed the payment-processing code recently), so we investigated that next. No joy — the application’s logs were clean. Next up was (A), but it proved false as well, since we found no errors or abnormal behavior in the firewall logs.
Throughout this process, I found that the differential diagnosis technique helped focus the team. Nobody stepped on each other’s toes, we were constantly making progress, and nobody had the feeling of groping in the dark that one can get when one troubleshoots without a method.