Falsifiability: why you rule things out, not in

This June, I had the honor of speaking at O’Reilly Velocity 2016 in Santa Clara. My topic was Troubleshooting Without Losing Common Ground, which I’ve written about and written about before that too.

I was pretty happy with my talk, especially the Star Trek: The Next Generation vignette in the middle. It was a lot of ideas to pack into a single talk, but I think a lot of people got the point. However, I did give a really unsatisfactory answer (30m46s) to the first question I received. The question was:

In the differential diagnosis steps, you listed performing tests to falsify assumptions. Are you borrowing that from medicine? In tech are we only trying to falsify assumptions, or are we sometimes trying to validate them?

I didn’t have a real answer at the time, so I spouted some bullshit and moved on. But it’s a good question, and I’ve thought more about it, and I’ve come up with two (related) answers: a common-sense answer and a pretentious philosophical answer.

The Common Sense Answer

My favorite thing about differential diagnosis is that it keeps the problem-solving effort moving. There’s always something to do. If you’re out of hypotheses, you come up with new ones. If you finish a test, you update the symptoms list. It may not always be easy to make progress, but you always have a direction to go, and everybody stays on the same page.

But when you seek to confirm your hypotheses, rather than to falsify others, it’s easy to fall victim to tunnel vision. That’s when you fixate on a single idea about what could be wrong with the system. That single idea is all you can see, as if you’re looking at it through a tunnel whose walls block everything else from view.

Tunnel vision takes that benefit of differential diagnosis – the constant presence of a path forward – and negates it. You keep running tests to try to confirm your hypothesis, but you may never prove it. You may just keep getting tests results that are consistent with what you believe, but that are also consistent with an infinite number of hypotheses you haven’t thought of.

A focus on falsification instead of verification can be seen as a guard against tunnel vision. You can’t get stuck on a single hypothesis if you’re constrained to falsify other ones. The more alternate hypotheses you manage to falsify, the more confident you get that you should be treating for the hypotheses that might still be right.

Now, of course, there are times when it’s possible to verify your hunch. If you have a highly specific test for a problem, then by all means try it. But in general it’s helpful to focus on knocking down hypotheses rather than propping them up.

The Pretentious Philosophical Answer

I just finished Karl Popper’s ridiculously influential book The Logic of Scientific Discovery. If you can stomach a dense philosophical tract, I would highly recommend it.

karl-popper-4.jpg
Karl “Choke Right On It, Logical Positivism” Popper

Published in 1959 – but based on Popper’s earlier book Logik der Forschung from 1934 – The Logic Of Scientific Discovery makes a then-controversial [now widely accepted (but not universally accepted, because philosophers make cats look like sheep, herdability-wise)] claim. I’ll paraphrase the claim like so:

Science does not produce knowledge by generalizing from individual experiences to theories. Rather, science is founded on the establishment of theories that prohibit classes of events, such that the reproducible occurrence of such events may falsify the theory.

Popper was primarily arguing against a school of thought called logical positivism, whose subscribers assert that a statement is meaningful if and only if it is empirically testable. But what matters to our understanding of differential diagnosis isn’t so much Popper’s absolutely brutal takedown of logical positivism (and damn is it brutal), as it is his arguments in favor of falsifiability as the central criterion of science.

I find one particular argument enlightening on the topic of falsification in differential diagnosis. It hinges on the concept of self-contradictory statements.

There’s an important logical precept named – a little hyperbolically – the Principle of Explosion. It asserts that any statement that contradicts itself (for example, “my eyes are brown and my eyes are not brown”) implies all possible statements. In other words: if you assume that a statement and its negation are both true, then you can deduce any other statement you like. Here’s how:

  1. Assume that the following two statements are true:
    1. “All cats are assholes”
    2. “There exists at least one cat that is not an asshole”
  2. Therefore the statement “Either all cats are assholes, or 9/11 was an inside job” (we’ll call this Statement A) is true, since the part about the asshole cats is true.
  3. However, if the statement “there exists at least one cat that is not an asshole” is true too (which we’ve assumed it is) and 9/11 were not an inside job, then Statement A would be false, since neither of its two parts would be true.
  4. So the only way left for Statement A to be true is for “9/11 was an inside job” to be a true statement. Therefore, 9/11 was an inside job.
  5. Wake up, sheeple.

plhpico

The Principle of Explosion is the crux of one of Popper’s most convincing arguments against the Principle of Induction as the basis for scientific knowledge.

It was assumed by many philosophers of science before Popper that science relied on some undefined Principle of Induction which allowed one to generalize from a finite list of experiences to a general rule about the universe. For example, the Principle of Induction would allow one to deduce from enough statements like “I dropped a ball and it fell” and “My friend dropped a wrench and it fell” to “When things are dropped, they fall.” But Popper argued against the existence of the Principle of Induction. In particular, he pointed out that:

If there were some way to prove a general rule by demonstrating the truth of a finite number of examples of its consequences, then we would be able to deduce anything from such a set of true statements.

Right? By the Principle of Explosion, a self-contradictory statement implies the truth of all statements. If we accepted the Principle of Induction, then the same evidence that proves “When things are dropped, they fall” would also prove “All cats are assholes and there exists at least one cat that is not an asshole,” which would prove every statement we can imagine.

So what does this have to do with falsification in differential diagnosis? Well, imagine you’ve come up with these hypotheses to explain some API slowness you’re troubleshooting:

Hypothesis Alpha: contention on the table cache is too high, so extra latency is introduced for each new table opened

Hypothesis Bravo: we’re hitting our IOPS limit on the EBS volume attached to the database server

There are many test results that would be compatible with Hypothesis Alpha. But unless you craft your tests very carefully, those same results will also be compatible with Hypothesis Bravo. Without a highly specific test for table cache contention, you can’t prove Hypothesis Alpha through a series of observations that agree with it.

What you can do, however, is try to quickly falsify Hypothesis Bravo by checking some graphs against some AWS configuration data. And if you do that, then Hypothesis Alpha is the your best remaining guess. Now you can start treating for table cache contention on the one hand, and attempting the more time-consuming process (especially if it’s correct!) of falsifying Hypothesis Alpha.

Isn’t this kind of abstract?

Haha OMG yes. It’s the most abstract. But that doesn’t mean it’s not a useful idea.

If it’s your job to troubleshoot problems, you know that tunnel vision is very real. If you focus on generating alternate hypotheses and falsifying them, you can resist tunnel vision’s allure.

Troubleshooting On A Distributed Team Without Losing Common Ground

I work on a team that fixes complex systems under time pressure. My teammates have different skill sets, different priorities, and different levels of expertise. But we all have to troubleshoot and solve problems together.

This is really hard to do effectively. Fortunately for us in the relatively new domain of DevOps, situations like ours have been studied extensively in the last couple decades. We can use the results of this research to inform our own processes and automation for troubleshooting.

One of the most important concepts to emerge from recent teamwork research, common ground, helps us understand why collaborative troubleshooting breaks down over time. This breakdown leads to wasted effort and mistakes, even if the team maintains constant communication in a chat room. But if we extend ChatOps by drawing on some ideas from medical diagnosis, we can make troubleshooting way easier without losing the benefits of fluid team conversation.

Common Ground

Ergonomics researchers D.D. Woods and Gary Klein (the latter of whom I wrote about in What makes an expert an expert?) published a phenomenally insightful paper in 2004 called Common Ground and Coordination in Joint Activity. In it, they describe a particular kind of failure that occurs when people engage in joint cognition: the Fundamental Common Ground Breakdown. Once you learn about the Fundamental Common Ground Breakdown, you see it everywhere. Here’s how the Woods/Klein paper describes the FCGB:

  • Party A believes that Party B possesses some knowledge
  • Party B doesn’t have this knowledge, and doesn’t know he is supposed to have it.
  • Therefore, he or she doesn’t request it.
  • This lack of a request confirms to Party A that Party B has the knowledge.

When this happens, Party A and Party B lose common ground, which Woods & Klein define as “pertinent knowledge, beliefs and assumptions that are shared among the involved parties.” The two parties start making incorrect assumptions about each other’s knowledge and beliefs, which causes their common ground to break down further and further. Eventually they reach a coordination surprise, which forces them to re-synchronize their understanding of the coordinated activity:

csel_eng_ohio-state_edu_woods_distributed_CG_final_pdf

Seriously, the FCGB is everywhere. Check out the paper.

I’m especially interested in one particular area where an understanding of common ground can help us do better teamwork: joint troubleshooting.

Common Ground Breakdown in Chatroom Troubleshooting

Everybody’s into ChatOps these days, and I totally get it. When a critical system is broken, it’s super useful to get everybody in the same room and hash it out. ChatOps allows everybody to track progress, coordinate activities, and share results. And it also helps to have lots of different roles represented in the room:

  • Operations folks, to provide insight into the differences between the system’s normal behavior and its current state
  • Software engineers, who bring detailed knowledge of the ways subsystems are supposed to work
  • Account managers and product managers and support reps: not just for their ability to translate technical jargon into the customer’s language for status reporting, but also because their understanding of customer needs can help establish the right priorities
  • Q.A. engineers, who can rule out certain paths of investigation early with their intuition for the ways in which subsystems tend to fail

The process of communicating across role boundaries isn’t just overhead: it helps us refine our own understanding, look for extra evidence, and empathize with each other’s perspectives.

But ChatOps still offers a lot of opportunities for common ground breakdown. The FCGB can occur whenever different people interpret the same facts in different ways. Interpretations can differ for many different reasons:

  • Some people have less technical fluency in the system than others. A statement like “OOM killer just killed Cassandra on db014” might change an ops engineer’s whole understanding of the problem, but such a shift could fly under the radar of, say, a support engineer.
  • Some people are multitasking. They may have a stake in the troubleshooting effort but be unable to internalize every detail from the chat room in real time.
  • Some people are co-located. They find it easier to discuss the problem using mouth words or by physically showing each other graphs, thereby adjusting their own shared understanding without transmitting these adjustments to the rest of the team.
  • Some people enter the conversation late, or leave for a while and come back. These people will miss common ground changes that happen during their absence.

These FCGB opportunities all become more pronounced as the troubleshooting drags on and folks become tired, bored, and confused. And when somebody says they’ve lost track of common ground, what do we do? Two main things: we provide a summary of recent events and let the person ask questions until they feel comfortable; or we tell them to read the backlog.

The Q&A approach has serious drawbacks. First of all, it requires somebody knowledgeable to stop what they’re doing and summarize the situation. If people are frequently leaving and entering the chat room, you end up with a big distraction. Second of all, it leaves lots of room for important information to get missed. The Fundamental Common Ground Breakdown happens when somebody doesn’t know what to ask, so fixing it with a Q&A session is kind of silly.

The other way people catch up with the troubleshooting effort is by reading the backlog. This is even more inefficient than Q&A. Here’s the kind of stuff you have to dig through when you’re reading a chat backlog:

tng-hipchat

There’s a lot to unpack there – and that’s just 18 messages! Imagine piecing together a troubleshooting effort that’s gone on for hours, or days. It would take forever, and you’d still make a lot of mistakes. It’s just not a good way to preserve common ground.

So what do we need?

Differential Diagnosis as an Engine of Common Ground

I’ve blogged before about how much I love differential diagnosis. It’s a formalism that doctors use to keep the diagnostic process moving in the right direction. I’ve used it many times in ops since I learned about it. It’s incredibly useful.

In differential diagnosis, you get together with your team in front of a whiteboard – making sure to bring together people from a wide variety of roles – and you go through a cycle of 3 steps:

  1. Identify symptoms. Write down all the anomalies you’ve seen. Don’t try to connect the dots just yet; just write down your observations.
  2. Generate hypotheses. Brainstorm explanations for the symptoms you’ve observed. This is where it really helps to have a good cross-section of roles represented. The more diverse the ideas you write down, the better.
  3. Test hypotheses. Now that you have a list of things that might be causing the problem, you start narrowing down that list by coming up with a test that will prove or disprove a certain hypothesis.

Once you’re done with step #3, you can cross out a hypothesis or two. Then you head back to step #1 and repeat the cycle until the problem is identified.

A big part of the power of differential diagnosis is that it’s written down. Anybody can walk into the room, read the whiteboard, and understand the state of the collaborative effort. It cuts down on redundant Q&A, because the most salient information is summarized on the board. It eliminates inefficient chat log reading – the chat log is still there, but you use it to search for specific pieces of information instead of reading it like a novel. But, most importantly, differential diagnosis cuts down on fundamental common ground breakdowns, because everybody has agreed to accept what’s on the whiteboard as the canonical state of troubleshooting.

Integrating Differential Diagnosis with ChatOps

We don’t want to lose the off-the-cuff, conversational nature of ChatOps. But we need a structured source of truth to provide a point-in-time understanding of the effort. And we (read: I) don’t want to write a whole damn software project to make that happen.

My proposal is this: use Trello for differential diagnosis, and integrate it with the chat through a Hubot plugin. I haven’t written this plugin yet, but it shouldn’t take long (I’ll probably fork hubot-trello and start from there). That way people could update the list of symptoms, hypotheses, and tests on the fly, and they’d always have a central source of common ground to refer to.

In the system I envision, the chat room conversation would be peppered with statements like:

Geordi: hubot symptom warp engine going full speed, but ship not moving

Hubot: Created (symp0): warp engine going full speed, but ship not moving

Beverly: hubot falsify hypo1

Hubot: Falsified (hypo1): feedback loop between graviton emitter and graviton roaster

Geordi: hubot finish test1

Hubot: Marked (test1) finished: reboot the quantum phase allometer

And the resulting differential diagnosis board, containing the agreed-upon state of the troubleshooting effort, might look like this example, with cards labeled to indicate that they’re no longer in play.

What do you think?

Let me know if your organization already has something like this, or has tried a formal differential diagnosis approach before. I’d love to read some observations about your team’s process in the comments. Also, VictorOps has a pretty neat suite of tools that approaches what I have in mind, but I still think a more conceptually structured (not to mention free) solution could be very useful.

Automation is most effective when it’s a team player. By using automation to preserve common ground, we can solve problems faster and more thoroughly, with less frustration and less waste. And that all sounds pretty good to me.

You Know Who’s Smart? Friggin’ Doctors, Man.

Inspired by Steve Bennett‘s talk at Velocity 2012 (slides here. I swear it’s a great talk; I didn’t just think he was smart because he’s British), I’ve been trying lately to apply medicine’s differential diagnosis approach to my ops problem solving.

If you’ve ever seen an episode of “House M.D,” you’ll recognize the approach right away.

Problem-Based Learning

Since my girlfriend (partner/common-law fiancée/non-Platonic ladyperson/whatever) is a veterinary student, I end up hearing a lot about medical reasoning. One of her classes in first year was “Problem-Based Learning,” or as I called it, “House D.V.M.”. The format of this class should sound familiar to anyone who’s worked in ops, or dev, or the middle bit of any Venn diagram thereof.

You walk in on Monday and grab a worksheet. This worksheet describes the symptoms of some cat or pug or gila monster or headcrab that was recently treated in the hospital. Your homework: figure out what might be wrong with the animal, and recommend a course of treatment and testing.

On Tuesday, you’re given worksheet number 2. It says what a real vet did, given Monday’s info, and then it lists the results of the tests that the vet ordered. So the process starts over: your homework is to infer from the test results what could be wrong with the animal, and then figure out what tests or treatments to administer next.

This process repeats until  Friday, by which point you’ve hopefully figured out what the hell.

When I heard this, I thought it was all very cool. But I didn’t pick up on the parallels with my own work, which are staggering. And what really should have caught my attention, in retrospect, is that this was a course they were taking. They’re teaching a deductive process!

Can We Formalize It? Yes We Can!

In tech, our egos often impede learning. We’re smart and we’ve built a unique, intricate system that nobody else understands as well as we do. “Procedures” and “methodologies” disgust us: it’s just so enterprisey to imagine that any one framework could be applied to the novel, cutting-edge complexities we’re grokking with our enormous hacker brains.

Give it a rest. Humans have been teaching each other how to troubleshoot esoteric problems in complex systems for friggin millennia. That’s what medicine is.

When faced with a challenging issue to troubleshoot, doctors will turn to a deductive process called “differential diagnosis.” I’m not going to describe it in that much detail; if you want more, then tell Steve Bennett to write a book. Or watch a few episodes of House. But basically the process goes like this:

  • Write down what you know: the symptoms.
  • Brainstorm possible causes (“differentials”) for these symptoms.
  • Figure out a test that will rule out (“falsify”) some of the differentials, and perform the test.
  • If you end up falsifying all your differentials, then clearly you didn’t brainstorm hard enough. Revisit your assumptions and come up with more ideas.

This simple process keeps you moving forward without getting lost in your own creativity.

Mnemonics As Brainstorming Aids

The brainstorming step of this deductive process (“writing down your differentials”) is critical. Write down whatever leaps to mind.

Doctors have mnemonic devices to help cover all the bases here. One of the most popular is VINDICATE (Vascular/Inflammatory/Neoplastic/Degenerative/Idiopathic/Congenital/Autoimmune/ Traumatic/Endocrine). They go through this list and ask “Could it be something in this category?” The list covers all the systems in the body, so if the doctor seriously considers each of the letters, they’ll usually come up with the right differential (although they may not know it yet).

Vets have a slightly different go-to mnemonic when listing differentials: DAMNIT. There are several different meanings for each letter, but the gist of it is Degenerative, Anomalous, Metabolic, Nutritional, Inflammatory, Traumatic. Besides being a mild oath (my second-favorite kind of oath), this device has the advantage of putting more focus on the trouble’s mode of operation, rather than its location.

These mnemonics are super useful to doctors, and it’s not that hard to come up with your own version. Bennett suggests CASHWOUND (see his slides to find out why).

No Seriously, Try It. It’s Great.

The other day, we were looking at our contribution dashboard and we noticed this (artist’s rendering):

Brief dip in donations
Brief dip in donations

 

That dip in donations lasted about 10 minutes, and we found it extremely disturbing. So we piled into a conference room with a clean whiteboard, and we started writing down differentials.

A. Firewall glitch between card processors and Internet

B. Database failure causing donation pages not to load

C. Failures from the third-party payment gateway

D. Long response times from the payment gateway

E. Errors in our payment-processing application

F. DNS lookup failures for the payment gateway

Admittedly this is not a very long list, and we could’ve brainstormed better. But anyway, we started trying to pick apart the hypotheses.

We began with a prognostic approach. That means we judged hypothesis (B) to be the most terrifying, so we investigated it first. We checked out the web access logs and found that donation pages had been loading just fine for our users. Phew.

The next hypotheses to test were (C) and (D). Here we had switched to a probabilistic approach — we’d seen this payment gateway fail before, so why shouldn’t it happen again? To test this hypothesis, we checked two sources: our own application’s logs (which would report gateway failures), and Twitter search. Neither turned up anything promising. So now we had these differentials (including a new one devised by my boss, who had wandered in):

A. Firewall glitch between card processors and Internet

B. Database failure causing donation pages not to load

C. Failures from the third-party payment gateway

D. Long response times from the payment gateway

E. Errors in our payment-processing application

F. DNS lookup failures for the payment gateway

G. Users were redirected to a different site

(E) is pretty severe (if not particularly likely, since we hadn’t deployed the payment-processing code recently), so we investigated that next. No joy — the application’s logs were clean. Next up was (A), but it proved false as well, since we found no errors or abnormal behavior in the firewall logs.

So all we had left was (F) and (G). Finally we were able to determine that a client was A/B testing the donation page by randomly redirecting half of the traffic with Javascript. So everything was fine.

Throughout this process, I found that the differential diagnosis technique helped focus the team. Nobody stepped on each other’s toes, we were constantly making progress, and nobody had the feeling of groping in the dark that one can get when one troubleshoots without a method.

Try it out some time!