You Know Who’s Smart? Friggin’ Doctors, Man.

Inspired by Steve Bennett‘s talk at Velocity 2012 (slides here. I swear it’s a great talk; I didn’t just think he was smart because he’s British), I’ve been trying lately to apply medicine’s differential diagnosis approach to my ops problem solving.

If you’ve ever seen an episode of “House M.D,” you’ll recognize the approach right away.

Problem-Based Learning

Since my girlfriend (partner/common-law fiancée/non-Platonic ladyperson/whatever) is a veterinary student, I end up hearing a lot about medical reasoning. One of her classes in first year was “Problem-Based Learning,” or as I called it, “House D.V.M.”. The format of this class should sound familiar to anyone who’s worked in ops, or dev, or the middle bit of any Venn diagram thereof.

You walk in on Monday and grab a worksheet. This worksheet describes the symptoms of some cat or pug or gila monster or headcrab that was recently treated in the hospital. Your homework: figure out what might be wrong with the animal, and recommend a course of treatment and testing.

On Tuesday, you’re given worksheet number 2. It says what a real vet did, given Monday’s info, and then it lists the results of the tests that the vet ordered. So the process starts over: your homework is to infer from the test results what could be wrong with the animal, and then figure out what tests or treatments to administer next.

This process repeats until  Friday, by which point you’ve hopefully figured out what the hell.

When I heard this, I thought it was all very cool. But I didn’t pick up on the parallels with my own work, which are staggering. And what really should have caught my attention, in retrospect, is that this was a course they were taking. They’re teaching a deductive process!

Can We Formalize It? Yes We Can!

In tech, our egos often impede learning. We’re smart and we’ve built a unique, intricate system that nobody else understands as well as we do. “Procedures” and “methodologies” disgust us: it’s just so enterprisey to imagine that any one framework could be applied to the novel, cutting-edge complexities we’re grokking with our enormous hacker brains.

Give it a rest. Humans have been teaching each other how to troubleshoot esoteric problems in complex systems for friggin millennia. That’s what medicine is.

When faced with a challenging issue to troubleshoot, doctors will turn to a deductive process called “differential diagnosis.” I’m not going to describe it in that much detail; if you want more, then tell Steve Bennett to write a book. Or watch a few episodes of House. But basically the process goes like this:

  • Write down what you know: the symptoms.
  • Brainstorm possible causes (“differentials”) for these symptoms.
  • Figure out a test that will rule out (“falsify”) some of the differentials, and perform the test.
  • If you end up falsifying all your differentials, then clearly you didn’t brainstorm hard enough. Revisit your assumptions and come up with more ideas.

This simple process keeps you moving forward without getting lost in your own creativity.

Mnemonics As Brainstorming Aids

The brainstorming step of this deductive process (“writing down your differentials”) is critical. Write down whatever leaps to mind.

Doctors have mnemonic devices to help cover all the bases here. One of the most popular is VINDICATE (Vascular/Inflammatory/Neoplastic/Degenerative/Idiopathic/Congenital/Autoimmune/ Traumatic/Endocrine). They go through this list and ask “Could it be something in this category?” The list covers all the systems in the body, so if the doctor seriously considers each of the letters, they’ll usually come up with the right differential (although they may not know it yet).

Vets have a slightly different go-to mnemonic when listing differentials: DAMNIT. There are several different meanings for each letter, but the gist of it is Degenerative, Anomalous, Metabolic, Nutritional, Inflammatory, Traumatic. Besides being a mild oath (my second-favorite kind of oath), this device has the advantage of putting more focus on the trouble’s mode of operation, rather than its location.

These mnemonics are super useful to doctors, and it’s not that hard to come up with your own version. Bennett suggests CASHWOUND (see his slides to find out why).

No Seriously, Try It. It’s Great.

The other day, we were looking at our contribution dashboard and we noticed this (artist’s rendering):

Brief dip in donations
Brief dip in donations


That dip in donations lasted about 10 minutes, and we found it extremely disturbing. So we piled into a conference room with a clean whiteboard, and we started writing down differentials.

A. Firewall glitch between card processors and Internet

B. Database failure causing donation pages not to load

C. Failures from the third-party payment gateway

D. Long response times from the payment gateway

E. Errors in our payment-processing application

F. DNS lookup failures for the payment gateway

Admittedly this is not a very long list, and we could’ve brainstormed better. But anyway, we started trying to pick apart the hypotheses.

We began with a prognostic approach. That means we judged hypothesis (B) to be the most terrifying, so we investigated it first. We checked out the web access logs and found that donation pages had been loading just fine for our users. Phew.

The next hypotheses to test were (C) and (D). Here we had switched to a probabilistic approach — we’d seen this payment gateway fail before, so why shouldn’t it happen again? To test this hypothesis, we checked two sources: our own application’s logs (which would report gateway failures), and Twitter search. Neither turned up anything promising. So now we had these differentials (including a new one devised by my boss, who had wandered in):

A. Firewall glitch between card processors and Internet

B. Database failure causing donation pages not to load

C. Failures from the third-party payment gateway

D. Long response times from the payment gateway

E. Errors in our payment-processing application

F. DNS lookup failures for the payment gateway

G. Users were redirected to a different site

(E) is pretty severe (if not particularly likely, since we hadn’t deployed the payment-processing code recently), so we investigated that next. No joy — the application’s logs were clean. Next up was (A), but it proved false as well, since we found no errors or abnormal behavior in the firewall logs.

So all we had left was (F) and (G). Finally we were able to determine that a client was A/B testing the donation page by randomly redirecting half of the traffic with Javascript. So everything was fine.

Throughout this process, I found that the differential diagnosis technique helped focus the team. Nobody stepped on each other’s toes, we were constantly making progress, and nobody had the feeling of groping in the dark that one can get when one troubleshoots without a method.

Try it out some time!