Quirks are bugs

“Stop Expecting That.”

When you use a program a lot, you start to notice its quirks. If you’re a programmer yourself, you start to develop theories about why the quirks exist, and how you’d fix them if you had the time or the source. If you’re not a programmer, you just shrug and work around the quirks.

I review about 400 virtual flash cards a day in studying for Jeopardy, so I’ve really started to pick up on the quirks of the flash card software I use. One quirk in particular really bothered me: the documentation, along with the first-tier support team, claims that when cards come up for review they will be presented in a random order. But I’ve noticed that, far from being truly random, the program presents cards in bunches of 50: old cards in the first bunch, then newer and newer bunches of cards. By the time I get to my last 50 cards of the day, they’re all less than 2 weeks old.

So I submitted a bug report, complete with scatterplot demonstrating this clear pattern. I explained “I would expect the cards to be shuffled evenly, but that doesn’t appear to be the case.” And do you know what the lead developer of the project told me?

“Stop expecting that.”

Not in so many words, of course, but there you have it. The problem was not in the software; it was in my expectations.

It’s a common reaction among software developers. We think “Look, that’s just the way it works. I understand why it works that way and I can explain it to you. So, you see, it’s not really a bug.” And as frustrating as this attitude is, I can’t say I’m immune to it myself. I’m in ops, so the users of my software are usually highly technical. I can easily make them understand why a weird thing keeps happening, and they can figure out how to work around the quirk. But the “stop expecting that” attitude is wrong, and it hurts everyone’s productivity, and it makes software worse. We have to consciously reject it.

Quirks are bugs.

A bug is when the program doesn’t work the way the programmer expects.

A quirk is when the program doesn’t work the way the user expects.

What’s the difference, really? Especially in the open-source world, where every user is a potential developer, and all your developers are users?

Quirks and bugs can both be worked around, but a workaround requires the user to learn arbitrary procedures which aren’t directly useful, and which aren’t connected in any meaningful way to his mental model of the software.

Quirks and bugs both make software less useful. They make users less productive. Neglected, they necessitate a sort of oral tradition — not dissimilar from superstition — in which users pass the proper set of incantations from generation to generation. Software shouldn’t be like that.

Quirks and bugs both drive users away.

Why should we treat them differently?

Stop “Stop Expecting That”ing

I’ve made some resolutions that I hope will gradually erase the distinction in my mind between quirks and bugs.

When I hear that a user encountered an unexpected condition in my software, I will ask myself how they developed their incorrect expectation. As they’ve used the program, has it guided them toward a flawed understanding? Or have I just exposed some internal detail that should be covered up?

If I find myself explaining to a user how my software works under the hood, I will redirect my spiel toward a search for ways to abstract away those implementation details instead of requiring the user to understand them.

If users are frequently confused about a particular feature, I’ll take a step back and examine the differences between my mental model of the software and the users’ mental model of it. I’ll then adjust one or both in order to bring them into congruence.

Anything that makes me a stronger force multiplier is worth doing.

A data set analyzed by an idiot, full of noise and worry, signifying nothing

Ever since I moved to Minneapolis and started working at Exosite, I’ve been using R almost daily to get to the bottom of ops mysteries. Sometimes it pays off and sometimes it doesn’t, but it’s always interesting.

Here’s an enigma I recently had the pleasure to rub some R on.


One fine day, after a release of Portals (our webapp that lets you work with data from your cloud-enabled devices), I noticed something odd in the server metrics. The 90th-percentile response time had increased by a factor of four!


This would be alarming by itself, but here’s what made it interesting. We have two different ways of reporting the response time of our application. We have the Apache response time metric from the access logs (which is shown above), but we also have a metric generated by the PHP application itself, through statsd. A timer is initialized as soon as the PHP script starts running, and its value is reported as a timer when it’s done. And the PHP metric, for the same time period, looked like this:


This does not follow the same pattern. Sure there’s some variability, but they’re both response time: shouldn’t they follow each other more closely? And let’s look at the 90th-percentile times for Apache (blue) and PHP (red) on a common scale:


Oh whoops — I should have warned you not to be drinking water when you looked at that graph. Sorry if you ruined your keyboard. Apache’s 90th-percentile response time is way higher than PHP’s. Like way higher. What gives?

Let’s summarize what we know so far:

  • 90th-percentile response times, as reported by Apache, have climbed above 5 seconds, which is much higher than normal.
  • PHP response times have not experienced any such climb, so something Apache-specific is causing this behavior.
  • Median values were not affected for either metric, so this issue is only affecting a particular subset of traffic.

Now it’s time to do some R.

Munging is my middle name

This disagreement between Apache response times and PHP response times is really intriguing, and might hint us toward the origin of the issue, so let’s dig into it.

The first thing to do is pull some access log entries into R. We aggregate and parse our logs with Logstash, so each entry is a JSON blob like this (irrelevant bits snipped for brevity):

    "@fields": {
        "agent": "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.37 Safari/537.36",
        "bytes": 2660,
        "clientip": "XXX.XXX.XXX.XXX",
        "hostheader": "portals.exosite.com",
        "overall_time": 0.097,
        "request": "/manage/devices",
        "response": "200",
        "servtime": 106908
    "@timestamp": "2013-08-22T14:51:32.000Z",
    "@type": "apache_access"

The servtime field is the Apache response time in microseconds, and the overall_time field is the PHP response time in seconds. I’ve got a script that goes and grabs a time-slice of logs from the log server and converts them to CSV format, and the result looks like this:


Which is perfect for slurping into R like so:

> reqdata <- read.csv("/tmp/reqdata.csv", header=TRUE, as.is=TRUE)
> # Parse the timestamps with Hadley's lubridate package
> reqdata$X.timestamp <- parse_date_time(reqdata$X.timestamp, "ymdHMs")
> # Remember, Apache response times are in microseconds, so we want to scale them to seconds
> reqdata$servtime <- reqdata$servtime / 10^6

And now we have a data frame called reqdata with a row for each log entry:

> head(reqdata)
X.timestamp overall_time servtime hostheader
1 2013-09-13 13:00:49 0.083 0.093408 foo.exosite.com
2 2013-09-13 13:00:48 0.173 0.185900 foo.exosite.com
3 2013-09-13 13:00:46 0.094 0.104675 bar.exosite.com
4 2013-09-13 13:00:46 0.122 0.131222 foo.exosite.com
5 2013-09-13 13:00:46 0.132 0.141991 bar.exosite.com
6 2013-09-13 13:00:46 0.096 0.106194 baz.exosite.com

The plot coagulates

Now that we have a nicely formatted data frame (we would also have accepted a pretty flower from your sweetie), we can use ggplot to check it out.

To get a handle on the Apache/PHP response time dichotomy, let’s plot one versus the other:

> p <- ggplot(reqdata, aes(overall_time, servtime))
> p + geom_point() +
    # A blue line showing the 1-to-1 relationship that we'd naïvely expect
    geom_abline(intercept=0, slope=1, color=I('blue')) +

    ggtitle('Apache response time versus PHP execution time for individual requests') +
    ylab('Apache response time') +
    xlab('PHP execution time')

Apache vs PHP time

So we can see here that most of the requests are doing what we expect: getting sent to PHP as soon as they come in, and getting sent back to the client as soon as they’re done. That’s what the blue line indicates: it’s a line of slope 1.

But then we’ve got this big near-vertical chunk of weird. Those are requests where PHP finished quickly (consistently under 200 milliseconds) but Apache took much longer to serve the request (up to 35 seconds). What is Apache doing for so long? Why can’t it just serve content like it’s supposed to?

We can get a big clue if we color the points by Host header (the domain requested by the browser):

p + geom_point(color=hostheader)

Apache vs PHP time

Now we’re getting somewhere! Look at all that green. That means that this issue is limited to a particular domain: baz.exosite.com.

The home stretch

Okay, so now we know:

  • Requests for baz.exosite.com are sitting around in Apache for up to 35 seconds before being delivered
  • Our PHP executes very quickly for this domain — it’s only Apache that’s sucking up the time

Given that this traffic is limited to a particular domain, I’m getting the suspicion that there are other homogenous things about it. And we can confirm this (pretend I’d pulled in the user-agent and client IP address from the logs with my initial import):

> Filter down our dataset to just the baz.exosite.com requests
> baz <- subset(reqdata,hostheader=='baz.exosite.com')
> unique(baz$useragent)
[1] "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.37 Safari/537.36"

There’s only one browser hitting this domain. Is there only one IP too? (These IPs have been changed to protect the irrelevant)

> unique(baz$clientip)
[1] "" "" "" ""

Now let’s skip to the most neato part. I plotted Apache response time versus time over the course of 2 hours, faceted by client IP address. It looks like this:

ggplot(baz, aes(X.timestamp, servtime)) + geom_point(aes(color=clientip)) + facet_grid(clientip ~ .)

Apache vs PHP time

So what are we looking at? Here’s what I think.

Only one person, let’s call him François, uses this domain. François has a Portal with several auto-refreshing widgets on it. Also, François is on a train, and he’s using a wireless access point to browse the internet.

François has his Portal open in a tab that he’s forgotten about. It’s dutifully refreshing his widgets every few seconds, but as the train moves, François’s access point keeps switching towers. It starts getting a weak signal, it switches towers, François’s IP address therefore changes, and repeat.

PHP only logs how long it took to execute a script on the server, but Apache logs how long it takes for the client to acknowledge receipt of the final TCP segment in the response. Since François is on an ever-changing series of unreliable networks, he’s dropping packets left and right. Therefore, Apache is logging response times all over the map.


Alas, this information is useless. Sometimes that happens: sometimes you dig and dig and learn and learn and prepare your beautiful data, only to find at the last minute that it was all for naught.

But at least you learned something.

Howto: Setting up a proxy server that requires users to log in via Google Apps

I’ve spent way too much time on this in the last couple days, and the solution ended up being far too simple. I’ll spare you my agony by sticking to a straightforward howto format.


I just set up Kibana, but of course I wanted my users to connect over HTTPS and with proper authentication. Well, I thought, we have Google Apps here at Exosite — how hard could it be to set up a gateway that uses OpenID to authenticate you with Google Apps before passing you through to the content?

The answer: it’s super easy to set up, but it’s kind of tricky to figure out at first. Luckily for you, I’ve already done the figuring out.

Step 1: Install the things

The following is all I need to install on Debian 7:

apt-get install apache2 libapache2-mod-auth-openid

Step 2: Write your Apache config

This example runs an HTTPS server with the hostname “coolstuff.danslimmon.com”, and it assumes you have an SSL cert for that domain in /etc/ssl/certs. It forces the user to log into Google Apps with an email address ending in “@danslimmon.com”, and then proxies their request to another web server listening on localhost port 5601.

If you don’t have an SSL cert and you don’t mind having your users get SSL validation warnings in their browsers, you can use the default Snakeoil certs.

Put this file in “/etc/apache2/sites-available/openid”:

<VirtualHost *:443>
    ServerName coolstuff.danslimmon.com
    ProxyPass / http://localhost:5601/

    SSLEngine on
    SSLCertificateFile    /etc/ssl/certs/coolstuff.danslimmon.com.crt
    SSLCertificateKeyFile /etc/ssl/private/coolstuff.danslimmon.com.key
    SSLProxyEngine        on
    SSLProxyCACertificatePath /etc/ssl/certs

    # Require a danslimmon.com email address via Google's OpenID
    <Location />
        AuthType OpenID
        Require valid-user
        # Require Google's OpenID endpoint
        AuthOpenIDSingleIdP https://www.google.com/accounts/o8/id
        # Require an @danslimmon.com email address
        AuthOpenIDAXRequire email \
            http://axschema.org/contact/email @danslimmon\.com$

    BrowserMatch "MSIE [2-6]" \
        nokeepalive ssl-unclean-shutdown \
        downgrade-1.0 force-response-1.0
    # MSIE 7 and newer should be able to use keepalive
    BrowserMatch "MSIE [17-9]" ssl-unclean-shutdown

Step 3: Enable the site

a2enmod ssl
a2enmod proxy
a2enmod proxy_http
a2enmod authopenid
a2ensite openid
service apache2 restart

That should do it!

Pulling data from OpenTSDB into R

Sometimes you want to jam data from OpenTSDB into R, so you can do neat things like parameterize it by time. Well, I wrote an R function that does just that!

Here it is on gist.

Here’s a usage example:


I considered turning this into a CRAN package, but then I was like “naw.” There’s not very much to it, and writing CRAN packages seems like a big pain. Correct me if I’m wrong.

Anyway, enjoy!

Parameterizing metrics by time through Graphite abuse

Everybody has a metric-spewing system, like StatsD, and a graphing system, like Graphite. Graphite makes time-series plots. Time-series plots are great! But they’re not everything.

I like to know how my system responds to a given stimulus. I like to characterize my product by asking questions like:

  • When CPU usage rises, what happens to page load time?
  • How many concurrent users does it take before we start seeing decreased clicks?
  • What’s the relationship between cache hit rate and conversion rate?

In each of these cases, we’re comparing (as Theo Schlossnagle is so fond of saying) a system metric to a business metric. A system metric is something like CPU usage, or logged-in users, or cache hit rate. A business metric is something that relates more directly to your bottom-line, like page load time, click rate, or conversion rate.

Time series plots aren’t great at answering these kinds of questions. Take the first question for example: “When CPU usage rises, what happens to page load time?” Sure, you can use Graphite to superimpose the day’s load time plot over the CPU usage plot, like so:

From this chart you can see that CPU usage and page load time do both tend to be higher in the afternoon. But you’re only looking at one day’s worth of data, and you don’t know:

  • Whether this relationship holds on other days
  • How strongly correlated the two quantities are
  • What happens at lower/higher CPU usages than were observed today

To answer questions like these, what you want is a plot of page load time versus CPU usage, with time parameterized away. That is, for each moment in time, you want to plot a point for that moment. Like so:

This scatterplot tells you a lot more about your system’s characteristic response to rising CPU usage. You can easily see:

  • As CPU usage gets higher, page load time generally increases, but not linearly.
  • For a given CPU usage, there’s a line (yellow) below which page load time will not go.
  • Above a certain CPU usage (red), the relationship between load time and CPU usage becomes less strong (as evidenced by the spreading-out of the band of data points toward the right-hand side of the plot)

Time-parameterized plots like these are a great way to get to know your system. If you make an architectural change and you find that the shape of this plot has changed significantly, then you can learn a lot about the effect your change had.

But sadly, I haven’t been able to find an open-source tool that makes these plots easy to generate. So I’ll show you one somewhat ugly, but still not too time-consuming, method I devised. The gist is this:

  1. Configure a graphite instance to store every data point for the metrics in which you’re interested.
  2. Run a script to download the data and parameterize it with respect to time.
  3. Mess with it in R

Configure Graphite

The first thing we’re going to do is set up Graphite to store high-resolution data for the metrics we want to plot, going back 30 days.

“But Dan! Why do you need Graphite? Couldn’t you just have statsd write these metrics out to MongoDB or write a custom backend to print them to a flat file?”

Sure I could, hypothetical question lady, but one nice thing about Graphite is its suite of built-in data transformation functions. If I wanted, for example, to make a parameterized plot of the sum of some metric gathered from multiple sources, I could just use Graphite’s SUM() function, rather than having to go and find the matching data points and add them all up myself.

I’m going to assume you already have a Graphite instance set up. If I were doing this for real in production, I’d use a separate Graphite instance. But I set up a fresh disposable one for the purposes of this post, and put the following in /opt/graphite/conf/storage-schemas.conf:

pattern = ^stats\.gauges\.nconn$
retentions = 10s:30d

pattern = ^stats\.timers\.resptime\..*
retentions = 10s:30d

pattern = .*
retentions = 10s:10s

This basically says: keep every data point for 30 days for stats.gauges.nconn (number of concurrent connections) and stats.timers.resptime (response times of API requests), and discard everything else.

Get parametric data

I wrote a script to print out a table of data points parameterized by time. Here it is: https://gist.github.com/danslimmon/5320247

Play with it in R

Now we can load this data into R:

data <- read.table("/path/to/output.tsv")

We can get a scatterplot immediately:

plot(data$resptime_p90 ~ data$nconn)

Screen Shot 2013-04-05 at 5.21.09 PM

There’s a lot of black here, which may be hiding some behavior. Maybe we’ll get a clearer picture by looking at a histogram of the data from each quartile (of nconn):

# Split the graphing area into four sections
# Get quantile values
quants <- quantile(data$nconn, c(.25,.5,.75))
# Draw the histograms
hist(data$resptime_90p[data$nconn <= quants[1]], xlim=c(0, 200), ylim=c(0,70000), breaks=seq(0,2000,20), main="Q1", xlab="resptime_90p")
hist(data$resptime_90p[data$nconn > quants[1] & data$nconn <= quants[2]], xlim=c(0, 200), ylim=c(0,70000), breaks=seq(0,2000,20),  main="Q2", xlab="resptime_90p")
hist(data$resptime_90p[data$nconn > quants[2] & data$nconn <= quants[3]], xlim=c(0, 200), ylim=c(0,70000), breaks=seq(0,2000,20),  main="Q3", xlab="resptime_90p")
hist(data$resptime_90p[data$nconn >= quants[3]], xlim=c(0, 200), ylim=c(0,70000), breaks=seq(0,2000,20), main="Q4", xlab="resptime_90p")

Which comes out like this:

Screen Shot 2013-04-05 at 7.05.07 PM

This shows us pretty intuitively how response time splays out as the number of concurrent connections rises.

Putting a number on scalability

If you believe — as I do — that everything worth talking about is measurable, then it probably irritates you sometimes when you hear everybody talk about scalability. We all recognize that we want to build systems that can adapt to changing capacity requirements, sure, but how do you measure how scalable your system is? How do you know when you’ve improved its scalability? How do you know how much you’ve improved it?

I propose that we pick a definition of scalability that lets us attach a number to it. When you say a system is highly scalable what you probably mean is that when you need to add capacity, you can do so without spending too much time or money. So here’s how we should define scalability:

Scalability = d(capacity)/d(money spent)

In case you’re rusty on calculus, this means scalability is the derivative of capacity with respect to money spent. And even if you’re even rustier, think about it as “how much capacity you get for a dollar.”

I’ve got an example that should make it more intuitive.

Calculating the scalability of a web cluster

Suppose you have a cluster of web servers whose capacity you measure in requests per second. A plot of that value over the course of a year might look like this:

Screen Shot 2013-03-28 at 8.28.45 PM

Similarly, you could plot the total amount of money (including time spent on labor, if you have to do anything manually), cumulatively spent on web cluster capacity. You probably don’t have this metric, and neither do I (yet), but bear with me.

Screen Shot 2013-03-28 at 8.14.13 PM

If you have both of these quantities available, then you can take your capacity datapoints and plot them against money datapoints from the same time. In math, we call this parameterizing by time.

Screen Shot 2013-03-28 at 8.30.04 PM

And finally we can get at d(Capacity)/d(Money). For discrete quantities like these, it suffices to plot the ratios between the increments by which capacity and money change. You can easily get at those values with R’s diff() function. Here’s what ours look like:

Screen Shot 2013-03-28 at 8.34.25 PM

So our scalability has increased since last May. In fact, we can say that it has increased by at least 0.1 req/s/$.

Think about it

I’m sure this is not a super practical calculation for most ops teams. But I do think it’s a useful way to think about scalability. Since I started thinking about scalability in this way, it’s felt more like a real thing. And honestly there’s not that much standing in the way of calculating this number for real, so maybe I’ll start doing it.

What do you think? Do you have a better definition for scalability that yields a quantity?

Calculating the expectation value of knowing Jeopardy! answers

Jeopardy players all know the most common response in the history of the show: “What is Australia?” It’s appeared over 220 times.

But, if you’re looking for the best topics to study as a potential Jeopardy contestant, the number of times a clue has appeared isn’t necessarily what you’re interested in. Suppose clues with the response “What is Australia?” are only ever worth $200, and clues with the response “What is Burundi?” are only ever worth $2000. In that case, as long as the probability of Burundi appearing in a given show is more than 1/10 the probability of Australia appearing, you’re better off, moneywise, brushing up on your Burundi facts.

This is the utility of the expectation value. It’s the probability of a given event multiplied by the value of that event happening. Here’s another way to think of it: if you played a chance game a million times (in our example, this would mean playing a million rounds of Jeopardy), the expectation value of betting on a given outcome (in our example, of studying a given country) is the average amount of money you’d win.

I want to be on Jeopardy, so to help myself prioritize what facts to learn, I calculated the expectation values of knowing every distinct response ever used in Jeopardy (courtesy of http://www.j-archive.com). Here’s my method:

  • The probability of a response appearing in a given game is the number of times that response has ever appeared, divided by the total number of clues in history, times 60 (Final Jeopardy is ignored). NOTE: See the comments for a discussion of whether this method is valid. Answer seems to be “pretty much, because the probability of any given answer appearing in a show is so miniscule.”
  • The value of giving a correct response is adjusted for modern clue values ($200 to $1000 in the Jeopardy round, $400 to $2000 in the Double Jeopardy round)
  • We add up all the adjusted values of a response’s appearance and divide by the number of occurrences to get that response’s average value, and then we multiply by its probability of appearance.

It ends up being a pretty different list! Here are the top 1000 Jeopardy! answers by expectation value: Link to Gist

Australia is still number one, but compare with the top 1000 by count: Link to Gist. There are 228 entries on each list that are missing from the other, and the order of the shared items is very different, especially further down the list.

If you’re going to study for Jeopardy, studying things in order of decreasing expectation value strikes me as more intelligent than studying in order of decreasing historical count. What do you think?

Car alarms and smoke alarms: the tradeoff between sensitivity and specificity

Wouldn’t you like to live in a world where your monitoring systems only alerted when things were actually broken? And wouldn’t it be great if, in that world, your alerts would always fire if things were broken?

Well so would everybody else. But we don’t live in that world. When we choose a threshold for alerting, we usually have to make a tradeoff between the chance of getting a false positive (an alert that fires when nothing is wrong) and the chance of getting a false negative (an alert that doesn’t fire when something is wrong).

Take the load average on an app server for example: if it’s above 100, then your service is probably broken. But there’s still a chance that the waiting processes aren’t blocking your mission-critical code paths. If you page somebody on this threshold, there’s always a chance that you’ll be waking that person up in the middle of the night for no good reason. However, if you raise the threshold to 200 to get rid of such spurious alerts, you’re making it more likely that a pathologically high load average will go unnoticed.

When presented with this tradeoff, the path of least resistance is to say “Let’s just keep the threshold lower. We’d rather get woken up when there’s nothing broken than sleep through a real problem.” And I can sympathize with that attitude. Undetected outages are embarrassing and harmful to your reputation. Surely it’s preferable to deal with a few late-night fire drills.

It’s a trap.

In the long run, false positives can — and will often — hurt you more than false negatives. Let’s learn about the base rate fallacy.

The base rate fallacy

Suppose you have a service that works fine most of the time, but breaks occasionally. It’s not trivial to determine whether the service is working, but you can write a probe that’ll detect its state correctly 99% of the time:

  • If the service is working, there’s a 1% chance that your probe will say it’s broken
  • If the service is broken, there’s a 1% chance that your probe will say it’s working

Naïvely, you might expect this probe to be a decent check of the service’s health. If it goes off, you’ve got a pretty good chance that the service is broken, right?

No. Bad. Wrong. This is what logicians and statisticians call the “base rate fallacy.” Your expectation hinges on the assumption that the service is only working half the time. In reality, if the service is any good, it works almost all the time. Let’s say the service is functional 99.9% of the time. If we assume the service just fails randomly the other 0.1% of the time, we can calculate the true-positive rate:

\begin{array}{rcl} \text{TPR} & = & \text{(prob. of service failure)}*\text{(prob. of detecting a failure)} \\ & = & (0.001) * (0.99) \\ & = & 0.00099 \\ & = & 0.099\% \end{array}

That is to say, about 1 in 1000 of all tests will run during a failure and detect that failure correctly. We can also calculate the false-positive rate:

\begin{array}{rcl} \text{FPR} & = & \text{(prob. of service non-failure)}*\text{(prob. of detecting failure anyway)} \\ & = & (1-0.001)*(1-0.99) \\ & = & 0.0099 \\ & = & 0.99\% \end{array}

So almost 1 test in 100 will run when the service is not broken, but will report that it’s broken anyway.

You should already be feeling anxious.

With these numbers, we can calculate what the medical field calls the probe’s positive predictive value: the probability that, if a given test produces a positive result, it’s a true positive. For our purposes this is the probability that, if we just got paged, something’s actually broken.

\begin{array}{rcl} \text{(Positive predictive value)} & = & \frac{\text{TPR}}{\text{TPR} + \text{FPR}} \\ & = & \frac{0.00099}{0.00099 + 0.0099} \\ & = & 0.091 \\ & = & 9.1\% \end{array}

Bad news. There’s no hand-waving here. If you get alerted by this probe, there’s only a 9.1% chance that something’s actually wrong.

Car alarms and smoke alarms

When you hear a car alarm going off, do you run to the window and start looking for car thieves? Do you call 9-1-1? Do you even notice car alarms anymore?

Car alarms have a very low positive predictive value. They go off for so many spurious reasons: glitchy electronics, drunk people leaning on the hood, accidental pressing of the panic button. And as a result of this low PPV, car alarms are much less useful as theft deterrents than they could be.

Now think about smoke alarms. People trust smoke alarms. When a smoke alarm goes off in a school or an office building, everybody stops what they’re doing and walks outside in an orderly fashion. Why? Because when smoke alarms go off (and there’s no drill scheduled), it frequently means there’s actual smoke somewhere.

This is not to say that smoke alarms have a perfect PPV, of course, as anybody who’s lost half an hour of their time to a false positive will tell you. But their PPV is high enough that people still pay attention to them.

We should strive to make our alerts more like smoke alarms than car alarms.

Sensitivity and specificity

Let’s go back to our example: probing a service that works 99.9% of the time. There’s some jargon for the tradeoff we’re looking at. It’s the tradeoff between the sensitivity of our test (the probability of detecting a real problem if there is one) and its specificity (the probability that we won’t detect a problem if there isn’t one).

Every time we set a monitoring threshold, we have to balance sensitivity and specificity. And one of the first questions we should ask ourselves is: “How high does our specificity have to be in order to get a decent positive predictive value?” It just takes some simple algebra to figure this out. We start with the PPV formula we used before, enjargoned below:

\begin{array}{rcl} \text{PPV} & = & \frac{\text{TPR}}{\text{TPR}+\text{FPR}} \\ & = & \frac{\text{(prob. of failure)}\cdot\text{(sensitivity)}}{\text{(prob. of failure)}\cdot\text{(sensitivity)} + (1 - \text{(prob. of failure)})\cdot(1 - \text{(specificity)})} \end{array}

To make this math a little more readable, let’s let p = PPV, f = the probability of service failure, a = sensitivity, and b = specificity. And let’s solve for b.

\begin{array}{rcl} p & = & \frac{fa}{fa + (1-f)*(1-b)} \\ fa + (1-f)(1-b) & = & \frac{fa}{p} \\ 1-b & = & \frac{\frac{fa}{p} - fa}{1-f} \\ b & = & 1 - \frac{\frac{fa}{p} - fa}{1-f} \end{array}

So, sticking with the parameters of our initial example (0.1% probability of service failure, 99% sensitivity) and deciding that we want a positive predictive value of at least 90% (so that 9 out of 10 alerts will mean something’s actually broken), we end up with

\begin{array}{rcl} \text{Specificity} & = & 1 - \frac{\frac{0.001*0.99}{0.9} - (0.001 * 0.99)}{(1 - 0.001)} \\ & = & 0.9999 \\ & = & 99.99\% \end{array}

The necessary specificity is about 99.99% — that’s way higher than the sensitivity of 99%! In order to get a probe that detects failures in this service with sufficient reliability, you need to be 100 times less likely to falsely detect a failure than you are to miss a positive!

So listen.

You’ll often be tempted to favor high sensitivity at the cost of specificity, and sometimes that’s the right choice. Just be careful: avoid the base rate fallacy by remembering that your false-positive rate needs to be much smaller than your failure rate if you want your test to have a decent positive predictive value.

The Universe is not listening

One of the oldest forms of superstition is the belief that particular words have a magical effect. Spells, incantations, Kotodama, “om”; it goes on and on. The word “abracadabra” is thought to have come from the ancient Aramaic for “I have created through speech” [srsly].

We really seem to wish the Universe were listening to us.

And it doesn’t even have to be a benevolent force listening to our words. Think about the old “Bloody Mary” urban legend, about the names of Yahweh and Voldemort. Even the ancient Greek horror stories about insulting the gods.

Now I’m not anti-folklore, but I do think that mixing superstition with engineering is a terrible idea. It’s corrosive to the data-oriented devops culture that we should all be trying to build.

The “Now You’ve Done It!” Fallacy

You’ve probably had or heard this conversation before:

ALTHEA: Hey, if you think about it, we haven’t had an outage for quite some time.

REGINALD: Althea, don’t say that! Now we’re going to have one.

Reginald is an asshole, and he’s hurting your team. Don’t be like Reginald.

Let’s think about what happened in this exchange. First, Althea made an observation: that it’s been abnormally long since the site went down. Bangin’, right? If true, this observation could have very interesting implications. The last thing you should do is ignore it. And if it’s part of a trend toward higher uptimes altogether, you should be trying to understand where that trend comes from, so you can keep it going.

Then Reginald was all like “Hey Althea, you need to stop commenting on good things that happen. We only talk about our failures on this team, because talking about our successes summons vengeful spirits to corrupt our hard disks and steal our packets.”

Sure you should be skeptical of Althea’s statement, and ideally you’re keeping track of the frequency of your outages so you can confirm it. But Reginald’s attitude actively discourages rational inquiry.

And I’ve got another bone to pick with Reginald. He just told Althea that our job, in ops, is to live in constant terror of the next disaster. He’s saying that downtime just happens and that, when we have a good run of uptime, it’s just a fluke. But uptime doesn’t just happen: we work at it, we push for it, and we should celebrate when we achieve it.

Dispelling Superstition

A superstition is the belief that one event leads to the cause of another without any physical process linking the two events. Crack-stepping leads to broken maternal spines; a Pennsylvanian groundhog seeing his shadow leads to aberrations in global weather patterns; you know. And Reginald’s warning to Althea is nothing more than superstition: talking about good uptime, he says, causes downtime.

Reginald probably doesn’t even really believe this, but that doesn’t matter. He’s damaged the team’s culture by chiding Althea, since she’ll be less likely to vocalize her observations in the future.

Superstitions emerge from confirmation bias, which is very powerful. In the past, Reginald has probably suffered through one or two outages that occurred soon after somebody made a comment about good uptime. When the outage started, he immediately remembered the comment, and went “A-ha! I knew this would happen.” So now it doesn’t matter how many outages happen in the absence of such a comment, because in those cases he’s not thinking about his superstition. Likewise, it doesn’t matter how many of those comments get made nowhere near an outage. The only feedback Reginald’s superstition gets is positive. He’s like one of B.F. Skinner’s parrots.

So how do you fight superstition? With empiricism, of course!

Track your MTBF: mean time between failures. Whenever a failure event happens, give it a rough severity rating and a timestamp and stick it in a spreadsheet. Get into this habit, and soon you’ll have a solid dataset that you can average by week or by month and show to Reginald (and Althea). If you’ve been working hard to improve stability and your MTBF has increased, then you should talk about that and you should feel good about it. You can use this data to erase Reginald’s counterproductive attitude and prevent it from harming your team’s culture any further.

You Know Who’s Smart? Friggin’ Doctors, Man.

Inspired by Steve Bennett‘s talk at Velocity 2012 (slides here. I swear it’s a great talk; I didn’t just think he was smart because he’s British), I’ve been trying lately to apply medicine’s differential diagnosis approach to my ops problem solving.

If you’ve ever seen an episode of “House M.D,” you’ll recognize the approach right away.

Problem-Based Learning

Since my girlfriend (partner/common-law fiancée/non-Platonic ladyperson/whatever) is a veterinary student, I end up hearing a lot about medical reasoning. One of her classes in first year was “Problem-Based Learning,” or as I called it, “House D.V.M.”. The format of this class should sound familiar to anyone who’s worked in ops, or dev, or the middle bit of any Venn diagram thereof.

You walk in on Monday and grab a worksheet. This worksheet describes the symptoms of some cat or pug or gila monster or headcrab that was recently treated in the hospital. Your homework: figure out what might be wrong with the animal, and recommend a course of treatment and testing.

On Tuesday, you’re given worksheet number 2. It says what a real vet did, given Monday’s info, and then it lists the results of the tests that the vet ordered. So the process starts over: your homework is to infer from the test results what could be wrong with the animal, and then figure out what tests or treatments to administer next.

This process repeats until  Friday, by which point you’ve hopefully figured out what the hell.

When I heard this, I thought it was all very cool. But I didn’t pick up on the parallels with my own work, which are staggering. And what really should have caught my attention, in retrospect, is that this was a course they were taking. They’re teaching a deductive process!

Can We Formalize It? Yes We Can!

In tech, our egos often impede learning. We’re smart and we’ve built a unique, intricate system that nobody else understands as well as we do. “Procedures” and “methodologies” disgust us: it’s just so enterprisey to imagine that any one framework could be applied to the novel, cutting-edge complexities we’re grokking with our enormous hacker brains.

Give it a rest. Humans have been teaching each other how to troubleshoot esoteric problems in complex systems for friggin millennia. That’s what medicine is.

When faced with a challenging issue to troubleshoot, doctors will turn to a deductive process called “differential diagnosis.” I’m not going to describe it in that much detail; if you want more, then tell Steve Bennett to write a book. Or watch a few episodes of House. But basically the process goes like this:

  • Write down what you know: the symptoms.
  • Brainstorm possible causes (“differentials”) for these symptoms.
  • Figure out a test that will rule out (“falsify”) some of the differentials, and perform the test.
  • If you end up falsifying all your differentials, then clearly you didn’t brainstorm hard enough. Revisit your assumptions and come up with more ideas.

This simple process keeps you moving forward without getting lost in your own creativity.

Mnemonics As Brainstorming Aids

The brainstorming step of this deductive process (“writing down your differentials”) is critical. Write down whatever leaps to mind.

Doctors have mnemonic devices to help cover all the bases here. One of the most popular is VINDICATE (Vascular/Inflammatory/Neoplastic/Degenerative/Idiopathic/Congenital/Autoimmune/ Traumatic/Endocrine). They go through this list and ask “Could it be something in this category?” The list covers all the systems in the body, so if the doctor seriously considers each of the letters, they’ll usually come up with the right differential (although they may not know it yet).

Vets have a slightly different go-to mnemonic when listing differentials: DAMNIT. There are several different meanings for each letter, but the gist of it is Degenerative, Anomalous, Metabolic, Nutritional, Inflammatory, Traumatic. Besides being a mild oath (my second-favorite kind of oath), this device has the advantage of putting more focus on the trouble’s mode of operation, rather than its location.

These mnemonics are super useful to doctors, and it’s not that hard to come up with your own version. Bennett suggests CASHWOUND (see his slides to find out why).

No Seriously, Try It. It’s Great.

The other day, we were looking at our contribution dashboard and we noticed this (artist’s rendering):

Brief dip in donations
Brief dip in donations


That dip in donations lasted about 10 minutes, and we found it extremely disturbing. So we piled into a conference room with a clean whiteboard, and we started writing down differentials.

A. Firewall glitch between card processors and Internet

B. Database failure causing donation pages not to load

C. Failures from the third-party payment gateway

D. Long response times from the payment gateway

E. Errors in our payment-processing application

F. DNS lookup failures for the payment gateway

Admittedly this is not a very long list, and we could’ve brainstormed better. But anyway, we started trying to pick apart the hypotheses.

We began with a prognostic approach. That means we judged hypothesis (B) to be the most terrifying, so we investigated it first. We checked out the web access logs and found that donation pages had been loading just fine for our users. Phew.

The next hypotheses to test were (C) and (D). Here we had switched to a probabilistic approach — we’d seen this payment gateway fail before, so why shouldn’t it happen again? To test this hypothesis, we checked two sources: our own application’s logs (which would report gateway failures), and Twitter search. Neither turned up anything promising. So now we had these differentials (including a new one devised by my boss, who had wandered in):

A. Firewall glitch between card processors and Internet

B. Database failure causing donation pages not to load

C. Failures from the third-party payment gateway

D. Long response times from the payment gateway

E. Errors in our payment-processing application

F. DNS lookup failures for the payment gateway

G. Users were redirected to a different site

(E) is pretty severe (if not particularly likely, since we hadn’t deployed the payment-processing code recently), so we investigated that next. No joy — the application’s logs were clean. Next up was (A), but it proved false as well, since we found no errors or abnormal behavior in the firewall logs.

So all we had left was (F) and (G). Finally we were able to determine that a client was A/B testing the donation page by randomly redirecting half of the traffic with Javascript. So everything was fine.

Throughout this process, I found that the differential diagnosis technique helped focus the team. Nobody stepped on each other’s toes, we were constantly making progress, and nobody had the feeling of groping in the dark that one can get when one troubleshoots without a method.

Try it out some time!