A data set analyzed by an idiot, full of noise and worry, signifying nothing

Ever since I moved to Minneapolis and started working at Exosite, I’ve been using R almost daily to get to the bottom of ops mysteries. Sometimes it pays off and sometimes it doesn’t, but it’s always interesting.

Here’s an enigma I recently had the pleasure to rub some R on.

Huh…

One fine day, after a release of Portals (our webapp that lets you work with data from your cloud-enabled devices), I noticed something odd in the server metrics. The 90th-percentile response time had increased by a factor of four!

Awry

This would be alarming by itself, but here’s what made it interesting. We have two different ways of reporting the response time of our application. We have the Apache response time metric from the access logs (which is shown above), but we also have a metric generated by the PHP application itself, through statsd. A timer is initialized as soon as the PHP script starts running, and its value is reported as a timer when it’s done. And the PHP metric, for the same time period, looked like this:

Agley

This does not follow the same pattern. Sure there’s some variability, but they’re both response time: shouldn’t they follow each other more closely? And let’s look at the 90th-percentile times for Apache (blue) and PHP (red) on a common scale:

Awhat

Oh whoops — I should have warned you not to be drinking water when you looked at that graph. Sorry if you ruined your keyboard. Apache’s 90th-percentile response time is way higher than PHP’s. Like way higher. What gives?

Let’s summarize what we know so far:

  • 90th-percentile response times, as reported by Apache, have climbed above 5 seconds, which is much higher than normal.
  • PHP response times have not experienced any such climb, so something Apache-specific is causing this behavior.
  • Median values were not affected for either metric, so this issue is only affecting a particular subset of traffic.

Now it’s time to do some R.

Munging is my middle name

This disagreement between Apache response times and PHP response times is really intriguing, and might hint us toward the origin of the issue, so let’s dig into it.

The first thing to do is pull some access log entries into R. We aggregate and parse our logs with Logstash, so each entry is a JSON blob like this (irrelevant bits snipped for brevity):

{
    "@fields": {
        "agent": "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.37 Safari/537.36",
        "bytes": 2660,
        "clientip": "XXX.XXX.XXX.XXX",
        "hostheader": "portals.exosite.com",
        "overall_time": 0.097,
        "request": "/manage/devices",
        "response": "200",
        "servtime": 106908
    },
    "@timestamp": "2013-08-22T14:51:32.000Z",
    "@type": "apache_access"
}

The servtime field is the Apache response time in microseconds, and the overall_time field is the PHP response time in seconds. I’ve got a script that goes and grabs a time-slice of logs from the log server and converts them to CSV format, and the result looks like this:

"@timestamp","overall_time","servtime","hostheader"
"2013-08-21T13:00:49.000Z",0.083,93408,"foo.exosite.com"
"2013-08-21T13:00:48.000Z",0.173,185900,"foo.exosite.com"
"2013-08-21T13:00:46.000Z",0.094,104675,"bar.exosite.com"
"2013-08-21T13:00:46.000Z",0.122,131222,"foo.exosite.com"
"2013-08-21T13:00:46.000Z",0.132,141991,"bar.exosite.com"
"2013-08-21T13:00:46.000Z",0.096,106194,"baz.exosite.com"
"2013-08-21T13:00:46.000Z",0.136,146550,"bar.exosite.com"
"2013-08-21T13:00:46.000Z",0.154,163942,"foo.exosite.com"
"2013-08-21T13:00:46.000Z",0.174,184448,"bar.exosite.com"

Which is perfect for slurping into R like so:

> reqdata <- read.csv("/tmp/reqdata.csv", header=TRUE, as.is=TRUE)
> # Parse the timestamps with Hadley's lubridate package
> reqdata$X.timestamp <- parse_date_time(reqdata$X.timestamp, "ymdHMs")
> # Remember, Apache response times are in microseconds, so we want to scale them to seconds
> reqdata$servtime <- reqdata$servtime / 10^6

And now we have a data frame called reqdata with a row for each log entry:

> head(reqdata)
X.timestamp overall_time servtime hostheader
1 2013-09-13 13:00:49 0.083 0.093408 foo.exosite.com
2 2013-09-13 13:00:48 0.173 0.185900 foo.exosite.com
3 2013-09-13 13:00:46 0.094 0.104675 bar.exosite.com
4 2013-09-13 13:00:46 0.122 0.131222 foo.exosite.com
5 2013-09-13 13:00:46 0.132 0.141991 bar.exosite.com
6 2013-09-13 13:00:46 0.096 0.106194 baz.exosite.com

The plot coagulates

Now that we have a nicely formatted data frame (we would also have accepted a pretty flower from your sweetie), we can use ggplot to check it out.

To get a handle on the Apache/PHP response time dichotomy, let’s plot one versus the other:

> p <- ggplot(reqdata, aes(overall_time, servtime))
> p + geom_point() +
    # A blue line showing the 1-to-1 relationship that we'd naïvely expect
    geom_abline(intercept=0, slope=1, color=I('blue')) +

    ggtitle('Apache response time versus PHP execution time for individual requests') +
    ylab('Apache response time') +
    xlab('PHP execution time')

Apache vs PHP time

So we can see here that most of the requests are doing what we expect: getting sent to PHP as soon as they come in, and getting sent back to the client as soon as they’re done. That’s what the blue line indicates: it’s a line of slope 1.

But then we’ve got this big near-vertical chunk of weird. Those are requests where PHP finished quickly (consistently under 200 milliseconds) but Apache took much longer to serve the request (up to 35 seconds). What is Apache doing for so long? Why can’t it just serve content like it’s supposed to?

We can get a big clue if we color the points by Host header (the domain requested by the browser):

p + geom_point(color=hostheader)

Apache vs PHP time

Now we’re getting somewhere! Look at all that green. That means that this issue is limited to a particular domain: baz.exosite.com.

The home stretch

Okay, so now we know:

  • Requests for baz.exosite.com are sitting around in Apache for up to 35 seconds before being delivered
  • Our PHP executes very quickly for this domain — it’s only Apache that’s sucking up the time

Given that this traffic is limited to a particular domain, I’m getting the suspicion that there are other homogenous things about it. And we can confirm this (pretend I’d pulled in the user-agent and client IP address from the logs with my initial import):

> Filter down our dataset to just the baz.exosite.com requests
> baz <- subset(reqdata,hostheader=='baz.exosite.com')
> unique(baz$useragent)
[1] "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.37 Safari/537.36"

There’s only one browser hitting this domain. Is there only one IP too? (These IPs have been changed to protect the irrelevant)

> unique(baz$clientip)
[1] "257.8.55.154" "257.28.192.38" "257.81.101.4" "257.119.178.94"

Now let’s skip to the most neato part. I plotted Apache response time versus time over the course of 2 hours, faceted by client IP address. It looks like this:

ggplot(baz, aes(X.timestamp, servtime)) + geom_point(aes(color=clientip)) + facet_grid(clientip ~ .)

Apache vs PHP time

So what are we looking at? Here’s what I think.

Only one person, let’s call him François, uses this domain. François has a Portal with several auto-refreshing widgets on it. Also, François is on a train, and he’s using a wireless access point to browse the internet.

François has his Portal open in a tab that he’s forgotten about. It’s dutifully refreshing his widgets every few seconds, but as the train moves, François’s access point keeps switching towers. It starts getting a weak signal, it switches towers, François’s IP address therefore changes, and repeat.

PHP only logs how long it took to execute a script on the server, but Apache logs how long it takes for the client to acknowledge receipt of the final TCP segment in the response. Since François is on an ever-changing series of unreliable networks, he’s dropping packets left and right. Therefore, Apache is logging response times all over the map.

Alas

Alas, this information is useless. Sometimes that happens: sometimes you dig and dig and learn and learn and prepare your beautiful data, only to find at the last minute that it was all for naught.

But at least you learned something.

Howto: Setting up a proxy server that requires users to log in via Google Apps

I’ve spent way too much time on this in the last couple days, and the solution ended up being far too simple. I’ll spare you my agony by sticking to a straightforward howto format.

Motivation

I just set up Kibana, but of course I wanted my users to connect over HTTPS and with proper authentication. Well, I thought, we have Google Apps here at Exosite — how hard could it be to set up a gateway that uses OpenID to authenticate you with Google Apps before passing you through to the content?

The answer: it’s super easy to set up, but it’s kind of tricky to figure out at first. Luckily for you, I’ve already done the figuring out.

Step 1: Install the things

The following is all I need to install on Debian 7:

apt-get install apache2 libapache2-mod-auth-openid

Step 2: Write your Apache config

This example runs an HTTPS server with the hostname “coolstuff.danslimmon.com”, and it assumes you have an SSL cert for that domain in /etc/ssl/certs. It forces the user to log into Google Apps with an email address ending in “@danslimmon.com”, and then proxies their request to another web server listening on localhost port 5601.

If you don’t have an SSL cert and you don’t mind having your users get SSL validation warnings in their browsers, you can use the default Snakeoil certs.

Put this file in “/etc/apache2/sites-available/openid”:

<VirtualHost *:443>
    ServerName coolstuff.danslimmon.com
    ProxyPass / http://localhost:5601/

    SSLEngine on
    SSLCertificateFile    /etc/ssl/certs/coolstuff.danslimmon.com.crt
    SSLCertificateKeyFile /etc/ssl/private/coolstuff.danslimmon.com.key
    SSLProxyEngine        on
    SSLProxyCACertificatePath /etc/ssl/certs

    # Require a danslimmon.com email address via Google's OpenID
    <Location />
        AuthType OpenID
        Require valid-user
        # Require Google's OpenID endpoint
        AuthOpenIDSingleIdP https://www.google.com/accounts/o8/id
        # Require an @danslimmon.com email address
        AuthOpenIDAXRequire email \
            http://axschema.org/contact/email @danslimmon\.com$
    </Location>

    BrowserMatch "MSIE [2-6]" \
        nokeepalive ssl-unclean-shutdown \
        downgrade-1.0 force-response-1.0
    # MSIE 7 and newer should be able to use keepalive
    BrowserMatch "MSIE [17-9]" ssl-unclean-shutdown
</VirtualHost>

Step 3: Enable the site

a2enmod ssl
a2enmod proxy
a2enmod proxy_http
a2enmod authopenid
a2ensite openid
service apache2 restart

That should do it!

Pulling data from OpenTSDB into R

Sometimes you want to jam data from OpenTSDB into R, so you can do neat things like parameterize it by time. Well, I wrote an R function that does just that!

Here it is on gist.

Here’s a usage example:

read.opentsdb("http://localhost:44242",
metrics=c("sum:dummy.linear","sum:dummy.random.unif{blah=*}"),
start="4h-ago")

I considered turning this into a CRAN package, but then I was like “naw.” There’s not very much to it, and writing CRAN packages seems like a big pain. Correct me if I’m wrong.

Anyway, enjoy!

Parameterizing metrics by time through Graphite abuse

Everybody has a metric-spewing system, like StatsD, and a graphing system, like Graphite. Graphite makes time-series plots. Time-series plots are great! But they’re not everything.

I like to know how my system responds to a given stimulus. I like to characterize my product by asking questions like:

  • When CPU usage rises, what happens to page load time?
  • How many concurrent users does it take before we start seeing decreased clicks?
  • What’s the relationship between cache hit rate and conversion rate?

In each of these cases, we’re comparing (as Theo Schlossnagle is so fond of saying) a system metric to a business metric. A system metric is something like CPU usage, or logged-in users, or cache hit rate. A business metric is something that relates more directly to your bottom-line, like page load time, click rate, or conversion rate.

Time series plots aren’t great at answering these kinds of questions. Take the first question for example: “When CPU usage rises, what happens to page load time?” Sure, you can use Graphite to superimpose the day’s load time plot over the CPU usage plot, like so:

From this chart you can see that CPU usage and page load time do both tend to be higher in the afternoon. But you’re only looking at one day’s worth of data, and you don’t know:

  • Whether this relationship holds on other days
  • How strongly correlated the two quantities are
  • What happens at lower/higher CPU usages than were observed today

To answer questions like these, what you want is a plot of page load time versus CPU usage, with time parameterized away. That is, for each moment in time, you want to plot a point for that moment. Like so:

This scatterplot tells you a lot more about your system’s characteristic response to rising CPU usage. You can easily see:

  • As CPU usage gets higher, page load time generally increases, but not linearly.
  • For a given CPU usage, there’s a line (yellow) below which page load time will not go.
  • Above a certain CPU usage (red), the relationship between load time and CPU usage becomes less strong (as evidenced by the spreading-out of the band of data points toward the right-hand side of the plot)

Time-parameterized plots like these are a great way to get to know your system. If you make an architectural change and you find that the shape of this plot has changed significantly, then you can learn a lot about the effect your change had.

But sadly, I haven’t been able to find an open-source tool that makes these plots easy to generate. So I’ll show you one somewhat ugly, but still not too time-consuming, method I devised. The gist is this:

  1. Configure a graphite instance to store every data point for the metrics in which you’re interested.
  2. Run a script to download the data and parameterize it with respect to time.
  3. Mess with it in R

Configure Graphite

The first thing we’re going to do is set up Graphite to store high-resolution data for the metrics we want to plot, going back 30 days.

“But Dan! Why do you need Graphite? Couldn’t you just have statsd write these metrics out to MongoDB or write a custom backend to print them to a flat file?”

Sure I could, hypothetical question lady, but one nice thing about Graphite is its suite of built-in data transformation functions. If I wanted, for example, to make a parameterized plot of the sum of some metric gathered from multiple sources, I could just use Graphite’s SUM() function, rather than having to go and find the matching data points and add them all up myself.

I’m going to assume you already have a Graphite instance set up. If I were doing this for real in production, I’d use a separate Graphite instance. But I set up a fresh disposable one for the purposes of this post, and put the following in /opt/graphite/conf/storage-schemas.conf:

[keep_nconn]
pattern = ^stats\.gauges\.nconn$
retentions = 10s:30d

[keep_resptime]
pattern = ^stats\.timers\.resptime\..*
retentions = 10s:30d

[default_blackhole]
pattern = .*
retentions = 10s:10s

This basically says: keep every data point for 30 days for stats.gauges.nconn (number of concurrent connections) and stats.timers.resptime (response times of API requests), and discard everything else.

Get parametric data

I wrote a script to print out a table of data points parameterized by time. Here it is: https://gist.github.com/danslimmon/5320247

Play with it in R

Now we can load this data into R:

data <- read.table("/path/to/output.tsv")

We can get a scatterplot immediately:

plot(data$resptime_p90 ~ data$nconn)

Screen Shot 2013-04-05 at 5.21.09 PM

There’s a lot of black here, which may be hiding some behavior. Maybe we’ll get a clearer picture by looking at a histogram of the data from each quartile (of nconn):

# Split the graphing area into four sections
par(mfrow=c(2,2))
# Get quantile values
quants <- quantile(data$nconn, c(.25,.5,.75))
# Draw the histograms
hist(data$resptime_90p[data$nconn <= quants[1]], xlim=c(0, 200), ylim=c(0,70000), breaks=seq(0,2000,20), main="Q1", xlab="resptime_90p")
hist(data$resptime_90p[data$nconn > quants[1] & data$nconn <= quants[2]], xlim=c(0, 200), ylim=c(0,70000), breaks=seq(0,2000,20),  main="Q2", xlab="resptime_90p")
hist(data$resptime_90p[data$nconn > quants[2] & data$nconn <= quants[3]], xlim=c(0, 200), ylim=c(0,70000), breaks=seq(0,2000,20),  main="Q3", xlab="resptime_90p")
hist(data$resptime_90p[data$nconn >= quants[3]], xlim=c(0, 200), ylim=c(0,70000), breaks=seq(0,2000,20), main="Q4", xlab="resptime_90p")

Which comes out like this:

Screen Shot 2013-04-05 at 7.05.07 PM

This shows us pretty intuitively how response time splays out as the number of concurrent connections rises.

Calculating the expectation value of knowing Jeopardy! answers

Jeopardy players all know the most common response in the history of the show: “What is Australia?” It’s appeared over 220 times.

But, if you’re looking for the best topics to study as a potential Jeopardy contestant, the number of times a clue has appeared isn’t necessarily what you’re interested in. Suppose clues with the response “What is Australia?” are only ever worth $200, and clues with the response “What is Burundi?” are only ever worth $2000. In that case, as long as the probability of Burundi appearing in a given show is more than 1/10 the probability of Australia appearing, you’re better off, moneywise, brushing up on your Burundi facts.

This is the utility of the expectation value. It’s the probability of a given event multiplied by the value of that event happening. Here’s another way to think of it: if you played a chance game a million times (in our example, this would mean playing a million rounds of Jeopardy), the expectation value of betting on a given outcome (in our example, of studying a given country) is the average amount of money you’d win.

I want to be on Jeopardy, so to help myself prioritize what facts to learn, I calculated the expectation values of knowing every distinct response ever used in Jeopardy (courtesy of http://www.j-archive.com). Here’s my method:

  • The probability of a response appearing in a given game is the number of times that response has ever appeared, divided by the total number of clues in history, times 60 (Final Jeopardy is ignored). NOTE: See the comments for a discussion of whether this method is valid. Answer seems to be “pretty much, because the probability of any given answer appearing in a show is so miniscule.”
  • The value of giving a correct response is adjusted for modern clue values ($200 to $1000 in the Jeopardy round, $400 to $2000 in the Double Jeopardy round)
  • We add up all the adjusted values of a response’s appearance and divide by the number of occurrences to get that response’s average value, and then we multiply by its probability of appearance.

It ends up being a pretty different list! Here are the top 1000 Jeopardy! answers by expectation value: Link to Gist

Australia is still number one, but compare with the top 1000 by count: Link to Gist. There are 228 entries on each list that are missing from the other, and the order of the shared items is very different, especially further down the list.

If you’re going to study for Jeopardy, studying things in order of decreasing expectation value strikes me as more intelligent than studying in order of decreasing historical count. What do you think?

The Universe is not listening

One of the oldest forms of superstition is the belief that particular words have a magical effect. Spells, incantations, Kotodama, “om”; it goes on and on. The word “abracadabra” is thought to have come from the ancient Aramaic for “I have created through speech” [srsly].

We really seem to wish the Universe were listening to us.

And it doesn’t even have to be a benevolent force listening to our words. Think about the old “Bloody Mary” urban legend, about the names of Yahweh and Voldemort. Even the ancient Greek horror stories about insulting the gods.

Now I’m not anti-folklore, but I do think that mixing superstition with engineering is a terrible idea. It’s corrosive to the data-oriented devops culture that we should all be trying to build.

The “Now You’ve Done It!” Fallacy

You’ve probably had or heard this conversation before:

ALTHEA: Hey, if you think about it, we haven’t had an outage for quite some time.

REGINALD: Althea, don’t say that! Now we’re going to have one.

Reginald is an asshole, and he’s hurting your team. Don’t be like Reginald.

Let’s think about what happened in this exchange. First, Althea made an observation: that it’s been abnormally long since the site went down. Bangin’, right? If true, this observation could have very interesting implications. The last thing you should do is ignore it. And if it’s part of a trend toward higher uptimes altogether, you should be trying to understand where that trend comes from, so you can keep it going.

Then Reginald was all like “Hey Althea, you need to stop commenting on good things that happen. We only talk about our failures on this team, because talking about our successes summons vengeful spirits to corrupt our hard disks and steal our packets.”

Sure you should be skeptical of Althea’s statement, and ideally you’re keeping track of the frequency of your outages so you can confirm it. But Reginald’s attitude actively discourages rational inquiry.

And I’ve got another bone to pick with Reginald. He just told Althea that our job, in ops, is to live in constant terror of the next disaster. He’s saying that downtime just happens and that, when we have a good run of uptime, it’s just a fluke. But uptime doesn’t just happen: we work at it, we push for it, and we should celebrate when we achieve it.

Dispelling Superstition

A superstition is the belief that one event leads to the cause of another without any physical process linking the two events. Crack-stepping leads to broken maternal spines; a Pennsylvanian groundhog seeing his shadow leads to aberrations in global weather patterns; you know. And Reginald’s warning to Althea is nothing more than superstition: talking about good uptime, he says, causes downtime.

Reginald probably doesn’t even really believe this, but that doesn’t matter. He’s damaged the team’s culture by chiding Althea, since she’ll be less likely to vocalize her observations in the future.

Superstitions emerge from confirmation bias, which is very powerful. In the past, Reginald has probably suffered through one or two outages that occurred soon after somebody made a comment about good uptime. When the outage started, he immediately remembered the comment, and went “A-ha! I knew this would happen.” So now it doesn’t matter how many outages happen in the absence of such a comment, because in those cases he’s not thinking about his superstition. Likewise, it doesn’t matter how many of those comments get made nowhere near an outage. The only feedback Reginald’s superstition gets is positive. He’s like one of B.F. Skinner’s parrots.

So how do you fight superstition? With empiricism, of course!

Track your MTBF: mean time between failures. Whenever a failure event happens, give it a rough severity rating and a timestamp and stick it in a spreadsheet. Get into this habit, and soon you’ll have a solid dataset that you can average by week or by month and show to Reginald (and Althea). If you’ve been working hard to improve stability and your MTBF has increased, then you should talk about that and you should feel good about it. You can use this data to erase Reginald’s counterproductive attitude and prevent it from harming your team’s culture any further.