Pulling data from OpenTSDB into R

Sometimes you want to jam data from OpenTSDB into R, so you can do neat things like parameterize it by time. Well, I wrote an R function that does just that!

Here it is on gist.

Here’s a usage example:

read.opentsdb("http://localhost:44242",
metrics=c("sum:dummy.linear","sum:dummy.random.unif{blah=*}"),
start="4h-ago")

I considered turning this into a CRAN package, but then I was like “naw.” There’s not very much to it, and writing CRAN packages seems like a big pain. Correct me if I’m wrong.

Anyway, enjoy!

Parameterizing metrics by time through Graphite abuse

Everybody has a metric-spewing system, like StatsD, and a graphing system, like Graphite. Graphite makes time-series plots. Time-series plots are great! But they’re not everything.

I like to know how my system responds to a given stimulus. I like to characterize my product by asking questions like:

  • When CPU usage rises, what happens to page load time?
  • How many concurrent users does it take before we start seeing decreased clicks?
  • What’s the relationship between cache hit rate and conversion rate?

In each of these cases, we’re comparing (as Theo Schlossnagle is so fond of saying) a system metric to a business metric. A system metric is something like CPU usage, or logged-in users, or cache hit rate. A business metric is something that relates more directly to your bottom-line, like page load time, click rate, or conversion rate.

Time series plots aren’t great at answering these kinds of questions. Take the first question for example: “When CPU usage rises, what happens to page load time?” Sure, you can use Graphite to superimpose the day’s load time plot over the CPU usage plot, like so:

From this chart you can see that CPU usage and page load time do both tend to be higher in the afternoon. But you’re only looking at one day’s worth of data, and you don’t know:

  • Whether this relationship holds on other days
  • How strongly correlated the two quantities are
  • What happens at lower/higher CPU usages than were observed today

To answer questions like these, what you want is a plot of page load time versus CPU usage, with time parameterized away. That is, for each moment in time, you want to plot a point for that moment. Like so:

This scatterplot tells you a lot more about your system’s characteristic response to rising CPU usage. You can easily see:

  • As CPU usage gets higher, page load time generally increases, but not linearly.
  • For a given CPU usage, there’s a line (yellow) below which page load time will not go.
  • Above a certain CPU usage (red), the relationship between load time and CPU usage becomes less strong (as evidenced by the spreading-out of the band of data points toward the right-hand side of the plot)

Time-parameterized plots like these are a great way to get to know your system. If you make an architectural change and you find that the shape of this plot has changed significantly, then you can learn a lot about the effect your change had.

But sadly, I haven’t been able to find an open-source tool that makes these plots easy to generate. So I’ll show you one somewhat ugly, but still not too time-consuming, method I devised. The gist is this:

  1. Configure a graphite instance to store every data point for the metrics in which you’re interested.
  2. Run a script to download the data and parameterize it with respect to time.
  3. Mess with it in R

Configure Graphite

The first thing we’re going to do is set up Graphite to store high-resolution data for the metrics we want to plot, going back 30 days.

“But Dan! Why do you need Graphite? Couldn’t you just have statsd write these metrics out to MongoDB or write a custom backend to print them to a flat file?”

Sure I could, hypothetical question lady, but one nice thing about Graphite is its suite of built-in data transformation functions. If I wanted, for example, to make a parameterized plot of the sum of some metric gathered from multiple sources, I could just use Graphite’s SUM() function, rather than having to go and find the matching data points and add them all up myself.

I’m going to assume you already have a Graphite instance set up. If I were doing this for real in production, I’d use a separate Graphite instance. But I set up a fresh disposable one for the purposes of this post, and put the following in /opt/graphite/conf/storage-schemas.conf:

[keep_nconn]
pattern = ^stats\.gauges\.nconn$
retentions = 10s:30d

[keep_resptime]
pattern = ^stats\.timers\.resptime\..*
retentions = 10s:30d

[default_blackhole]
pattern = .*
retentions = 10s:10s

This basically says: keep every data point for 30 days for stats.gauges.nconn (number of concurrent connections) and stats.timers.resptime (response times of API requests), and discard everything else.

Get parametric data

I wrote a script to print out a table of data points parameterized by time. Here it is: https://gist.github.com/danslimmon/5320247

Play with it in R

Now we can load this data into R:

data <- read.table("/path/to/output.tsv")

We can get a scatterplot immediately:

plot(data$resptime_p90 ~ data$nconn)

Screen Shot 2013-04-05 at 5.21.09 PM

There’s a lot of black here, which may be hiding some behavior. Maybe we’ll get a clearer picture by looking at a histogram of the data from each quartile (of nconn):

# Split the graphing area into four sections
par(mfrow=c(2,2))
# Get quantile values
quants <- quantile(data$nconn, c(.25,.5,.75))
# Draw the histograms
hist(data$resptime_90p[data$nconn <= quants[1]], xlim=c(0, 200), ylim=c(0,70000), breaks=seq(0,2000,20), main="Q1", xlab="resptime_90p")
hist(data$resptime_90p[data$nconn > quants[1] & data$nconn <= quants[2]], xlim=c(0, 200), ylim=c(0,70000), breaks=seq(0,2000,20),  main="Q2", xlab="resptime_90p")
hist(data$resptime_90p[data$nconn > quants[2] & data$nconn <= quants[3]], xlim=c(0, 200), ylim=c(0,70000), breaks=seq(0,2000,20),  main="Q3", xlab="resptime_90p")
hist(data$resptime_90p[data$nconn >= quants[3]], xlim=c(0, 200), ylim=c(0,70000), breaks=seq(0,2000,20), main="Q4", xlab="resptime_90p")

Which comes out like this:

Screen Shot 2013-04-05 at 7.05.07 PM

This shows us pretty intuitively how response time splays out as the number of concurrent connections rises.

Calculating the expectation value of knowing Jeopardy! answers

Jeopardy players all know the most common response in the history of the show: “What is Australia?” It’s appeared over 220 times.

But, if you’re looking for the best topics to study as a potential Jeopardy contestant, the number of times a clue has appeared isn’t necessarily what you’re interested in. Suppose clues with the response “What is Australia?” are only ever worth $200, and clues with the response “What is Burundi?” are only ever worth $2000. In that case, as long as the probability of Burundi appearing in a given show is more than 1/10 the probability of Australia appearing, you’re better off, moneywise, brushing up on your Burundi facts.

This is the utility of the expectation value. It’s the probability of a given event multiplied by the value of that event happening. Here’s another way to think of it: if you played a chance game a million times (in our example, this would mean playing a million rounds of Jeopardy), the expectation value of betting on a given outcome (in our example, of studying a given country) is the average amount of money you’d win.

I want to be on Jeopardy, so to help myself prioritize what facts to learn, I calculated the expectation values of knowing every distinct response ever used in Jeopardy (courtesy of http://www.j-archive.com). Here’s my method:

  • The probability of a response appearing in a given game is the number of times that response has ever appeared, divided by the total number of clues in history, times 60 (Final Jeopardy is ignored). NOTE: See the comments for a discussion of whether this method is valid. Answer seems to be “pretty much, because the probability of any given answer appearing in a show is so miniscule.”
  • The value of giving a correct response is adjusted for modern clue values ($200 to $1000 in the Jeopardy round, $400 to $2000 in the Double Jeopardy round)
  • We add up all the adjusted values of a response’s appearance and divide by the number of occurrences to get that response’s average value, and then we multiply by its probability of appearance.

It ends up being a pretty different list! Here are the top 1000 Jeopardy! answers by expectation value: Link to Gist

Australia is still number one, but compare with the top 1000 by count: Link to Gist. There are 228 entries on each list that are missing from the other, and the order of the shared items is very different, especially further down the list.

If you’re going to study for Jeopardy, studying things in order of decreasing expectation value strikes me as more intelligent than studying in order of decreasing historical count. What do you think?

The Universe is not listening

One of the oldest forms of superstition is the belief that particular words have a magical effect. Spells, incantations, Kotodama, “om”; it goes on and on. The word “abracadabra” is thought to have come from the ancient Aramaic for “I have created through speech” [srsly].

We really seem to wish the Universe were listening to us.

And it doesn’t even have to be a benevolent force listening to our words. Think about the old “Bloody Mary” urban legend, about the names of Yahweh and Voldemort. Even the ancient Greek horror stories about insulting the gods.

Now I’m not anti-folklore, but I do think that mixing superstition with engineering is a terrible idea. It’s corrosive to the data-oriented devops culture that we should all be trying to build.

The “Now You’ve Done It!” Fallacy

You’ve probably had or heard this conversation before:

ALTHEA: Hey, if you think about it, we haven’t had an outage for quite some time.

REGINALD: Althea, don’t say that! Now we’re going to have one.

Reginald is an asshole, and he’s hurting your team. Don’t be like Reginald.

Let’s think about what happened in this exchange. First, Althea made an observation: that it’s been abnormally long since the site went down. Bangin’, right? If true, this observation could have very interesting implications. The last thing you should do is ignore it. And if it’s part of a trend toward higher uptimes altogether, you should be trying to understand where that trend comes from, so you can keep it going.

Then Reginald was all like “Hey Althea, you need to stop commenting on good things that happen. We only talk about our failures on this team, because talking about our successes summons vengeful spirits to corrupt our hard disks and steal our packets.”

Sure you should be skeptical of Althea’s statement, and ideally you’re keeping track of the frequency of your outages so you can confirm it. But Reginald’s attitude actively discourages rational inquiry.

And I’ve got another bone to pick with Reginald. He just told Althea that our job, in ops, is to live in constant terror of the next disaster. He’s saying that downtime just happens and that, when we have a good run of uptime, it’s just a fluke. But uptime doesn’t just happen: we work at it, we push for it, and we should celebrate when we achieve it.

Dispelling Superstition

A superstition is the belief that one event leads to the cause of another without any physical process linking the two events. Crack-stepping leads to broken maternal spines; a Pennsylvanian groundhog seeing his shadow leads to aberrations in global weather patterns; you know. And Reginald’s warning to Althea is nothing more than superstition: talking about good uptime, he says, causes downtime.

Reginald probably doesn’t even really believe this, but that doesn’t matter. He’s damaged the team’s culture by chiding Althea, since she’ll be less likely to vocalize her observations in the future.

Superstitions emerge from confirmation bias, which is very powerful. In the past, Reginald has probably suffered through one or two outages that occurred soon after somebody made a comment about good uptime. When the outage started, he immediately remembered the comment, and went “A-ha! I knew this would happen.” So now it doesn’t matter how many outages happen in the absence of such a comment, because in those cases he’s not thinking about his superstition. Likewise, it doesn’t matter how many of those comments get made nowhere near an outage. The only feedback Reginald’s superstition gets is positive. He’s like one of B.F. Skinner’s parrots.

So how do you fight superstition? With empiricism, of course!

Track your MTBF: mean time between failures. Whenever a failure event happens, give it a rough severity rating and a timestamp and stick it in a spreadsheet. Get into this habit, and soon you’ll have a solid dataset that you can average by week or by month and show to Reginald (and Althea). If you’ve been working hard to improve stability and your MTBF has increased, then you should talk about that and you should feel good about it. You can use this data to erase Reginald’s counterproductive attitude and prevent it from harming your team’s culture any further.