Wherein I Rant About Fourier Analysis

Monitorama PDX was a fantastic conference. Lots of engineers, lots of diversity, not a lot of bullshit. Jason Dixon and his minions are some top-notch conference-runners.

As someone who loves to see math everywhere, I was absolutely psyched at the mathiness of the talks at Monitorama 2014. I mean, damn. Here, watch my favorite talk: Noah Kantrowitz’s.

I studied physics in college, and I worked in computational research, so the Fourier Transform was a huge deal for me. In his talk, Noah gives some really interesting takes on the application of digital signal processing techniques to ops. I came home inspired by this talk and immediately started trying my hand at this stuff.

What Fourier analysis is

“FOUR-ee-ay” or if you want to be French about it “FOO-ree-ay” with a hint of phlegm on the “r”.

Fourier analysis is used in all sorts of fields that study waves. I learned about it when I was studying physics in college, but it’s most notably popular in digital signal processing.

It’s a thing you can do to waves. Let’s take sound waves, for instance. When you digitize a sound wave, you _sample_ it at a certain frequency: some number of times every second, you write down the strength of the wave. For sound, this sampling frequency is usually 44100 Hz (44,100 times a second).

Why 44100 Hz? Well, the highest pitch that can be heard by the human ear is around 20000 Hz, and you can only reconstitute frequencies from a digital signal at up to half the sampling rate. We don’t need to capture frequencies we can’t hear, so we don’t need to sample any faster than twice our top hearing frequency.

Now what Fourier analysis lets you do is look at a given wave and determine the frequencies that were superimposed to create it. Take the example of a pure middle C (this is R code):

library(ggplot2)
signal = data.frame(t=seq(0, .1, 1/44100))
signal$a = cos(signal$t * 261.625)
qplot(data=signal, t, a, geom='line', color=I('blue'), size=I(2)) +
    ylab('Amplitude') + xlab('Time (seconds)') + ggtitle('Middle C')

Image

Pass this through a Fourier transform and you get:

fourier = data.frame(v=fft(signal$a))
# The first value of fourier$v will be a (zero) DC
# component we don't care about:
fourier = tail(fourier, nrow(fourier) - 1)
# Also, a Fourier transform contains imaginary values, which
# I'm going to ignore for the sake of this example:
fourier$v = Re(fourier$v)
# These are the frequencies represented by each value in the
# Fourier transform:
fourier$f = seq(10, 44100, 10)
# And anything over half our sampling frequency is gonna be
# garbage:
fourier = subset(fourier, f <= 22050)

qplot(data=fourier, f, v, geom='line', color=I('red'), size=I(2)) +
    ylab('Fourier Transform') + xlab('Frequency (Hz)') +
    ggtitle('Middle C Fourier Transform') + coord_cartesian(xlim=c(0,400))

Image

As you can see, there’s a spike at 261.625 Hz, which is the frequency of middle C. Why does it gradually go up, and then go negative and asymptotically come back to 0? That has to do with windowing, but let’s not worry about it. It’s an artifact of this being a numerical approximation of a Fourier Transform, rather than an analytical solution.

You can do Fourier analysis with equal success on a composition of frequencies, like a chord. Here’s a C7 chord, which consists of four notes:

signal$a = cos(signal$t * 261.625 * 2 * pi) + # C
    cos(signal$t * 329.63 * 2 * pi) + # E
    cos(signal$t * 392.00 * 2 * pi) + # G
    cos(signal$t * 466.16 * 2 * pi) + # B flat
qplot(data=signal, t, a, geom='line', color=I('blue'), size=I(2)) +
    ylab('Amplitude') + xlab('Time (seconds)') + ggtitle('C7 chord')

Image

Looking at that mess, you probably wouldn’t guess that it was a C7 chord. You probably wouldn’t even guess that it’s composed of exactly four pure tones. But Fourier analysis makes this very clear:

fourier = data.frame(v=fft(signal$a))
fourier = tail(fourier, nrow(fourier) - 1)
fourier$v = Re(fourier$v)
fourier$f = seq(10, 44100, 10)
fourier = subset(fourier, f <= 22050)

qplot(data=fourier, f, v, geom='line', color=I('red'), size=I(2)) +
    ylab('Fourier Transform') +
    xlab('Frequency (Hz)') +
    ggtitle('Middle C Fourier Transform') +
    coord_cartesian(xlim=c(0,400))

Image

And there are our four peaks, right at the frequencies of the four notes in a C7 chord!

Straight-up Fourier analysis on server metrics

Naturally, when I heard all these Monitorama speakers mention Fourier transforms, I got super psyched. It’s an extremely versatile technique, and I was sure that I was about to get some amazing results.

It’s been kinda disappointing.

By default, a Graphite server samples your metrics (in a manner of speaking) once every 10 seconds. That’s a sampling frequency of 0.1 Hz. So we have a Nyquist Frequency (the maximum frequency at which we can resolve signals with a Fourier transform) of half that: 0.005 Hz.

So, if our goal is to look at a Fourier transform and see important information jump out at us, we have to be looking for oscillations that occur three times a minute or less. I don’t know about you, but I find that outages and performance anomalies rarely show up as oscillations like that. And when they do, you’re going to notice them before you do a Fourier transform.

Usually we get spikes or step functions instead, which bleed into wide ranges of nearby frequencies and end up being much clearer in amplitude-space than in Fourier space. Take this example of some shit hitting some fans:

Image

If we were trying to get information from this metric with Fourier transforms, we’d be interested in the Fourier transform before and after the fan got shitty. But those transforms are much less useful than the amplitude-space data:

Image

I haven’t been able to find the value in automatically applying Fourier transforms to server metrics. It’s a good technique for finding oscillating components of a messy signal, but unless you know that that’s what you’re looking for, I don’t think you’ll get much else out of them.

What about low-pass filters?

A low-pass filter uses a Fourier transform to remove high frequency components from a signal. One of my favorite takeaways from that Noah Kantrowitz talk was this: Nagios’s flapping detection mechanism is a low-pass filter.

If you want to alert when a threshold is exceeded — but not every time your metric goes above and below that threshold in a short period of time — you can run your metric through a low-pass filter. The high-frequency, less-valuable data will go away, and you’ll be left with a more stable signal to check against your threshold.

I haven’t tried this method of flap detection, but I suspect that the low-sampling-frequency problem would make it significantly less useful than one might hope. If you’ve seen Fourier analysis applied as a flap detection algorithm, I’d love to see it. I would eat my words, and they’d be delicious.

I hope I’m wrong

If somebody can show me a useful application of Fourier analysis to server monitoring, I will freak out with happiness. I love the concept. But until I see a concrete example of Fourier analysis doing something that couldn’t be done effectively with a much simpler algorithm, I’m skeptical.

Addendum

As Abe Stanway points out, Fourier analysis is a great tool to have in your modeling toolbox. It excels at finding seasonal (meaning periodic) oscillations in your data. Also, Abe and the Skyline team are working on adding seasonality detection to Skyline, which might use Fourier analysis to determine whether seasonal components should be used.

Theo Schlossnagle coyly suggests that Circonus uses Fourier analysis in a similar manner.

Devops needs feminism

UPDATE 2021 Oct 4 – Well hey, look at that! The clock on the wall says 2021! Time to check in on some of the old posts and think about how well they’ve aged. This will surely be uplifting! ~later~ Oh no

So, these ideas evidently meant something to me at the time I wrote them, or I wouldn’t have
published them. But hell if I can really make heads or tails of this post anymore. If this idea occurred to me today, I would reject it. So, please take it with a grain of yeesh 🙏

I just returned to Minneapolis from Velocity NY bursting with ideas as always. The program was saturated with fantastic speakers, like my new ops crush Ilya Grigorik of Google. And my favorite part, as always, was the hallway track. I met dozens of brilliant, inspiring engineers. Allspaw, Souders, and Nash really know how to throw a conference.

One exhilarating thing about Velocity is the focus on culture as a driving force for business. Everybody’s in introspection mode, ready to break down their organizational culture and work new ideas into it. It reminds me of artificial competence in genetic engineering. It’s a joy to experience.

But despite all this wonderful cultural introspection, y’know what word you don’t hear? Y’know what drags up awkward silences and sometimes downright reactionary vitriol?

Feminism.

As long as we’re putting our tech culture under the microscope, why don’t we swap in a feminist lens? If you question any random geek at Velocity about gender, you can bet they’ll say “Women are just as good as men at ops,” or “I work with a female engineer, and she’s really smart!” But as soon as you say “feminism,” the barriers go up. It’s like packet loss: the crowd only hears part of what you’re saying, and they assume that there’s nothing else to hear.

We need to build feminism into our organizations, and Velocity would be a great venue for that. I’m just one engineer, and I’m not by any means a feminism expert, but I do think I can shed some light on the most common wrongnesses uttered by engineers when feminism is placed on the table.

Feminism != “Girls are better than boys”

Mention feminism to a random engineer, and you’re likely to hear some variation on:

I’m against all bias! We don’t need feminism, we just need to treat each other equally.

Feminism is often portrayed as the belief that women are superior, or that men should be punished for the inequality they’ve created. Feminism is often portrayed as man-hating.

Feminism is not that. Everyone defines it differently, but I like the definition at the Geek Feminism Wiki:

Feminism is a movement which seeks respect and equality for women both under law and culturally.

Equality. Everyone who’s not an asshole wants it, but we don’t have it yet. That’s why we need a framework in which to analyze our shortcomings, conscious and unconscious. Feminism can be that framework.

Imagine hearing an engineer say this:

Our product should perform optimally! We don’t need metrics, we just need to build a system that performs well.

Would this not be face-palmingly absurd? Of course it would. Metrics let you define your goals, demonstrate the value of your goals, and check how well you’re doing. Metrics show you where you’re losing milliseconds. Metrics are the compass and map with which you navigate the dungeon of performance.

Feminism is to equality as metrics are to performance. Without a framework for self-examination, all the best intentions in the world won’t get you any closer to an equality culture.

Wanting equality isn’t enough

When feminism comes up, you might hear yourself say something like this:

I already treat female engineers equally. Good engineers are good engineers, no matter their gender.

Hey great! The intention to treat others equally is a necessary condition for a culture of equality. But it’s not a sufficient condition.

This is akin to saying:

I’m really into performance, so our site is as fast as it can be.

You might be a performance juggernaut, but you’re just one engineer. You’re responsible for one cross-section of the product. First of all, one person doesn’t constitute a self-improving or even a self-sustaining performance culture. And even more crucially, there are performance mistakes you don’t even know you’re making!

Promoting equality in your organization requires a cultural shift, just like promoting performance. Cultural shifts happen through discourse and introspection and goal-setting — not wishing. That’s why we need to look to feminism.

If you start actively working to attack inequality in your organization, I guarantee you’ll realize you were already a feminist.

Feminism doesn’t require you to be ashamed of yourself

When your heart’s in the right place and you’re constantly examining your own actions and your organization’s, you start to notice bias and prejudice in more and more places. Most disturbingly, you notice it in yourself.

Biases are baked right into ourselves and our culture. They’re so deeply ingrained that we often don’t see or hear them anymore. Think anti-patterns and the broken windows theory. When we do notice our biases, it’s horrifying. We feel ashamed and we want to sweep them under the rug.

Seth Walker of Etsy gave an excellent talk at Velocity NY entitled A Public Commitment to Performance.” It’s about how, rather than keeping their performance shortcomings private until everything’s fixed, Etsy makes public blog posts detailing their current performance challenges and recent performance improvements. This way, everyone at the company knows that there will be public eyes on any performance enhancement they make. It promotes a culture of excitement about improvements, rather than one of shame about failures.

When you notice biases in your organization — and moreover when others notice them — don’t hide them. Talk about them, analyze them, and figure out how to fix them. That’s the productive thing to do with software bugs and performance bottlenecks, so why not inequality?

Where to go from here

I’m kind of a feminism noob, but that won’t stop me from exploring it and talking about it. It shouldn’t stop you either. Geek Feminism is a good jumping-off point if you want to learn about feminism, and they also have a blog. @OnlyGirlInTech is a good Twitter account. I know there’s other stuff out there, so if you’ve got something relevant,  jam it in the comment section!

EDIT on 2013-10-21: Here are some links provided in the comments by Alexis Finch (thanks, Alexis Finch!)

Ada Initiative – focused on OpenSource, working to create allies as well as support women directly
http://adainitiative.org/what-we-do/workshops-and-training/

Girls Who Code – working with high school girls to teach them the skills and provide inspiration to join the tech fields
http://www.girlswhocode.com/

LadyBits – adding women’s voices to the media, covering tech and science [w/ a few men writing as well]
https://medium.com/ladybits-on-medium

Reductress – satire addressing the absurdity of women’s portrayal in the media [The Onion, feminized]
http://www.reductress.com/top-five-lip-glosses-paid-tell/

WomenWhoCode & LadiesWhoCode & PyLadies – if you want to find an expert engineer who happens to also be of the female persuasion [to speak at a conference, or to join your team] these are places to find seasoned tech folks, as well as for those new to tech to get started learning, with chapters worldwide.
http://www.meetup.com/Women-Who-Code-SF/ & https://twitter.com/WomenWhoCode
http://www.ladieswhocode.com/ & https://twitter.com/ladieswhocode
http://www.pyladies.com/ https://twitter.com/pyladies

Making a quick data visualization web-app with Shiny

Lately we’ve been getting concerned about our PHP error log. You know the story: errors start popping up, but they’re not causing problems in production, so you move on with your busy life. But you know in your heart of hearts that you should really be fixing the error.

The time has come for us to prune those errors, and I thought the first step should be, as always, to look at the data. Since it’s really the PHP developers who will know what to do with it, I thought it might be useful to make my analysis interactive. Enter Shiny: a web app framework that lets users interact directly with your data.

The first step was to massage my log data into a CSV that looks like this:

"date","error.id","error.count","access.count"
"2013-06-04","inc/foo/mario/journey.php:700",5,308733
"2013-06-04","inc/foo/mario/xenu.php:498",1,308733
"2013-06-04","inc/bar/mario/larp.php:363",14,308733
"2013-06-04","inc/nico.php:1859",3,308733
"2013-06-04","inc/spoot/heehaw.php:728",5,308733
"2013-06-04","inc/spoot/heehaw.php:735",5,308733
"2013-06-04","inc/spoot/heehaw.php:736",5,308733
"2013-06-04","inc/spoot/heehaw.php:737",5,308733
"2013-06-04","inc/spoot/heehaw.php:739",5,308733

For each date, error.id indicates the file and line on which the error occurred, error.count is how many times that error occurred on that date, and access.count is the total number of hits our app received on that date. With me so far?

Now I install Shiny (sure, this is artifice — I already had Shiny installed — but let’s pretend) at the R console like so:

install.packages('devtools')
library(devtools)
install_github('shiny', 'rstudio')
library(shiny)

And from the shell, I start a project:

mkdir portalserr
cd portalserr
cp /tmp/portalserr.csv .

Defining the UI

Now I write my app. I know what I want it to look like, so I’ll start with ui.R. Going through that bit by bit:

shinyUI(pageWithSidebar(
  headerPanel("PHP errors by time"),

I’m telling Shiny how to lay out my UI. I want a sidebar with form controls, and a header that describes the app.

  sidebarPanel(
    checkboxGroupInput("errors_shown", "Most common errors:", c(
      "davidbowie.php:50"="lib/exosite/robot/davidbowie.php:50",
      "heehaw.php:728"="inc/spoot/heehaw.php:728",
      …
      "llamas-10.php:84"="inc/widgets/llamas-10.php:84"
    )
  )),

Now we put a bunch of checkboxes on my sidebar. The first argument to checkboxGroupInput() gives the checkbox group a name. This is how server.R will refer to the checkbox contents. You’ll see.

The second argument is a label for the form control, and the last argument is a list (in non-R parlance an associative array or a hash) defining the checkboxes themselves. The keys (like davidbowie.php:50) will be the labels visible in the browser, and the values are the strings that server.R will receive when the corresponding box is checked.

  mainPanel(
    plotOutput("freqPlot")
  )

We’re finished describing the sidebar, so now we describe the main section of the page. It will contain only one thing: a plot called “freqPlot”.

And that’s it for the UI! But it needs something to talk to.

Defining the server

The server goes in — surprise — server.R. Let’s walk through that.

logfreq <- read.csv('portalserr.csv')
logfreq$date <- as.POSIXct(logfreq$date)
logfreq$perthou <- logfreq$error.count / logfreq$access.count * 10^3

We load the CSV into a data frame called logfreq and translate all the strings in the date column into POSIXct objects so that they’ll plot right.

Then we generate the perthou column, which contains the number of occurrences of a given error on a given day, per thousand requests that occurred that day.

shinyServer(function(input, output) {
  output$freqPlot

Okay now we start to see the magic that makes Shiny so easy to use: reactivity. We start declaring the server application with shinyServer(), which we pass a callback. That callback will be passed the input and output parameters.

input is a data frame containing the values of all the inputs we defined in ui.R. Whenever the user messes with those checkboxes, the reactive blocks (what does that mean? I’ll tell you in a bit) of our callback will be re-run, and the names of any checked boxes will be in input$errors_shown.

Similarly, output is where you put the stuff you want to send back to the UI, like freqPlot.

But the coolest part of this excerpt is the last bit: renderPlot({. That curly-bracket there means that what follows is an expression: a literal block of R code that can be evaluated later. Shiny uses expressions in a very clever way: it determines which expressions depend on which input elements, and when the user messes with inputs Shiny reevaluates only the expressions that depend on the inputs that were touched! That way, if you have a complicated analysis that can be broken down into independent subroutines, you don’t have to re-run the whole thing every time a single parameter changes.

     lf.filtered <- subset(logfreq, error.id %in% input$errors_shown)

      p <- ggplot(lf.filtered) +
        geom_point(aes(date, perthou, color=error.id), size=3) +
        geom_line(aes(date, perthou, color=error.id, group=error.id), size=2) +
        expand_limits(ymin=0) +
        theme(legend.position='left') +
        ggtitle('Errors per thousand requests') +
        ylab('Errors per thousand requests') +
        xlab('Date')
      print(p)

This logic will be reevaluated every time our checkboxes are touched. It filters the logfreq data frame down to just the errors whose boxes are checked, then makes a plot with ggplot2 and sends it to the UI.

And we’re done.

Running it

From the R console, we do this:

> runApp('/path/to/portalserr')

Listening on port 3087

This automatically opens up http://localhost:3087 in a browser and presents us with our shiny new… uh… Shiny app:

Why don’t we do it in production?

Running Shiny apps straight from the R console is fine for sharing them around the office, but if you need a more robust production environment for Shiny apps (e.g. if you want to share them with the whole company or with the public), you’ll probably want to use shiny-server. If you’re putting your app behind an SSL-enabled proxy server, use the latest HEAD from Github since it contains this change.

Go forth and visualize!

Quirks are bugs

“Stop Expecting That.”

When you use a program a lot, you start to notice its quirks. If you’re a programmer yourself, you start to develop theories about why the quirks exist, and how you’d fix them if you had the time or the source. If you’re not a programmer, you just shrug and work around the quirks.

I review about 400 virtual flash cards a day in studying for Jeopardy, so I’ve really started to pick up on the quirks of the flash card software I use. One quirk in particular really bothered me: the documentation, along with the first-tier support team, claims that when cards come up for review they will be presented in a random order. But I’ve noticed that, far from being truly random, the program presents cards in bunches of 50: old cards in the first bunch, then newer and newer bunches of cards. By the time I get to my last 50 cards of the day, they’re all less than 2 weeks old.

So I submitted a bug report, complete with scatterplot demonstrating this clear pattern. I explained “I would expect the cards to be shuffled evenly, but that doesn’t appear to be the case.” And do you know what the lead developer of the project told me?

“Stop expecting that.”

Not in so many words, of course, but there you have it. The problem was not in the software; it was in my expectations.

It’s a common reaction among software developers. We think “Look, that’s just the way it works. I understand why it works that way and I can explain it to you. So, you see, it’s not really a bug.” And as frustrating as this attitude is, I can’t say I’m immune to it myself. I’m in ops, so the users of my software are usually highly technical. I can easily make them understand why a weird thing keeps happening, and they can figure out how to work around the quirk. But the “stop expecting that” attitude is wrong, and it hurts everyone’s productivity, and it makes software worse. We have to consciously reject it.

Quirks are bugs.

A bug is when the program doesn’t work the way the programmer expects.

A quirk is when the program doesn’t work the way the user expects.

What’s the difference, really? Especially in the open-source world, where every user is a potential developer, and all your developers are users?

Quirks and bugs can both be worked around, but a workaround requires the user to learn arbitrary procedures which aren’t directly useful, and which aren’t connected in any meaningful way to his mental model of the software.

Quirks and bugs both make software less useful. They make users less productive. Neglected, they necessitate a sort of oral tradition — not dissimilar from superstition — in which users pass the proper set of incantations from generation to generation. Software shouldn’t be like that.

Quirks and bugs both drive users away.

Why should we treat them differently?

Stop “Stop Expecting That”ing

I’ve made some resolutions that I hope will gradually erase the distinction in my mind between quirks and bugs.

When I hear that a user encountered an unexpected condition in my software, I will ask myself how they developed their incorrect expectation. As they’ve used the program, has it guided them toward a flawed understanding? Or have I just exposed some internal detail that should be covered up?

If I find myself explaining to a user how my software works under the hood, I will redirect my spiel toward a search for ways to abstract away those implementation details instead of requiring the user to understand them.

If users are frequently confused about a particular feature, I’ll take a step back and examine the differences between my mental model of the software and the users’ mental model of it. I’ll then adjust one or both in order to bring them into congruence.

Anything that makes me a stronger force multiplier is worth doing.

A data set analyzed by an idiot, full of noise and worry, signifying nothing

Ever since I moved to Minneapolis and started working at Exosite, I’ve been using R almost daily to get to the bottom of ops mysteries. Sometimes it pays off and sometimes it doesn’t, but it’s always interesting.

Here’s an enigma I recently had the pleasure to rub some R on.

Huh…

One fine day, after a release of Portals (our webapp that lets you work with data from your cloud-enabled devices), I noticed something odd in the server metrics. The 90th-percentile response time had increased by a factor of four!

Awry

This would be alarming by itself, but here’s what made it interesting. We have two different ways of reporting the response time of our application. We have the Apache response time metric from the access logs (which is shown above), but we also have a metric generated by the PHP application itself, through statsd. A timer is initialized as soon as the PHP script starts running, and its value is reported as a timer when it’s done. And the PHP metric, for the same time period, looked like this:

Agley

This does not follow the same pattern. Sure there’s some variability, but they’re both response time: shouldn’t they follow each other more closely? And let’s look at the 90th-percentile times for Apache (blue) and PHP (red) on a common scale:

Awhat

Oh whoops — I should have warned you not to be drinking water when you looked at that graph. Sorry if you ruined your keyboard. Apache’s 90th-percentile response time is way higher than PHP’s. Like way higher. What gives?

Let’s summarize what we know so far:

  • 90th-percentile response times, as reported by Apache, have climbed above 5 seconds, which is much higher than normal.
  • PHP response times have not experienced any such climb, so something Apache-specific is causing this behavior.
  • Median values were not affected for either metric, so this issue is only affecting a particular subset of traffic.

Now it’s time to do some R.

Munging is my middle name

This disagreement between Apache response times and PHP response times is really intriguing, and might hint us toward the origin of the issue, so let’s dig into it.

The first thing to do is pull some access log entries into R. We aggregate and parse our logs with Logstash, so each entry is a JSON blob like this (irrelevant bits snipped for brevity):

{
    "@fields": {
        "agent": "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.37 Safari/537.36",
        "bytes": 2660,
        "clientip": "XXX.XXX.XXX.XXX",
        "hostheader": "portals.exosite.com",
        "overall_time": 0.097,
        "request": "/manage/devices",
        "response": "200",
        "servtime": 106908
    },
    "@timestamp": "2013-08-22T14:51:32.000Z",
    "@type": "apache_access"
}

The servtime field is the Apache response time in microseconds, and the overall_time field is the PHP response time in seconds. I’ve got a script that goes and grabs a time-slice of logs from the log server and converts them to CSV format, and the result looks like this:

"@timestamp","overall_time","servtime","hostheader"
"2013-08-21T13:00:49.000Z",0.083,93408,"foo.exosite.com"
"2013-08-21T13:00:48.000Z",0.173,185900,"foo.exosite.com"
"2013-08-21T13:00:46.000Z",0.094,104675,"bar.exosite.com"
"2013-08-21T13:00:46.000Z",0.122,131222,"foo.exosite.com"
"2013-08-21T13:00:46.000Z",0.132,141991,"bar.exosite.com"
"2013-08-21T13:00:46.000Z",0.096,106194,"baz.exosite.com"
"2013-08-21T13:00:46.000Z",0.136,146550,"bar.exosite.com"
"2013-08-21T13:00:46.000Z",0.154,163942,"foo.exosite.com"
"2013-08-21T13:00:46.000Z",0.174,184448,"bar.exosite.com"

Which is perfect for slurping into R like so:

> reqdata <- read.csv("/tmp/reqdata.csv", header=TRUE, as.is=TRUE)
> # Parse the timestamps with Hadley's lubridate package
> reqdata$X.timestamp <- parse_date_time(reqdata$X.timestamp, "ymdHMs")
> # Remember, Apache response times are in microseconds, so we want to scale them to seconds
> reqdata$servtime <- reqdata$servtime / 10^6

And now we have a data frame called reqdata with a row for each log entry:

> head(reqdata)
X.timestamp overall_time servtime hostheader
1 2013-09-13 13:00:49 0.083 0.093408 foo.exosite.com
2 2013-09-13 13:00:48 0.173 0.185900 foo.exosite.com
3 2013-09-13 13:00:46 0.094 0.104675 bar.exosite.com
4 2013-09-13 13:00:46 0.122 0.131222 foo.exosite.com
5 2013-09-13 13:00:46 0.132 0.141991 bar.exosite.com
6 2013-09-13 13:00:46 0.096 0.106194 baz.exosite.com

The plot coagulates

Now that we have a nicely formatted data frame (we would also have accepted a pretty flower from your sweetie), we can use ggplot to check it out.

To get a handle on the Apache/PHP response time dichotomy, let’s plot one versus the other:

> p <- ggplot(reqdata, aes(overall_time, servtime))
> p + geom_point() +
    # A blue line showing the 1-to-1 relationship that we'd naïvely expect
    geom_abline(intercept=0, slope=1, color=I('blue')) +

    ggtitle('Apache response time versus PHP execution time for individual requests') +
    ylab('Apache response time') +
    xlab('PHP execution time')

Apache vs PHP time

So we can see here that most of the requests are doing what we expect: getting sent to PHP as soon as they come in, and getting sent back to the client as soon as they’re done. That’s what the blue line indicates: it’s a line of slope 1.

But then we’ve got this big near-vertical chunk of weird. Those are requests where PHP finished quickly (consistently under 200 milliseconds) but Apache took much longer to serve the request (up to 35 seconds). What is Apache doing for so long? Why can’t it just serve content like it’s supposed to?

We can get a big clue if we color the points by Host header (the domain requested by the browser):

p + geom_point(color=hostheader)

Apache vs PHP time

Now we’re getting somewhere! Look at all that green. That means that this issue is limited to a particular domain: baz.exosite.com.

The home stretch

Okay, so now we know:

  • Requests for baz.exosite.com are sitting around in Apache for up to 35 seconds before being delivered
  • Our PHP executes very quickly for this domain — it’s only Apache that’s sucking up the time

Given that this traffic is limited to a particular domain, I’m getting the suspicion that there are other homogenous things about it. And we can confirm this (pretend I’d pulled in the user-agent and client IP address from the logs with my initial import):

> Filter down our dataset to just the baz.exosite.com requests
> baz <- subset(reqdata,hostheader=='baz.exosite.com')
> unique(baz$useragent)
[1] "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.37 Safari/537.36"

There’s only one browser hitting this domain. Is there only one IP too? (These IPs have been changed to protect the irrelevant)

> unique(baz$clientip)
[1] "257.8.55.154" "257.28.192.38" "257.81.101.4" "257.119.178.94"

Now let’s skip to the most neato part. I plotted Apache response time versus time over the course of 2 hours, faceted by client IP address. It looks like this:

ggplot(baz, aes(X.timestamp, servtime)) + geom_point(aes(color=clientip)) + facet_grid(clientip ~ .)

Apache vs PHP time

So what are we looking at? Here’s what I think.

Only one person, let’s call him François, uses this domain. François has a Portal with several auto-refreshing widgets on it. Also, François is on a train, and he’s using a wireless access point to browse the internet.

François has his Portal open in a tab that he’s forgotten about. It’s dutifully refreshing his widgets every few seconds, but as the train moves, François’s access point keeps switching towers. It starts getting a weak signal, it switches towers, François’s IP address therefore changes, and repeat.

PHP only logs how long it took to execute a script on the server, but Apache logs how long it takes for the client to acknowledge receipt of the final TCP segment in the response. Since François is on an ever-changing series of unreliable networks, he’s dropping packets left and right. Therefore, Apache is logging response times all over the map.

Alas

Alas, this information is useless. Sometimes that happens: sometimes you dig and dig and learn and learn and prepare your beautiful data, only to find at the last minute that it was all for naught.

But at least you learned something.

Howto: Setting up a proxy server that requires users to log in via Google Apps

I’ve spent way too much time on this in the last couple days, and the solution ended up being far too simple. I’ll spare you my agony by sticking to a straightforward howto format.

Motivation

I just set up Kibana, but of course I wanted my users to connect over HTTPS and with proper authentication. Well, I thought, we have Google Apps here at Exosite — how hard could it be to set up a gateway that uses OpenID to authenticate you with Google Apps before passing you through to the content?

The answer: it’s super easy to set up, but it’s kind of tricky to figure out at first. Luckily for you, I’ve already done the figuring out.

Step 1: Install the things

The following is all I need to install on Debian 7:

apt-get install apache2 libapache2-mod-auth-openid

Step 2: Write your Apache config

This example runs an HTTPS server with the hostname “coolstuff.danslimmon.com”, and it assumes you have an SSL cert for that domain in /etc/ssl/certs. It forces the user to log into Google Apps with an email address ending in “@danslimmon.com”, and then proxies their request to another web server listening on localhost port 5601.

If you don’t have an SSL cert and you don’t mind having your users get SSL validation warnings in their browsers, you can use the default Snakeoil certs.

Put this file in “/etc/apache2/sites-available/openid”:

<VirtualHost *:443>
    ServerName coolstuff.danslimmon.com
    ProxyPass / http://localhost:5601/

    SSLEngine on
    SSLCertificateFile    /etc/ssl/certs/coolstuff.danslimmon.com.crt
    SSLCertificateKeyFile /etc/ssl/private/coolstuff.danslimmon.com.key
    SSLProxyEngine        on
    SSLProxyCACertificatePath /etc/ssl/certs

    # Require a danslimmon.com email address via Google's OpenID
    <Location />
        AuthType OpenID
        Require valid-user
        # Require Google's OpenID endpoint
        AuthOpenIDSingleIdP https://www.google.com/accounts/o8/id
        # Require an @danslimmon.com email address
        AuthOpenIDAXRequire email \
            http://axschema.org/contact/email @danslimmon\.com$
    </Location>

    BrowserMatch "MSIE [2-6]" \
        nokeepalive ssl-unclean-shutdown \
        downgrade-1.0 force-response-1.0
    # MSIE 7 and newer should be able to use keepalive
    BrowserMatch "MSIE [17-9]" ssl-unclean-shutdown
</VirtualHost>

Step 3: Enable the site

a2enmod ssl
a2enmod proxy
a2enmod proxy_http
a2enmod authopenid
a2ensite openid
service apache2 restart

That should do it!

Pulling data from OpenTSDB into R

Sometimes you want to jam data from OpenTSDB into R, so you can do neat things like parameterize it by time. Well, I wrote an R function that does just that!

Here it is on gist.

Here’s a usage example:

read.opentsdb("http://localhost:44242",
metrics=c("sum:dummy.linear","sum:dummy.random.unif{blah=*}"),
start="4h-ago")

I considered turning this into a CRAN package, but then I was like “naw.” There’s not very much to it, and writing CRAN packages seems like a big pain. Correct me if I’m wrong.

Anyway, enjoy!

Parameterizing metrics by time through Graphite abuse

Everybody has a metric-spewing system, like StatsD, and a graphing system, like Graphite. Graphite makes time-series plots. Time-series plots are great! But they’re not everything.

I like to know how my system responds to a given stimulus. I like to characterize my product by asking questions like:

  • When CPU usage rises, what happens to page load time?
  • How many concurrent users does it take before we start seeing decreased clicks?
  • What’s the relationship between cache hit rate and conversion rate?

In each of these cases, we’re comparing (as Theo Schlossnagle is so fond of saying) a system metric to a business metric. A system metric is something like CPU usage, or logged-in users, or cache hit rate. A business metric is something that relates more directly to your bottom-line, like page load time, click rate, or conversion rate.

Time series plots aren’t great at answering these kinds of questions. Take the first question for example: “When CPU usage rises, what happens to page load time?” Sure, you can use Graphite to superimpose the day’s load time plot over the CPU usage plot, like so:

From this chart you can see that CPU usage and page load time do both tend to be higher in the afternoon. But you’re only looking at one day’s worth of data, and you don’t know:

  • Whether this relationship holds on other days
  • How strongly correlated the two quantities are
  • What happens at lower/higher CPU usages than were observed today

To answer questions like these, what you want is a plot of page load time versus CPU usage, with time parameterized away. That is, for each moment in time, you want to plot a point for that moment. Like so:

This scatterplot tells you a lot more about your system’s characteristic response to rising CPU usage. You can easily see:

  • As CPU usage gets higher, page load time generally increases, but not linearly.
  • For a given CPU usage, there’s a line (yellow) below which page load time will not go.
  • Above a certain CPU usage (red), the relationship between load time and CPU usage becomes less strong (as evidenced by the spreading-out of the band of data points toward the right-hand side of the plot)

Time-parameterized plots like these are a great way to get to know your system. If you make an architectural change and you find that the shape of this plot has changed significantly, then you can learn a lot about the effect your change had.

But sadly, I haven’t been able to find an open-source tool that makes these plots easy to generate. So I’ll show you one somewhat ugly, but still not too time-consuming, method I devised. The gist is this:

  1. Configure a graphite instance to store every data point for the metrics in which you’re interested.
  2. Run a script to download the data and parameterize it with respect to time.
  3. Mess with it in R

Configure Graphite

The first thing we’re going to do is set up Graphite to store high-resolution data for the metrics we want to plot, going back 30 days.

“But Dan! Why do you need Graphite? Couldn’t you just have statsd write these metrics out to MongoDB or write a custom backend to print them to a flat file?”

Sure I could, hypothetical question lady, but one nice thing about Graphite is its suite of built-in data transformation functions. If I wanted, for example, to make a parameterized plot of the sum of some metric gathered from multiple sources, I could just use Graphite’s SUM() function, rather than having to go and find the matching data points and add them all up myself.

I’m going to assume you already have a Graphite instance set up. If I were doing this for real in production, I’d use a separate Graphite instance. But I set up a fresh disposable one for the purposes of this post, and put the following in /opt/graphite/conf/storage-schemas.conf:

[keep_nconn]
pattern = ^stats\.gauges\.nconn$
retentions = 10s:30d

[keep_resptime]
pattern = ^stats\.timers\.resptime\..*
retentions = 10s:30d

[default_blackhole]
pattern = .*
retentions = 10s:10s

This basically says: keep every data point for 30 days for stats.gauges.nconn (number of concurrent connections) and stats.timers.resptime (response times of API requests), and discard everything else.

Get parametric data

I wrote a script to print out a table of data points parameterized by time. Here it is: https://gist.github.com/danslimmon/5320247

Play with it in R

Now we can load this data into R:

data <- read.table("/path/to/output.tsv")

We can get a scatterplot immediately:

plot(data$resptime_p90 ~ data$nconn)

Screen Shot 2013-04-05 at 5.21.09 PM

There’s a lot of black here, which may be hiding some behavior. Maybe we’ll get a clearer picture by looking at a histogram of the data from each quartile (of nconn):

# Split the graphing area into four sections
par(mfrow=c(2,2))
# Get quantile values
quants <- quantile(data$nconn, c(.25,.5,.75))
# Draw the histograms
hist(data$resptime_90p[data$nconn <= quants[1]], xlim=c(0, 200), ylim=c(0,70000), breaks=seq(0,2000,20), main="Q1", xlab="resptime_90p")
hist(data$resptime_90p[data$nconn > quants[1] & data$nconn <= quants[2]], xlim=c(0, 200), ylim=c(0,70000), breaks=seq(0,2000,20),  main="Q2", xlab="resptime_90p")
hist(data$resptime_90p[data$nconn > quants[2] & data$nconn <= quants[3]], xlim=c(0, 200), ylim=c(0,70000), breaks=seq(0,2000,20),  main="Q3", xlab="resptime_90p")
hist(data$resptime_90p[data$nconn >= quants[3]], xlim=c(0, 200), ylim=c(0,70000), breaks=seq(0,2000,20), main="Q4", xlab="resptime_90p")

Which comes out like this:

Screen Shot 2013-04-05 at 7.05.07 PM

This shows us pretty intuitively how response time splays out as the number of concurrent connections rises.

Calculating the expectation value of knowing Jeopardy! answers

Jeopardy players all know the most common response in the history of the show: “What is Australia?” It’s appeared over 220 times.

But, if you’re looking for the best topics to study as a potential Jeopardy contestant, the number of times a clue has appeared isn’t necessarily what you’re interested in. Suppose clues with the response “What is Australia?” are only ever worth $200, and clues with the response “What is Burundi?” are only ever worth $2000. In that case, as long as the probability of Burundi appearing in a given show is more than 1/10 the probability of Australia appearing, you’re better off, moneywise, brushing up on your Burundi facts.

This is the utility of the expectation value. It’s the probability of a given event multiplied by the value of that event happening. Here’s another way to think of it: if you played a chance game a million times (in our example, this would mean playing a million rounds of Jeopardy), the expectation value of betting on a given outcome (in our example, of studying a given country) is the average amount of money you’d win.

I want to be on Jeopardy, so to help myself prioritize what facts to learn, I calculated the expectation values of knowing every distinct response ever used in Jeopardy (courtesy of http://www.j-archive.com). Here’s my method:

  • The probability of a response appearing in a given game is the number of times that response has ever appeared, divided by the total number of clues in history, times 60 (Final Jeopardy is ignored). NOTE: See the comments for a discussion of whether this method is valid. Answer seems to be “pretty much, because the probability of any given answer appearing in a show is so miniscule.”
  • The value of giving a correct response is adjusted for modern clue values ($200 to $1000 in the Jeopardy round, $400 to $2000 in the Double Jeopardy round)
  • We add up all the adjusted values of a response’s appearance and divide by the number of occurrences to get that response’s average value, and then we multiply by its probability of appearance.

It ends up being a pretty different list! Here are the top 1000 Jeopardy! answers by expectation value: Link to Gist

Australia is still number one, but compare with the top 1000 by count: Link to Gist. There are 228 entries on each list that are missing from the other, and the order of the shared items is very different, especially further down the list.

If you’re going to study for Jeopardy, studying things in order of decreasing expectation value strikes me as more intelligent than studying in order of decreasing historical count. What do you think?

The Universe is not listening

One of the oldest forms of superstition is the belief that particular words have a magical effect. Spells, incantations, Kotodama, “om”; it goes on and on. The word “abracadabra” is thought to have come from the ancient Aramaic for “I have created through speech” [srsly].

We really seem to wish the Universe were listening to us.

And it doesn’t even have to be a benevolent force listening to our words. Think about the old “Bloody Mary” urban legend, about the names of Yahweh and Voldemort. Even the ancient Greek horror stories about insulting the gods.

Now I’m not anti-folklore, but I do think that mixing superstition with engineering is a terrible idea. It’s corrosive to the data-oriented devops culture that we should all be trying to build.

The “Now You’ve Done It!” Fallacy

You’ve probably had or heard this conversation before:

ALTHEA: Hey, if you think about it, we haven’t had an outage for quite some time.

REGINALD: Althea, don’t say that! Now we’re going to have one.

Reginald is an asshole, and he’s hurting your team. Don’t be like Reginald.

Let’s think about what happened in this exchange. First, Althea made an observation: that it’s been abnormally long since the site went down. Bangin’, right? If true, this observation could have very interesting implications. The last thing you should do is ignore it. And if it’s part of a trend toward higher uptimes altogether, you should be trying to understand where that trend comes from, so you can keep it going.

Then Reginald was all like “Hey Althea, you need to stop commenting on good things that happen. We only talk about our failures on this team, because talking about our successes summons vengeful spirits to corrupt our hard disks and steal our packets.”

Sure you should be skeptical of Althea’s statement, and ideally you’re keeping track of the frequency of your outages so you can confirm it. But Reginald’s attitude actively discourages rational inquiry.

And I’ve got another bone to pick with Reginald. He just told Althea that our job, in ops, is to live in constant terror of the next disaster. He’s saying that downtime just happens and that, when we have a good run of uptime, it’s just a fluke. But uptime doesn’t just happen: we work at it, we push for it, and we should celebrate when we achieve it.

Dispelling Superstition

A superstition is the belief that one event leads to the cause of another without any physical process linking the two events. Crack-stepping leads to broken maternal spines; a Pennsylvanian groundhog seeing his shadow leads to aberrations in global weather patterns; you know. And Reginald’s warning to Althea is nothing more than superstition: talking about good uptime, he says, causes downtime.

Reginald probably doesn’t even really believe this, but that doesn’t matter. He’s damaged the team’s culture by chiding Althea, since she’ll be less likely to vocalize her observations in the future.

Superstitions emerge from confirmation bias, which is very powerful. In the past, Reginald has probably suffered through one or two outages that occurred soon after somebody made a comment about good uptime. When the outage started, he immediately remembered the comment, and went “A-ha! I knew this would happen.” So now it doesn’t matter how many outages happen in the absence of such a comment, because in those cases he’s not thinking about his superstition. Likewise, it doesn’t matter how many of those comments get made nowhere near an outage. The only feedback Reginald’s superstition gets is positive. He’s like one of B.F. Skinner’s parrots.

So how do you fight superstition? With empiricism, of course!

Track your MTBF: mean time between failures. Whenever a failure event happens, give it a rough severity rating and a timestamp and stick it in a spreadsheet. Get into this habit, and soon you’ll have a solid dataset that you can average by week or by month and show to Reginald (and Althea). If you’ve been working hard to improve stability and your MTBF has increased, then you should talk about that and you should feel good about it. You can use this data to erase Reginald’s counterproductive attitude and prevent it from harming your team’s culture any further.