Lecture: Queueing theory on a cocktail napkin

On 2025/06/252025/07/03 By Dan SlimmonIn Monitoring, Problem-Solving, Statistics and probabilityLeave a comment

I couldn’t be prouder of “Queueing Theory on a Cocktail Napkin,” this lecture I gave at Antithesis in DC on June 10.

When I started to apply just a little queueing theory, it absolutely revolutionized my practice. If you’re responsible for the health of a software service, watch this talk.

Incident metrics tell you nothing about reliability

On 2023/04/042023/04/04 By Dan SlimmonIn UncategorizedLeave a comment

When an incident response process is created, there arise many voices calling for measurement. “As long as we’re creating standards for incidents, let’s track Mean-Time-To-Recovery (MTTR) and Mean-Time-To-Detection (MTTD) and Mean-time-Between-Failures (MTBF)!” they say things like.

Usually there’s no further discussion on this point. Everyone agrees that more data is better (“but let’s make sure we interpret the data correctly, of course!”). As practitioners, it’s exciting to think that these metrics will finally let us demonstrate concrete results for all the effort we put into reliability. Middle management, for its part, looks forward to having a “north star” for reliability work. No more wishy-washy “we believe this project improved system reliability substantially” hand-waving: the initiative either moved the needle on MTBF or it didn’t.

The assumption underlying all this is that incident data can shed light on a system’s reliability. But this assumption is never warranted. Analysis of your incident data may shed light on your incidents. But incident data really can’t tell you anything about your reliability.

Picture this. Your team owns a web service. You’ve set up an alert on latency. If the 90th percentile latency exceeds 100 milliseconds, your team gets paged and you start an incident.

One day, you find out that, really, if that 90th percentile latency even gets above 50 milliseconds, it’s already a problem. Downstream services start getting sluggish, and user experience suffers. “No problem,” you say, lowering the alert threshold to 50 milliseconds.

Now you’re able to observe more failures. Or, if you prefer: you’ve widened the definition of “failure” to include more events. But the system’s reliability has in no way been altered. And yet, you will have more incidents. Possibly quite a few more! Mean-time-between-failures will drop.

Picture this. Every once in a while, the database gets into a “wedged” state, where a bunch of queries are piled up and work grinds to a halt. It’s simple to un-wedge it: just manually kill the query that caused the pileup, and everything else gets flushed through the system. Each time the database gets wedged, the database team gets paged, they start an incident, they find and kill the query, and the incident is over. These incidents happen semi-regularly, but at least they’re over quickly.

After a while, someone finally implements a fix. Those problematic queries? Gone! No more pointless incidents spun up just to manually clear out a pileup. And, more importantly: no more service degradations resulting from this problem. The reliability of the system has improved!

So what happens to the incident metrics? MTBF goes up, of course. That checks out: there are fewer incidents now. But what about MTTR? Before the fix, MTTR was low. These query-pileup incidents were frequent and short, so they kept mean-time-to-recovery anchored down. After the fix – which, remember, corresponded to a direct and pure improvement in reliability – mean-time-to-recovery is higher. Whoopsy daisy!

Picture this. You have a particular customer whose use-case tests the limits of your product’s capabilities. And furthermore, this customer is, let’s say, somewhat of a squeaky wheel. Every 2 or 3 weeks, they file a support ticket about something-or-other, and the problem is usually tricky enough that an incident gets spun up.

Eventually this customer has had enough. They leave. This represents both a sizable blow to revenue and a scathing indictment of your product’s reliability at scale. But, on the bright side, both MTTR and MTBF benefit enormously! That’ll look great on the quarterly slide deck.

When someone says they want to measure incident metrics (MTTR or MTBF or any other one in the family), I always want to ask them, “What’s your model?”

Many people rush to collect data without first considering whether they can come up with a model powerful enough to interpret the data. If our model is bad, then we’ll make bad decisions even with good data. And if, as is often the case, we have no explicit model at all, then our implicit model is just, “Reliability up, MTBF up, MTTR down.” Which, as we’ve seen, doesn’t match our observations even in straightforward, common scenarios.

Don’t waste time and energy collecting metrics you’ll never be able to use. Learn as much as you can from the failures themselves, and then move on.

Outliers carry information. Don’t leave them on the table

On 2023/03/16 By Dan SlimmonIn Descriptive engineering, Monitoring, Problem-Solving3 Comments

Over a decade ago, I saw this talk by John Rauser. Only recently, though, did I come to realize how incredibly influential this talk has been on my career. Gosh what a great talk! You should watch it.

If you operate a complex system, like a SaaS app, you probably have a dashboard showing a few high-level metrics that summarize the system’s overall state. These metrics (“summary statistics”) are essential. They can reveal many kinds of gross changes (both gross “large scale” and gross “ick”) in the system’s state, over many different time scales. Very useful!

But don’t be misled. Summary statistics reveal certain patterns in the system’s behavior, but they are not identical to the system’s behavior. All summary statistics – yes, even distributions – hide information. They’re lossy. It’s easy to get lulled into the sense that, if an anomaly doesn’t show up in the summary statistics, it doesn’t matter. But a complex system’s behavior is not just curves on a plot. It’s a frothing, many-dimensional vector sum of instant-to-instant interactions.

When you investigate an anomaly in summary statistics, you’re faced with a small number of big facts. Average latency jumped by 20% at such-and-such time. Write IOPS doubled. API server queue depth started rising at some later time. Usually, you “zoom in” from there to find patterns that might explain these changes.

When you instead investigate a specific instance of anomalous behavior, you start with a large number of small facts. A request to such-and-such an endpoint with this-and-that parameter took however many seconds and crashed on line 99 of thing_doer.rb. None of these small facts tell you anything about the system’s overall behavior: this is just a single event among millions or billions or more. But, nevertheless: these small facts can be quite illuminating if you zoom out.

First of all, this probably isn’t the only time a crash like this has ever occurred. Maybe it’s happening multiple times a day. Maybe it happened twice as often this week as it did last week. Maybe it’s happening every time a specific customer makes a specific API request. Maybe that customer is fuming.

And second of all, the reason this event caught our eye in the first place was because it was anomalous. It had some extreme characteristic. Take, for example, a request that was served with very high latency. Perhaps, in the specific anomalous case before us, that extreme latency didn’t cause a problem. But how extreme could it get before it did cause a problem? If it took 20 seconds today, could it take 30 seconds next time? When it hits 30, it’ll time out and throw an error. Or, if multiple requests like this all arrived at the same time, could they exhaust some resource and interfere with other requests?

If the only anomalies you investigate are those that show up in summary statistics, then you’ll only find problems that have already gotten bad enough to move those needles. But if you dig into specific instances of anomalous behavior – “outliers” – then you can often find problems earlier, before they become crises.

Huh! as a signal

On 2023/02/062023/02/06 By Dan SlimmonIn Descriptive engineering3 Comments

Every time our system fails, and we go to analyze the failure, we find ourselves saying things like “We didn’t know X was happening,” “we didn’t know Y could happen,” and so on. And it’s true: we didn’t know those things.

We can never predict with certainty what the next system failure will be. But we can predict, because painful experience has taught us, that some or all of the causes of that failure will be surprising.

We can use that!

When we go looking at data (and by “data” I mostly mean logs, traces, metrics, and so on, but data can be many things), sometimes we see something weird, and we go like, Huh!. That Huh! is a signal. If we follow that Huh! – get to the bottom of it, figure it out, make it not surprising anymore – two things happen. First, we get a chance to correct a latent problem which might some day contribute to a failure. And second, we make our mental model that much better.

Of course, any individual Huh! could turn out to be nothing. Perhaps there’s a bug. Perhaps circumstances have shifted, and our expectations no longer line up with reality. Or perhaps it’s just a monitoring blip. We won’t know until we run it down.

But, whatever the shortcomings of any particular investigation, a habit of investigating surprises has many attractive qualities. The main one is that we get to fix problems before those problems get worse, start bouncing off other problems, and cause fires. In other words: our system runs smoother. Consider what that’s worth.

Pulling data from OpenTSDB into R

On 2013/05/072013/05/07 By Dan SlimmonIn Uncategorized4 Comments

Sometimes you want to jam data from OpenTSDB into R, so you can do neat things like parameterize it by time. Well, I wrote an R function that does just that!

Here it is on gist.

Here’s a usage example:

read.opentsdb("http://localhost:44242",
    metrics=c("sum:dummy.linear","sum:dummy.random.unif{blah=*}"),
    start="4h-ago")

I considered turning this into a CRAN package, but then I was like “naw.” There’s not very much to it, and writing CRAN packages seems like a big pain. Correct me if I’m wrong.

Anyway, enjoy!

Parameterizing metrics by time through Graphite abuse

On 2013/04/05 By Dan SlimmonIn Uncategorized1 Comment

Everybody has a metric-spewing system, like StatsD, and a graphing system, like Graphite. Graphite makes time-series plots. Time-series plots are great! But they’re not everything.

I like to know how my system responds to a given stimulus. I like to characterize my product by asking questions like:

When CPU usage rises, what happens to page load time?
How many concurrent users does it take before we start seeing decreased clicks?
What’s the relationship between cache hit rate and conversion rate?

In each of these cases, we’re comparing (as Theo Schlossnagle is so fond of saying) a system metric to a business metric. A system metric is something like CPU usage, or logged-in users, or cache hit rate. A business metric is something that relates more directly to your bottom-line, like page load time, click rate, or conversion rate.

Time series plots aren’t great at answering these kinds of questions. Take the first question for example: “When CPU usage rises, what happens to page load time?” Sure, you can use Graphite to superimpose the day’s load time plot over the CPU usage plot, like so:

From this chart you can see that CPU usage and page load time do both tend to be higher in the afternoon. But you’re only looking at one day’s worth of data, and you don’t know:

Whether this relationship holds on other days
How strongly correlated the two quantities are
What happens at lower/higher CPU usages than were observed today

To answer questions like these, what you want is a plot of page load time versus CPU usage, with time parameterized away. That is, for each moment in time, you want to plot a point for that moment. Like so:

This scatterplot tells you a lot more about your system’s characteristic response to rising CPU usage. You can easily see:

As CPU usage gets higher, page load time generally increases, but not linearly.
For a given CPU usage, there’s a line (yellow) below which page load time will not go.
Above a certain CPU usage (red), the relationship between load time and CPU usage becomes less strong (as evidenced by the spreading-out of the band of data points toward the right-hand side of the plot)

Time-parameterized plots like these are a great way to get to know your system. If you make an architectural change and you find that the shape of this plot has changed significantly, then you can learn a lot about the effect your change had.

But sadly, I haven’t been able to find an open-source tool that makes these plots easy to generate. So I’ll show you one somewhat ugly, but still not too time-consuming, method I devised. The gist is this:

Configure a graphite instance to store every data point for the metrics in which you’re interested.
Run a script to download the data and parameterize it with respect to time.
Mess with it in R

Configure Graphite

The first thing we’re going to do is set up Graphite to store high-resolution data for the metrics we want to plot, going back 30 days.

“But Dan! Why do you need Graphite? Couldn’t you just have statsd write these metrics out to MongoDB or write a custom backend to print them to a flat file?”

Sure I could, hypothetical question lady, but one nice thing about Graphite is its suite of built-in data transformation functions. If I wanted, for example, to make a parameterized plot of the sum of some metric gathered from multiple sources, I could just use Graphite’s SUM() function, rather than having to go and find the matching data points and add them all up myself.

I’m going to assume you already have a Graphite instance set up. If I were doing this for real in production, I’d use a separate Graphite instance. But I set up a fresh disposable one for the purposes of this post, and put the following in /opt/graphite/conf/storage-schemas.conf:

[keep_nconn]
pattern = ^stats\.gauges\.nconn$
retentions = 10s:30d

[keep_resptime]
pattern = ^stats\.timers\.resptime\..*
retentions = 10s:30d

[default_blackhole]
pattern = .*
retentions = 10s:10s

This basically says: keep every data point for 30 days for stats.gauges.nconn (number of concurrent connections) and stats.timers.resptime (response times of API requests), and discard everything else.

Get parametric data

I wrote a script to print out a table of data points parameterized by time. Here it is: https://gist.github.com/danslimmon/5320247

Play with it in R

Now we can load this data into R:

data <- read.table("/path/to/output.tsv")

We can get a scatterplot immediately:

plot(data$resptime_p90 ~ data$nconn)

There’s a lot of black here, which may be hiding some behavior. Maybe we’ll get a clearer picture by looking at a histogram of the data from each quartile (of nconn):

# Split the graphing area into four sections
par(mfrow=c(2,2))
# Get quantile values
quants <- quantile(data$nconn, c(.25,.5,.75))
# Draw the histograms
hist(data$resptime_90p[data$nconn <= quants[1]], xlim=c(0, 200), ylim=c(0,70000), breaks=seq(0,2000,20), main="Q1", xlab="resptime_90p")
hist(data$resptime_90p[data$nconn > quants[1] & data$nconn <= quants[2]], xlim=c(0, 200), ylim=c(0,70000), breaks=seq(0,2000,20),  main="Q2", xlab="resptime_90p")
hist(data$resptime_90p[data$nconn > quants[2] & data$nconn <= quants[3]], xlim=c(0, 200), ylim=c(0,70000), breaks=seq(0,2000,20),  main="Q3", xlab="resptime_90p")
hist(data$resptime_90p[data$nconn >= quants[3]], xlim=c(0, 200), ylim=c(0,70000), breaks=seq(0,2000,20), main="Q4", xlab="resptime_90p")

Which comes out like this:

This shows us pretty intuitively how response time splays out as the number of concurrent connections rises.