Ever since the late 2000s, I’ve been implementing “alert review” processes on ops teams. As a team, we go through all the alerts we’ve received in the last week, and identify those that are bad.
But what makes an alert bad? How do we distinguish the good ones from the bad?
I use a simple framework wherein a good alert has the following properties:
Actionable:Indicates a problem for which the recipient is well placed to take immediate corrective action.
Investigable: (yes, I made this word up) Indicates a problem whose solution is not yet known by the organization.
These properties can be present or absent independently, so this framework identifies four types of alerts:
Actionability has been widely touted as a necessary condition for good alerts, and for good reason. Why send an alert about something you can’t fix? Just to make the recipient anxious?
Referring to my definition of an actionable alert:
Indicates a problem for which the recipient is well placed to take immediate corrective action.
we can see three main ways in which non-actionability shows up:
Someone is well placed to take immediate corrective action, but not the recipient. For example, an alert that indicates a problem with the Maple Syrup service instead pages the Butter team, which can’t do anything about it. In cases like these, the fix is often simple: change the alert’s recipient. Sometimes, though, you’ll first have to refactor your config to let it distinguish between Maple Syrup problems and Butter problems.
There is an action to take, but it can’t be taken immediately. For example, Apache needs to be restarted, but the recipient of the alert isn’t sure whether this will cause an outage. This type of non-actionable alert often calls for either improved documentation (e.g. a “runbook” indicating steps to perform). Another example might be a disk space alert that has been slowly climbing for a while and just crossed a threshold: action can’t be taken immediately, because the team needs to agree upon an action to take.
There is no action to take. For example, “CPU utilization” or “Packet loss.” These are your classic FYI alerts. Instead of alerting, these things should appear on a dashboard for use in troubleshooting when a problem is already known to exist.
An alert is non-investigable if its implications are obvious at first glance. Here are the two most common types of non-investigable alerts:
“Chief O’Brien” alerts. If you look at your phone and instantly know the commands you have to run to fix it, that’s a “Chief O’Brien” alert. There’s no need to bother a human to fix the issue; the response should be automated.
Redundant alerts. Sometimes you get an alert for increased error rates from one of your services, and by the time you get up and get to your laptop, you’ve gotten 8 more similar alerts. The first one might well have been a perfectly good alert, but the other 8 are likely non-investigable. Whatever you learn in investigating the first one will apply to the rest in exactly the same way. The correct response to alerts like these is to use dependencies or grouping.
What to do with this framework
Like I said, I like to have my team go through all the alerts they’ve received in the last week and discuss them. Imagine a spreadsheet where each alert is a row and there are columns labeled Actionable? and Investigable?
Actually, don’t bother. I imagined one for you:
This actionability/investigability framework helps the team identify bad alerts and agree on the precise nature of their badness. And as a bonus, the data in these spreadsheets can shine a light on trends in alert quality over time.
I’ve had a lot of success with this framework throughout the years, and I’d like to hear how it works for others. Let me know if you try it out, or if you have a different model for addressing the same sorts of questions!
CHARLIE: There are too many tickets in our backlog!
JULIET: I know, right? So many of them are old-ass feature requests that’ll never get done.
MIKE: I’ve got an idea! Why don’t we auto-delete tickets older than 6 months?
(Everyone looks at each other in shock. Confetti wipe. MIKE is being carried through the office to thunderous applause.)
A maximum ticket age seems like a good idea. Big ticket backlogs are demoralizing, and it’s impossible to get any useful information out of them. Furthermore, we know with certainty that most things in the backlog are never getting done.
The submitter of a ticket knows more about why a piece of work is important than you do, right? Why not rely on that knowledge for triage?
No. Ticket cutoff ages are silly.
Ticket cutoff ages are silly because they rest on the assumption that a reporter will notice a ticket has been closed, read the notification, evaluate the ticket’s importance in the appropriate context, and reopen the ticket if – and only if – it’s important. This assumption is wildly flawed, as anyone with an inbox full of JIRA updates can attest. The evaluation step is especially absurd: how can ticket reporters be expected to consistently place requested work in the context of your team’s ever-shifting priorities and constraints?
Ticket cutoff ages are silly because the age of a ticket is a terrible proxy metric for value. There’s a crucial difference between urgent and important. Tasks that get resolved at a young age tend to be urgent ones, which are usually associated with achieving someone else’s goals. By enforcing a maximum age, your team will naturally focus on urgent tickets to the exclusion of important ones: generally those that advance the team’s goals. A ticket’s age has little connection with its importance.
Ticket cutoff ages are silly because they don’t even solve the problem they set out to solve: that demoralizing, disorganized pile of work that’s always in your peripheral vision. Sure, some tickets get closed, but you can never explain why. Important tasks disappear according to the whims of entropy, and unimportant tasks are always mixed in with everything else. Each time you try to understand your team’s commitments, you must reevaluate the merits of every ticket in the queue.
This method of constraining a ticket backlog is nothing short of an abdication of ownership. Instead of respecting our own time and that of the ticket reporter by making an explicit, up-front decision about a ticket’s fate, we let chance and frustration shape our product.
Wouldn’t it be better to take matters into our own hands and cap the backlog?
Unbounded ticket backlogs are one of the most pernicious influences facing an engineering team. Consider:
A team’s capacity (the amount of work it’s capable of doing in a given week) is limited. As the ticket backlog grows, so grows the number of tasks that will never be finished.
Keeping tickets that will never be finished amounts to a tacit denial of the team’s limited capacity. It’s pure fantasy to imagine that every ticket in the backlog is one that the team might eventually get around to.
Keeping tickets that will never be finished hides information about the team’s capacity and agenda, making it near impossible to reason about priorities. A giant backlog offers no information about what the team considers important, and it invites stakeholders to share in the delusion that all work will eventually get done.
A large ticket backlog is just demotivating. No one even contemplates digging into the backlog because of how Sisyphean that would feel.
To top it all off, a large, unbounded ticket backlog perpetuates itself. The more tickets you have, the less empowered you feel to say no.
Why ticket backlogs grow
Like any queueing system, a ticket backlog will grow without bound if tasks arrive faster (on average) than they leave.
Every engineering team has more ideas than time. There are many things that your team should probably do, but you’ll never have enough time to do them all. Add to this the firehose of feature requests, bug reports, and maintenance tasks, and it’s easy to see why tickets arrive in a backlog like… well… a firehose.
But we already know where tickets come from. That’s only one of two important variables. Why don’t tickets ever leave faster than they arrive?
Because refuse to close them, of course! When a ticket represents a neat idea our team had, we don’t want to close it because the idea is still neat. And when it represents a request from a different team, we don’t feel empowered to close it because we feel it’s our team’s job to fulfill that request.
So tickets enter our system faster than we get rid of them. This usually leads to one of two situations:
Growth without bound. Your ticket queue has 4000 tickets in it, and it’s become a running joke on the team that “oh sure, we’ll get around to it when we finish the rest of the ticket queue.” Sound familiar?
Automatic closing. You write a cron job to auto-close tickets that haven’t been touched in a year, on the assumption that “if it’s important, they’ll re-open it.” Instead of making informed, thoughtful decisions about what work is most important to your team, you leave the problem to the vagaries of ticket reporters and the march of entropy.
It doesn’t have to be like this. But the solution is so obvious that most people will dismiss it out of hand.
Cap your ticket backlog.
Letting your backlog grow without bound is a form of information hiding. It serves only to obfuscate. So why not just cap its size? Say “from now on, there can’t be more than 10 tickets sitting in our backlog.”
Think of all the benefits:
You’ll force yourself to make decisions about which tasks are critical to your team’s success.
You’ll develop a sense of how much work you can realistically commit to.
You’ll know at a glance which types of tasks are in most dire need of streamlining.
You’ll understand, all at once and without summarizing, what’s in the backlog.
You’ll start looking at the backlog and thinking “I can put a dent in that.”
At first it will be hard to come to terms with the fact that most of the things in your backlog are never going to get done. But that’s a fact. The sooner you accept it, the better for everyone: not just your team, but also the people who make requests of you. It doesn’t serve anyone to pretend that you can finish every piece of work that would be nice to do.
Putting it into practice
My team at Etsy has started capping its backlog at 10 issues.
Each Tuesday morning, we meet in Slack for 15 minutes. We look at a JIRA filter that shows all unassigned tickets that haven’t been updated in the last 2 days. If there are more than 10, we prune the list.
There are a bunch of different ways we can remove a ticket from the backlog, including:
Mark for clobbering. Apply the label clobbering-time to an issue. This adds it to the list of issues that we’ll draw from during a Clobbering Time session. Clobbering Time is a monthly event where we try to knock out as many non-urgent tasks as possible. As such, Clobbering Time issues should be resolvable by a single engineer within an hour.
Move to another project. If it’s not a thing our team needs to do, move the ticket to the appropriate project.
Assign to someone. If someone’s ready to work on the issue immediately, assign it to that person.
Request more details. If the issue isn’t clearly defined enough to begin work, assign it to the creator and request more details. Make sure to un-assign the issue once enough details are received.
Close. Just close outright, with a comment explaining that we don’t have enough bandwidth.
At first, we found a lot of low-hanging fruit. There were many clobbering-friendly tickets we could remove immediately, and there were many others that simply were no longer relevant.
As we use this process more, we’re going to start having to make tougher decisions. We’ll have to choose between tasks that seem equally essential. It won’t always be pleasant, but I believe it’s necessary and we’ll be glad we did it, because ignoring those difficult decisions is worse.
Some days I feel like estimates and plans are worthless. Other days I find myself idly speculating that if only we were better at planning and estimating, everything would be beautiful.
In an engineering organization, we want realistic plans. Realistic plans and roadmaps will hypothetically help us utilize our capacity effectively and avoid overcommitting. And our plans are usually built on estimates; that’s why we believe that “correct” estimates are more likely to generate a realistic plan. Therefore, we believe that correct estimates are useful for planning.
On the other hand, if you’ve ever worked with engineers, you also believe that estimates are never correct. Something always gets in the way, or we pad the estimate out to avoid being held responsible for missing a deadline, or (occasionally) the task ends up being much simpler than we expected. The time we estimate is never the time it actually takes to do the job.
These two statements seem to be in contradiction, but they are not. There is a key to holding both these beliefs in your head at once. And I believe that if an entire engineering org were to grasp this key together, it would unlock enormous potential.
The key is to treat uncertainty as a first-class citizen.
Why This Is Hard
Developing organizational intuition in about estimates is really hard. It is in fact so hard that I’m not aware of any organization that has done it. I see two big reasons for this:
We don’t have good awareness of the amount of uncertainty in our estimates.
We pay lip service to the idea that estimates aren’t perfect, but we plan as if they are.
1. Uncalibrated estimates
When we give estimates, those estimates aren’t calibrated to a particular uncertainty level. Everybody has different levels of confidence in their estimates, and those levels are neither discussed nor adjusted. As a result, engineers and project managers find it impossible to turn estimates into meaningful plans.
2. Planning with incorrect use of uncertainty
You’ve surely noticed uncertainty being ignored. It usually goes something like this:
PM: How soon do you think you can have the power converters re-triangulated? ENG: I’m not sure. It could take a day, or it could take five. PM: I understand. What do you think is most likely? Can we say three days? ENG: Sure, three or four. PM: Okay. I’ll put it down for four days, just to be safe.
Both sides acknowledge that there is a great deal of uncertainty in this estimate. But a number needs to get written down, so they write down 4. In this step, information about uncertainty is lost. So it shouldn’t be a surprise later when re-triangulating the power converters takes 7 days and the whole plan gets thrown off.
Suppose 5 teams are working independently on different parts of a project. Each team has said they’re 80% confident that their part will be done by the end of the year. The project manager will want to know how likely is it that all five parts will be done by the end of the year?
When we answer questions like this, we don’t usually think in terms of probability. And even if we do think in terms of probability, it’s easy to get it wrong. If you ask non-mathy people this question (and most project managers I’ve worked with are not mathy), their intuition will be: 80%.
But that intuition is extremely misleading. What we’re dealing with is a set of 5 independent events, each with 80% probability. Imagine rolling five 5-sided dice. Or, since 5-sided dice are not really a thing, imagine spinning five 5-sided dreidels (which are also not really a thing, but are at least geometrically possible). You spin your five dreidels and if any of them comes up ש, the project fails to meet its end-of-year deadline. The probability of success, then, is:
(80%)⁵ = 80% ⋅ 80% ⋅ 80% ⋅ 80% ⋅ 80% = 32.8%
Awareness of the probabilistic nature of estimates makes it obvious that a bunch of high-confidence estimates do not add up to a high-confidence estimate.
To make matters even worse, most organizations don’t even talk about how confident they are. In real life, we wouldn’t even know the uncertainties associated with our teams’ estimates (this is problem 1: uncalibrated estimates). Nobody would even be talking about probabilities. They’d be saying “All five teams said they’re pretty sure they can be done by the end of the year, so we can be pretty sure we’ll hit an end-of-year target for the project.” But they’d be dead wrong.
Why This Is Worth Fixing
I believe that poor handling of uncertainty is a major antagonist not just of technical collaboration, but of social cohesion at large. When uncertainty is not understood:
Project managers feel let down. Day in and day out they ask for reasonable estimates, and they’re left explaining to management why the estimates were all incorrect and the project is going to be late.
Engineers feel held to unreasonable expectations. They have to commit to a single number, and when that number is too low they have to work extra hard to hit it anyway. It feels like being punished for something that’s no one’s fault.
Managers feel responsible for failures. Managers are supposed to help their teams collaborate and plan, but planning without respect for uncertainty leads to failure every time. And what’s worse, if you don’t understand the role of uncertainty, you can’t learn from the failure.
Leadership makes bad investments. Plans may look certain when they’re not at all. Return on an investment often degrades quickly with delays, making what looked like a good investment on paper turn out to be a dud.
Fostering an intuition for uncertainty in an organization could help fix all these problems. I think it would be crazy not to at least try.
How To Fix It
Okay, here’s the part where I admit that I don’t quite know what to do about this problem. But I’ve got a couple ideas.
1. Uncalibrated estimates
To solve the problem of uncalibrated estimates (estimates whose uncertainty is wrong or never acknowledged), what’s needed is practice. Teams need to make 80%-confidence estimates for everything – big and small – and record those estimates somewhere. Then when the work is actually done, they can review their estimates and compare to the real-world completion time. If they didn’t get about 80% of their estimates correct, they can learn to adjust their confidence.
Estimating with a certain level of confidence is a skill that can be learned through practice. But practice, as always, takes a while and it needs to be consistent. I’m not sure how one would get an organization to put in the work, but I do believe it would pay off if they did.
2. Planning with incorrect use of uncertainty
Managers and project managers should get some training in how to manipulate estimates with uncertainty values attached.
Things get tricky when tasks need to be done one after the other instead of in parallel. But I like to think we’re not adding complexity to the process of planning; rather we’re just revealing the ways in which the current process is broken. And recognizing the limits of our knowledge is a great way to keep our batch sizes small and our planning horizons close.
I think we need to get used to calibrating estimates before we can see our way clear to address the planning challenges.
Can It Be Fixed?
It strikes me as super hard to fix the problems I’ve outlined here in a company that already has its old habits. Perhaps a company needs to be built from the ground up with a culture of uncertainty awareness. But I hope not.
If there is a way to make uncertainty a first-class citizen, I truly believe that engineering teams will benefit hugely from it.
I was pretty happy with my talk, especially the Star Trek: The Next Generation vignette in the middle. It was a lot of ideas to pack into a single talk, but I think a lot of people got the point. However, I did give a really unsatisfactory answer (30m46s) to the first question I received. The question was:
In the differential diagnosis steps, you listed performing tests to falsify assumptions. Are you borrowing that from medicine? In tech are we only trying to falsify assumptions, or are we sometimes trying to validate them?
I didn’t have a real answer at the time, so I spouted some bullshit and moved on. But it’s a good question, and I’ve thought more about it, and I’ve come up with two (related) answers: a common-sense answer and a pretentious philosophical answer.
The Common Sense Answer
My favorite thing about differential diagnosis is that it keeps the problem-solving effort moving. There’s always something to do. If you’re out of hypotheses, you come up with new ones. If you finish a test, you update the symptoms list. It may not always be easy to make progress, but you always have a direction to go, and everybody stays on the same page.
But when you seek to confirm your hypotheses, rather than to falsify others, it’s easy to fall victim to tunnel vision. That’s when you fixate on a single idea about what could be wrong with the system. That single idea is all you can see, as if you’re looking at it through a tunnel whose walls block everything else from view.
Tunnel vision takes that benefit of differential diagnosis – the constant presence of a path forward – and negates it. You keep running tests to try to confirm your hypothesis, but you may never prove it. You may just keep getting tests results that are consistent with what you believe, but that are also consistent with an infinite number of hypotheses you haven’t thought of.
A focus on falsification instead of verification can be seen as a guard against tunnel vision. You can’t get stuck on a single hypothesis if you’re constrained to falsify other ones. The more alternate hypotheses you manage to falsify, the more confident you get that you should be treating for the hypotheses that might still be right.
Now, of course, there are times when it’s possible to verify your hunch. If you have a highly specific test for a problem, then by all means try it. But in general it’s helpful to focus on knocking down hypotheses rather than propping them up.
Published in 1959 – but based on Popper’s earlier book Logik der Forschung from 1934 – The Logic Of Scientific Discovery makes a then-controversial [now widely accepted (but not universally accepted, because philosophers make cats look like sheep, herdability-wise)] claim. I’ll paraphrase the claim like so:
Science does not produce knowledge by generalizing from individual experiences to theories. Rather, science is founded on the establishment of theories that prohibit classes of events, such that the reproducible occurrence of such events may falsify the theory.
Popper was primarily arguing against a school of thought called logical positivism, whose subscribers assert that a statement is meaningful if and only if it is empirically testable. But what matters to our understanding of differential diagnosis isn’t so much Popper’s absolutely brutal takedown of logical positivism (and damn is it brutal), as it is his arguments in favor of falsifiability as the central criterion of science.
I find one particular argument enlightening on the topic of falsification in differential diagnosis. It hinges on the concept of self-contradictory statements.
There’s an important logical precept named – a little hyperbolically – the Principle of Explosion. It asserts that any statement that contradicts itself (for example, “my eyes are brown and my eyes are not brown”) implies all possible statements. In other words: if you assume that a statement and its negation are both true, then you can deduce any other statement you like. Here’s how:
Assume that the following two statements are true:
“All cats are assholes”
“There exists at least one cat that is not an asshole”
Therefore the statement “Either all cats are assholes, or 9/11 was an inside job” (we’ll call this Statement A) is true, since the part about the asshole cats is true.
However, if the statement “there exists at least one cat that is not an asshole” is true too (which we’ve assumed it is) and 9/11 were not an inside job, then Statement A would be false, since neither of its two parts would be true.
So the only way left for Statement A to be true is for “9/11 was an inside job” to be a true statement. Therefore, 9/11 was an inside job.
Wake up, sheeple.
The Principle of Explosion is the crux of one of Popper’s most convincing arguments against the Principle of Induction as the basis for scientific knowledge.
It was assumed by many philosophers of science before Popper that science relied on some undefined Principle of Induction which allowed one to generalize from a finite list of experiences to a general rule about the universe. For example, the Principle of Induction would allow one to deduce from enough statements like “I dropped a ball and it fell” and “My friend dropped a wrench and it fell” to “When things are dropped, they fall.” But Popper argued against the existence of the Principle of Induction. In particular, he pointed out that:
If there were some way to prove a general rule by demonstrating the truth of a finite number of examples of its consequences, then we would be able to deduce anything from such a set of true statements.
Right? By the Principle of Explosion, a self-contradictory statement implies the truth of all statements. If we accepted the Principle of Induction, then the same evidence that proves “When things are dropped, they fall” would also prove “All cats are assholes and there exists at least one cat that is not an asshole,” which would prove every statement we can imagine.
So what does this have to do with falsification in differential diagnosis? Well, imagine you’ve come up with these hypotheses to explain some API slowness you’re troubleshooting:
Hypothesis Alpha: contention on the table cache is too high, so extra latency is introduced for each new table opened
Hypothesis Bravo: we’re hitting our IOPS limit on the EBS volume attached to the database server
There are many test results that would be compatible with Hypothesis Alpha. But unless you craft your tests very carefully, those same results will also be compatible with Hypothesis Bravo. Without a highly specific test for table cache contention, you can’t prove Hypothesis Alpha through a series of observations that agree with it.
What you can do, however, is try to quickly falsify Hypothesis Bravo by checking some graphs against some AWS configuration data. And if you do that, then Hypothesis Alpha is the your best remaining guess. Now you can start treating for table cache contention on the one hand, and attempting the more time-consuming process (especially if it’s correct!) of falsifying Hypothesis Alpha.
Isn’t this kind of abstract?
Haha OMG yes. It’s the most abstract. But that doesn’t mean it’s not a useful idea.
If it’s your job to troubleshoot problems, you know that tunnel vision is very real. If you focus on generating alternate hypotheses and falsifying them, you can resist tunnel vision’s allure.
Note: This blog post is not about computers or math or DevOps. I like those things, and I write about them usually. But not today.
Sometimes I read something and I’m like “that can’t be right,” but then I think about it for a while and I can’t figure out why it’s not right. This happens to me especially often with arguments about moral intuition.
We humans make moral judgements and decisions all day, every day, without even thinking about it. It’s a central part of what makes us us.
Try this: pick a moral belief that you hold very firmly. For example, I went with, “It’s wrong to treat people with less respect because of their sexual orientation.” Now, try to imagine no longer believing that thing. Imagine that everything about your mind is the same, except you no longer hold to that one belief or its consequences. It’s hard, isn’t it? It really feels like you wouldn’t be the same person.
We know that our moral beliefs change over the course of our lives, and so that feeling of our selves being tightly coupled to those beliefs must be an illusion. But still, it’s very disturbing when a well constructed thought experiment forces us to reevaluate our basic moral intuitions.
What follows is a thought experiment that has such a brain-breaking effect on me. We can’t all be moral geniuses who cut right through the Trolley Problem like this kid:
Making People Happy, or Making Happy People?
On this week’s episode of Sam Harris’s Waking Up podcast, the Scottish moral philosopher and effective altruism wunderkind Will MacAskill gave a very brief but very brain-breaking argument. It’s had me scratching my chin intensely for a couple days now.
It seems intuitive to me that we, as humans, have no moral obligation to make sure that more humans come into existence in the future. After all, it’s not like you owe anything to hypothetical people who will never come into existence. There seems to me to be no way that such a moral obligation could arise. I think a lot of people would agree with this intuition.
Will MacAskill’s argument (and I’m not sure it’s his originally, but I heard it from him) goes like this. Imagine what we’ll call World A. World A has some number of people in it, living their lives. And let’s also imagine World B, which has all the same people as World A, plus another person named Harry. Everybody in World B is exactly as well-off as their counterparts in World A, and Harry’s well-being is at a 6 out of 10. He’s moderately well-off.
According to my intuition that there is no moral reason to prefer World B, in which Harry exists, to World A, in which Harry never came into being. If we say the total moral value of World A is a and the moral value of World B is b, I believe that a = b.
Alright, MacAskill says, now let’s introduce World C. This world is identical to World A, except it includes a person named Harry whose well-being is an 8 out of 10. He’s very happy almost all the time!
Now, by the same logic I used before, letting World C’s total moral value equal c, I have to say that a = c. This world with a Harry at well-being level 8 is not preferable to a world in which Harry never existed.
This puts me in a bit of a pickle. Because, by straight-up math, we know that if a = b and a = c then b = c. In other words, there’s no moral reason to prefer World C to World B. But come on! World C is exactly like World B except that Harry is better off. Obviously it’s an objectively preferable world.
I feel reductio ad absurdum‘d and it makes me very uncomfortable. Does this mean I have an obligation to have kids if I think I can give them happy lives? I don’t believe that, but I’m not sure what to believe now.
You only need to learn a little bit of queueing theory before you start getting that ecstatic “everything is connected!” high that good math always evokes. So many damn things follow the same set of abstract rules. Queueing theory lets you reason effectively about an enormous class of diverse systems, all with a tiny number of theorems.
I want to share with you the most important fundamental fact I have learned about queues. It’s counterintuitive, but once you understand it, you’ll have deeper insight into the behavior not just of CPUs and database thread pools, but also grocery store checkout lines, ticket queues, highways – really just a mind-blowing collection of systems.
Okay. Here’s the Important Thing:
As you approach maximum throughput, average queue size – and therefore average wait time – approaches infinity.
“Wat,” you say? Let’s break it down.
A queueing is exactly what you think it is: a processor that processes tasks whenever they’re available, and a queue that holds tasks until the processor is ready to process them. When a task is processed, it leaves the system.
The Important Thing I stated above applies to a specific type of queue called an M/M/1/∞ queue. That blob of letters and numbers and symbols is a description in Kendall’s notation. I won’t go into the M/M/1 part, as it doesn’t really end up affecting the Important Thing I’m telling you about. What does matter, however, is that ‘∞’.
The ‘∞’ in “M/M/1/∞” means that the queue is unbounded. There’s no limit to the number of tasks that can be waiting in the queue at any given time.
Like most infinities, this doesn’t reflect any real-world situation. It would be like a grocery store with infinite waiting space, or an I/O scheduler that can queue infinitely many operations. But really, all we’re saying with this ‘∞’ is that the system under study never hits its maximum occupancy.
The processor (or processors) of a queueing system have a given total capacity. Capacity is the number of tasks per unit time that can be processed. For an I/O scheduler, the capacity might be measured in IOPS (I/O operations per second). For a ticket queue, it might be measured in tickets per sprint.
Consider a simple queueing system with a single processor, like the one depicted in the GIF above. If, on average, 50% of the system’s capacity is in use, that means that the processor is working on a task 50% of the time. At 0% utilization, no tasks are ever processed. At 100% utilization, the processor is always busy.
Let’s think about 100% utilization a bit more. Imagine a server with all 4 of its CPUs constantly crunching numbers, or a dev team that’s working on high-priority tickets every hour of the day. Or a devOp who’s so busy putting out fires that he never has time to blog.
When a system has steady-state 100% utilization, its queue will grow to infinity. It might shrink over short time scales, but over a sufficiently long time it will grow without bound.
Why is that?
If the processor is always busy, that means there’s never even an instant when a newly arriving task can be assigned immediately to the processor. That means that, whenever a task finishes processing, there must always be a task waiting in the queue; otherwise the processor would have an idle moment. And by induction, we can show that a system that’s forever at 100% utilization will exceed any finite queue size you care to pick:
So basically, the statements “average capacity utilization is 100%” and “the queue size is growing without bound” are equivalent. And remember: by Little’s Law, we know that average service time is directly proportional to queue size. That means that a queueing system with 100% average capacity utilization will have wait times that grow without bound.
If all the developers on a team are constantly working on critical tickets, then the time it takes to complete tickets in the queue will approach infinity. If all the CPUs on a server are constantly grinding, load average will climb and climb. And so on.
Now you may be saying to yourself, “100% capacity utilization is a purely theoretical construct. There’s always some about of idle capacity, so queues will never grow toward infinity forever.” And you’re right. But you might be surprised at how quickly things start to look ugly as you get closer to maximum throughput.
What about 90%?
The problems start well before you get to 100% utilization. Why?
Sometimes a bunch of tasks will just happen to show up at the same time. These tasks will get queued up. If there’s not much capacity available to crunch through that backlog, then it’ll probably still be around when the next random bunch of tasks show up. The queue will just keep getting longer until there happens to be along enough period of low activity to clear it out.
I ran a qsim simulation to illustrate this point. It simulates a single queue that can process, on average, one task per second. I ran the simulation for a range of arrival rates to get different average capacity usages. (For the wonks: I used a Poisson arrival process and exponentially distributed processing times). Here’s the plot:
Notice how things start to get a little crazy around 80% utilization. By the time you reach 96% utilization (the rightmost point on the plot), average wait times are already 20 seconds. And what if you go further and run at 99.8% utilization?
Check out that Y axis. Crazy. You don’t want any part of that.
As you approach maximum throughput, average queue size – and therefore average wait time – approaches infinity.
What’s the solution, then? We only have so many variables to work with in a queueing system. Any solution for the exploding-queue-size problem is going to involve some combination of these three approaches:
Set an upper bound for the queue size
I like number 3. It forces you to acknowledge the fact that queue size is always finite, and to think through the failure modes that a full queue will engender.
Stay tuned for a forthcoming blog post in which I’ll examine some of the common methods for dealing with this property of queueing systems. Until then, I hope I’ve helped you understand a very important truth about many of the systems you encounter on a daily basis.