The paradox of the bloated backlog

I want to point out a paradox you may not have noticed.

A team of software engineers or SREs invariably has more good ideas than time. We know this very well. Pick any system we own, and we’ll come up with a list of 10 things we could do to make it better. Off the top of our head.

On the other hand, when our team is confronted with the opportunity to purge some old features or enhancements out of the backlog, there’s resistance. We say, “we might get around to this some day,” or “this still needs to get done.”

These two beliefs, taken together, reveal a deep lack of team self-confidence.

If our team always has more good ideas than time, then we’re never going to implement all the good ideas in our backlog. If we add more people, we’ll just get more good ideas, and the backlog will just get more bloated.

Why are we reluctant to ruthlessly remove old tickets, then? We know that we’re constantly generating good ideas. In fact, the ideas we’re generating today are necessarily (on average) better than most of the ideas in the backlog:

  1. We have more information about the system now then we used to, so our new ideas are more aligned with real-world facts, and
  2. We have more experience as engineers now, so we have developed better intuition about what kind of interventions will create the most value.

Seen in this light, a hesitance to let go of old ideas is revealed as a symptom of a deep pathology. It means we don’t believe in our own creativity and agency. If we did, we would have easy answers to the questions we always ask when we consider closing old tickets:

  • What if we forget about this idea? That’s okay. We’ll have plenty of other, better, more relevant ideas. We never stop generating them.
  • What if this problem gets worse over time? If the risk is enough that we should prioritize this ticket over our other work, then let’s do that now. Otherwise, we can cross that bridge if we ever get to it.
  • Will the reporter of this ticket even let us close it? Nobody owns our backlog but us. When we decide to close a ticket, all we owe the reporter is an honest explanation of why.

Leave behind the paradox of the bloated backlog and start believing in your team’s own agency and creativity. Hell, maybe even cap your backlog. A team with faith in its competence is a team unleashed.

Latency- and Throughput-Optimized Clusters Under Load

It’s good to have accurate and thorough metrics for our systems. But that’s only where observability starts. In order to get value out of metrics, we have to focus on the right ones: the ones that tell us about outcomes that matter to our users.

In The Latency/Throughput Tradeoff, we talked about why a given cluster can’t be optimized for low latency and high throughput at the same time. In conclusion, we decided that separate clusters should be provisioned for latency-centric use cases and throughput-centric use cases. And since these different clusters are optimized to provide different outcomes, we’ll need to interpret their metrics in accordingly different ways.

Let’s consider an imaginary graph dashboard for each type of cluster. We’ll walk through the relationships between metrics in each cluster, both under normal conditions and under excessive load. And then we’ll wrap up with some ideas about evaluating the capacity of each cluster over the longer term.

Metrics for comparison

In order to contrast our two types of clusters, we’ll need some common metrics. I like to employ the USE metrics at the top of my graph dashboards (more on that in a future post). So let’s use those:

  • Utilization: The total CPU usage for all hosts in the cluster.
  • Saturation: The number of queued requests.
  • Error rate: The rate at which lines are being added to hosts’ error logs across the cluster.

In addition to these three metrics, we want to see a fourth metric. For the latency-optimized cluster, we want to see latency. And for the throughput-optimized cluster, we want to see throughput.

One of the best things about splitting out latency-optimized and throughput-optimized clusters is that the relationships between metrics become much clearer. It’s hard to tell when something’s wrong if you’re not sure what kind of work your cluster is supposed to be doing at any given moment. But separation of concerns allows us to develop intuition about our systems’ behavior and stop groping around in the dark.

Latency-optimized cluster

Let’s look at the relationships between these four important metrics in a latency-optimized cluster under normal, healthy conditions:

2019-03-11 blog charts latency-oriented healthyUtilization will vary over time depending on how many requests are in flight. Saturation, however, should stay at zero. Any time a request is queued, we take a latency hit.\

Error rate would ideally be zero, but come on. We should expect it to be correlated to utilization, following the mental model that all requests carry the same probability of throwing an error.

Latency, then – the metric this cluster is optimized for – should be constant. It should not be affected by utilization. If latency is correlated to utilization, then that’s a bug. There will always be some wiggle in the long tail (the high percentiles), but for a large enough workload, the median and the 10th percentile should pretty much stay put.

Now let’s see what happens when the cluster is overloaded:

2019-03-12 blog charts latency-oriented loaded

We start to see plateaus in utilization, or at least wider, flatter peaks. This means that the system is low on “slack”: the idle resources that it should always have available to serve new requests. When all resources are busy, saturation rises above zero.

Error rate may still be correlated with utilization, or it may start to do wacky things. That depends on the specifics of our application. In a latency-optimized cluster, saturation is always pathological, so it shouldn’t be surprising to see error rates climb or spike when saturation rises above zero.

Finally, we start to see consistent upward trends in latency. The higher percentiles are affected first and most dramatically. Then, as saturation rises even higher, we can see the median rise too.

Throughput-optimized cluster

The behavior of our throughput-optimized cluster, on the other hand, is pretty different. When it’s healthy, it looks like this:

2019-03-22 blog charts throughput-oriented healthy.png

Utilization – which, remember, we’re measuring via CPU usage – is no longer a fluffy cloud. Instead, it’s a series of trapezoids. When no job is in progress, utilization is at zero. When a job starts, utilization climbs up to 100% (or as close to 100% as reality allows it to get), and then stays there until the job is almost done. Eventually, the number of active tasks drops below the number of available processors, and utilization trails off back to zero.

Saturation (the number of queued tasks) follows more of a sawtooth pattern. A client puts a ton of jobs into the system, bringing utilization up to its plateau, and then as we grind through jobs, we see saturation slowly decline. When saturation reaches zero, utilization starts to drop.

Unlike that of a latency-optimized cluster, the error rate in a throughput-optimized cluster shouldn’t be sensitive to saturation. Nonzero saturation is an expected and desired condition here. Error rate is, however, expected to be follow utilization. If it bumps around at all, it should plateau, not sawtooth.

And finally, in the spot where the other cluster’s dashboard had a latency graph, we now have throughput. This we should measure in requests per second per job queued! What our customers really care about is not how many requests per second our cluster is processing, but how many of their requests per second. We’ll see why this matters so much in a bit, when we talk about this cluster’s behavior under excessive load.

Throughput should be tightly correlated to utilization and nothing else. If it also seems to exhibit a negative correlation to saturation, then that’s worth looking into: it could mean that queue management is inappropriately coupled to job processing.

Now what if our throughput-optimized cluster starts to get too loaded? What does that look like?

2019-03-23 blog charts throughput-oriented loaded

Utilization by itself doesn’t tell us anything about the cluster’s health. Qualitatively, it looks just like before: big wide trapezoids. Sure, there a bit wider than they were before, but we probably won’t notice that – especially since it’s only on average that they’re wider.

Saturation is where we really start to see the cracks. Instead of the sawtooth pattern we had before, we start to get more sawteeth-upon-sawteeth. These are jobs starting up on top of jobs that are already running. A few of these are to be expected even under healthy circumstances, but if they become more frequent, it’s an indication that there may be too much work for the cluster to handle without violating throughput SLOs.

Error rate may not budge much in the face of higher load. After all, this cluster is supposed to hold on to big queues of requests. As far as it’s concerned, nothing is wrong with that.

And that’s why we need to measure throughput in requests per second per job queued. If all we looked at was requests per second, everything would look hunky dory. But when there are two jobs queued and they’re constrained to use the same pool of processors, somebody’s throughput is going to suffer. On this throughput graph, we can see that happen right before our eyes.

Longer-term metrics

Already, we’ve seen how decoupling these two clusters gives us a much clearer mental model of their expected behavior. As we get used to the relationships between the metrics we’ve surfaced, we’ll start to build intuition. That’s huge.

But we don’t have to stop there! Having a theory (which is what we built above) also allows us to reason about long-term changes in the observable properties of these clusters as their workloads shift.

Take the throughput-optimized cluster, for instance. As our customers place more and more demand on it, we expect to see more sawteeth-upon sawtooth and longer intervals between periods of zero saturation. This latter observation is key, since wait times grow asymptotically as utilization approaches 100%. So, if we’re going to want to do evaluate the capacity needs of our throughput-optimized cluster, we should start producing these metrics on day one and put them on a dashboard:

  • Number of jobs started when another job was already running, aggregated by day. This is the “sawteeth-upon-sawteeth” number.
  • Proportion of time spent with zero saturation, aggregated by day.
  • Median (or 90th percentile) time-to-first-execution for jobs, aggregated by day. By this I mean: for each customer job, how much time passed between the first request being enqueued and the first request starting to be processed. This, especially compared with the previous metric, will show us how much our system is allowing jobs to interfere with one another.

An analogous thought process will yield useful capacity evaluation metrics for the latency-optimized cluster.

TL;DR

UnsteadyBlueAardvark-size_restricted

Separating latency- and throughput-optimized workloads doesn’t just make it easier to optimize each. It carries the added benefit of making it easier to develop a theory about our system’s behavior. And when you have a theory that’s consistent with the signal and a signal that’s interpretable within your theory, you have observability.

The Latency/Throughput Tradeoff: Why Fast Services Are Slow And Vice Versa

Special thanks to the graceful and cunning Ben Ng for consulting on this post.

I’m finally getting around to reading that DevOps* book everybody’s been raving about, Site Reliability Engineering: How Google Runs Production Systems. My verdict so far: it’s pretty good.

Here’s one of the first passages to jump out to me, from Chapter 3: Embracing Risk:

The low-latency user wants Bigtable’s request queues to be (almost always) empty so that the system can process each outstanding request immediately upon arrival. (Indeed, inefficient queuing is often a cause of high tail latency.) The user concerned with offline analysis is more interested in system throughput, so that user wants request queues to never be empty. To optimize for throughput, the Bigtable system should never need to idle while waiting for its next request.

This is a profound and general insight. When I read this passage, my last decade of abject suffering suddenly came into focus for me.

When I say “abject suffering,” I’m of course talking about ElasticSearch administration. When a storage system like ElasticSearch has to serve both high-latency and high-throughput workloads, it is guaranteed to get ugly. This fact is super important, which is why I’m devoting this blog post to exploring the relationships among latency, throughput, and capacity from a queueing perspective. I hope I can make these relationships stick in your mind like they’ve stuck in mine.

* Go ahead. Tell me DevOps and SRE aren’t the same thing. I dare you.

The tradeoff between throughput and latency

Consider a service that responds to requests. As an example, let’s say it’s a service that takes as input a picture of a dog and returns a picture of that dog wearing a silly hat.

2019-02-25 dog wearing hat
Artist’s rendering

Like almost any service (exception: Tourbillon), our service can only handle a certan number of requests per second [to put hats on dogs (RPSTPHOD)]. We’ll call this number its capacity. If we have 200 processes devoted to dog-hatting, and dogs take on average 400 milliseconds to haberdash, then the theoretical capacity of the system is

(200) / (0.4s) = 500 hats per second

Now let’s consider the two types of users that depend on our service:

  • On-the-spot dog hatters. At any given time, these users have a single dog picture that requires a hat as soon as possible. Perhaps they’re using our service to support a website that generates a single dog-hat picture per page load, and they want their page to load quickly. These users are interested primarily in how quickly they can get a hat on a dog. In a word: latency.
  • Bulk dog-hatters. These users tend to have massive data sets that they want processed as quickly as possible. The most obvious example would be a law enforcement agency wanting to compare their large database of pet photos to surveillance footage of a particular dog robbing a bank while wearing a hat. Bulk dog-hatters care not about the latency of any individual dog-hatting, but about the throughput they can achieve. In other words: how close they can get to our service’s theoretical capacity of 500 hats per second.

But here’s the problem: no single cluster of dog-hatting servers can be optimal for both types of users. And the better we make the service for one kind of user, the worse we make it for the other.

The needs of on-the-spot users

In order to minimize latency for our on-the-spot users (without dropping any of their requests), we need to make sure that there’s always a processor idle when their request comes in. If we fail to make sure of this, then new requests will have to be queued while we wait for a spot to open up, thus inflating latency. The system needs some “slack.”

2019-02-25 blog charts low-latency

Since we need slack, we don’t ever want throughput to approach capacity. The closer we get to our system’s capacity, the more drastically latencies will balloon, like I talked about in this post.

The needs of bulk users

Our bulk dog-hatters, on the other hand, don’t care so much about request latency. Some of their individual requests might take seconds, or minutes, or even hours to complete. What they care about is how quickly our service can process their entire data set. In other words, they care about getting throughput as close as possible to capacity.

This means that, whenever a job is running, bulk dog-hatters want there to be (virtually) zero slack. Every processor should be active at all times. Consequently, our queue sizes will explode as soon as the job starts, and our queues will stay occupied until the job is almost done.

2019-02-25 blog charts high-throughput

In this case, we want our queues to be full whenever there’s a bulk job running. Anything else would give sub-optimal throughput.

Splitting the cluster up

The needs of on-the-spot and bulk users are incompatible. One group needs minimal latency, while the other group needs maximal throughput.

If both of these groups are using the same cluster, we’re going to have serious problems. On-the-spot users’ latencies will vary widely depending on whether there’s currently a bulk job in progress, and bulk users’ job times will vary depending on the number of on-the-spot users currently using the system. No matter how much we scale or tweak tuning parameters, neither group will get what they need. And what’s worse, we’ll be stuck in a perpetual tug-of-war between the priorities of these two groups.

So let’s split our cluster in two: a “low latency” cluster and a “high throughput” cluster. And let’s let our users pick the right one for their use case. This way, we’ll have much clearer expectations about the performance and scaling characteristics of our service, and we’ll avoid the frustrating priority tug-of-war that characterized our mixed-use cluster.

The split doesn’t have to be complete. Instead of having two wholly separate clusters, we could have some kind of load balancer that reserves a certain portion of our fleet for low-latency traffic and slots bulk jobs onto dedicated segments of the cluster. The details of every solution will vary. What matters is that on-the-spot and bulk dog-hatters aren’t drawing on the same pool of resources.

Once we do split up our cluster, then, what should we expect the performance characteristics of the new clusters to be? What will their graph dashboards look like when they’re healthy, or near capacity, or over capacity? In an upcoming post, I’ll use some more queueing reasoning to answer these questions. So get hype for that!

[UPDATE: It’s here!]

What makes a good alert?

Ever since the late 2000s, I’ve been implementing “alert review” processes on ops teams. As a team, we go through all the alerts we’ve received in the last week, and identify those that are bad.

But what makes an alert bad? How do we distinguish the good ones from the bad?

I use a simple framework wherein a good alert has the following properties:

  • Actionable: Indicates a problem for which the recipient is well placed to take immediate corrective action.
  • Investigable: (yes, I made this word up) Indicates a problem whose solution is not yet known by the organization.

These properties can be present or absent independently, so this framework identifies four types of alerts:

alert_types

Actionability

Actionability has been widely touted as a necessary condition for good alerts, and for good reason. Why send an alert about something you can’t fix? Just to make the recipient anxious?

Referring to my definition of an actionable alert:

Indicates a problem for which the recipient is well placed to take immediate corrective action.

we can see three main ways in which non-actionability shows up:

  • Someone is well placed to take immediate corrective action, but not the recipient. For example, an alert that indicates a problem with the Maple Syrup service instead pages the Butter team, which can’t do anything about it. In cases like these, the fix is often simple: change the alert’s recipient. Sometimes, though, you’ll first have to refactor your config to let it distinguish between Maple Syrup problems and Butter problems.
  • There is an action to take, but it can’t be taken immediately. For example, Apache needs to be restarted, but the recipient of the alert isn’t sure whether this will cause an outage. This type of non-actionable alert often calls for either improved documentation (e.g. a “runbook” indicating steps to perform). Another example might be a disk space alert that has been slowly climbing for a while and just crossed a threshold: action can’t be taken immediately, because the team needs to agree upon an action to take.
  • There is no action to take. For example, “CPU utilization” or “Packet loss.” These are your classic FYI alerts. Instead of alerting, these things should appear on a dashboard for use in troubleshooting when a problem is already known to exist.

Investigability

An alert is non-investigable if its implications are obvious at first glance. Here are the two most common types of non-investigable alerts:

  • “Chief O’Brien” alerts. If you look at your phone and instantly know the commands you have to run to fix it, that’s a “Chief O’Brien” alert. There’s no need to bother a human to fix the issue; the response should be automated.
  • Redundant alerts. Sometimes you get an alert for increased error rates from one of your services, and by the time you get up and get to your laptop, you’ve gotten 8 more similar alerts. The first one might well have been a perfectly good alert, but the other 8 are likely non-investigable. Whatever you learn in investigating the first one will apply to the rest in exactly the same way. The correct response to alerts like these is to use dependencies or grouping.

What to do with this framework

Like I said, I like to have my team go through all the alerts they’ve received in the last week and discuss them. Imagine a spreadsheet where each alert is a row and there are columns labeled Actionable? and Investigable?

Actually, don’t bother. I imagined one for you:

spreadsheet.png

This actionability/investigability framework helps the team identify bad alerts and agree on the precise nature of their badness. And as a bonus, the data in these spreadsheets can shine a light on trends in alert quality over time.

I’ve had a lot of success with this framework throughout the years, and I’d like to hear how it works for others. Let me know if you try it out, or if you have a different model for addressing the same sorts of questions!

Ticket Cutoff Ages Are Silly

CHARLIE:   There are too many tickets in our backlog!
JULIET:       I know, right? So many of them are old-ass feature requests that’ll never get done.
MIKE:          I’ve got an idea! Why don’t we auto-delete tickets older than 6 months?

(Everyone looks at each other in shock. Confetti wipe. MIKE is being carried through the office to thunderous applause.)

A maximum ticket age seems like a good idea. Big ticket backlogs are demoralizing, and it’s impossible to get any useful information out of them. Furthermore, we know with certainty that most things in the backlog are never getting done.

The submitter of a ticket knows more about why a piece of work is important than you do, right? Why not rely on that knowledge for triage?

No. Ticket cutoff ages are silly.

Ticket cutoff ages are silly because they rest on the assumption that a reporter will notice a ticket has been closed, read the notification, evaluate the ticket’s importance in the appropriate context, and reopen the ticket if – and only if – it’s important. This assumption is wildly flawed, as anyone with an inbox full of JIRA updates can attest. The evaluation step is especially absurd: how can ticket reporters be expected to consistently place requested work in the context of your team’s ever-shifting priorities and constraints?

Ticket cutoff ages are silly because the age of a ticket is a terrible proxy metric for value. There’s a crucial difference between urgent and important. Tasks that get resolved at a young age tend to be urgent ones, which are usually associated with achieving someone else’s goals. By enforcing a maximum age, your team will naturally focus on urgent tickets to the exclusion of important ones: generally those that advance the team’s goals. A ticket’s age has little connection with its importance.

Ticket cutoff ages are silly because they don’t even solve the problem they set out to solve: that demoralizing, disorganized pile of work that’s always in your peripheral vision. Sure, some tickets get closed, but you can never explain why. Important tasks disappear according to the whims of entropy, and unimportant tasks are always mixed in with everything else. Each time you try to understand your team’s commitments, you must reevaluate the merits of every ticket in the queue.

This method of constraining a ticket backlog is nothing short of an abdication of ownership. Instead of respecting our own time and that of the ticket reporter by making an explicit, up-front decision about a ticket’s fate, we let chance and frustration shape our product.

Wouldn’t it be better to take matters into our own hands and cap the backlog?

Cap Your Ticket Backlog

Unbounded ticket backlogs are one of the most pernicious influences facing an engineering team. Consider:

  • A team’s capacity (the amount of work it’s capable of doing in a given week) is limited. As the ticket backlog grows, so grows the number of tasks that will never be finished.
  • Keeping tickets that will never be finished amounts to a tacit denial of the team’s limited capacity. It’s pure fantasy to imagine that every ticket in the backlog is one that the team might eventually get around to.
  • Keeping tickets that will never be finished hides information about the team’s capacity and agenda, making it near impossible to reason about priorities. A giant backlog offers no information about what the team considers important, and it invites stakeholders to share in the delusion that all work will eventually get done.
  • A large ticket backlog is just demotivating. No one even contemplates digging into the backlog because of how Sisyphean that would feel.
  • To top it all off, a large, unbounded ticket backlog perpetuates itself. The more tickets you have, the less empowered you feel to say no.

Why ticket backlogs grow

Like any queueing system, a ticket backlog will grow without bound if tasks arrive faster (on average) than they leave.

Every engineering team has more ideas than time. There are many things that your team should probably do, but you’ll never have enough time to do them all. Add to this the firehose of feature requests, bug reports, and maintenance tasks, and it’s easy to see why tickets arrive in a backlog like… well… a firehose.

But we already know where tickets come from. That’s only one of two important variables. Why don’t tickets ever leave faster than they arrive?

Because refuse to close them, of course! When a ticket represents a neat idea our team had, we don’t want to close it because the idea is still neat. And when it represents a request from a different team, we don’t feel empowered to close it because we feel it’s our team’s job to fulfill that request.

So tickets enter our system faster than we get rid of them. This usually leads to one of two situations:

  1. Growth without bound. Your ticket queue has 4000 tickets in it, and it’s become a running joke on the team that “oh sure, we’ll get around to it when we finish the rest of the ticket queue.” Sound familiar?
  2. Automatic closing. You write a cron job to auto-close tickets that haven’t been touched in a year, on the assumption that “if it’s important, they’ll re-open it.” Instead of making informed, thoughtful decisions about what work is most important to your team, you leave the problem to the vagaries of ticket reporters and the march of entropy.

It doesn’t have to be like this. But the solution is so obvious that most people will dismiss it out of hand.

Cap your ticket backlog.

Letting your backlog grow without bound is a form of information hiding. It serves only to obfuscate. So why not just cap its size? Say “from now on, there can’t be more than 10 tickets sitting in our backlog.”

Think of all the benefits:

  • You’ll force yourself to make decisions about which tasks are critical to your team’s success.
  • You’ll develop a sense of how much work you can realistically commit to.
  • You’ll know at a glance which types of tasks are in most dire need of streamlining.
  • You’ll understand, all at once and without summarizing, what’s in the backlog.
  • You’ll start looking at the backlog and thinking “I can put a dent in that.”

At first it will be hard to come to terms with the fact that most of the things in your backlog are never going to get done. But that’s a fact. The sooner you accept it, the better for everyone: not just your team, but also the people who make requests of you. It doesn’t serve anyone to pretend that you can finish every piece of work that would be nice to do.

Putting it into practice

My team at Etsy has started capping its backlog at 10 issues.

Each Tuesday morning, we meet in Slack for 15 minutes. We look at a JIRA filter that shows all unassigned tickets that haven’t been updated in the last 2 days. If there are more than 10, we prune the list.

There are a bunch of different ways we can remove a ticket from the backlog, including:

  • Mark for clobbering. Apply the label clobbering-time to an issue. This adds it to the list of issues that we’ll draw from during a Clobbering Time session. Clobbering Time is a monthly event where we try to knock out as many non-urgent tasks as possible. As such, Clobbering Time issues should be resolvable by a single engineer within an hour.
  • Move to another project. If it’s not a thing our team needs to do, move the ticket to the appropriate project.
  • Assign to someone. If someone’s ready to work on the issue immediately, assign it to that person.
  • Request more details. If the issue isn’t clearly defined enough to begin work, assign it to the creator and request more details. Make sure to un-assign the issue once enough details are received.
  • Close. Just close outright, with a comment explaining that we don’t have enough bandwidth.

At first, we found a lot of low-hanging fruit. There were many clobbering-friendly tickets we could remove immediately, and there were many others that simply were no longer relevant.

As we use this process more, we’re going to start having to make tougher decisions. We’ll have to choose between tasks that seem equally essential. It won’t always be pleasant, but I believe it’s necessary and we’ll be glad we did it, because ignoring those difficult decisions is worse.

Let’s Stop Pretending Estimates Are Exact

Some days I feel like estimates and plans are worthless. Other days I find myself idly speculating that if only we were better at planning and estimating, everything would be beautiful.

In an engineering organization, we want realistic plans. Realistic plans and roadmaps will hypothetically help us utilize our capacity effectively and avoid overcommitting. And our plans are usually built on estimates; that’s why we believe that “correct” estimates are more likely to generate a realistic plan. Therefore, we believe that correct estimates are useful for planning.

On the other hand, if you’ve ever worked with engineers, you also believe that estimates are never correct. Something always gets in the way, or we pad the estimate out to avoid being held responsible for missing a deadline, or (occasionally) the task ends up being much simpler than we expected. The time we estimate is never the time it actually takes to do the job.C7ofJ_scaled

These two statements seem to be in contradiction, but they are not. There is a key to holding both these beliefs in your head at once. And I believe that if an entire engineering org were to grasp this key together, it would unlock enormous potential.

The key is to treat uncertainty as a first-class citizen.

Why This Is Hard

Developing organizational intuition in about estimates is really hard. It is in fact so hard that I’m not aware of any organization that has done it. I see two big reasons for this:

  1. We don’t have good awareness of the amount of uncertainty in our estimates.
  2. We pay lip service to the idea that estimates aren’t perfect, but we plan as if they are.

1. Uncalibrated estimates

When we give estimates, those estimates aren’t calibrated to a particular uncertainty level. Everybody has different levels of confidence in their estimates, and those levels are neither discussed nor adjusted. As a result, engineers and project managers find it impossible to turn estimates into meaningful plans.

2. Planning with incorrect use of uncertainty

You’ve surely noticed uncertainty being ignored. It usually goes something like this:

PM: How soon do you think you can have the power converters re-triangulated?
ENG: I’m not sure. It could take a day, or it could take five.
PM: I understand. What do you think is most likely? Can we say three days?
ENG: Sure, three or four.
PM: Okay. I’ll put it down for four days, just to be safe.

Both sides acknowledge that there is a great deal of uncertainty in this estimate. But a number needs to get written down, so they write down 4. In this step, information about uncertainty is lost. So it shouldn’t be a surprise later when re-triangulating the power converters takes 7 days and the whole plan gets thrown off.

Example

Suppose 5 teams are working independently on different parts of a project. Each team has said they’re 80% confident that their part will be done by the end of the year. The project manager will want to know how likely is it that all five parts will be done by the end of the year?

When we answer questions like this, we don’t usually think in terms of probability. And even if we do think in terms of probability, it’s easy to get it wrong. If you ask non-mathy people this question (and most project managers I’ve worked with are not mathy), their intuition will be: 80%.

DreidelsBut that intuition is extremely misleading. What we’re dealing with is a set of 5 independent events, each with 80% probability. Imagine rolling five 5-sided dice. Or, since 5-sided dice are not really a thing, imagine spinning five 5-sided dreidels (which are also not really a thing, but are at least geometrically possible). You spin your five dreidels and if any of them comes up ש‎, the project fails to meet its end-of-year deadline. The probability of success, then, is:

(80%)⁵ = 80% ⋅ 80% ⋅ 80% ⋅ 80% ⋅ 80% = 32.8%

Awareness of the probabilistic nature of estimates makes it obvious that a bunch of high-confidence estimates do not add up to a high-confidence estimate.

To make matters even worse, most organizations don’t even talk about how confident they are. In real life, we wouldn’t even know the uncertainties associated with our teams’ estimates (this is problem 1: uncalibrated estimates). Nobody would even be talking about probabilities. They’d be saying “All five teams said they’re pretty sure they can be done by the end of the year, so we can be pretty sure we’ll hit an end-of-year target for the project.” But they’d be dead wrong.

Why This Is Worth Fixing

I believe that poor handling of uncertainty is a major antagonist not just of technical collaboration, but of social cohesion at large. When uncertainty is not understood:

  • Project managers feel let down. Day in and day out they ask for reasonable estimates, and they’re left explaining to management why the estimates were all incorrect and the project is going to be late.
  • Engineers feel held to unreasonable expectations. They have to commit to a single number, and when that number is too low they have to work extra hard to hit it anyway. It feels like being punished for something that’s no one’s fault.
  • Managers feel responsible for failures. Managers are supposed to help their teams collaborate and plan, but planning without respect for uncertainty leads to failure every time. And what’s worse, if you don’t understand the role of uncertainty, you can’t learn from the failure.
  • Leadership makes bad investments. Plans may look certain when they’re not at all. Return on an investment often degrades quickly with delays, making what looked like a good investment on paper turn out to be a dud.

Fostering an intuition for uncertainty in an organization could help fix all these problems. I think it would be crazy not to at least try.

How To Fix It

Okay, here’s the part where I admit that I don’t quite know what to do about this problem. But I’ve got a couple ideas.

1. Uncalibrated estimates

To solve the problem of uncalibrated estimates (estimates whose uncertainty is wrong or never acknowledged), what’s needed is practice. Teams need to make 80%-confidence estimates for everything – big and small – and record those estimates somewhere. Then when the work is actually done, they can review their estimates and compare to the real-world completion time. If they didn’t get about 80% of their estimates correct, they can learn to adjust their confidence.

Estimating with a certain level of confidence is a skill that can be learned through practice. But practice, as always, takes a while and it needs to be consistent. I’m not sure how one would get an organization to put in the work, but I do believe it would pay off if they did.

2. Planning with incorrect use of uncertainty

Managers and project managers should get some training in how to manipulate estimates with uncertainty values attached.

Things get tricky when tasks need to be done one after the other instead of in parallel. But I like to think we’re not adding complexity to the process of planning; rather we’re just revealing the ways in which the current process is broken. And recognizing the limits of our knowledge is a great way to keep our batch sizes small and our planning horizons close.

I think we need to get used to calibrating estimates before we can see our way clear to address the planning challenges.

Can It Be Fixed?

It strikes me as super hard to fix the problems I’ve outlined here in a company that already has its old habits. Perhaps a company needs to be built from the ground up with a culture of uncertainty awareness. But I hope not.

If there is a way to make uncertainty a first-class citizen, I truly believe that engineering teams will benefit hugely from it.