Takeaways: managing attention with stacks

On 2021/03/212021/03/21 By Dan SlimmonIn Uncategorized3 Comments

Last week I wrote about a system I’ve been using to manage my attention. It’s inspired by the call stack, but it presupposes no knowledge of computer science. Since I need a name for it, I’ve been calling it Impulse.

Since that post, I’ve been using Impulse more seriously. It has been super illuminating. My practices are in rapid flux at the moment, but the core workflow has more or less solidified to the following:

Each line represents a task.
A task can have child tasks, which appear above the parent task and indented.
The top line is always the thing I’m currently working on.

At first, I thought Impulse was just a to-do list flipped upside down. But now I think it’s more than that.

Making a to-do list involves writing down all the things you need to do, and committing to do all of them. There are many who find this comforting! The list represents a finite sequence of discrete steps, and once you’ve gone through all the steps, you’re done. You can relax.

For me, it’s quite the opposite. I get paralyzed thinking about all the possible activities I could potentially do that might be valuable. If I let myself, I’ll sit there for an hour just silently panicking about what should be on the list. Should I tidy up my desk today? Should I clean the bathroom? Should I put my DuoLingo on here? Should I add a task to reply to that email from my Dad? Should I add breaks? Now the list is way too long to get done before lunch. Should I cut some things? Which things are most important? Which things are most urgent? Oh shoot, I also need to get groceries today. Which means I don’t have time to… and so on. It’s exhausting and it doesn’t get me anywhere.

With Impulse, the focus is always on the top line. That’s always the thing I’m currently doing, no matter what. I don’t have to worry so much about whether everything important is on the list, or whether it’s in the right order. I can just bang away at the thing on top, abiding in faith that if I simply spend time doing things, things will get done.

Here’s another way to look at it: with Impulse, the process of choosing what to do is decoupled from the process of doing it.

This makes work much less overwhelming. When I’m working, I never need to worry about how my current task fits into the big picture, or whether I’ve skipped a step, or whether there’s something more valuable I could be doing instead. There is only one thing to do: the line at the top of the stack.

I’ve been using Impulse for a whole bunch of purposes. As a to-do list when I’m puttering around the house:

dishes
fold laundry
throw out coffee cup
fold baby laundry
clean roomba, put back
outline followup blog post on impulse

To break down a task into pieces and make it more manageable and help keep track of where I am:

        wire clipper
        new strings
        something to put under guitar neck
        tuner
    get equipment together
        loosen all strings
        cut all strings
        throw old strings away
    take off old strings
    ...
restring guitar

To keep track of things I want to follow up on later:

write widget function
    reply to RFC comments
    tidy up desk
    order thai food
review at end of day

And I use it in many other ways that are less well defined but no less intuitive.

Current implementation

Right now I’m just using Vim and Git. The stack is a text file in a Git repository, and I also maintain some buckets as auxiliary text files. Here’s a snapshot of the buckets I’m currently using:

pink: (called “pink” for historical reasons) Tasks that are ready to start, which don’t demand sustained focus, and which can probably be done in under 15 minutes. Examples: check whether there's a better time for that weekly meeting, fix dead link on website, read that Stephen Malkmus interview I saw
blue: (called “blue” for historical reasons) Tasks that are ready to start, which do require sustained focus, and which will probably take longer than 30 minutes. Examples: read RFC that Tom sent, connect anti-tip kit to bookshelf, watch latest Philosophytube video
ongoing: BIGGER projects that are currently in flight. I try to keep this limited to 2, and I’ll pull tasks from ongoing into the stack when I have time. I use ongoing to think through the next few steps of a project, and those steps are organized in the same way as the stack. Examples: write impulse followup blog post, document disaster recovery testing procedure, prepare slide show for shitposting Zoom
project: Projects that are not in flight. I pull these into ongoing as it empties out.

In addition to the files listed above, I have a few templates, which I can pull into the stack to represent tasks that I need to do repeatedly. Some of these templates are:

matins: A sequence of activities I do every morning when I sit down at my desk: personal email, work email, work slack, and planning out how I’m going to spend my day.
vespers: A sequence of (much less structured) activities I do at the end of work.
restring_guitar: The process for restringing a guitar, organized into an Impulse task. It’s a thing I do so infrequently that I always forget how to do it. So Impulse is a nice way both to save that information, and to learn the process more thoroughly.

This Vim/Git approach works… okay. There’s definitely some friction. Sometimes parent tasks get separated from their children because I’ve made a mistake moving things around. Syncing is a pain, because I have to remember to pull and commit and push. But despite these and other difficulties, I still find it super helpful to work from a stack.

The software angle

As soon as I started using Impulse consistently, I ran into the age-old problem of consensus. I have both a work laptop and a personal laptop, and I use Impulse on both. It immediately became clear that the whole thing falls apart as soon as there’s more than one stack.

So the biggest challenge I can foresee in building a software implementation of Impulse is distributed consensus. I have some ideas about how to handle it, which may one day come to take the form of an eventually consistent object store built on top of Git. But I think I need to ignore the consensus problem for now.

Instead, I’ll try to hone a super streamlined ncurses-based UI for a single, in-memory Impulse instance:

--- Moving the Cursor

j ↓     move cursor down
k ↑     move cursor up
h ←     move cursor to parent
l →     move cursor to child
t       move cursor to top

--- Moving tasks

J ⇧↓    move task down (among its siblings)
K ⇧↑    move task up (among its siblings)
H ⇧←    move task left (make it a child of the task that's currently its grandparent)
L ⇧→    move task right (make it a child of the sibling directly above it)

--- Changing tasks

c       add child task(s)
s       add sibling task(s)
d       delete task
Enter   edit task name

--- Etc.

?       help (this message)

This should keep me busy for a while. Then I’ll see where I’m at.

Managing attention with stacks

On 2021/03/152021/03/15 By Dan SlimmonIn Uncategorized1 Comment

I have ADHD and anxiety. They interact, and it’s impossible to say where one ends and the other begins.

I can focus on a task, but if I get distracted, it’s often very hard to bring myself back to whatever I was doing. The cost of “context switching” – rebuilding the mental edifice of what I was doing and why and what I was going to do next – is exceptionally high for me. Sometimes the context switching is so cumbersome that I get distracted again before it’s done. Depending on the degree of anxiety bouncing around in my brain, this cycle may just repeat until I run out of day.

So there’s that, and then there’s also this: I’m mostly in technical leadership roles these days. I spend a lot of time in meetings and Slack threads and emails, which means I don’t get much heads-down time in a given week. And what heads-down time I do get is often split up into hours here and half hours there. Plus I have a kid now, which, while wonderful and great, is another way that my calendar gets all chopped up.

In a given week, then, if I want to accomplish any significant effort that requires focus (usually writing, research, or coding), I need to minimize the cost of context switching. To this end I’ve lately been using a system inspired by the call stack. (Note: I say “inspired” because it is not a direct analog of the call stack; please don’t expect it to be.)

Suppose I’m coding. I open my text editor to write code, but I also open another text editor to an empty file, visible on the same monitor. I call this empty document the stack. Whatever the thing is that I need to do, I write it in the stack:

implement widget feature

Implementing the widget feature is a multi-step process, though, so I need to create a few children of the implement widget feature task:

	write definition and doc string for widget function
	write widget test
	implement widget function such that it passes tests
implement widget feature

Children of a task appear above the task and indented. The three lines above implement widget feature are its children.

When working with this stack system, the top item on the stack is always the thing I’m currently doing. So, in the example at hand, I’m currently working on the task write definition and doc string for widget function.

But maybe, when I get to work writing the doc string for the widget function, I remember that I saw a good blog post about how to design functions like this. Maybe I want to reread that blog post before I commit to an interface. That’s fine! I can push (add at the top) a line for that:

		reread that blog post i saw
	write definition and doc string for widget function
	write widget test
	implement widget function such that it passes tests
implement widget feature

Now the top item in the stack is reread that blog post i saw. So I load up the post and start reading.

Of course, after I’ve read a few paragraphs, I see a Slack notification that I need to attend to: an alert about API latency. Normally, I would go deal with that, and by the time I got back to my terminal I’d have to rebuild all my context. But with this stack system, I can just add an interrupt: a task that temporarily steals the top slot. Like so:

deal with API latency alert
		reread that blog post i saw
	write definition and doc string for widget function
	write widget test
	implement widget function such that it passes tests
implement widget feature

As always, the thing I’m currently doing occupies the top line in the stack. So now I’ll hang out in Slack and Datadog for a while, digging into this alert. That may involve adding children of the deal with API latency alert task, or it may not, depending on how much I need to hold in my head. But let’s say it doesn’t – my teammate says she’s on it, and she doesn’t need help.

So I come back to the stack and pop that task off it. By “pop” I just mean delete the line. Since my current task is always the topmost line in the stack, I always delete (pop) the top line. Now I’m back to the widget work:

		reread that blog post i saw
	write definition and doc string for widget function
	write widget test
	implement widget function such that it passes tests
implement widget feature

I don’t have to worry about remembering what I was in the middle of. The top line is reread that blog post i saw, so I can immediately jump back into that. When I’m done, I pop that task and move on to whatever’s now on top. And so on and so on, popping lines as I finish them, adding lines as I think of followup tasks or feel the need to break down tasks by adding children.

I’ve started a Github project called impulse to elaborate on this idea, but so far the README for impulse is entirely aspirational.

Anyway, that’s my stack system. If you have context switching difficulties like me, try it out! It’s cool.

Stop apologizing for bugs

On 2019/08/02 By Dan SlimmonIn Uncategorized20 Comments

Look out, honey, ’cause I’m using technology
Ain’t got time to make no apology
The Stooges, “Search and Destroy”

For the last year or so, I’ve made a conscious effort to stop apologizing for bugs in my code.

Apologizing for bugs is very tempting. I used to do it a lot. When my code was involved in a failure that screwed up a coworker’s day or caused a user-facing problem, I’d say “Whoops! Sorry! I should’ve thought about that.”

The motivation for apologizing is sound: you want to do your best for the team. In principle, you could have prevented a problem if you’d just done something slightly different. Through this lens, apologizing for bugs may seem innocuous. But it contributes to a bunch of cultural anti-patterns:

It reinforces the idea that any one person or piece of code can be blamed for a given failure. Short of malice, this is never the case.
It gives the impression that, when you wrote the code, you should have written it better. This is a counterfactual that rarely holds up to examination.
It positions shame as the correct emotion to feel about bugs in your code: if you were a better engineer – a better teammate – the bug wouldn’t exist.
If you’re a more senior engineer on your team, the effects of these anti-patterns are magnified: people see you apologizing for bugs, so they think that they should be striving to write bug-free code. They may feel ashamed if their code has bugs.

Even if you don’t intellectually believe any of these fallacies about bugs, the act of apologizing reinforces them. Your teammates can’t know what you really believe; they can only know what you say and do.

Everyone knows that all code has bugs. Code is written under constraints. Deadlines. Goals other than quality. Imperfect knowledge of the future. Even your own skill as an engineer is a constraint. If we all tried to write perfect, bugless code, we’d never accomplish anything. So how does it make sense to apologize for bugs?

This rule I’ve made for myself forces me to distinguish between problems caused by constraints and problems caused by my own faults. If I really think I caused a problem through some discrete action (or lack of action), then that’s something I’ll apologize for. But if I wrote code that got something done, and it just so happens that it didn’t work in a given situation, then I have nothing to apologize for. There was always bound to be something.

Make a resolution not to apologize for bugs. Especially if you’re in a leadership position. It’s a simple way to tweak attitudes about mistakes and failure in a positive way.

Do-nothing scripting: the key to gradual automation

On 2019/07/15 By Dan SlimmonIn Uncategorized77 Comments

Every ops team has some manual procedures that they haven’t gotten around to automating yet. Toil can never be totally eliminated.

Very often, the biggest toil center for a team at a growing company will be its procedure for modifying infrastructure or its procedure for provisioning user accounts. Partial instructions for the latter might look like this:

Create an SSH key pair for the user.
Commit the public key to Git and push to master.
Wait for the build job to finish.
Find the user’s email address in the employee directory.
Send the user their private key via 1Password.

This is a relatively short example. Sometimes there are 20 steps in the process. Sometimes there are branches and special cases to keep track of as you go. Over time, these procedures can become unmanageably large and complex.

Procedures like this are frustrating because they’re focus-intensive yet require very little thought. They demand our full attention, but our attention isn’t rewarded with interesting problems or satisfying solutions – just another checkbox checked. I have a word for a procedure like this: a slog.

We know that this procedure is ripe for automation. We can easily see how to automate any given step. And we know that a computer could carry out the instructions with far greater speed and accuracy than we can, and with less tendency toward practical drift.

However, automating slogs sometimes feels like an all-or-nothing proposition. Sure, we could write a script to handle step 2, or step 5. But that wouldn’t really make the procedure any less cumbersome. It would lead to a proliferation of single-purpose scripts with different conventions and expectations, and you’d still have to follow a documented multi-step procedure for using those scripts.

This perception of futility is the problem we really need to solve in order to escape from these manual slogs. I’ve found an approach that works pretty reliably: do-nothing scripting.

Do-nothing scripting

Almost any slog can be turned into a do-nothing script. A do-nothing script is a script that encodes the instructions of a slog, encapsulating each step in a function. For the example procedure above, we could write the following do-nothing script:

import sys

def wait_for_enter():
    raw_input("Press Enter to continue: ")

class CreateSSHKeypairStep(object):
    def run(self, context):
        print("Run:")
        print("   ssh-keygen -t rsa -f ~/{0}".format(context["username"]))
        wait_for_enter()

class GitCommitStep(object):
    def run(self, context):
        print("Copy ~/new_key.pub into the `user_keys` Git repository, then run:")
        print("    git commit {0}".format(context["username"]))
        print("    git push")
        wait_for_enter()

class WaitForBuildStep(object):
    build_url = "http://example.com/builds/user_keys"
    def run(self, context):
        print("Wait for the build job at {0} to finish".format(self.build_url))
        wait_for_enter()

class RetrieveUserEmailStep(object):
    dir_url = "http://example.com/directory"
    def run(self, context):
        print("Go to {0}".format(self.dir_url))
        print("Find the email address for user `{0}`".format(context["username"]))
        context["email"] = raw_input("Paste the email address and press enter: ")

class SendPrivateKeyStep(object):
    def run(self, context):
        print("Go to 1Password")
        print("Paste the contents of ~/new_key into a new document")
        print("Share the document with {0}".format(context["email"]))
        wait_for_enter()

if __name__ == "__main__":
    context = {"username": sys.argv[1]}
    procedure = [
        CreateSSHKeypairStep(),
        GitCommitStep(),
        WaitForBuildStep(),
        RetrieveUserEmailStep(),
        SendPrivateKeyStep(),
    ]
    for step in procedure:
        step.run(context)
    print("Done.")

This script doesn’t actually do any of the steps of the procedure. That’s why it’s called a do-nothing script. It feeds the user a step at a time and waits for them to complete each step manually.

At first glance, it might not be obvious that this script provides value. Maybe it looks like all we’ve done is make the instructions harder to read. But the value of a do-nothing script is immense:

It’s now much less likely that you’ll lose your place and skip a step. This makes it easier to maintain focus and power through the slog.
Each step of the procedure is now encapsulated in a function, which makes it possible to replace the text in any given step with code that performs the action automatically.
Over time, you’ll develop a library of useful steps, which will make future automation tasks more efficient.

A do-nothing script doesn’t save your team any manual effort. It lowers the activation energy for automating tasks, which allows the team to eliminate toil over time.

An Incident Command Training Handbook

On 2019/06/24 By Dan SlimmonIn Uncategorized5 Comments

One of the first jobs I took on at Hashicorp was to create a training document for our corps of Incident Commanders. It was a super interesting task, because it gave me an opportunity to synthesize a whole bunch of thoughts I’ve been exposed to during my many years of responding to incidents.

Below is the training document I wrote, mostly unedited. I hope it can be of some use to you, whether you’re defining an incident response policy or just noodling on incident response in general. Enjoy!

So you want to be an Incident Commander.

Our incident response protocol indicates that every incident must have an Incident Commander. Incident command is a rewarding job. But it’s a job that takes skills that you probably haven’t exercised in your job before – skills that you may not even have recognized as skills!

This document explains what being an Incident Commander entails. It then presents an overview of things that commonly go wrong during an incident, and some strategies for dealing with them.

To be an Incident Commander, you need to do these things:

Read this document
Familiarize yourself with the Incident Commander reference sheet (I’ll try to get permission to publish this reference sheet in a future post. It’s basically a step-by-step runbook that walks through some formal procedures)
Shadow an incident commander on a real incident
Get added to the @inccom Slack group

Once you’ve completed these steps: congratulations! You’re a member of the Incident Command Team. Next time somebody pings the @inccom group, you can go ahead and volunteer. Just remember to refer to the Incident Commander reference sheet, which you can bring up in Slack by typing start incident in any channel.

What does an Incident Commander do?

An Incident Commander’s job is to keep the incident moving toward resolution. But an Incident Commander’s job is not to fix the problem.

As Incident Commander, you shouldn’t touch a terminal or search for a graph or kick off a deploy unless you’re absolutely the only person available to do it. This may feel uncomfortable, especially if your background is in engineering. It will probably feel like you’re not doing enough to help. What you need to remember is this: whatever your usual job, when you’re the Incident Commander, your job is to be the Incident Commander.

How are you supposed to keep an incident moving toward resolution without fixing things? Well, incidents move forward when a team of people works together. The Incident Commander’s job is to form that team and keep the team on the same page. It’s a demanding task that will take your full attention, and it comprises three main pieces:

Populating the incident hierarchy
Being the ultimate decision maker
Facilitating the spread of information

We’ll talk about each of these pieces in turn.

Populating the incident hierarchy

In an incident, everybody who’s working on the problem has a specific role. As a corollary, anyone who hasn’t been explicitly assigned to a role should not work on the problem. It’s critical that everyone involved knows what they’re accountable for and who they’re accountable to.

There are four main roles in incident response, which should be filled as soon as possible and remain filled until the incident is closed. As detailed in the Incident Commander reference sheet, the Incident Commander is responsible for maintaining in the topic of the Slack channel an up-to-date accounting of who’s in which of the main roles.

The main roles are as follows:

Incident Commander – you!
Primary SME (Subject Matter Expert) – in charge of technical investigation into the problem
External Liaison – in charge of communicating with customers about the incident
Scribe – responsible for taking notes in Slack about the incident and keeping track of follow-up items

At the very beginning of an incident, you may, as Incident Commander, need to handle more than one of these responsibilities. But at the earliest opportunity, you should assign others to these roles. Or, if it happens that you’re the person with the best shot at fixing the problem, you should assign someone else to be the Incident Commander and make yourself the Primary SME.

When you assign a person to a role, it’s good to be assertive. Instead of asking “Can anyone volunteer to serve as scribe?” try picking a specific person and saying “Name, can you be the Scribe for this incident?”

It’s very common for incident response to require more than just four people. But like we said above, it’s critical that everyone involved know what they’re accountable for and who they’re accountable to. For this reason, whenever someone new wants to start work on the incident, you must either assign them one the main roles or make them subordinate to someone who already has a role. For example, if the External Liaison has their hands full doing customer communications and someone volunteers to help, you can say something to the effect of “Name, you are now a Deputy External Liaison. You report to the External Liaison. Please confirm.”

Everyone in an incident should look to the Incident Manager for the final word on who’s responsible for what. It’s a part of being the ultimate decision maker.

Flexibility is your prerogative

The predefined incident hierarchy is designed to fit most scenarios. Sometimes, there will be incidents that aren’t well served by it. As the Incident Commander, you have the ability to modify the hierarchy on a case-by-case basis so it best fits the incident at hand. It’s important that your effectiveness is not hampered by a strict adherence to formula. You are empowered to temporarily change the system as you see fit.

For example, it may make sense to have multiple Primary SMEs. If an incident impacts multiple systems, you might want to pick an SME for each one. The procedure for assigning multiple SMEs would be the same as in the case of a single Primary SME, as would their responsibilities. You might want to say, “Name, you are now the Primary SME for Area of Responsibility. Please confirm.”

Being the ultimate decision maker

When we say the IC is the “ultimate decision maker,” we don’t mean that they’re somehow expected to make better decisions. What we mean is that everyone involved in an incident treats the IC’s decisions as final and binding. What matters is not so much that you always make the correct decision; what matters is that you make a decision.

Having an Incident Commander available to make decisions enables others to act in the way an incident demands. Rather than second-guessing themselves and spending lots of time weighing pros and cons, people can surface important decisions to you. Given the information available, you can then choose what seems like the best path forward. And since your decision is final and binding, the decision itself keeps everyone on the same page.

There’s another, less obvious benefit to the Incident Commander’s role as ultimate decision maker. When people are actively digging into production problems or liaising with customers, they’re constantly building context that others don’t have. In order to have the IC make important decisions, those people have to explain enough of their mental context to make the decision tractable. This is one of the ways we make sure that information gets spread around during an incident.

Facilitating the spread of information

Managing information flow is the single most important responsibility of the Incident Commander.

When we think about information during an incident, we’re usually thinking about the data that comes out of our telemetry systems or the output of commands that we run. That’s the kind of information we tend to spread around most readily. But as the Incident Commander, this concrete, sought-out information is not the only kind of information you need to be concerned about.

There’s also information inside the heads of all the people involved in incident response. Everyone has a different perspective on the incident. And, in general, people don’t know what pieces of the picture they have that others are missing. Therefore, the IC should always be looking for opportunities to get useful context out of people’s heads and into the sphere of shared knowledge.

The key to facilitating the spread of information is to manage signal-to-noise ratio. “Signal” means information that can be used to move incident resolution forward, and “noise” is information that can’t. So when there’s a piece of information that needs to get to somebody, the Incident Commander’s job is to boost the signal and make sure it gets to the right place. Conversely, when there’s an information stream that nobody can use – for example, someone in the channel posting updates on some irrelevant system’s behavior, or someone in the video call asking for duplicate status updates – your job is to suppress that noise.

To put it simply, the IC is responsible for making sure all incident communication channels remain high-signal, low-noise environments.

How incidents get off track

Every incident is different, but it’s useful to know some common ways in which incident response can go astray. When you recognize these anti-patterns, try applying the tactics described below to get the team back on track.

Thematic vagabonding

Perhaps the most common incident response anti-pattern is thematic vagabonding. This is when responders keep moving from one general area of investigation to another. When thematic vagabonding is happening, you’ll notice that:

Responders look for clues in various places without stating any specific idea about what could be wrong.
Ideas about the nature of the problem remain vague, like “something could be wrong at the API layer.” There doesn’t seem to be any momentum toward developing those vague ideas into actionable theories.
It’s hard to follow the Primary SME’s train of thought.

Thematic vagabonding is a source of noise. It generates a lot of information, but that information doesn’t get used in any coherent way.

When you notice thematic vagabonding, a good thing to do is to start asking the Primary SME to elaborate on their motivation for each action they perform. For example, if they say “I’m looking through the database error logs,” you might reply “What did you see that makes you think there’d be database errors?” Challenge them to explain how database issues could cause the problem under investigation, and why database error logs seem likely to lead them to the root cause.

If the thematic vagabonding originates not from the Primary SME but from others, it might be good to redirect those people’s attention to whatever the Primary SME is looking into. For example, if the Primary SME is investigating load balancer anomalies and someone suggests “I’m going to look at recent deploys to see if anything big was changed,” you might say “Before you do that, I want to make sure we’ve got enough eyes on these load balancer anomalies. <Primary SME>, can you use help interpreting the weird log entries you found?”

Tunnel vision

Tunnel vision is, in a way, the opposite of thematic vagabonding. It’s when responders get stuck on a particular idea about what might be wrong, even though that idea is no longer productive. Tunnel vision happens when investigators fail to get a signal that should push them on to the next phase of investigation.

Despite having opposite symptoms from those of thematic vagabonding, tunnel vision can be addressed with a similar approach: asking investigators to elaborate on their motivations. Sometimes simply repeating back their own explanation is all it takes to make them realize that they’re going down a rabbit hole.

Another useful tactic for putting an end to tunnel vision is to get responders to seek out disconfirming evidence. For example, if the Primary SME is stuck on the idea that a particular code change is responsible for the problem under investigation, but that idea doesn’t seem to be bearing fruit, you might ask them “If we wanted to prove that this change was not the cause of the problem we’re seeing, how could we prove that?” By making this conceptual shift, investigators are forced to engage with ideas outside their tunnel, and this will often allow them to start making progress again.

Inconsistent mental models

In order to collaborate effectively, incident responders need to have a shared set of ideas about how the problem under investigation could be caused. These ideas are called hypotheses.

When hypotheses are in short supply or insufficiently communicated, incidents tend to stall out. As Incident Commander, part of your job is to make sure that responders are all on the same page about which hypotheses are being entertained and which hypotheses have already been disproven. In addition, it’s a good idea to keep track of what hypothesis is motivating each investigative action. If you don’t fully understand why the Primary SME is digging into queueing metrics, maybe you should ask them to explain their thought process before continuing.

Sometimes progress on an incident will slow to a halt because there are no clear hypotheses left to investigate. Unless this situation is addressed, the response can devolve into thematic vagabonding or tunnel vision. When you become aware of hypothesis scarcity, it can be useful to call a pause to any active investigation while the group brainstorms new hypotheses. You may get some push-back because it will feel to some like a waste of time. But sometimes, in order to move forward with concrete break-fix work, you need to do some abstract ideation work first.

Disconnect between IC and Primary SME

The worst incident train wrecks happen when the Incident Commander and the Primary SME get out of sync. Of all the relationships that make up an incident response effort, theirs is the most important. It’s so important, in fact, that we have a process just for ensuring that the relationship between IC and Primary SME stays solid. It’s called the hands-off status update.

At the very beginning of an incident – as soon as IC and Primary SME are both assigned and present in the video call – the IC should ask for a hands-off status update. The “hands off” means that, until the update is over, neither person should be typing or clicking or reading. Both should be focused entirely on communicating with each other.

The hands-off status update consists of five questions:

Are you ready for a hands-off status update? This question serves as a reminder that the hands-off status update is beginning, and that both the IC and the Primary SME should be focused on it. If the Primary SME says they’re not ready for a hands-off status update, ask them if they can be ready 60 seconds from now.
What’s your best guess at impact? It won’t always be clear how many customers are affected by the problem under investigation, or how badly the customer experience is disrupted. But it’s always useful for the Primary SME to venture a guess.
What possible root causes are you thinking about? This question helps obviate thematic vagabonding and tunnel vision. When the Primary SME states their thought process out loud, everyone in the video call – including the Primary SME themself – gets a clearer sense of the path forward.
What’s your next move? While the possible root causes are still fresh in everyone’s mind, we take an opportunity to establish the next step of problem solving. As the IC, it’s your job to make sure that the Primary SME’s next move makes sense in the context set by the answers to the previous two questions.
Is there any person you would like brought in? Finally, we give the Primary SME an opportunity to consider whether there are specific individuals whose skills would be useful in moving incident response forward. If they name anyone, you should do your best to bring that person into the incident response channel and the video call and – if possible – assign them an SME role subordinate to the Primary SME.

Once you’re confident that you understand the Primary SME’s answers to all of these questions, the last step of the hands-off status update is to schedule the next hands-off status update. Pick a time between 5 and 20 minutes from now, and tell the Primary SME “I’ll ask you for another hands-off status update in <that many> minutes.” Finally, set a timer to remind yourself. Repeat this cycle until the incident is resolved.

The Incident Commander reference sheet

To ensure consistent handling of roles and information during incidents, we have defined some standard procedures in the Incident Commander reference sheet. You should review the reference sheet before you sign up to be an Incident Commander. As you read it, remember the three main responsibilities of the IC:

Populating the incident hierarchy
Being the ultimate decision maker
Facilitating the spread of information

Recommended resources

📄 Common Ground and Coordination in Joint Activity (Klein, Feltovich, Woods 2004). This paper analyzes joint cognition – which is what we do when we work together to resolve incidents – from the perspective of “common ground.” It describes one of the ways in which this “common ground” most frequently falls apart. The authors call this the Fundamental Common Ground Breakdown, and by understanding it and recognizing it, you can become a more effective Incident Commander.
🎬 How to Create a Differential Diagnosis. What we do in incident response has a lot in common with what doctors do when they’re trying to make a diagnosis. In both cases, the investigator is faced with a highly complex system, a ticking clock, and a limited arsenal of tactics for obtaining explanations of the system’s behavior. Although this video is targeted at medical students rather than software engineers, Incident Commanders can benefit enormously from learning the principles of differential diagnosis and applying the formalism to incident response.

Believing in the value of conceptual labor

On 2019/05/07 By Dan SlimmonIn UncategorizedLeave a comment

Ideas are funny things. It can take hours or days or months of noodling on a concept before you’re even able to start putting your thoughts into a shape that others will understand. And by then, you’ve explored the contours of the problem space enough that the end result of your noodling doesn’t seem interesting anymore: it seems obvious.

It’s a lot easier to feel like you’re doing real work when you’re writing code or collecting data or fixing firewall rules. These are tasks with a visible, concrete output that correlates roughly with the amount of time you put into them. When you finish writing a script, you can hold it up to your colleagues and they can see how much effort you put into it.

But as you get into more senior-type engineering roles, your most valuable contributions start to take the form not of concrete labor, but of conceptual labor. You’re able to draw on a rich mental library of abstractions, synthesizing and analyzing concepts in a way that only someone with your experience can do.

One of the central challenges in growing into a senior engineer consists in recognizing and acknowledging the value of the conceptual labor. We have to learn to identify and discard discouraging self-talk, like:

Time spent thinking is time spent not doing. When you’re a senior engineer, thinking is often the most valuable kind of doing you can do.
It’s not worth telling anyone about this connection I thought of. As a senior engineer, you will make conceptual connections that no one else will make. If they seem worthwhile to you, then why shouldn’t it be worthwhile to communicate them to others?
My coworkers are all grinding through difficult work and I’m just over here ruminating. Working with concepts is difficult work, and it’s work whose outcome can be immensely beneficial to your team and your org, but only if you follow through with it. Don’t talk yourself down from it.

Perhaps for the first time in your career, one of the skills you have to develop is faith: faith in the value of your conceptual labor. When an idea looks important to you, have faith that your experience is leading you down a fruitful path. Find that idea’s shape, see how it fits into your cosmos, and put that new knowledge – knowledge that you created by the sweat of your own brow – out into the world.

The paradox of the bloated backlog

On 2019/03/28 By Dan SlimmonIn UncategorizedLeave a comment

I want to point out a paradox you may not have noticed.

A team of software engineers or SREs invariably has more good ideas than time. We know this very well. Pick any system we own, and we’ll come up with a list of 10 things we could do to make it better. Off the top of our head.

On the other hand, when our team is confronted with the opportunity to purge some old features or enhancements out of the backlog, there’s resistance. We say, “we might get around to this some day,” or “this still needs to get done.”

These two beliefs, taken together, reveal a deep lack of team self-confidence.

If our team always has more good ideas than time, then we’re never going to implement all the good ideas in our backlog. If we add more people, we’ll just get more good ideas, and the backlog will just get more bloated.

Why are we reluctant to ruthlessly remove old tickets, then? We know that we’re constantly generating good ideas. In fact, the ideas we’re generating today are necessarily (on average) better than most of the ideas in the backlog:

We have more information about the system now then we used to, so our new ideas are more aligned with real-world facts, and
We have more experience as engineers now, so we have developed better intuition about what kind of interventions will create the most value.

Seen in this light, a hesitance to let go of old ideas is revealed as a symptom of a deep pathology. It means we don’t believe in our own creativity and agency. If we did, we would have easy answers to the questions we always ask when we consider closing old tickets:

What if we forget about this idea? That’s okay. We’ll have plenty of other, better, more relevant ideas. We never stop generating them.
What if this problem gets worse over time? If the risk is enough that we should prioritize this ticket over our other work, then let’s do that now. Otherwise, we can cross that bridge if we ever get to it.
Will the reporter of this ticket even let us close it? Nobody owns our backlog but us. When we decide to close a ticket, all we owe the reporter is an honest explanation of why.

Leave behind the paradox of the bloated backlog and start believing in your team’s own agency and creativity. Hell, maybe even cap your backlog. A team with faith in its competence is a team unleashed.

Latency- and Throughput-Optimized Clusters Under Load

On 2019/03/25 By Dan SlimmonIn Uncategorized3 Comments

It’s good to have accurate and thorough metrics for our systems. But that’s only where observability starts. In order to get value out of metrics, we have to focus on the right ones: the ones that tell us about outcomes that matter to our users.

In The Latency/Throughput Tradeoff, we talked about why a given cluster can’t be optimized for low latency and high throughput at the same time. In conclusion, we decided that separate clusters should be provisioned for latency-centric use cases and throughput-centric use cases. And since these different clusters are optimized to provide different outcomes, we’ll need to interpret their metrics in accordingly different ways.

Let’s consider an imaginary graph dashboard for each type of cluster. We’ll walk through the relationships between metrics in each cluster, both under normal conditions and under excessive load. And then we’ll wrap up with some ideas about evaluating the capacity of each cluster over the longer term.

Metrics for comparison

In order to contrast our two types of clusters, we’ll need some common metrics. I like to employ the USE metrics at the top of my graph dashboards (more on that in a future post). So let’s use those:

Utilization: The total CPU usage for all hosts in the cluster.
Saturation: The number of queued requests.
Error rate: The rate at which lines are being added to hosts’ error logs across the cluster.

In addition to these three metrics, we want to see a fourth metric. For the latency-optimized cluster, we want to see latency. And for the throughput-optimized cluster, we want to see throughput.

One of the best things about splitting out latency-optimized and throughput-optimized clusters is that the relationships between metrics become much clearer. It’s hard to tell when something’s wrong if you’re not sure what kind of work your cluster is supposed to be doing at any given moment. But separation of concerns allows us to develop intuition about our systems’ behavior and stop groping around in the dark.

Latency-optimized cluster

Let’s look at the relationships between these four important metrics in a latency-optimized cluster under normal, healthy conditions:

2019-03-11 blog charts latency-oriented healthy Utilization will vary over time depending on how many requests are in flight. Saturation, however, should stay at zero. Any time a request is queued, we take a latency hit.\

Error rate would ideally be zero, but come on. We should expect it to be correlated to utilization, following the mental model that all requests carry the same probability of throwing an error.

Latency, then – the metric this cluster is optimized for – should be constant. It should not be affected by utilization. If latency is correlated to utilization, then that’s a bug. There will always be some wiggle in the long tail (the high percentiles), but for a large enough workload, the median and the 10th percentile should pretty much stay put.

Now let’s see what happens when the cluster is overloaded:

2019-03-12 blog charts latency-oriented loaded

We start to see plateaus in utilization, or at least wider, flatter peaks. This means that the system is low on “slack”: the idle resources that it should always have available to serve new requests. When all resources are busy, saturation rises above zero.

Error rate may still be correlated with utilization, or it may start to do wacky things. That depends on the specifics of our application. In a latency-optimized cluster, saturation is always pathological, so it shouldn’t be surprising to see error rates climb or spike when saturation rises above zero.

Finally, we start to see consistent upward trends in latency. The higher percentiles are affected first and most dramatically. Then, as saturation rises even higher, we can see the median rise too.

Throughput-optimized cluster

The behavior of our throughput-optimized cluster, on the other hand, is pretty different. When it’s healthy, it looks like this:

2019-03-22 blog charts throughput-oriented healthy.png

Utilization – which, remember, we’re measuring via CPU usage – is no longer a fluffy cloud. Instead, it’s a series of trapezoids. When no job is in progress, utilization is at zero. When a job starts, utilization climbs up to 100% (or as close to 100% as reality allows it to get), and then stays there until the job is almost done. Eventually, the number of active tasks drops below the number of available processors, and utilization trails off back to zero.

Saturation (the number of queued tasks) follows more of a sawtooth pattern. A client puts a ton of jobs into the system, bringing utilization up to its plateau, and then as we grind through jobs, we see saturation slowly decline. When saturation reaches zero, utilization starts to drop.

Unlike that of a latency-optimized cluster, the error rate in a throughput-optimized cluster shouldn’t be sensitive to saturation. Nonzero saturation is an expected and desired condition here. Error rate is, however, expected to be follow utilization. If it bumps around at all, it should plateau, not sawtooth.

And finally, in the spot where the other cluster’s dashboard had a latency graph, we now have throughput. This we should measure in requests per second per job queued! What our customers really care about is not how many requests per second our cluster is processing, but how many of their requests per second. We’ll see why this matters so much in a bit, when we talk about this cluster’s behavior under excessive load.

Throughput should be tightly correlated to utilization and nothing else. If it also seems to exhibit a negative correlation to saturation, then that’s worth looking into: it could mean that queue management is inappropriately coupled to job processing.

Now what if our throughput-optimized cluster starts to get too loaded? What does that look like?

2019-03-23 blog charts throughput-oriented loaded

Utilization by itself doesn’t tell us anything about the cluster’s health. Qualitatively, it looks just like before: big wide trapezoids. Sure, there a bit wider than they were before, but we probably won’t notice that – especially since it’s only on average that they’re wider.

Saturation is where we really start to see the cracks. Instead of the sawtooth pattern we had before, we start to get more sawteeth-upon-sawteeth. These are jobs starting up on top of jobs that are already running. A few of these are to be expected even under healthy circumstances, but if they become more frequent, it’s an indication that there may be too much work for the cluster to handle without violating throughput SLOs.

Error rate may not budge much in the face of higher load. After all, this cluster is supposed to hold on to big queues of requests. As far as it’s concerned, nothing is wrong with that.

And that’s why we need to measure throughput in requests per second per job queued. If all we looked at was requests per second, everything would look hunky dory. But when there are two jobs queued and they’re constrained to use the same pool of processors, somebody’s throughput is going to suffer. On this throughput graph, we can see that happen right before our eyes.

Longer-term metrics

Already, we’ve seen how decoupling these two clusters gives us a much clearer mental model of their expected behavior. As we get used to the relationships between the metrics we’ve surfaced, we’ll start to build intuition. That’s huge.

But we don’t have to stop there! Having a theory (which is what we built above) also allows us to reason about long-term changes in the observable properties of these clusters as their workloads shift.

Take the throughput-optimized cluster, for instance. As our customers place more and more demand on it, we expect to see more sawteeth-upon sawtooth and longer intervals between periods of zero saturation. This latter observation is key, since wait times grow asymptotically as utilization approaches 100%. So, if we’re going to want to do evaluate the capacity needs of our throughput-optimized cluster, we should start producing these metrics on day one and put them on a dashboard:

Number of jobs started when another job was already running, aggregated by day. This is the “sawteeth-upon-sawteeth” number.
Proportion of time spent with zero saturation, aggregated by day.
Median (or 90th percentile) time-to-first-execution for jobs, aggregated by day. By this I mean: for each customer job, how much time passed between the first request being enqueued and the first request starting to be processed. This, especially compared with the previous metric, will show us how much our system is allowing jobs to interfere with one another.

An analogous thought process will yield useful capacity evaluation metrics for the latency-optimized cluster.

TL;DR

UnsteadyBlueAardvark-size_restricted

Separating latency- and throughput-optimized workloads doesn’t just make it easier to optimize each. It carries the added benefit of making it easier to develop a theory about our system’s behavior. And when you have a theory that’s consistent with the signal and a signal that’s interpretable within your theory, you have observability.

The Latency/Throughput Tradeoff: Why Fast Services Are Slow And Vice Versa

On 2019/02/262022/09/24 By Dan SlimmonIn Uncategorized5 Comments

Special thanks to the graceful and cunning Ben Ng for consulting on this post.

I’m finally getting around to reading that DevOps* book everybody’s been raving about, Site Reliability Engineering: How Google Runs Production Systems. My verdict so far: it’s pretty good.

Here’s one of the first passages to jump out to me, from Chapter 3: Embracing Risk:

The low-latency user wants Bigtable’s request queues to be (almost always) empty so that the system can process each outstanding request immediately upon arrival. (Indeed, inefficient queuing is often a cause of high tail latency.) The user concerned with offline analysis is more interested in system throughput, so that user wants request queues to never be empty. To optimize for throughput, the Bigtable system should never need to idle while waiting for its next request.

This is a profound and general insight. When I read this passage, my last decade of abject suffering suddenly came into focus for me.

When I say “abject suffering,” I’m of course talking about ElasticSearch administration. When a storage system like ElasticSearch has to serve both low-latency and high-throughput workloads, it is guaranteed to get ugly. This fact is super important, which is why I’m devoting this blog post to exploring the relationships among latency, throughput, and capacity from a queueing perspective. I hope I can make these relationships stick in your mind like they’ve stuck in mine.

* Go ahead. Tell me DevOps and SRE aren’t the same thing. I dare you.

The tradeoff between throughput and latency

Consider a service that responds to requests. As an example, let’s say it’s a service that takes as input a picture of a dog and returns a picture of that dog wearing a silly hat.

2019-02-25 dog wearing hat — Artist’s rendering

Like almost any service (exception: Tourbillon), our service can only handle a certan number of requests per second [to put hats on dogs (RPSTPHOD)]. We’ll call this number its capacity. If we have 200 processes devoted to dog-hatting, and dogs take on average 400 milliseconds to haberdash, then the theoretical capacity of the system is

(200) / (0.4s) = 500 hats per second

Now let’s consider the two types of users that depend on our service:

On-the-spot dog hatters. At any given time, these users have a single dog picture that requires a hat as soon as possible. Perhaps they’re using our service to support a website that generates a single dog-hat picture per page load, and they want their page to load quickly. These users are interested primarily in how quickly they can get a hat on a dog. In a word: latency.
Bulk dog-hatters. These users tend to have massive data sets that they want processed as quickly as possible. The most obvious example would be a law enforcement agency wanting to compare their large database of pet photos to surveillance footage of a particular dog robbing a bank while wearing a hat. Bulk dog-hatters care not about the latency of any individual dog-hatting, but about the throughput they can achieve. In other words: how close they can get to our service’s theoretical capacity of 500 hats per second.

But here’s the problem: no single cluster of dog-hatting servers can be optimal for both types of users. And the better we make the service for one kind of user, the worse we make it for the other.

The needs of on-the-spot users

In order to minimize latency for our on-the-spot users (without dropping any of their requests), we need to make sure that there’s always a processor idle when their request comes in. If we fail to make sure of this, then new requests will have to be queued while we wait for a spot to open up, thus inflating latency. The system needs some “slack.”

Since we need slack, we don’t ever want throughput to approach capacity. The closer we get to our system’s capacity, the more drastically latencies will balloon, like I talked about in this post.

The needs of bulk users

Our bulk dog-hatters, on the other hand, don’t care so much about request latency. Some of their individual requests might take seconds, or minutes, or even hours to complete. What they care about is how quickly our service can process their entire data set. In other words, they care about getting throughput as close as possible to capacity.

This means that, whenever a job is running, bulk dog-hatters want there to be (virtually) zero slack. Every processor should be active at all times. Consequently, our queue sizes will explode as soon as the job starts, and our queues will stay occupied until the job is almost done.

In this case, we want our queues to be full whenever there’s a bulk job running. Anything else would give sub-optimal throughput.

Splitting the cluster up

The needs of on-the-spot and bulk users are incompatible. One group needs minimal latency, while the other group needs maximal throughput.

If both of these groups are using the same cluster, we’re going to have serious problems. On-the-spot users’ latencies will vary widely depending on whether there’s currently a bulk job in progress, and bulk users’ job times will vary depending on the number of on-the-spot users currently using the system. No matter how much we scale or tweak tuning parameters, neither group will get what they need. And what’s worse, we’ll be stuck in a perpetual tug-of-war between the priorities of these two groups.

So let’s split our cluster in two: a “low latency” cluster and a “high throughput” cluster. And let’s let our users pick the right one for their use case. This way, we’ll have much clearer expectations about the performance and scaling characteristics of our service, and we’ll avoid the frustrating priority tug-of-war that characterized our mixed-use cluster.

The split doesn’t have to be complete. Instead of having two wholly separate clusters, we could have some kind of load balancer that reserves a certain portion of our fleet for low-latency traffic and slots bulk jobs onto dedicated segments of the cluster. The details of every solution will vary. What matters is that on-the-spot and bulk dog-hatters aren’t drawing on the same pool of resources.

Once we do split up our cluster, then, what should we expect the performance characteristics of the new clusters to be? What will their graph dashboards look like when they’re healthy, or near capacity, or over capacity? In an upcoming post, I’ll use some more queueing reasoning to answer these questions. So get hype for that!

[UPDATE: It’s here!]

Let’s Stop Pretending Estimates Are Exact

On 2017/04/282017/05/12 By Dan SlimmonIn UncategorizedLeave a comment

Some days I feel like estimates and plans are worthless. Other days I find myself idly speculating that if only we were better at planning and estimating, everything would be beautiful.

In an engineering organization, we want realistic plans. Realistic plans and roadmaps will hypothetically help us utilize our capacity effectively and avoid overcommitting. And our plans are usually built on estimates; that’s why we believe that “correct” estimates are more likely to generate a realistic plan. Therefore, we believe that correct estimates are useful for planning.

On the other hand, if you’ve ever worked with engineers, you also believe that estimates are never correct. Something always gets in the way, or we pad the estimate out to avoid being held responsible for missing a deadline, or (occasionally) the task ends up being much simpler than we expected. The time we estimate is never the time it actually takes to do the job. C7ofJ_scaled

These two statements seem to be in contradiction, but they are not. There is a key to holding both these beliefs in your head at once. And I believe that if an entire engineering org were to grasp this key together, it would unlock enormous potential.

The key is to treat uncertainty as a first-class citizen.

Why This Is Hard

Developing organizational intuition in about estimates is really hard. It is in fact so hard that I’m not aware of any organization that has done it. I see two big reasons for this:

We don’t have good awareness of the amount of uncertainty in our estimates.
We pay lip service to the idea that estimates aren’t perfect, but we plan as if they are.

1. Uncalibrated estimates

When we give estimates, those estimates aren’t calibrated to a particular uncertainty level. Everybody has different levels of confidence in their estimates, and those levels are neither discussed nor adjusted. As a result, engineers and project managers find it impossible to turn estimates into meaningful plans.

2. Planning with incorrect use of uncertainty

You’ve surely noticed uncertainty being ignored. It usually goes something like this:

PM: How soon do you think you can have the power converters re-triangulated?
ENG: I’m not sure. It could take a day, or it could take five.
PM: I understand. What do you think is most likely? Can we say three days?
ENG: Sure, three or four.
PM: Okay. I’ll put it down for four days, just to be safe.

Both sides acknowledge that there is a great deal of uncertainty in this estimate. But a number needs to get written down, so they write down 4. In this step, information about uncertainty is lost. So it shouldn’t be a surprise later when re-triangulating the power converters takes 7 days and the whole plan gets thrown off.

Example

Suppose 5 teams are working independently on different parts of a project. Each team has said they’re 80% confident that their part will be done by the end of the year. The project manager will want to know how likely is it that all five parts will be done by the end of the year?

When we answer questions like this, we don’t usually think in terms of probability. And even if we do think in terms of probability, it’s easy to get it wrong. If you ask non-mathy people this question (and most project managers I’ve worked with are not mathy), their intuition will be: 80%.

But that intuition is extremely misleading. What we’re dealing with is a set of 5 independent events, each with 80% probability. Imagine rolling five 5-sided dice. Or, since 5-sided dice are not really a thing, imagine spinning five 5-sided dreidels (which are also not really a thing, but are at least geometrically possible). You spin your five dreidels and if any of them comes up ש‎, the project fails to meet its end-of-year deadline. The probability of success, then, is:

(80%)⁵ = 80% ⋅ 80% ⋅ 80% ⋅ 80% ⋅ 80% = 32.8%

Awareness of the probabilistic nature of estimates makes it obvious that a bunch of high-confidence estimates do not add up to a high-confidence estimate.

To make matters even worse, most organizations don’t even talk about how confident they are. In real life, we wouldn’t even know the uncertainties associated with our teams’ estimates (this is problem 1: uncalibrated estimates). Nobody would even be talking about probabilities. They’d be saying “All five teams said they’re pretty sure they can be done by the end of the year, so we can be pretty sure we’ll hit an end-of-year target for the project.” But they’d be dead wrong.

Why This Is Worth Fixing

I believe that poor handling of uncertainty is a major antagonist not just of technical collaboration, but of social cohesion at large. When uncertainty is not understood:

Project managers feel let down. Day in and day out they ask for reasonable estimates, and they’re left explaining to management why the estimates were all incorrect and the project is going to be late.
Engineers feel held to unreasonable expectations. They have to commit to a single number, and when that number is too low they have to work extra hard to hit it anyway. It feels like being punished for something that’s no one’s fault.
Managers feel responsible for failures. Managers are supposed to help their teams collaborate and plan, but planning without respect for uncertainty leads to failure every time. And what’s worse, if you don’t understand the role of uncertainty, you can’t learn from the failure.
Leadership makes bad investments. Plans may look certain when they’re not at all. Return on an investment often degrades quickly with delays, making what looked like a good investment on paper turn out to be a dud.

Fostering an intuition for uncertainty in an organization could help fix all these problems. I think it would be crazy not to at least try.

How To Fix It

Okay, here’s the part where I admit that I don’t quite know what to do about this problem. But I’ve got a couple ideas.

1. Uncalibrated estimates

To solve the problem of uncalibrated estimates (estimates whose uncertainty is wrong or never acknowledged), what’s needed is practice. Teams need to make 80%-confidence estimates for everything – big and small – and record those estimates somewhere. Then when the work is actually done, they can review their estimates and compare to the real-world completion time. If they didn’t get about 80% of their estimates correct, they can learn to adjust their confidence.

Estimating with a certain level of confidence is a skill that can be learned through practice. But practice, as always, takes a while and it needs to be consistent. I’m not sure how one would get an organization to put in the work, but I do believe it would pay off if they did.

2. Planning with incorrect use of uncertainty

Managers and project managers should get some training in how to manipulate estimates with uncertainty values attached.

Things get tricky when tasks need to be done one after the other instead of in parallel. But I like to think we’re not adding complexity to the process of planning; rather we’re just revealing the ways in which the current process is broken. And recognizing the limits of our knowledge is a great way to keep our batch sizes small and our planning horizons close.

I think we need to get used to calibrating estimates before we can see our way clear to address the planning challenges.

Can It Be Fixed?

It strikes me as super hard to fix the problems I’ve outlined here in a company that already has its old habits. Perhaps a company needs to be built from the ground up with a culture of uncertainty awareness. But I hope not.

If there is a way to make uncertainty a first-class citizen, I truly believe that engineering teams will benefit hugely from it.