Thanks very much to host Stephen Townshend of Slight Reliability podcast. We talked about incident response, diagnosis, and looking for trouble. It was very chill!
Full 28-minute episode:
Thanks very much to host Stephen Townshend of Slight Reliability podcast. We talked about incident response, diagnosis, and looking for trouble. It was very chill!
Full 28-minute episode:
Over my career as an SRE, I’ve diagnosed thousands of production issues. When I’m lucky, I have enough information at hand, and enough expertise in the systems involved, to get to the bottom of the problem on my own. But very often I need to bring together a team.
Troubleshooting with a team unleashes enormous power. Problems that would have taken me days to solve as an individual might take only hours or minutes, thanks to the benefit of pooled knowledge.
However, collaborative troubleshooting also comes with dangers. Time and again, I’ve seen groups struggle to make progress on an investigation due to miscommunication, misalignment, and confusion. Among other difficulties, the fundamental common ground breakdown can have especially heinous consequences in this context.
Over the years, I’ve developed a reliable method for harnessing the diagnostic power of groups. My approach is derived from a different field in which groups of experts with various levels of context need to reason together about problems in a complex, dynamic system: medicine.
I call this method clinical troubleshooting.
Although clinical troubleshooting can be useful in solo troubleshooting, it really shines as a group activity. It’s a lightweight structure that always adds value. I recommend reaching for clinical troubleshooting as soon as you need to involve another person in any “why” question about an unexpected behavior of your system.
Before you start, gather the people you’ll be troubleshooting with. Any synchronous communication channel can work for this: Slack, Zoom, a meeting room; you name it.
You don’t need a big group. In fact, a small group is best. What matters most is that you bring together diverse perspectives. If you’re a backend engineer, try to pull in a network engineer and a frontend engineer, or a support agent and a sysadmin. Cast a wide net.
Once you have an initial group, share a blank Google doc with everyone.
Add a Symptoms header to the doc.
You’re the one bringing the issue, so you must have some observations already. Write those down in a numbered list.
It’s important that it be a numbered list rather than a bulleted list. As the troubleshooting process goes on, you’re going to want to refer to individual symptoms (and, later, hypotheses and actions). If each symptom has a number and the number never changes, this is a lot easier.
Ask your collaborators to list symptoms, too. They may have observed some other facet of the problem, or they may think to look at a graph that you don’t know about.
Here’s what an initial symptom list might look like:
Symptoms
(In this and subsequent examples, don’t worry about understanding the exact technical situation. Clinical troubleshooting can be used on problems in any part of any tech stack.)
All the symptoms on the list should be precise and objective. In other words, if a statement is quantifiable, quantify it. Don’t make suppositions yet about why these phenomena have been observed. That comes next.
Once you’re all on the same page about what problem you’re investigating, the initial symptom list is done.
Add a Hypotheses header to the doc. Invite your colleagues to join you in suggesting hypotheses that might explain the symptoms.
Let the ideas flow, and write them all down. This is where having a diverse set of perspectives in the room really pays off. Your co-investigators will think of hypotheses that would never have occurred to you, and vice versa. The more of these you come up with, the more likely the actual explanation will be on the list.
A hypothesis can’t be just anything, though. A hypothesis must
For example, given the symptoms above, “requests to the storage API are getting queued up behind a long-running query” would not be a sound hypothesis, since it’s inconsistent with Symptom 3. If requests were queued up, we’d expect them to take longer before failing.
After some discussion, your hypothesis list might look like this:
Hypotheses
Again, use a numbered list. If a hypothesis is ruled out or deemed irrelevant, don’t delete it: you don’t want the list numbering to change. Instead, mark it in some with formatting. I use strikethrough. Gray works too.
Add an Actions header.
In a new numbered list, choose one or two actions that will advance the troubleshooting effort. Usually, you should pick actions that will rule out, or “falsify,” one or more of the hypotheses on the table.
To rule out Hypothesis 2 above, for instance, you could review the logs for one of the error spikes and check whether all the affected requests were associated with the same load balancer or the same storage server. If the requests are distributed across your infrastructure, then Hypothesis 2 is ruled out (and Hypothesis 1 as well, for that matter!).
When you agree upon actions, it’s best to assign them to individuals. Sometimes an action can be taken right away, and other times it’ll take a while and the group will have to reconvene later. But ownership should never be unclear.
Ruling out hypotheses the only purpose of actions in clinical troubleshooting. You can also assign actions that expand the group’s understanding of the problem, in order to generate new symptoms and new hypotheses. These actions can be things like, “Read the documentation on system X‘s network configuration,” or “Search for blog posts about error message E.” As long as there’s at least one hypothesis in the running, though, there ought to be at least one action in flight that could falsify it. That’s one of the ways clinical troubleshooting ensures constant progress.
When actions are completed, you get more information for the symptom list. More symptoms suggest new hypotheses. New hypotheses imply further actions. Just keep going through the cycle until you’re satisfied.
Sometimes you won’t be satisfied until you have a definitive diagnosis: a solid explanation for all the symptoms that’s been proven right. Other times, you’ll be satisfied as soon as you take an action that makes the problem go away, even if there’s still uncertainty about what exactly was going on.
In any case, clinical troubleshooting will reliably get you to the solution.
In the absence of structure, collaborative diagnosis can stall out. Or worse, go backward.
With clinical troubleshooting, there’s always a next step forward. Teams that practice this method will consistently get to the bottom of technical mysteries, even when strapped for data or under intense pressure. And over time, as this scientific way of thinking becomes a habit, and then a culture, we come to understand the behavior of our system that much better.
I can teach your team how to do this. Get in touch.
Every time our system fails, and we go to analyze the failure, we find ourselves saying things like “We didn’t know X was happening,” “we didn’t know Y could happen,” and so on. And it’s true: we didn’t know those things.
We can never predict with certainty what the next system failure will be. But we can predict, because painful experience has taught us, that some or all of the causes of that failure will be surprising.
We can use that!
When we go looking at data (and by “data” I mostly mean logs, traces, metrics, and so on, but data can be many things), sometimes we see something weird, and we go like, Huh!. That Huh! is a signal. If we follow that Huh! – get to the bottom of it, figure it out, make it not surprising anymore – two things happen. First, we get a chance to correct a latent problem which might some day contribute to a failure. And second, we make our mental model that much better.
Of course, any individual Huh! could turn out to be nothing. Perhaps there’s a bug. Perhaps circumstances have shifted, and our expectations no longer line up with reality. Or perhaps it’s just a monitoring blip. We won’t know until we run it down.
But, whatever the shortcomings of any particular investigation, a habit of investigating surprises has many attractive qualities. The main one is that we get to fix problems before those problems get worse, start bouncing off other problems, and cause fires. In other words: our system runs smoother. Consider what that’s worth.
I do most of the grocery shopping for my family. Which is, of course, a privilege. But it’s a time-consuming privilege. From the moment I enter the supermarket to the moment I enter the checkout line, it’s about 30 minutes on average. Multiply that by 49 grocery trips a year, and you get 24.5 hours spent in the supermarket annually.
That’s 24 hours hours a year when I’d rather be doing basically anything else. The whole experience of grocery shopping grates on my nerves: jockeying for position with all the other shopping carts, getting the goddamn cleanup robot in my way, listening to the inane patter of the pre-recorded PA messages. It’s miserable.
All considered, I don’t mind doing it. My family needs to eat, and this is part of the cost of eating.
Buuut…
When I finally reach the end of the slog and look in my cart, I can’t help but think, there’s no way this should’ve taken me 30 minutes. There’s just not that much stuff.
So I started paying close attention to how I spend my time. Ideally, the process would look like this:
In reality, though, the process is much more like this:
This process involves tremendously more context switching than the ideal. Because of deficiencies in my brain, I can’t remember more than 1 or 2 items at a time, and every context switch entails a risk of forgetting what I was in the middle of doing. Compounding with this problem is that my grocery list isn’t sorted in any particular order. I’m at the mercy of my incomplete knowledge of supermarket geography: if I miss an item because I walked past the aisle it’s in, I have to waste time walking back to that aisle.
This close examination of my time expenditure got me thinking: how much of that 30 minutes could be optimized away?
It became clear that, if I wanted to spend less time in the supermarket, my best bet was to calculate the most efficient travel path through the store. Backtracking and searching were huge sources of waste. Instead, I wanted to take a single pass through the store, grabbing everything I needed as I went.
The first solution that popped into my imagination was to write a script that could take my shopping list, compare it to data about the supermarket’s layout, and produce step-by-step instructions. The instructions would look something like this:
I implemented this as a simple Go program called grocery-run. This script contains a hard-coded shopping list, store layout, and arrangement of items. The simple routing algorithm traverses the aisles in order, walking down any aisle that contains an item on the shopping list, and checking that item off. It keeps track of which direction I’m walking so as to order the items appropriately and predict which side of the aisle the item will be found on. The output of grocery-run
looks like this:
This served decently for a few weeks. Each weekend before going to the supermarket, I would transfer my shopping list from the place where my family keeps it (Trello) into the code. Then I’d run the script and transfer the output to written form in a notebook. I was of course losing time with all these transfers, but my main goal at the time was just to validate the hypothesis that a path-finding algorithm like this could save me time in the long run. If that hypothesis turned out to be correct, then from there I could set my mind to optimizing away much of the overhead of using the script.
At the supermarket, I’d follow the instructions on the paper, noting the locations of any items that weren’t yet in my data set for later entry. This data recording too was a drain on my time, but – I reasoned – one that would disappear as my data set grew over multiple trips.
The initial results were encouraging! Even with the extra time spent noting item locations, I was spending the same amount of shopping time per trip. And I was also learning some important things:
This grocery-run
script served decently, but it ended up being a stepping stone to an even simpler solution. Since I didn’t need the left/right or front-of-store/back-of-store data, it turned out I could replace my whole script with a spreadsheet. When you can replace a script with a spreadsheet, it’s almost always a win.
Here’s the spreadsheet. It does basically the same thing as the grocery-run
script. There’s one sheet for the current shopping list, one for the store layout, and one for the arrangement of items within the store.
This spreadsheet makes use of a feature of Google Sheets that I hadn’t leveraged before: VLOOKUP. This function lets you construct a lookup table in one part of your spreadsheet, which can be referenced from another part. I think of this by analogy to JOINs in SQL:
SELECT shopping_list.item
FROM shopping_list
JOIN store_arrangement ON item
JOIN store_layout ON aisle
ORDER BY store_layout.visit_order;
My workflow with this spreadsheet is more or less the same as before:
grocery-run
script)The final shopping list looks like this:
The left column represents the aisle, with empty spaces indicating repetition.
Before this intervention, my grocery shopping process was circuitous and wasteful.
Now it’s much more streamlined. Much less backtracking, no more futzing with my phone, no more checking items off the list and getting distracted in the meantime.
All told, this saves me about 15 minutes a week of shopping time. Over the course of the year, that’s
(15m) * (49 weeks a year, adjusting for vacation) = 12.25h
I like to think of this as about 1 book: due to this optimization, I get to read about 1 extra book a year, or do something else of equivalent value. Plus, I get to spend 12 hours less each year listening to that insipid Stop & Shop PA loop. I’ll take it!
For a year and change, I’ve been using a home-grown, constantly evolving task-tracking system I call Impulse. Part of the deal with Impulse is this: when I think of something I need to get done, I write it down immediately and get back to whatever I was doing.
Later, the thing I wrote down gets merged into a workstream. A workstream is list of tasks in the order I intend to start them. My two main workstreams are work (stuff I’m getting paid for), and personal (stuff I’m not getting paid for).
Impulse’s central principle, which is both obvious and, for me, life-altering, is that the more time you spend doing a thing, the more of that thing you will get done. Sure, there are other variables that play a role in determining output – focus, effectiveness of planning, and a match between your skill set and the work you undertake, to name a few – but time spent working is the most important.
Consequently, I try to arrange my life so as to allot some time every day to working through the tasks in my workstreams. I work from top to bottom, in whatever order the tasks are in at the time I start.
Among the myriad benefits this system has brought to my life is that it mitigates the effect of Ugh Fields.
An ugh field is a flinch response to thinking about a given domain of life, conditioned over time through repeated, self-reinforcing negative associations.
For example, I’ve long had an ugh field (or, as I’ve called it before, an attentional sneeze) around replying to texts from my family. I won’t go into how this ugh field developed. It wasn’t my family’s fault; they’re great. Point is, every time I thought about family texts I needed to reply to, my mind would flinch away. Over time, this flinch came earlier and earlier, and with less of a connection to any real source of discomfort. It grew through a feedback loop: think about the texts, flinch away, accrue consequences for ignoring them, think about those consequences, flinch earlier and more forcefully next time.
By succumbing to this ugh field, I’ve done significant damage to my relationship with my family. But the damage is not irreparable, and Impulse is helping me repair it.
How can a simple task management system help repair years’ worth of self-conditioning? The key is to decouple the commitment to do a thing from the doing of the thing.
Time was, I didn’t have anywhere in particular to keep track of my plans and obligations. When it occurred to me that something needed doing (e.g. “read that my friend sent me”, “paint the railing on the back steps”, “reply to Dad about getting together in August”), I either had to do the thing right then, or remember to do it later. Thanks to attention deficit, this choice constitutes a dilemma. If I do the thing now, then I’ll lose track of what it was I was doing right beforehand. But if I decide to do the thing later, I’ll almost certainly forget to do it.
Now I have a third choice: record the task in my “inbox.” No matter how trivial it seems, just record it for now and get back to whatever I was doing. Later, on a recurring basis, merge the inbox into my personal and work workstreams.
Right off the bat, this helps. When I think of something I need to do, I don’t need to act on that thought right away. In fact I shouldn’t: that would distract me from whatever I’m already doing. I don’t have to feel bad about putting the thing off, so I don’t have so much of a flinch response.
Then, when it comes time to merge the tasks from my inbox (which is a Google Keep note) into the personal and work workstreams, there’s no longer an ugh field to overcome. I just put each task somewhere on the list, depending on how soon I want to get around to it. Here, another benefit kicks in: I’m forced to compare any new tasks to the other tasks I’ve committed to do, on equal footing.
Because I’m comparing each task to all the others on a single dimension (i.e. where in the execution order to slot it), instead of a flinch-inducing dreaded obligation, I now just have another piece of work that will take a certain amount of time and create a certain amount of value. Like any other task, there are only 2 ways to get it off the list:
Either outcome is fine. What matters is that, instead of flinching away from even considering a task, I end up having an explicit conversation with myself about whether and when to do it. This lets me make better decisions.
This benefit is compounded by the way I burn down the workstreams. Somewhere in my day plan, there’s always some scheduled “heads down time,” during which I will execute tasks in a given workstream. For example, starting at 3pm today, I’ll do personal tasks. Always starting from the top of the list.
This means there’s no time to dread the ugh tasks. I don’t have to worry about whether I’m choosing the best thing to work on at a given moment, or what’s next afterward. I just have to trust the commitment I’ve already made to myself and do things in the order I’ve agreed to do them. The deciding is decoupled from the doing, which is an enormous help.
If you’ve come up with a similar arrangement, or if you’ve solved this problem for yourself in a different way, I’d love to hear about it! Hit me up in the comments.
In Technical debt is not a thing, I argue that we should stop using the metaphor of technical debt to inform strategy. Instead, I propose a time horizon model, in which our goal as engineers (or what have you) is to produce the most value possible over some time window.
In the technical debt model, we identify some inefficiency in our workflow process and trace it back to some prior decision that necessitated this inefficiency. That decision, we say, entailed an accumulation of technical debt. We traded away our future productivity for an earlier delivery date. If we want that productivity back, we must “pay off” the debt by rectifying that decision.
In the time horizon model, by contrast, we don’t worry about how we arrived at the existing sociotechnical system. Over the history of the product, we’ve made many decisions to sacrifice completeness for expediency. In making those decisions we went down a path that ultimately led us to the system as it exists today. We can’t go back and choose a different path: there’s only forward.
Let’s say we’ve got a SaaS product that relies on manually generated TLS certificates. We have to do 2 hours of toil every 3 months to renew these certs.
If we believe in technical debt, we might look back at the decision to make cert renewal a manual process and say, “By not automating this, we took on technical debt. We must pay off this debt.” We’d make a ticket, give it the technical-debt
tag, and eventually pick it up as part of our 20% time commitment to technical debt paydown.
By contrast, in the time horizon model, our team’s stated raison d’être is simply to produce the most value possible within our agreed-upon time window.
So instead, we’ll say something like “Manual cert renewal costs 2 hours of labor every 3 months. It would take us 15 hours of work to automate.” Those 15 hours could instead be spent delivering value directly, so we should only undertake this project if it will free us up to deliver more total value between now and the time horizon:
Our time horizon | The estimated time investment (which is also the opportunity cost) | The payoff over the time horizon | Our decision |
3 months | 15 hours | 2 hours | Don’t do it |
1 year | 15 hours | 8 hours | Don’t do it |
3 years | 15 hours | 24 hours | Maybe do it |
5 years | 15 hours | 40 hours | Maybe do it |
Of course, just because a given time investment passes the time horizon test doesn’t necessarily mean we should make that investment. We still need to compare it to the set of other efforts we could undertake, and devise an overall strategy that maximizes the value we’ll deliver over our time window.
The horizon model gives us a basis for making these comparisons, and lets us establish a lower bound for the expected return on our time investments. It helps us focus on the right things.
Last week I wrote about a system I’ve been using to manage my attention. It’s inspired by the call stack, but it presupposes no knowledge of computer science. Since I need a name for it, I’ve been calling it Impulse.
Since that post, I’ve been using Impulse more seriously. It has been super illuminating. My practices are in rapid flux at the moment, but the core workflow has more or less solidified to the following:
At first, I thought Impulse was just a to-do list flipped upside down. But now I think it’s more than that.
Making a to-do list involves writing down all the things you need to do, and committing to do all of them. There are many who find this comforting! The list represents a finite sequence of discrete steps, and once you’ve gone through all the steps, you’re done. You can relax.
For me, it’s quite the opposite. I get paralyzed thinking about all the possible activities I could potentially do that might be valuable. If I let myself, I’ll sit there for an hour just silently panicking about what should be on the list. Should I tidy up my desk today? Should I clean the bathroom? Should I put my DuoLingo on here? Should I add a task to reply to that email from my Dad? Should I add breaks? Now the list is way too long to get done before lunch. Should I cut some things? Which things are most important? Which things are most urgent? Oh shoot, I also need to get groceries today. Which means I don’t have time to… and so on. It’s exhausting and it doesn’t get me anywhere.
With Impulse, the focus is always on the top line. That’s always the thing I’m currently doing, no matter what. I don’t have to worry so much about whether everything important is on the list, or whether it’s in the right order. I can just bang away at the thing on top, abiding in faith that if I simply spend time doing things, things will get done.
Here’s another way to look at it: with Impulse, the process of choosing what to do is decoupled from the process of doing it.
This makes work much less overwhelming. When I’m working, I never need to worry about how my current task fits into the big picture, or whether I’ve skipped a step, or whether there’s something more valuable I could be doing instead. There is only one thing to do: the line at the top of the stack.
I’ve been using Impulse for a whole bunch of purposes. As a to-do list when I’m puttering around the house:
dishes
fold laundry
throw out coffee cup
fold baby laundry
clean roomba, put back
outline followup blog post on impulse
To break down a task into pieces and make it more manageable and help keep track of where I am:
wire clipper
new strings
something to put under guitar neck
tuner
get equipment together
loosen all strings
cut all strings
throw old strings away
take off old strings
...
restring guitar
To keep track of things I want to follow up on later:
write widget function
reply to RFC comments
tidy up desk
order thai food
review at end of day
And I use it in many other ways that are less well defined but no less intuitive.
Right now I’m just using Vim and Git. The stack is a text file in a Git repository, and I also maintain some buckets as auxiliary text files. Here’s a snapshot of the buckets I’m currently using:
pink
: (called “pink” for historical reasons) Tasks that are ready to start, which don’t demand sustained focus, and which can probably be done in under 15 minutes. Examples: check whether there's a better time for that weekly meeting
, fix dead link on website
, read that Stephen Malkmus interview I saw
blue
: (called “blue” for historical reasons) Tasks that are ready to start, which do require sustained focus, and which will probably take longer than 30 minutes. Examples: read RFC that Tom sent
, connect anti-tip kit to bookshelf
, watch latest Philosophytube video
ongoing
: BIGGER projects that are currently in flight. I try to keep this limited to 2, and I’ll pull tasks from ongoing
into the stack when I have time. I use ongoing
to think through the next few steps of a project, and those steps are organized in the same way as the stack. Examples: write impulse followup blog post
, document disaster recovery testing procedure
, prepare slide show for shitposting Zoom
project
: Projects that are not in flight. I pull these into ongoing
as it empties out.In addition to the files listed above, I have a few templates, which I can pull into the stack to represent tasks that I need to do repeatedly. Some of these templates are:
matins
: A sequence of activities I do every morning when I sit down at my desk: personal email, work email, work slack, and planning out how I’m going to spend my day.vespers
: A sequence of (much less structured) activities I do at the end of work.restring_guitar
: The process for restringing a guitar, organized into an Impulse task. It’s a thing I do so infrequently that I always forget how to do it. So Impulse is a nice way both to save that information, and to learn the process more thoroughly.This Vim/Git approach works… okay. There’s definitely some friction. Sometimes parent tasks get separated from their children because I’ve made a mistake moving things around. Syncing is a pain, because I have to remember to pull and commit and push. But despite these and other difficulties, I still find it super helpful to work from a stack.
As soon as I started using Impulse consistently, I ran into the age-old problem of consensus. I have both a work laptop and a personal laptop, and I use Impulse on both. It immediately became clear that the whole thing falls apart as soon as there’s more than one stack.
So the biggest challenge I can foresee in building a software implementation of Impulse is distributed consensus. I have some ideas about how to handle it, which may one day come to take the form of an eventually consistent object store built on top of Git. But I think I need to ignore the consensus problem for now.
Instead, I’ll try to hone a super streamlined ncurses-based UI for a single, in-memory Impulse instance:
--- Moving the Cursor
j ↓ move cursor down
k ↑ move cursor up
h ← move cursor to parent
l → move cursor to child
t move cursor to top
--- Moving tasks
J ⇧↓ move task down (among its siblings)
K ⇧↑ move task up (among its siblings)
H ⇧← move task left (make it a child of the task that's currently its grandparent)
L ⇧→ move task right (make it a child of the sibling directly above it)
--- Changing tasks
c add child task(s)
s add sibling task(s)
d delete task
Enter edit task name
--- Etc.
? help (this message)
This should keep me busy for a while. Then I’ll see where I’m at.
I have ADHD and anxiety. They interact, and it’s impossible to say where one ends and the other begins.
I can focus on a task, but if I get distracted, it’s often very hard to bring myself back to whatever I was doing. The cost of “context switching” – rebuilding the mental edifice of what I was doing and why and what I was going to do next – is exceptionally high for me. Sometimes the context switching is so cumbersome that I get distracted again before it’s done. Depending on the degree of anxiety bouncing around in my brain, this cycle may just repeat until I run out of day.
So there’s that, and then there’s also this: I’m mostly in technical leadership roles these days. I spend a lot of time in meetings and Slack threads and emails, which means I don’t get much heads-down time in a given week. And what heads-down time I do get is often split up into hours here and half hours there. Plus I have a kid now, which, while wonderful and great, is another way that my calendar gets all chopped up.
In a given week, then, if I want to accomplish any significant effort that requires focus (usually writing, research, or coding), I need to minimize the cost of context switching. To this end I’ve lately been using a system inspired by the call stack. (Note: I say “inspired” because it is not a direct analog of the call stack; please don’t expect it to be.)
Suppose I’m coding. I open my text editor to write code, but I also open another text editor to an empty file, visible on the same monitor. I call this empty document the stack. Whatever the thing is that I need to do, I write it in the stack:
implement widget feature
Implementing the widget feature is a multi-step process, though, so I need to create a few children of the implement widget feature
task:
write definition and doc string for widget function
write widget test
implement widget function such that it passes tests
implement widget feature
Children of a task appear above the task and indented. The three lines above implement widget feature
are its children.
When working with this stack system, the top item on the stack is always the thing I’m currently doing. So, in the example at hand, I’m currently working on the task write definition and doc string for widget function
.
But maybe, when I get to work writing the doc string for the widget function, I remember that I saw a good blog post about how to design functions like this. Maybe I want to reread that blog post before I commit to an interface. That’s fine! I can push (add at the top) a line for that:
reread that blog post i saw
write definition and doc string for widget function
write widget test
implement widget function such that it passes tests
implement widget feature
Now the top item in the stack is reread that blog post i saw
. So I load up the post and start reading.
Of course, after I’ve read a few paragraphs, I see a Slack notification that I need to attend to: an alert about API latency. Normally, I would go deal with that, and by the time I got back to my terminal I’d have to rebuild all my context. But with this stack system, I can just add an interrupt: a task that temporarily steals the top slot. Like so:
deal with API latency alert
reread that blog post i saw
write definition and doc string for widget function
write widget test
implement widget function such that it passes tests
implement widget feature
As always, the thing I’m currently doing occupies the top line in the stack. So now I’ll hang out in Slack and Datadog for a while, digging into this alert. That may involve adding children of the deal with API latency alert
task, or it may not, depending on how much I need to hold in my head. But let’s say it doesn’t – my teammate says she’s on it, and she doesn’t need help.
So I come back to the stack and pop that task off it. By “pop” I just mean delete the line. Since my current task is always the topmost line in the stack, I always delete (pop) the top line. Now I’m back to the widget work:
reread that blog post i saw
write definition and doc string for widget function
write widget test
implement widget function such that it passes tests
implement widget feature
I don’t have to worry about remembering what I was in the middle of. The top line is reread that blog post i saw
, so I can immediately jump back into that. When I’m done, I pop that task and move on to whatever’s now on top. And so on and so on, popping lines as I finish them, adding lines as I think of followup tasks or feel the need to break down tasks by adding children.
I’ve started a Github project called impulse to elaborate on this idea, but so far the README for impulse is entirely aspirational.
Anyway, that’s my stack system. If you have context switching difficulties like me, try it out! It’s cool.
I work on a team that fixes complex systems under time pressure. My teammates have different skill sets, different priorities, and different levels of expertise. But we all have to troubleshoot and solve problems together.
This is really hard to do effectively. Fortunately for us in the relatively new domain of DevOps, situations like ours have been studied extensively in the last couple decades. We can use the results of this research to inform our own processes and automation for troubleshooting.
One of the most important concepts to emerge from recent teamwork research, common ground, helps us understand why collaborative troubleshooting breaks down over time. This breakdown leads to wasted effort and mistakes, even if the team maintains constant communication in a chat room. But if we extend ChatOps by drawing on some ideas from medical diagnosis, we can make troubleshooting way easier without losing the benefits of fluid team conversation.
Ergonomics researchers D.D. Woods and Gary Klein (the latter of whom I wrote about in What makes an expert an expert?) published a phenomenally insightful paper in 2004 called Common Ground and Coordination in Joint Activity. In it, they describe a particular kind of failure that occurs when people engage in joint cognition: the Fundamental Common Ground Breakdown. Once you learn about the Fundamental Common Ground Breakdown, you see it everywhere. Here’s how the Woods/Klein paper describes the FCGB:
- Party A believes that Party B possesses some knowledge
- Party B doesn’t have this knowledge, and doesn’t know he is supposed to have it.
- Therefore, he or she doesn’t request it.
- This lack of a request confirms to Party A that Party B has the knowledge.
When this happens, Party A and Party B lose common ground, which Woods & Klein define as “pertinent knowledge, beliefs and assumptions that are shared among the involved parties.” The two parties start making incorrect assumptions about each other’s knowledge and beliefs, which causes their common ground to break down further and further. Eventually they reach a coordination surprise, which forces them to re-synchronize their understanding of the coordinated activity:
Seriously, the FCGB is everywhere. Check out the paper.
I’m especially interested in one particular area where an understanding of common ground can help us do better teamwork: joint troubleshooting.
Everybody’s into ChatOps these days, and I totally get it. When a critical system is broken, it’s super useful to get everybody in the same room and hash it out. ChatOps allows everybody to track progress, coordinate activities, and share results. And it also helps to have lots of different roles represented in the room:
The process of communicating across role boundaries isn’t just overhead: it helps us refine our own understanding, look for extra evidence, and empathize with each other’s perspectives.
But ChatOps still offers a lot of opportunities for common ground breakdown. The FCGB can occur whenever different people interpret the same facts in different ways. Interpretations can differ for many different reasons:
These FCGB opportunities all become more pronounced as the troubleshooting drags on and folks become tired, bored, and confused. And when somebody says they’ve lost track of common ground, what do we do? Two main things: we provide a summary of recent events and let the person ask questions until they feel comfortable; or we tell them to read the backlog.
The Q&A approach has serious drawbacks. First of all, it requires somebody knowledgeable to stop what they’re doing and summarize the situation. If people are frequently leaving and entering the chat room, you end up with a big distraction. Second of all, it leaves lots of room for important information to get missed. The Fundamental Common Ground Breakdown happens when somebody doesn’t know what to ask, so fixing it with a Q&A session is kind of silly.
The other way people catch up with the troubleshooting effort is by reading the backlog. This is even more inefficient than Q&A. Here’s the kind of stuff you have to dig through when you’re reading a chat backlog:
There’s a lot to unpack there – and that’s just 18 messages! Imagine piecing together a troubleshooting effort that’s gone on for hours, or days. It would take forever, and you’d still make a lot of mistakes. It’s just not a good way to preserve common ground.
So what do we need?
I’ve blogged before about how much I love differential diagnosis. It’s a formalism that doctors use to keep the diagnostic process moving in the right direction. I’ve used it many times in ops since I learned about it. It’s incredibly useful.
In differential diagnosis, you get together with your team in front of a whiteboard – making sure to bring together people from a wide variety of roles – and you go through a cycle of 3 steps:
Once you’re done with step #3, you can cross out a hypothesis or two. Then you head back to step #1 and repeat the cycle until the problem is identified.
A big part of the power of differential diagnosis is that it’s written down. Anybody can walk into the room, read the whiteboard, and understand the state of the collaborative effort. It cuts down on redundant Q&A, because the most salient information is summarized on the board. It eliminates inefficient chat log reading – the chat log is still there, but you use it to search for specific pieces of information instead of reading it like a novel. But, most importantly, differential diagnosis cuts down on fundamental common ground breakdowns, because everybody has agreed to accept what’s on the whiteboard as the canonical state of troubleshooting.
We don’t want to lose the off-the-cuff, conversational nature of ChatOps. But we need a structured source of truth to provide a point-in-time understanding of the effort. And we (read: I) don’t want to write a whole damn software project to make that happen.
My proposal is this: use Trello for differential diagnosis, and integrate it with the chat through a Hubot plugin. I haven’t written this plugin yet, but it shouldn’t take long (I’ll probably fork hubot-trello and start from there). That way people could update the list of symptoms, hypotheses, and tests on the fly, and they’d always have a central source of common ground to refer to.
In the system I envision, the chat room conversation would be peppered with statements like:
Geordi: hubot symptom warp engine going full speed, but ship not moving
Hubot: Created (symp0): warp engine going full speed, but ship not moving
Beverly: hubot falsify hypo1
Hubot: Falsified (hypo1): feedback loop between graviton emitter and graviton roaster
Geordi: hubot finish test1
Hubot: Marked (test1) finished: reboot the quantum phase allometer
And the resulting differential diagnosis board, containing the agreed-upon state of the troubleshooting effort, might look like this example, with cards labeled to indicate that they’re no longer in play.
Let me know if your organization already has something like this, or has tried a formal differential diagnosis approach before. I’d love to read some observations about your team’s process in the comments. Also, VictorOps has a pretty neat suite of tools that approaches what I have in mind, but I still think a more conceptually structured (not to mention free) solution could be very useful.
Automation is most effective when it’s a team player. By using automation to preserve common ground, we can solve problems faster and more thoroughly, with less frustration and less waste. And that all sounds pretty good to me.
When we imagine how to use a resource effectively – be that resource a development team, a CPU core, or a port-a-potty – our thoughts usually turn to efficiency. Ideally, the resource gets used at 100% of its capacity: we have enough capacity to serve our needs without generating queues, but not so much that we’re wasting money on idle resources. In practice there are spikes and lulls in traffic, so we should provision enough capacity to handle those spikes when they arrive, but we should always try to minimize the amount of capacity that’s sitting idle.
Except what I just said is bullshit.
In the early chapters of Donald G. Reinertsen’s brain-curdlingly rich Principles of Product Development Flow, I learned a very important and counterintuitive lesson about queueing theory that puts the lie to this naïve aspiration to efficiency-above-all-else. I want to share it with you, because once you understand it you will see the consequences everywhere.
Queueing theory is an unreasonably effective discipline that deals with systems in which tasks take time to get processed, and if there are no processors available then a task has to wait its turn in a queue. Sound familiar? That’s because queueing theory can be used to study basically anything.
In its easiest-to-consume form, queueing theory tells us about average quantities in the steady state of a queueing system. Suppose you’re managing a small supermarket with 3 checkout lines. Customers take different, unpredictable amounts of time to finish their shopping. So they arrive at the checkout line at different intervals. We call the interval between two customers reaching the checkout line the arrival interval.
And customers also take different, unpredictable amounts of time to get checked out. The time it takes from when the cashier scans a customer’s first item to when they finish checking that customer out is called the processing time.
Each of these quantities has some variability in it and can’t be predicted in advance for a particular customer. But you can empirically determine the probability distribution of these quantities:
Given just the information we’ve stated so far, queueing theory can answer a lot of questions about your supermarket. Questions like:
These sorts of questions are super important in all sorts of systems, and queueing theory provides a shockingly generalizable framework for answering them. Here’s an important theme that shows up in a huge variety of queueing systems:
The closer you get to full capacity utilization, the longer your queues get. If you’re using 100% of capacity all time, your queues grow to infinity.
This is counterintuitive but absolutely true, so let’s think through it.
What the hell? Isn’t using capacity efficiently how you’re supposed to get rid of queues? Well yes, but it doesn’t work if you do it all the time. You need some buffer capacity.
Let’s think about a generic queueing system with 5 processors. This system’s manager is all about efficiency, so the system operates at 100% capacity all the time. No idle time. That’s ideal, right?
Sure, okay, now what happens when a task gets completed? If we want to make sure we’re always operating at 100% capacity, then there needs to be a task waiting behind that one. Otherwise we’d end up with an idle processor. So our queueing system must look more like this:
In order to operate at 100% capacity all the time, we need to have at least as many tasks queued as there are processors. But wait! That means that when another new task arrives, it has to get in line behind those other tasks in the queue! Here’s what our system might look like a little while later:
Some queues may be longer than others, but no queue is ever empty. This forces the total number of items in the queue to grow without limit. Eventually our system will look like this:
If you don’t quite believe it, I don’t blame you. Go back through the logic and convince yourself. It took me a while to absorb the idea too.
You can think of a team as a queueing system. Tasks arrive in your queue at random intervals, and they take unpredictable amounts of time to complete. Each member of the team is a processor, and when everybody’s working as hard as they can, the system is at 100% capacity.
That’s what a Taylorist manager would want: everybody working as hard as they can, all the time, with no waste of capacity. But as we’ve seen, in any system with variability, that’s an unachievable goal. The closer you get to full capacity utilization, the faster your queues grow. The longer your queues are, the longer the average task waits in the queue before getting done. It gets bad real fast:
So there are very serious costs to pushing your capacity too hard for too long:
Any queueing system operating at full capacity is gonna build up giant queues. That includes your team. What should you do about it?
Just by being aware that this relationship exists, you can gain a lot of intuition about team dynamics. What I’m taking away from it is this: There’s a tradeoff between how fast your team gets planned work done and how long it takes your team to get around to tasks. This changes the way I think about side projects, and makes me want to find the sweet spot. Let me know what you take away from it.