Huh! as a signal

Every time our system fails, and we go to analyze the failure, we find ourselves saying things like “We didn’t know X was happening,” “we didn’t know Y could happen,” and so on. And it’s true: we didn’t know those things.

We can never predict with certainty what the next system failure will be. But we can predict, because painful experience has taught us, that some or all of the causes of that failure will be surprising.

We can use that!

When we go looking at data (and by “data” I mostly mean logs, traces, metrics, and so on, but data can be many things), sometimes we see something weird, and we go like, Huh!. That Huh! is a signal. If we follow that Huh! – get to the bottom of it, figure it out, make it not surprising anymore – two things happen. First, we get a chance to correct a latent problem which might some day contribute to a failure. And second, we make our mental model that much better.

Of course, any individual Huh! could turn out to be nothing. Perhaps there’s a bug. Perhaps circumstances have shifted, and our expectations no longer line up with reality. Or perhaps it’s just a monitoring blip. We won’t know until we run it down.

But, whatever the shortcomings of any particular investigation, a habit of investigating surprises has many attractive qualities. The main one is that we get to fix problems before those problems get worse, start bouncing off other problems, and cause fires. In other words: our system runs smoother. Consider what that’s worth.

I used a spreadsheet to save myself 12 hours a year grocery shopping

I do most of the grocery shopping for my family. Which is, of course, a privilege. But it’s a time-consuming privilege. From the moment I enter the supermarket to the moment I enter the checkout line, it’s about 30 minutes on average. Multiply that by 49 grocery trips a year, and you get 24.5 hours spent in the supermarket annually.

That’s 24 hours hours a year when I’d rather be doing basically anything else. The whole experience of grocery shopping grates on my nerves: jockeying for position with all the other shopping carts, getting the goddamn cleanup robot in my way, listening to the inane patter of the pre-recorded PA messages. It’s miserable.

All considered, I don’t mind doing it. My family needs to eat, and this is part of the cost of eating.

Buuut…

Why does it take me so long?

When I finally reach the end of the slog and look in my cart, I can’t help but think, there’s no way this should’ve taken me 30 minutes. There’s just not that much stuff.

So I started paying close attention to how I spend my time. Ideally, the process would look like this:

  • Walk to where the milk is
  • Put milk in the cart
  • Walk to where the eggs are
  • Put eggs in the cart
  • Walk to where the bread is
  • … and so on.

In reality, though, the process is much more like this:

  • Look at the list and decide what to get first
  • Walk in the general direction of that item
  • Hopefully find the item there, and put it in the cart
  • Take out my phone and check it off the list
  • Look at the list again and guess which of the remaining items is closest
  • Walk in the general direction of that, hoping I’ll see the right overhead sign and not miss the aisle

This process involves tremendously more context switching than the ideal. Because of deficiencies in my brain, I can’t remember more than 1 or 2 items at a time, and every context switch entails a risk of forgetting what I was in the middle of doing. Compounding with this problem is that my grocery list isn’t sorted in any particular order. I’m at the mercy of my incomplete knowledge of supermarket geography: if I miss an item because I walked past the aisle it’s in, I have to waste time walking back to that aisle.

This close examination of my time expenditure got me thinking: how much of that 30 minutes could be optimized away?

Better grocery shopping through data

It became clear that, if I wanted to spend less time in the supermarket, my best bet was to calculate the most efficient travel path through the store. Backtracking and searching were huge sources of waste. Instead, I wanted to take a single pass through the store, grabbing everything I needed as I went.

The first solution that popped into my imagination was to write a script that could take my shopping list, compare it to data about the supermarket’s layout, and produce step-by-step instructions. The instructions would look something like this:

  1. Starting from the store entrance,
  2. Go up Aisle 24. Find bread on the right side.
  3. Toward the back, find milk on the left side.
  4. Head to Aisle 14. Find yogurt against the back wall.
  5. Go down Aisle 12. Find coffee on the right side.

I implemented this as a simple Go program called grocery-run. This script contains a hard-coded shopping list, store layout, and arrangement of items. The simple routing algorithm traverses the aisles in order, walking down any aisle that contains an item on the shopping list, and checking that item off. It keeps track of which direction I’m walking so as to order the items appropriately and predict which side of the aisle the item will be found on. The output of grocery-run looks like this:

This served decently for a few weeks. Each weekend before going to the supermarket, I would transfer my shopping list from the place where my family keeps it (Trello) into the code. Then I’d run the script and transfer the output to written form in a notebook. I was of course losing time with all these transfers, but my main goal at the time was just to validate the hypothesis that a path-finding algorithm like this could save me time in the long run. If that hypothesis turned out to be correct, then from there I could set my mind to optimizing away much of the overhead of using the script.

At the supermarket, I’d follow the instructions on the paper, noting the locations of any items that weren’t yet in my data set for later entry. This data recording too was a drain on my time, but – I reasoned – one that would disappear as my data set grew over multiple trips.

The initial results were encouraging! Even with the extra time spent noting item locations, I was spending the same amount of shopping time per trip. And I was also learning some important things:

  • The left/right information wasn’t especially useful. Since I was spending no time mucking with my phone (as I was before, when I would refer to the list on Trello many times over the course of a trip), my eyes were free to scan the shelves on both sides of an aisle.
  • The front-of-store/back-of-store information wasn’t that useful either. Because I always knew which item was next, I would see the item as I reached it. Then I could use my own judgement about whether to continue walking to the end of the aisle or turn back.
  • Time spent noting item locations was indeed decreasing: from week to week, the store wasn’t getting rearranged much.
  • A paper shopping list was far more efficient than using my phone. First of all, I didn’t have to go through the distracting and time-consuming exercise of taking my phone out of my pocket, unlocking it, scanning the list, and putting it back. And moreover: since the order of the paper list was aligned with my path through the supermarket, I didn’t even need to spend time or attention checking things off. Everything before the current item had already been grabbed, and everything after it remained.

The next iteration

This grocery-run script served decently, but it ended up being a stepping stone to an even simpler solution. Since I didn’t need the left/right or front-of-store/back-of-store data, it turned out I could replace my whole script with a spreadsheet. When you can replace a script with a spreadsheet, it’s almost always a win.

Here’s the spreadsheet. It does basically the same thing as the grocery-run script. There’s one sheet for the current shopping list, one for the store layout, and one for the arrangement of items within the store.

This spreadsheet makes use of a feature of Google Sheets that I hadn’t leveraged before: VLOOKUP. This function lets you construct a lookup table in one part of your spreadsheet, which can be referenced from another part. I think of this by analogy to JOINs in SQL:

    SELECT shopping_list.item

    FROM shopping_list
    JOIN store_arrangement ON item
    JOIN store_layout ON aisle

    ORDER BY store_layout.visit_order;

My workflow with this spreadsheet is more or less the same as before:

  1. Transfer shopping list from Trello to spreadsheet
  2. Sort shopping list sheet by the “visit order” column (this step corresponds roughly to running the grocery-run script)
  3. Transfer the resulting list to paper

The final shopping list looks like this:

The left column represents the aisle, with empty spaces indicating repetition.

Before and after

Before this intervention, my grocery shopping process was circuitous and wasteful.

Now it’s much more streamlined. Much less backtracking, no more futzing with my phone, no more checking items off the list and getting distracted in the meantime.

All told, this saves me about 15 minutes a week of shopping time. Over the course of the year, that’s

(15m) * (49 weeks a year, adjusting for vacation) = 12.25h

I like to think of this as about 1 book: due to this optimization, I get to read about 1 extra book a year, or do something else of equivalent value. Plus, I get to spend 12 hours less each year listening to that insipid Stop & Shop PA loop. I’ll take it!

A simple intervention against Ugh Fields

For a year and change, I’ve been using a home-grown, constantly evolving task-tracking system I call Impulse. Part of the deal with Impulse is this: when I think of something I need to get done, I write it down immediately and get back to whatever I was doing.

Later, the thing I wrote down gets merged into a workstream. A workstream is list of tasks in the order I intend to start them. My two main workstreams are work (stuff I’m getting paid for), and personal (stuff I’m not getting paid for).

Impulse’s central principle, which is both obvious and, for me, life-altering, is that the more time you spend doing a thing, the more of that thing you will get done. Sure, there are other variables that play a role in determining output – focus, effectiveness of planning, and a match between your skill set and the work you undertake, to name a few – but time spent working is the most important.

Consequently, I try to arrange my life so as to allot some time every day to working through the tasks in my workstreams. I work from top to bottom, in whatever order the tasks are in at the time I start.

Among the myriad benefits this system has brought to my life is that it mitigates the effect of Ugh Fields.

Ugh Fields

An ugh field is a flinch response to thinking about a given domain of life, conditioned over time through repeated, self-reinforcing negative associations.

For example, I’ve long had an ugh field (or, as I’ve called it before, an attentional sneeze) around replying to texts from my family. I won’t go into how this ugh field developed. It wasn’t my family’s fault; they’re great. Point is, every time I thought about family texts I needed to reply to, my mind would flinch away. Over time, this flinch came earlier and earlier, and with less of a connection to any real source of discomfort. It grew through a feedback loop: think about the texts, flinch away, accrue consequences for ignoring them, think about those consequences, flinch earlier and more forcefully next time.

By succumbing to this ugh field, I’ve done significant damage to my relationship with my family. But the damage is not irreparable, and Impulse is helping me repair it.

Attenuating the ugh field

How can a simple task management system help repair years’ worth of self-conditioning? The key is to decouple the commitment to do a thing from the doing of the thing.

Time was, I didn’t have anywhere in particular to keep track of my plans and obligations. When it occurred to me that something needed doing (e.g. “read that my friend sent me”, “paint the railing on the back steps”, “reply to Dad about getting together in August”), I either had to do the thing right then, or remember to do it later. Thanks to attention deficit, this choice constitutes a dilemma. If I do the thing now, then I’ll lose track of what it was I was doing right beforehand. But if I decide to do the thing later, I’ll almost certainly forget to do it.

Now I have a third choice: record the task in my “inbox.” No matter how trivial it seems, just record it for now and get back to whatever I was doing. Later, on a recurring basis, merge the inbox into my personal and work workstreams.

Right off the bat, this helps. When I think of something I need to do, I don’t need to act on that thought right away. In fact I shouldn’t: that would distract me from whatever I’m already doing. I don’t have to feel bad about putting the thing off, so I don’t have so much of a flinch response.

Then, when it comes time to merge the tasks from my inbox (which is a Google Keep note) into the personal and work workstreams, there’s no longer an ugh field to overcome. I just put each task somewhere on the list, depending on how soon I want to get around to it. Here, another benefit kicks in: I’m forced to compare any new tasks to the other tasks I’ve committed to do, on equal footing.

Work’s just work

Because I’m comparing each task to all the others on a single dimension (i.e. where in the execution order to slot it), instead of a flinch-inducing dreaded obligation, I now just have another piece of work that will take a certain amount of time and create a certain amount of value. Like any other task, there are only 2 ways to get it off the list:

  1. Spend time doing it, or
  2. Decide consciously not to do it.

Either outcome is fine. What matters is that, instead of flinching away from even considering a task, I end up having an explicit conversation with myself about whether and when to do it. This lets me make better decisions.

This benefit is compounded by the way I burn down the workstreams. Somewhere in my day plan, there’s always some scheduled “heads down time,” during which I will execute tasks in a given workstream. For example, starting at 3pm today, I’ll do personal tasks. Always starting from the top of the list.

This means there’s no time to dread the ugh tasks. I don’t have to worry about whether I’m choosing the best thing to work on at a given moment, or what’s next afterward. I just have to trust the commitment I’ve already made to myself and do things in the order I’ve agreed to do them. The deciding is decoupled from the doing, which is an enormous help.

If you’ve come up with a similar arrangement, or if you’ve solved this problem for yourself in a different way, I’d love to hear about it! Hit me up in the comments.

The time horizon outlook: an alternative to technical debt

In Technical debt is not a thing, I argue that we should stop using the metaphor of technical debt to inform strategy. Instead, I propose a time horizon model, in which our goal as engineers (or what have you) is to produce the most value possible over some time window.

In the technical debt model, we identify some inefficiency in our workflow process and trace it back to some prior decision that necessitated this inefficiency. That decision, we say, entailed an accumulation of technical debt. We traded away our future productivity for an earlier delivery date. If we want that productivity back, we must “pay off” the debt by rectifying that decision.

In the time horizon model, by contrast, we don’t worry about how we arrived at the existing sociotechnical system. Over the history of the product, we’ve made many decisions to sacrifice completeness for expediency. In making those decisions we went down a path that ultimately led us to the system as it exists today. We can’t go back and choose a different path: there’s only forward.

Let’s say we’ve got a SaaS product that relies on manually generated TLS certificates. We have to do 2 hours of toil every 3 months to renew these certs.

If we believe in technical debt, we might look back at the decision to make cert renewal a manual process and say, “By not automating this, we took on technical debt. We must pay off this debt.” We’d make a ticket, give it the technical-debt tag, and eventually pick it up as part of our 20% time commitment to technical debt paydown.

By contrast, in the time horizon model, our team’s stated raison d’être is simply to produce the most value possible within our agreed-upon time window.

So instead, we’ll say something like “Manual cert renewal costs 2 hours of labor every 3 months. It would take us 15 hours of work to automate.” Those 15 hours could instead be spent delivering value directly, so we should only undertake this project if it will free us up to deliver more total value between now and the time horizon:

Our time horizonThe estimated time investment
(which is also the opportunity
cost)
The payoff over the time horizonOur decision
3 months15 hours2 hoursDon’t do it
1 year15 hours8 hoursDon’t do it
3 years15 hours24 hoursMaybe do it
5 years15 hours40 hoursMaybe do it

Of course, just because a given time investment passes the time horizon test doesn’t necessarily mean we should make that investment. We still need to compare it to the set of other efforts we could undertake, and devise an overall strategy that maximizes the value we’ll deliver over our time window.

The horizon model gives us a basis for making these comparisons, and lets us establish a lower bound for the expected return on our time investments. It helps us focus on the right things.

Takeaways: managing attention with stacks

Last week I wrote about a system I’ve been using to manage my attention. It’s inspired by the call stack, but it presupposes no knowledge of computer science. Since I need a name for it, I’ve been calling it Impulse.

Since that post, I’ve been using Impulse more seriously. It has been super illuminating. My practices are in rapid flux at the moment, but the core workflow has more or less solidified to the following:

  • Each line represents a task.
  • A task can have child tasks, which appear above the parent task and indented.
  • The top line is always the thing I’m currently working on.

At first, I thought Impulse was just a to-do list flipped upside down. But now I think it’s more than that.

Making a to-do list involves writing down all the things you need to do, and committing to do all of them. There are many who find this comforting! The list represents a finite sequence of discrete steps, and once you’ve gone through all the steps, you’re done. You can relax.

For me, it’s quite the opposite. I get paralyzed thinking about all the possible activities I could potentially do that might be valuable. If I let myself, I’ll sit there for an hour just silently panicking about what should be on the list. Should I tidy up my desk today? Should I clean the bathroom? Should I put my DuoLingo on here? Should I add a task to reply to that email from my Dad? Should I add breaks? Now the list is way too long to get done before lunch. Should I cut some things? Which things are most important? Which things are most urgent? Oh shoot, I also need to get groceries today. Which means I don’t have time to… and so on. It’s exhausting and it doesn’t get me anywhere.

With Impulse, the focus is always on the top line. That’s always the thing I’m currently doing, no matter what. I don’t have to worry so much about whether everything important is on the list, or whether it’s in the right order. I can just bang away at the thing on top, abiding in faith that if I simply spend time doing things, things will get done.

Here’s another way to look at it: with Impulse, the process of choosing what to do is decoupled from the process of doing it.

This makes work much less overwhelming. When I’m working, I never need to worry about how my current task fits into the big picture, or whether I’ve skipped a step, or whether there’s something more valuable I could be doing instead. There is only one thing to do: the line at the top of the stack.

I’ve been using Impulse for a whole bunch of purposes. As a to-do list when I’m puttering around the house:

dishes
fold laundry
throw out coffee cup
fold baby laundry
clean roomba, put back
outline followup blog post on impulse

To break down a task into pieces and make it more manageable and help keep track of where I am:

        wire clipper
        new strings
        something to put under guitar neck
        tuner
    get equipment together
        loosen all strings
        cut all strings
        throw old strings away
    take off old strings
    ...
restring guitar

To keep track of things I want to follow up on later:

write widget function
    reply to RFC comments
    tidy up desk
    order thai food
review at end of day

And I use it in many other ways that are less well defined but no less intuitive.

Current implementation

Right now I’m just using Vim and Git. The stack is a text file in a Git repository, and I also maintain some buckets as auxiliary text files. Here’s a snapshot of the buckets I’m currently using:

  • pink: (called “pink” for historical reasons) Tasks that are ready to start, which don’t demand sustained focus, and which can probably be done in under 15 minutes. Examples: check whether there's a better time for that weekly meeting, fix dead link on website, read that Stephen Malkmus interview I saw
  • blue: (called “blue” for historical reasons) Tasks that are ready to start, which do require sustained focus, and which will probably take longer than 30 minutes. Examples: read RFC that Tom sent, connect anti-tip kit to bookshelf, watch latest Philosophytube video
  • ongoing: BIGGER projects that are currently in flight. I try to keep this limited to 2, and I’ll pull tasks from ongoing into the stack when I have time. I use ongoing to think through the next few steps of a project, and those steps are organized in the same way as the stack. Examples: write impulse followup blog post, document disaster recovery testing procedure, prepare slide show for shitposting Zoom
  • project: Projects that are not in flight. I pull these into ongoing as it empties out.

In addition to the files listed above, I have a few templates, which I can pull into the stack to represent tasks that I need to do repeatedly. Some of these templates are:

  • matins: A sequence of activities I do every morning when I sit down at my desk: personal email, work email, work slack, and planning out how I’m going to spend my day.
  • vespers: A sequence of (much less structured) activities I do at the end of work.
  • restring_guitar: The process for restringing a guitar, organized into an Impulse task. It’s a thing I do so infrequently that I always forget how to do it. So Impulse is a nice way both to save that information, and to learn the process more thoroughly.

This Vim/Git approach works… okay. There’s definitely some friction. Sometimes parent tasks get separated from their children because I’ve made a mistake moving things around. Syncing is a pain, because I have to remember to pull and commit and push. But despite these and other difficulties, I still find it super helpful to work from a stack.

The software angle

As soon as I started using Impulse consistently, I ran into the age-old problem of consensus. I have both a work laptop and a personal laptop, and I use Impulse on both. It immediately became clear that the whole thing falls apart as soon as there’s more than one stack.

So the biggest challenge I can foresee in building a software implementation of Impulse is distributed consensus. I have some ideas about how to handle it, which may one day come to take the form of an eventually consistent object store built on top of Git. But I think I need to ignore the consensus problem for now.

Instead, I’ll try to hone a super streamlined ncurses-based UI for a single, in-memory Impulse instance:

--- Moving the Cursor

j ↓     move cursor down
k ↑     move cursor up
h ←     move cursor to parent
l →     move cursor to child
t       move cursor to top

--- Moving tasks

J ⇧↓    move task down (among its siblings)
K ⇧↑    move task up (among its siblings)
H ⇧←    move task left (make it a child of the task that's currently its grandparent)
L ⇧→    move task right (make it a child of the sibling directly above it)

--- Changing tasks

c       add child task(s)
s       add sibling task(s)
d       delete task
Enter   edit task name

--- Etc.

?       help (this message)

This should keep me busy for a while. Then I’ll see where I’m at.

Managing attention with stacks

I have ADHD and anxiety. They interact, and it’s impossible to say where one ends and the other begins.

I can focus on a task, but if I get distracted, it’s often very hard to bring myself back to whatever I was doing. The cost of “context switching” – rebuilding the mental edifice of what I was doing and why and what I was going to do next – is exceptionally high for me. Sometimes the context switching is so cumbersome that I get distracted again before it’s done. Depending on the degree of anxiety bouncing around in my brain, this cycle may just repeat until I run out of day.

So there’s that, and then there’s also this: I’m mostly in technical leadership roles these days. I spend a lot of time in meetings and Slack threads and emails, which means I don’t get much heads-down time in a given week. And what heads-down time I do get is often split up into hours here and half hours there. Plus I have a kid now, which, while wonderful and great, is another way that my calendar gets all chopped up.

In a given week, then, if I want to accomplish any significant effort that requires focus (usually writing, research, or coding), I need to minimize the cost of context switching. To this end I’ve lately been using a system inspired by the call stack. (Note: I say “inspired” because it is not a direct analog of the call stack; please don’t expect it to be.)

Suppose I’m coding. I open my text editor to write code, but I also open another text editor to an empty file, visible on the same monitor. I call this empty document the stack. Whatever the thing is that I need to do, I write it in the stack:

implement widget feature

Implementing the widget feature is a multi-step process, though, so I need to create a few children of the implement widget feature task:

	write definition and doc string for widget function
	write widget test
	implement widget function such that it passes tests
implement widget feature

Children of a task appear above the task and indented. The three lines above implement widget feature are its children.

When working with this stack system, the top item on the stack is always the thing I’m currently doing. So, in the example at hand, I’m currently working on the task write definition and doc string for widget function.

But maybe, when I get to work writing the doc string for the widget function, I remember that I saw a good blog post about how to design functions like this. Maybe I want to reread that blog post before I commit to an interface. That’s fine! I can push (add at the top) a line for that:

		reread that blog post i saw
	write definition and doc string for widget function
	write widget test
	implement widget function such that it passes tests
implement widget feature

Now the top item in the stack is reread that blog post i saw. So I load up the post and start reading.

Of course, after I’ve read a few paragraphs, I see a Slack notification that I need to attend to: an alert about API latency. Normally, I would go deal with that, and by the time I got back to my terminal I’d have to rebuild all my context. But with this stack system, I can just add an interrupt: a task that temporarily steals the top slot. Like so:

deal with API latency alert
		reread that blog post i saw
	write definition and doc string for widget function
	write widget test
	implement widget function such that it passes tests
implement widget feature

As always, the thing I’m currently doing occupies the top line in the stack. So now I’ll hang out in Slack and Datadog for a while, digging into this alert. That may involve adding children of the deal with API latency alert task, or it may not, depending on how much I need to hold in my head. But let’s say it doesn’t – my teammate says she’s on it, and she doesn’t need help.

So I come back to the stack and pop that task off it. By “pop” I just mean delete the line. Since my current task is always the topmost line in the stack, I always delete (pop) the top line. Now I’m back to the widget work:

		reread that blog post i saw
	write definition and doc string for widget function
	write widget test
	implement widget function such that it passes tests
implement widget feature

I don’t have to worry about remembering what I was in the middle of. The top line is reread that blog post i saw, so I can immediately jump back into that. When I’m done, I pop that task and move on to whatever’s now on top. And so on and so on, popping lines as I finish them, adding lines as I think of followup tasks or feel the need to break down tasks by adding children.

I’ve started a Github project called impulse to elaborate on this idea, but so far the README for impulse is entirely aspirational.

Anyway, that’s my stack system. If you have context switching difficulties like me, try it out! It’s cool.

Troubleshooting On A Distributed Team Without Losing Common Ground

I work on a team that fixes complex systems under time pressure. My teammates have different skill sets, different priorities, and different levels of expertise. But we all have to troubleshoot and solve problems together.

This is really hard to do effectively. Fortunately for us in the relatively new domain of DevOps, situations like ours have been studied extensively in the last couple decades. We can use the results of this research to inform our own processes and automation for troubleshooting.

One of the most important concepts to emerge from recent teamwork research, common ground, helps us understand why collaborative troubleshooting breaks down over time. This breakdown leads to wasted effort and mistakes, even if the team maintains constant communication in a chat room. But if we extend ChatOps by drawing on some ideas from medical diagnosis, we can make troubleshooting way easier without losing the benefits of fluid team conversation.

Common Ground

Ergonomics researchers D.D. Woods and Gary Klein (the latter of whom I wrote about in What makes an expert an expert?) published a phenomenally insightful paper in 2004 called Common Ground and Coordination in Joint Activity. In it, they describe a particular kind of failure that occurs when people engage in joint cognition: the Fundamental Common Ground Breakdown. Once you learn about the Fundamental Common Ground Breakdown, you see it everywhere. Here’s how the Woods/Klein paper describes the FCGB:

  • Party A believes that Party B possesses some knowledge
  • Party B doesn’t have this knowledge, and doesn’t know he is supposed to have it.
  • Therefore, he or she doesn’t request it.
  • This lack of a request confirms to Party A that Party B has the knowledge.

When this happens, Party A and Party B lose common ground, which Woods & Klein define as “pertinent knowledge, beliefs and assumptions that are shared among the involved parties.” The two parties start making incorrect assumptions about each other’s knowledge and beliefs, which causes their common ground to break down further and further. Eventually they reach a coordination surprise, which forces them to re-synchronize their understanding of the coordinated activity:

csel_eng_ohio-state_edu_woods_distributed_CG_final_pdf

Seriously, the FCGB is everywhere. Check out the paper.

I’m especially interested in one particular area where an understanding of common ground can help us do better teamwork: joint troubleshooting.

Common Ground Breakdown in Chatroom Troubleshooting

Everybody’s into ChatOps these days, and I totally get it. When a critical system is broken, it’s super useful to get everybody in the same room and hash it out. ChatOps allows everybody to track progress, coordinate activities, and share results. And it also helps to have lots of different roles represented in the room:

  • Operations folks, to provide insight into the differences between the system’s normal behavior and its current state
  • Software engineers, who bring detailed knowledge of the ways subsystems are supposed to work
  • Account managers and product managers and support reps: not just for their ability to translate technical jargon into the customer’s language for status reporting, but also because their understanding of customer needs can help establish the right priorities
  • Q.A. engineers, who can rule out certain paths of investigation early with their intuition for the ways in which subsystems tend to fail

The process of communicating across role boundaries isn’t just overhead: it helps us refine our own understanding, look for extra evidence, and empathize with each other’s perspectives.

But ChatOps still offers a lot of opportunities for common ground breakdown. The FCGB can occur whenever different people interpret the same facts in different ways. Interpretations can differ for many different reasons:

  • Some people have less technical fluency in the system than others. A statement like “OOM killer just killed Cassandra on db014” might change an ops engineer’s whole understanding of the problem, but such a shift could fly under the radar of, say, a support engineer.
  • Some people are multitasking. They may have a stake in the troubleshooting effort but be unable to internalize every detail from the chat room in real time.
  • Some people are co-located. They find it easier to discuss the problem using mouth words or by physically showing each other graphs, thereby adjusting their own shared understanding without transmitting these adjustments to the rest of the team.
  • Some people enter the conversation late, or leave for a while and come back. These people will miss common ground changes that happen during their absence.

These FCGB opportunities all become more pronounced as the troubleshooting drags on and folks become tired, bored, and confused. And when somebody says they’ve lost track of common ground, what do we do? Two main things: we provide a summary of recent events and let the person ask questions until they feel comfortable; or we tell them to read the backlog.

The Q&A approach has serious drawbacks. First of all, it requires somebody knowledgeable to stop what they’re doing and summarize the situation. If people are frequently leaving and entering the chat room, you end up with a big distraction. Second of all, it leaves lots of room for important information to get missed. The Fundamental Common Ground Breakdown happens when somebody doesn’t know what to ask, so fixing it with a Q&A session is kind of silly.

The other way people catch up with the troubleshooting effort is by reading the backlog. This is even more inefficient than Q&A. Here’s the kind of stuff you have to dig through when you’re reading a chat backlog:

tng-hipchat

There’s a lot to unpack there – and that’s just 18 messages! Imagine piecing together a troubleshooting effort that’s gone on for hours, or days. It would take forever, and you’d still make a lot of mistakes. It’s just not a good way to preserve common ground.

So what do we need?

Differential Diagnosis as an Engine of Common Ground

I’ve blogged before about how much I love differential diagnosis. It’s a formalism that doctors use to keep the diagnostic process moving in the right direction. I’ve used it many times in ops since I learned about it. It’s incredibly useful.

In differential diagnosis, you get together with your team in front of a whiteboard – making sure to bring together people from a wide variety of roles – and you go through a cycle of 3 steps:

  1. Identify symptoms. Write down all the anomalies you’ve seen. Don’t try to connect the dots just yet; just write down your observations.
  2. Generate hypotheses. Brainstorm explanations for the symptoms you’ve observed. This is where it really helps to have a good cross-section of roles represented. The more diverse the ideas you write down, the better.
  3. Test hypotheses. Now that you have a list of things that might be causing the problem, you start narrowing down that list by coming up with a test that will prove or disprove a certain hypothesis.

Once you’re done with step #3, you can cross out a hypothesis or two. Then you head back to step #1 and repeat the cycle until the problem is identified.

A big part of the power of differential diagnosis is that it’s written down. Anybody can walk into the room, read the whiteboard, and understand the state of the collaborative effort. It cuts down on redundant Q&A, because the most salient information is summarized on the board. It eliminates inefficient chat log reading – the chat log is still there, but you use it to search for specific pieces of information instead of reading it like a novel. But, most importantly, differential diagnosis cuts down on fundamental common ground breakdowns, because everybody has agreed to accept what’s on the whiteboard as the canonical state of troubleshooting.

Integrating Differential Diagnosis with ChatOps

We don’t want to lose the off-the-cuff, conversational nature of ChatOps. But we need a structured source of truth to provide a point-in-time understanding of the effort. And we (read: I) don’t want to write a whole damn software project to make that happen.

My proposal is this: use Trello for differential diagnosis, and integrate it with the chat through a Hubot plugin. I haven’t written this plugin yet, but it shouldn’t take long (I’ll probably fork hubot-trello and start from there). That way people could update the list of symptoms, hypotheses, and tests on the fly, and they’d always have a central source of common ground to refer to.

In the system I envision, the chat room conversation would be peppered with statements like:

Geordi: hubot symptom warp engine going full speed, but ship not moving

Hubot: Created (symp0): warp engine going full speed, but ship not moving

Beverly: hubot falsify hypo1

Hubot: Falsified (hypo1): feedback loop between graviton emitter and graviton roaster

Geordi: hubot finish test1

Hubot: Marked (test1) finished: reboot the quantum phase allometer

And the resulting differential diagnosis board, containing the agreed-upon state of the troubleshooting effort, might look like this example, with cards labeled to indicate that they’re no longer in play.

What do you think?

Let me know if your organization already has something like this, or has tried a formal differential diagnosis approach before. I’d love to read some observations about your team’s process in the comments. Also, VictorOps has a pretty neat suite of tools that approaches what I have in mind, but I still think a more conceptually structured (not to mention free) solution could be very useful.

Automation is most effective when it’s a team player. By using automation to preserve common ground, we can solve problems faster and more thoroughly, with less frustration and less waste. And that all sounds pretty good to me.

When efficiency hurts more than it helps

When we imagine how to use a resource effectively – be that resource a development team, a CPU core, or a port-a-potty – our thoughts usually turn to efficiency. Ideally, the resource gets used at 100% of its capacity: we have enough capacity to serve our needs without generating queues, but not so much that we’re wasting money on idle resources. In practice there are spikes and lulls in traffic, so we should provision enough capacity to handle those spikes when they arrive, but we should always try to minimize the amount of capacity that’s sitting idle.

Except what I just said is bullshit.

In the early chapters of Donald G. Reinertsen’s brain-curdlingly rich Principles of Product Development Flow, I learned a very important and counterintuitive lesson about queueing theory that puts the lie to this naïve aspiration to efficiency-above-all-else. I want to share it with you, because once you understand it you will see the consequences everywhere.

Queueing theory?

Queueing theory is an unreasonably effective discipline that deals with systems in which tasks take time to get processed, and if there are no processors available then a task has to wait its turn in a queue. Sound familiar? That’s because queueing theory can be used to study basically anything.

In its easiest-to-consume form, queueing theory tells us about average quantities in the steady state of a queueing system. Suppose you’re managing a small supermarket with 3 checkout lines. Customers take different, unpredictable amounts of time to finish their shopping. So they arrive at the checkout line at different intervals. We call the interval between two customers reaching the checkout line the arrival interval.

And customers also take different, unpredictable amounts of time to get checked out. The time it takes from when the cashier scans a customer’s first item to when they finish checking that customer out is called the processing time.

Each of these quantities has some variability in it and can’t be predicted in advance for a particular customer. But you can empirically determine the probability distribution of these quantities:

distributions

Given just the information we’ve stated so far, queueing theory can answer a lot of questions about your supermarket. Questions like:

  • How long on average will a customer have to wait to check out?
  • What proportion of customers will arrive at the checkout counter without having to wait in line?
  • Can you get away with pulling an employee off one of the registers to go stock shelves? And if you do that, how will you know when you need to re-staff that register?

These sorts of questions are super important in all sorts of systems, and queueing theory provides a shockingly generalizable framework for answering them. Here’s an important theme that shows up in a huge variety of queueing systems:

The closer you get to full capacity utilization, the longer your queues get. If you’re using 100% of capacity all time, your queues grow to infinity.

This is counterintuitive but absolutely true, so let’s think through it.

What happens when you have no idle capacity

What the hell? Isn’t using capacity efficiently how you’re supposed to get rid of queues? Well yes, but it doesn’t work if you do it all the time. You need some buffer capacity.

Let’s think about a generic queueing system with 5 processors. This system’s manager is all about efficiency, so the system operates at 100% capacity all the time. No idle time. That’s ideal, right?

fullcap-0

Sure, okay, now what happens when a task gets completed? If we want to make sure we’re always operating at 100% capacity, then there needs to be a task waiting behind that one. Otherwise we’d end up with an idle processor. So our queueing system must look more like this:

fullcap-1

In order to operate at 100% capacity all the time, we need to have at least as many tasks queued as there are processors. But wait! That means that when another new task arrives, it has to get in line behind those other tasks in the queue! Here’s what our system might look like a little while later:

fullcap-2

Some queues may be longer than others, but no queue is ever empty. This forces the total number of items in the queue to grow without limit. Eventually our system will look like this:

fullcap-3

If you don’t quite believe it, I don’t blame you. Go back through the logic and convince yourself. It took me a while to absorb the idea too.

What this means for teams

You can think of a team as a queueing system. Tasks arrive in your queue at random intervals, and they take unpredictable amounts of time to complete. Each member of the team is a processor, and when everybody’s working as hard as they can, the system is at 100% capacity.

That’s what a Taylorist manager would want: everybody working as hard as they can, all the time, with no waste of capacity. But as we’ve seen, in any system with variability, that’s an unachievable goal. The closer you get to full capacity utilization, the faster your queues grow. The longer your queues are, the longer the average task waits in the queue before getting done. It gets bad real fast:

Quartz_3____

So there are very serious costs to pushing your capacity too hard for too long:

  • Your queues get longer, which itself is demotivating. People are less effective when they don’t feel that their work is making a difference (see The Progress Principle)
  • The average wait time between a task arriving and a getting done rises linearly with queue length. With long wait times, you hemorrhage value: you commit time and energy to ideas that might not be relevant anymore by the time you get around to them (again: read the crap out of Principles of Product Development Flow)
  • Since you’re already operating at or near full capacity, you can’t even deploy extra capacity to knock those queues down: it becomes basically impossible to ever get rid of them.
  • The increased wait time in your ticket queue creates long feedback times, nullifying the benefit of agile techniques.

Efficiency isn’t the holy grail

Any queueing system operating at full capacity is gonna build up giant queues. That includes your team. What should you do about it?

Just by being aware that this relationship exists, you can gain a lot of intuition about team dynamics. What I’m taking away from it is this: There’s a tradeoff between how fast your team gets planned work done and how long it takes your team to get around to tasks. This changes the way I think about side projects, and makes me want to find the sweet spot. Let me know what you take away from it.

Kanban Highway: The Least Popular Mario Kart Course

I’ve been reading a really excellent book on product development called The Principles of Product Development Flow, by Donald G. Reinertsen. It’s a very appealing book to me, because it sort of lays down the theoretical and mathematical case for agile product development. And you know that theory is the tea, earl grey, hot to my Jean-Luc Picard.

But as much as I love this book, I just have to bring up this chart that’s in it:

terrible chart

This is the Hindenburg of charts. I can’t even, and it’s not for lack of trying to even. Being horrified by the awfulness of this chart is left as an exercise for the reader, but don’t hold me responsible if this chart gives you ebola.

But despite the utter contempt I feel for that chart, I think the point it’s trying to make is very interesting. So let’s talk about highways.

Highways!

Highways need to be able to get many many people into the city in the morning and out of the city in the evening. So when civil engineers design highways, one of their main concerns is throughput, measured in cars per hour.

Average throughput can be measured in a very straightforward way. First, you figure out the average speed, in miles per hour, of the cars on the highway. The cars are all traveling different speeds depending on what lane they’re in, how old they are, etc. But you don’t care about all that variation: you just need the average.

The other thing you need to calculate is the density of cars on the highway, measured in cars per mile. You take a given length of highway, and you count how many cars are on it, then you repeat. Ta-da! Average car density.

Then you do some math:

\frac{cars}{hour} = \frac{cars}{mile} \cdot \frac{miles}{hour}

Easy peasy. But let’s think about what that means. Here’s a super interesting graph of average car speed versus average car speed:

speed vs speed

Stay with me. Here’s a graph of average car density versus average car speed:

density vs speed

This makes sense, right? Cars tend to pack together at low speed. That’s called a bumper-to-bumper traffic jam. And when they’re going fast, cars tend to spread out because they need more time to hit the brakes if there’s a problem.

So, going back to our equation, what shape do we get when we multiply a linear equation by another linear equation? That’s right: we get a parabola:

throughput-vs-speed

That right there is the throughput curve for the highway (which in the real world isn’t quite so symmetric, but has roughly the same properties). On the left hand side, throughput is low because traffic is stopped in a bumper-to-bumper jam. On the right hand side, throughput is low too: the cars that are on the highway are able to go very fast, but there aren’t enough of them to raise the throughput appreciably.

So already we can pick up a very important lesson: Faster movement does not equate to higher throughput. Up to a point, faster average car speed improves throughput. Then you get up toward the peak of the parabola and it starts having less and less effect. And then you get past the peak, and throughput actually goes down as you increase speed. Very interesting.

Congestion

Now, looking at that throughput curve, you might be tempted to run your highway at the very top, where the highest throughput is. If you can get cars traveling the right average speed, you can maximize throughput thereby minimizing rush hour duration. Sounds great, right?

Well, not so fast. Suppose you’re operating right at the peak, throughput coming out the wazoo. What happens if a couple more cars get on the highway? The traffic’s denser now, so cars have to slow down to accommodate that density. The average speed is therefore reduced, which means we’re now a bit left of our throughput peak. So throughput has been reduced, but cars are still arriving at the same rate, so that’s gonna increase density some more.

congestion

This is congestion collapse: a sharp, catastrophic drop in throughput that leads to a traffic jam. It can happen in any queueing system where there’s feedback between throughput and processing speed. It’s why traffic jams tend to start and end all at once rather than gradually appearing and disappearing.

The optimal place for a highway to be is just a little to the right of the throughput peak. This doesn’t hurt throughput much because the curve is pretty flat near the top, but it keeps us away from the dangerous positive feedback loop.

So what does all this have to do with product development workflow?

Kanban Boards Are Just Like Highways

On a kanban board, tickets move from left to right as we make progress on them. If we had a kanban board where tickets moved continuously rather than in discrete steps, we could measure the average speed of tickets on our board in inches per day (or, if we were using the metric system, centimeters per kilosecond):

ticket-speedAnd we could also measure the density of tickets just like we measured the density of cars, by dividing the board into inch-wide slices and counting the tickets per inch:

tickets-per-inchLet’s see how seriously we can abuse the analogy between this continuous kanban board and a highway:

  • Tickets arrive in our queue at random intervals, just as cars pull onto a highway at random intervals.
  • Tickets “travel” at random speeds (in inches/day) because we’re never quite sure how long a given task is going to take. This is just like how cars travel at random speeds (in miles per hour)
  • Tickets travel more slowly when there are many tickets to do (because of context switching, interdependencies, blocked resources, etc.), just as cars travel more slowly when they’re packed more densely onto the highway.
  • Tickets travel more quickly when there are fewer tickets to do, just as cars travel more quickly when the road ahead of them is open.

There are similarities enough that we can readily mine traffic engineering patterns for help dealing with ticket queues. We end up with a very familiar throughput curve for our kanban board:

ticket-throughputAnd just like with highway traffic, we run the risk of congestion collapse if we stray too close to the left-hand side of this curve. Since kanban boards generally have a limit on the number of tickets in progress, however, our congestion won’t manifest as a board densely packed with tickets. Rather, it will take the form of very long queues of work waiting to start. This is just as bad: longer queues mean longer wait times for incoming work, and long queues don’t go away without a direct effort to smash them.

What we can learn from real-world queues

A kanban board is a queueing system like any other, and the laws of queueing theory are incredibly general in their applicability. So we can learn a lot about managing ticket throughput by looking at the ways in which other queueing systems have been optimized.

Measure your system’s attributes

First off: you need metrics. Use automation to measure and graph, at the very least,

  • Number of tickets in queue (waiting to start work)
  • Number of tickets in progress
  • Number of tickets completed per day (or week)

Productivity metrics smell bad to a lot of people, and I think that’s because they’re so often used by incompetent managers as “proof” that employees could be pulling more weight. But these metrics can be invaluable if you understand the theory that they fit into. You can’t improve without measuring.

Control occupancy to sustain throughput

As we’ve seen, when there are too many tickets in the system, completion times suffer in a self-reinforcing way. If we’re to avoid that, we need to control the number of tickets not just in progress, but occupying the system as a whole. This includes queued tickets.

In some cities (Minneapolis and Los Angeles, for example), highway occupancy is controlled during rush hour by traffic lights on the on-ramp. The light flashes green to let a single car at a time onto the highway, and the frequency at which that happens can be tuned to the current density of traffic. It’s a safeguard against an abrupt increase in density shoving throughput over the peak into congestion collapse.

But how can we control the total occupancy of our ticketing system when tickets arrive at random?

Don’t let long queues linger

If you’re monitoring your queue length, you’ll be able to see when there’s a sharp spike in incoming tickets. When that happens, you need to address it immediately.

For every item in a queue, the average wait time for all work in the system goes up. Very long queues cause very long wait times. And long queues don’t go away by themselves: if tickets are arriving at random intervals, then a long queue is just as likely to grow as it is to shrink.

One way to address a long queue is to provision a bit more capacity as soon as you see the queue forming. Think about supermarkets. When checkout lines are getting a bit too long, the manager will staff one or two extra lanes. All it takes is enough capacity to get the queues back down to normal – it’s not necessary to eliminate them altogether – and then those employees can leave the register and go back to whatever they were doing before.

The other way to address a long queue is to slash requirements. When you see a long queue of tickets forming, spend some time going through it and asking questions like

  • Can this ticket be assigned to a different team?
  • Can this feature go into a later release?
  • Are there any duplicates?
  • Can we get increased efficiency by merging any of these tickets into one? (e.g. through automation or reduced context switching)

If you can shave down your queue by eliminating some unnecessary work, your system’s wait times will improve and the required capacity to mop up the queue will be lower.

Provide forecasts of expected wait time

At Disney World, they tell you how long the wait will be for each ride. Do you think they do this because it’s a fun little bit of data? Of course not. It helps them break the feedback loop of congestion.

When the wait for Space Mountain is 15 minutes, you don’t think twice. But when the wait is an hour, you might say to yourself “Eh, maybe I’ll go get lunch now and see if the line’s shorter later.” So these wait time forecasts are a very elegant way to cut down on incoming traffic when occupancy is high. Just like those traffic lights on highway on-ramps.

Why not use Little’s law to make your own forecasts of expected ticket wait time? If you’re tracking the occupancy of your system (including queued tickets) and the average processing rate (in tickets completed per day), it’s just:

\text{Average Wait Time} = \frac{\text{Occupancy}}{\text{Average Processing Rate}}

If you display this forecast in a public place, like the home page for your JIRA project, people will think twice when they’re about to submit a ticket. They might say to themselves “If it’s gonna take that many days, I might as well do this task myself” or “The information I’m asking for won’t be useful a week from now, so I guess there’s no point filing this ticket.”

Forecasts like this allow you to shed incoming ticket load when queues are high without having to tell stakeholders “no.”

Queues are everywhere

If you learn a little bit about queueing theory, you’ll see queues everywhere you look. It’s a great lens for solving problems and understanding the world. Try it out.