The time horizon outlook: an alternative to technical debt

In Technical debt is not a thing, I argue that we should stop using the metaphor of technical debt to inform strategy. Instead, I propose a time horizon model, in which our goal as engineers (or what have you) is to produce the most value possible over some time window.

In the technical debt model, we identify some inefficiency in our workflow process and trace it back to some prior decision that necessitated this inefficiency. That decision, we say, entailed an accumulation of technical debt. We traded away our future productivity for an earlier delivery date. If we want that productivity back, we must “pay off” the debt by rectifying that decision.

In the time horizon model, by contrast, we don’t worry about how we arrived at the existing sociotechnical system. Over the history of the product, we’ve made many decisions to sacrifice completeness for expediency. In making those decisions we went down a path that ultimately led us to the system as it exists today. We can’t go back and choose a different path: there’s only forward.

Let’s say we’ve got a SaaS product that relies on manually generated TLS certificates. We have to do 2 hours of toil every 3 months to renew these certs.

If we believe in technical debt, we might look back at the decision to make cert renewal a manual process and say, “By not automating this, we took on technical debt. We must pay off this debt.” We’d make a ticket, give it the technical-debt tag, and eventually pick it up as part of our 20% time commitment to technical debt paydown.

By contrast, in the time horizon model, our team’s stated raison d’être is simply to produce the most value possible within our agreed-upon time window.

So instead, we’ll say something like “Manual cert renewal costs 2 hours of labor every 3 months. It would take us 15 hours of work to automate.” Those 15 hours could instead be spent delivering value directly, so we should only undertake this project if it will free us up to deliver more total value between now and the time horizon:

Our time horizonThe estimated time investment
(which is also the opportunity
cost)
The payoff over the time horizonOur decision
3 months15 hours2 hoursDon’t do it
1 year15 hours8 hoursDon’t do it
3 years15 hours24 hoursMaybe do it
5 years15 hours40 hoursMaybe do it

Of course, just because a given time investment passes the time horizon test doesn’t necessarily mean we should make that investment. We still need to compare it to the set of other efforts we could undertake, and devise an overall strategy that maximizes the value we’ll deliver over our time window.

The horizon model gives us a basis for making these comparisons, and lets us establish a lower bound for the expected return on our time investments. It helps us focus on the right things.

Using Little’s Law to scale applications

In a complex application, there are queues everywhere. Which is lucky, in a way, because it means we can use queueing theory to slice through a whole class of Gordian knots.

One of queueing theory’s most general insights is Little’s Law:

L = λW

L is the long-term average number of customers in the system, λ is the long-term average arrival rate of new customers, and W is the average amount of time that customers spend waiting.

In the parlance of queueing theory, “customer” doesn’t just mean “customer.” It means whatever unit of work needs to pass through the system. A customer can be a phone call or an IP packet or a literal customer at a grocery store or any one of infinitely many other things. As long as there are pieces of work that arrive, get queued, get processed, and then exit the system*, Little’s Law works. It’s breathtakingly general.

As an illustration, let me share an anecdote from my job.

*and as long as you’re not hitting a queue size limit

How many web servers do we need?

I’m on a team that’s responsible for a web app that looks more or less like this:

Requests come in from the Internet to the load balancer. The load balancer forwards the requests to a bunch of web servers, each of which, in turn, distributes requests among 6 independent worker threads. The worker threads run the business logic and send responses back up the stack. Pretty straightforward.

When a web server receives a request, it hands that request off to one of its worker threads. Or, if all the worker threads are busy, the request gets queued in a backlog to be processed once capacity becomes available.

If everything’s hunky dory, the backlog should be empty. There should always be idle capacity, such that requests never have to wait in a queue. But one day I noticed that the backlog wasn’t empty. The total number of backlogged requests across the fleet looked like this:

Things were getting queued at peak traffic, so we needed to scale up the number of web servers. But scale it up to what? I could have used trial and error, but instead, I turned to Little’s Law.

The first step was to establish the mapping between entities in my system and the general mathematical objects related by Little’s Law:

  • L: the number of in-flight requests. In other words, requests that have arrived at the load balancer and for which responses haven’t yet been sent back to the user.
  • λ: the rate at which new requests arrive at the load balancer.
  • W: the average request latency.

What I wanted to know – and didn’t have a metric for – was L. I did have a metric in my telemetry system for W, the average request latency.

While I didn’t exactly have a metric for λ, the arrival rate of requests, I did have the completion rate of requests (i.e. how many requests per second were being served). The long-term average arrival rate can’t differ from the completion rate, since every request does exit the system eventually. Therefore I was able to use the completion rate as a stand-in for λ. Here’s what I found (these aren’t the actual numbers):

L = λW
(average occupancy) = (arrival rate)(average wait time)
(average occupancy) = (1000 req/s)(340ms)
(average occupancy) = 340 requests

I chose an arrival rate close to the peak throughput of the system. This still works as a “long-term average,” though, since the interval between arrivals (on the order of 1 millisecond) is much less than the duration of the average request (on the order of 300 milliseconds).

So, according to Little’s Law, at peak-traffic times, there will be on average 340 requests in flight in the system. Sometimes more, sometimes less, but on average 340. From there, it was trivial to see why requests were getting queued:

(average web server occupancy) = (average occupancy) / (number of web servers)
(average web server occupancy) = (340 requests) / (40)
(average web server occupancy) = 8.5

If you recall that each web server maintains 6 worker threads, you’ll see the problem. No matter what fancy stuff we try to do with queueing discipline or load balancing algorithm or whatever, there will be on average 2.5 queued requests per worker.

Little’s Law can also tell us what we need to scale up to if we want to extract ourselves from this mire:

(total worker threads) ≥ (arrival rate)(average wait time)
(number of web servers)(worker threads per web server) ≥ 340

So we can either scale up web servers or worker threads per web server until their product is greater than 340.

Little’s Law is about long-term averages

This is only a lower bound, of course.

Little’s Law tells us about the long-term average behavior of a queueing system: nothing else. From moment to moment, the occupancy of the system will vary around this average. So, in the example above, we need to provision enough worker capacity to absorb these variations.

How much extra capacity do we need? Little’s Law can’t tell us. The answer will depend on the idiosyncrasies of our system and the traffic that passes through it. Different systems have different latency distributions, different arrival time distributions, different queueing disciplines, and so on. These variables all have some effect on our worst case and our 99th-percentile occupancy. So, in most cases, it’ll be important to get a sense for the empirical ratio between a system’s average occupancy and its occupancy at whatever percentile you decide to care about. But Little’s Law is still a very helpful tool.

If you do have a good sense of how worst-case occupancy varies with the average, you might even be able to use Little’s Law to inform your autoscaling strategy. As long as the system’s arrival rate changes much more gradually than the average latency of requests (such that you’re still working with long-term averages), you can rely on

L = λW

to predict your capacity requirements. Or, at least, I think you could. I haven’t tried it yet.

Technical debt is not a thing

In software engineering, we have this concept called “technical debt.”

When we talk about technical debt, we usually mean that, because of a given technical decision, we’ll be making “interest payments” out of our productivity until the original “debt” is paid off. These payments can take many forms, from manual toil to risk to cognitive load. Anything that saps our productivity on a recurring basis.

Technical debt is a metaphor. It’s a conceptual tool we use to grapple with elusive, abstract stuff like productivity and value. That abstract stuff gets grounded to a more concrete family of concepts: namely, money and debt.

All of which is well and good, if the metaphor helps us reason. But it doesn’t. The technical debt metaphor serves only to obscure and encumber. We must discard it.

Direct and indirect work

We use the technical debt metaphor when we’re talking about strategy: our plan for undertaking certain pieces of work in a certain order so as to achieve some goal.

The most straightforward kind of strategy for a software engineering team would be to do nothing but build high-value features. Pick the highest-value feature that can be built in a sprint, build, repeat. Get paychecks, win promotions.

This strategy works for some teams. But for most, it’s naïve. If we only do direct work (what I’m calling, more-or-less, feature work: work that creates value for customers), problems start to emerge. We find ourselves repeating manual procedures, fixing more user-facing bugs, and fighting more fires. All this unplanned work demands our labor, diverting it from direct work. It thereby reduces the effective rate at which we can create value: our team’s capacity (“velocity”, if you Agile).

In order to keep our capacity from slipping, we have to do indirect work. Examples of indirect work are many and varied: writing internal documentation, caring for test suites and CI pipelines, building metric dashboards, refactoring old crufty code.

Indirect work uses some of our capacity. But we hope that, by folding indirect work into our strategy, we can avoid a lot of disruptive and expensive unplanned work that would otherwise emerge.

However, on its own, indirect work doesn’t tend to float to the top of the priority list. We usually prioritize tasks based on how much value they directly create for customers. It’s hard to even talk about direct and indirect work in the same conversation. It feels as though they create forms of value that are incommensurable: impossible to prioritize with respect to each other according to a single metric. Therefore, direct work tends to elbow indirect work out of the roadmap.

Nevertheless, we know from experience what happens when we neglect to do enough indirect work. Everything falls apart and we fail.

How we use the technical debt metaphor

The metaphor of technical debt is mainly a tool for working around the incommensurability of direct and indirect work. It’s how we try to balance tasks whose output is an amount of value (direct work) and tasks whose output is an increase in the rate at which we can create value (indirect work).

“Rapid progress isn’t free,” we say to management. “It’s achieved by taking out a sort of loan. We get to ship sooner, but in exchange we take on an obligation to make recurring payments, in the currency of labor.” These recurring payments only touch the interest. They will never, on their own, pay down the principal. To do that, we need to undertake some more substantial effort. A “paydown” project that will liberate us from the waste – the toil, or risk, or cognitive load – that our technical debt entails.

The technical debt metaphor feels true. And it carries some useful baggage from its origins in finance to its new home in software engineering. Namely: neglecting your debts is short-sighted and irresponsible. Almost everybody agrees on this, at least with respect to real debt. So, if the metaphor holds, why shouldn’t it be true of tech debt too?

Thanks to the implicit association between paying down debts and responsibility, engineers and management both get what they want. Engineers get to spend time doing work that benefits them directly. In other words, they get a respite from the feeling of alienation from their labor product. Management gets a plausible justification for prioritizing work that keeps their team happy. It’s a win-win, right?

How the metaphor fails us

If you take out a real loan, your payments are usually proportional to the initial principal. It’s a rough proportionality, since interest rates and maturity dates vary, but in general: the bigger the loan, the higher the payments. Mortgage payments are higher than car payments.

This fact about loans has an important implication. Since the labor you have to expend to pay off a loan is (roughly) proportional to the amount of your recurring payment, loans are always worth paying off. If you have a tiny loan, it will take a tiny amount of labor to pay off, so it’s worth doing even though the interest payments are tiny. If you have a huge loan, then your interest payments constitute a significant financial drain, which makes the loan worth paying off even though it will take a lot of labor to do it. So we all share an intuition about loans: there’s no responsible alternative to paying them off.

This intuition implicitly tags along with the debt metaphor, because of how metaphor works, into the engineering space. We feel we must pay off our debts. But since technical debt paydown tasks never seem to make it onto the roadmap by themselves, we have to come up with some system to force them into the pipeline. To spend a certain proportion of our labor paying down technical debt, even though we won’t have any new features to show for it.

The number cited most often is 20%. 20% of our time feels like about how much it takes to pay down the right amount of tech debt. So we throw all the tech debt into a big bucket and try to devote about 20% of our labor to tasks in that bucket. After all, it’s debt: we can’t just not pay it.

But the dynamics of technical debt are quite different from those of real debt. Different enough that this intuition no longer stands up to reality.

How much are we paying in interest?

Imagine, for example, that a team decides to forgo automating a software product’s release process. By accepting a manual release process instead, they’re able to launch a week earlier than otherwise. But, since they lack an automated release process, they need to do an extra t hours of work each month.

The decision to accept a manual release process in order to launch earlier could be classified as taking on technical debt. Which means, if we apply our intuition about real debts, that it must be paid off. So it will go into the “tech debt” bucket and get done as part of the team’s 20% time. Regardless of the value of t. In fact, on most teams (in my experience), t isn’t even discussed. The task is qualitatively tech debt, so it goes in the bucket.

But you have to know t. If t is high enough, then there’s an enormous drain on our productivity which must be rectified urgently. If t is low enough, the task might not even be worth doing at all.

How much will it cost to pay down the principal?

Furthermore, in order to know whether it’s worth paying down a given bit of technical debt, we need to know how much it will cost to pay down.

Going back to the example from above: suppose it will take P hours of labor to automate the release process. If P is low enough, the automation is worth building. But if P is very high, it’s better to accept the toil and move on.

P doesn’t tend to be discussed either. Or, rather, first we decide that something is technical debt, and only when it comes time to prioritize it with respect to the other technical debt in the bucket do we ask about P.

Rejecting the technical debt metaphor

So, in order to strategize – to decide what work to do in what order – we need to estimate, at the very least:

  • t: The amount of value that will be created by doing a given task
  • P: The amount of labor required to do the task

Fine. But are these parameters not both necessary for evaluating feature work too? And if we need to consider them anyway, regardless of whether a task is technical debt, then what’s the point of calling some tasks technical debt paydown and others not?

The defining characteristic of technical debt is the decision to sacrifice some amount of future productivity in exchange for earlier delivery. But this decision is not unique to technical debt: it’s universal to engineering. Every change we ship – whether for a feature, or a bug, or a technical debt payment – every change carries an implicit decision about how much value to defer and how much to deliver now. “Technical debt” is just what we call it when the deferred value takes the form of a decrease in our team’s velocity. But that form doesn’t make it special or different, or incommensurable with other work.

When we make strategic decisions – decisions about what work to do in what order – we need a guiding principle. “Spend at least 20% of our time paying down tech debt” is not a principle. It’s an excuse to substitute our arbitrary personal taste for a realistic, evidence-based appraisal of cost and value.

And yet, something must be done.

But okay: so what if the technical debt discourse is useless? We’re dissatisfied with how much time we lose to recurring low-value-add tasks. Without the technical debt metaphor, how can we make the case that it’s worth investing time to eliminate these drains on our productivity?

One way is to adopt the simple model of a time horizon. Suppose a team declares the following as their goal:

To produce the most value possible over the coming year.

This team has a time horizon of 1 year. So, when there comes a proposal to make a particular improvement to their productivity (be it a process fix or some automation or whatever else), they can estimate:

  • How many hours of labor it will take to implement
  • How many weeks it will take to complete that labor
  • How many hours of labor it will save them over the year (the remainder of the year, that is, after subtracting the weeks spent implementing)

In this way, productivity investments can be evaluated each on their own merit, rather than being lumped together into the “technical debt” pile. The investments with the highest expected returns will emerge as clear priorities. And many proposals can be rejected entirely, if it’s determined that they require more labor than they can possibly save over a year.

With an agreed-upon time horizon, direct work and indirect work become commensurable. We get a more coherent strategy, and we don’t waste time wringing our hands over what’s technical debt and what’s not. The technical debt metaphor falls away, obsolete.

Learning about attention deficit through spreadsheets

I struggle with attention deficit. Often this manifests through Twitter. I’ll find myself scrolling it with no memory of how I got there. And no particular interest. Just scrolling.

This is not great. First of all, it’s a waste of time. But that’s okay – I’ve made a decision to continue having a Twitter account, and some amount of wasted time is a consequence of that. Sue me: I love posting.

The bigger problem is that whenever I unintentionally end up on Twitter, even if I only end up being there for a couple seconds, my whole mental stack gets disrupted. I lose context. If the thing I was doing was a thing that required a lot of abstraction juggling – such as coding or writing – the bowling pins may have already tumbled down by the time I even realize I’m scrolling Twitter again.

Impulse helps a lot. It gives me a trail of breadcrumbs to follow back to my prior train of thought. But still: the mental context has largely dissipated. I have to build my momentum back up again. If this happens many times a day, it’s a serious leak of labor power.

Spreadsheet!!!

Like all problems, this one can be solved with spreadsheets.

Okay, not solved. But maybe understood. Maybe. In part.

About a week ago, I logged out of Twitter on my work laptop. Now, every time I go to check Twitter, instead of wonderful, infinitely scrolling little posts, I see this:

Not posts

This immediately jars me out of my reverie. “Whoops!” I think, “This isn’t what I meant to be doing.” I can go back to the task at hand instead of wasting time on Twitter. Of course, it’s already too late for the cognitive context – that turned to salt the instant I opened the new browser tab. But at least now I can learn something from these attention faults.

I started a spreadsheet called Twitter Attempts. Over the workday, every time I see the Twitter login screen, I take it as a reminder to update this Twitter Attempts spreadsheet. I log the timestamp of the attention fault, as well as a note about what I was doing right beforehand. Then I go back to what I was doing. The spreadsheet looks roughly like so:

Observations

I’ve found this technique pretty illuminating.

Attention faults have precipitating events

I used to think of attention deficit as something like getting in a car accident. My attention is puttering along all lah-dee-dah, then pow! a different thing is happening.

The data lead(s) me to believe it’s not really like that. Instead of being spontaneous and random, attention faults are almost invariably preceded by some specific precipitating event. In other words, when I see the Twitter login screen and redirect myself to my spreadsheet, I can nearly always recall a particular event that seems to have “caused” the attention fault.

I call this sequence – precipitating event followed by attention fault – an attentional sneeze.

gross

If you’ll forgive the grossness of the metaphor, you could kind of think of a precipitating event as the pepper and checking Twitter as the snot. Sorry.

I review my spreadsheet at the end of every week. Here are some (synthesized; not verbatim) examples of the kinds of “peppers” I tend to find in the Notes column:

  • Just submitted a pull request
  • Just received an annoying Slack message
  • Just shifted from an emotionally fraught topic to a safe topic in meeting
  • Just finished writing a paragraph of documentation

An attention fault basically never just feels like, “I don’t know what happened! I was in the middle of typing a sentence and suddenly I was on Twitter.” This is surprising and I didn’t know it before. Hooray for spreadsheets!

Peppers fall into some categories

Here are some examples, straight from the data, of some of these pepper events:

  • somebody talking to me in zoom => that person talking to someone else instead
  • shift from asking in slack a question about git, to refactoring my log pipeline work
  • emotional shift from vulnerable team convo to administrivia
  • finished mtg w/ arthur
  • shift from asking for help to waiting for a reply

As I look at these entries and many more like them, there seem to emerge a few patterns. Peppers usually involve things like:

  1. A shift between pragmatic and intuitive modes of thought
  2. An emotional shift
  3. The beginning of an interval of waiting
  4. The completion of a task

These categories overlap quite a bit. In any case, they’re all boundaries between one mental state and another.

Sliding activation energy

Some days I feel centered and focused; other days I feel flustered and preoccupied. Distractable.

I’ve noticed that, on days characterized by high distractibility, the bar for what counts as “completion of a task” (the 4th type of pepper event) becomes a much lower bar. For example, suppose I’m replying to comments on a pull request. On a low-distractibility day I might finish replying to all the comments in the PR before checking Twitter. On a high-distractibility day, however, I might check Twitter after every reply! (But still – never in the middle of one.)

Limitations, plans

I don’t know where to go with any of this. It’s food for thought.

I will say that this particular well of data dries up pretty fast. There’s a positive feedback loop: I’m not logged into Twitter, so my mind eventually acclimates to this and substitutes some other type of flinch instead. As a result, fewer Twitter login attempts; as a result of that, stronger reinforcement that Twitter is no longer a viable distraction.

Another limitation of this avenue of investigation is its vulnerability to the availability heuristic. Just because I think a given attentional sneeze was caused by a particular pepper event, doesn’t mean it really was. But hey – all I have to work with is my own perceptions.

At any rate, the exercise has already been pretty worth it. And if any of this sounds interesting, I’d be super keen to hear how this experiment goes for you!

Big Problems and Small Problems under load

In a complex system, like a SaaS product, you have countless interdependent things all happening at every moment. On a gross scale: means of production go in; value comes out. But under the hood, all sorts of interactions are going on, at various rates and latencies, with various exit conditions, stacked to various heights of indirection. And the rules for these interactions change with every deploy.

As load gets added to the system, you start to experience 🏋️‍♀️🔥🪦 load problems. The system just can’t keep up with traffic, and it starts to break down under various circumstances, in all kinds of different more- or less spectacular ways.

At the onset of load problems, it feels like everything is breaking at once. Like some threshold was breached and we’ve entered a new domain. We always had problems, but now we have a new Big Problem.

Big Problems, though, are always composed of smaller problems. Big Problems conditioned by load are not atomic novel phenomena. There’s a knot of interdependent processes, and when this knot gets pulled tight by load, it binds up.

Maybe there was a code deploy 6 weeks ago that inadvertently made this API endpoint’s median latency
50% higher. And maybe that API endpoint holds open an AMQP connection until it finishes. And maybe some customer just spun up 500 instances of a container that hits the endpoint. Now all these tangled Small Problems constrict, resources are exhausted, and the system stops producing.

It’s just these sorts of interdependency knots that cause an increase in the frequency of customer-facing problems as load rises.

We rely heavily on Big Problems to tell us which Small Problems need fixing. “How did this failure happen? It happened because of this and this and this contributing factor. Which of these can we fix?” And of course the results of this sort of analysis are useful. The Small Problems we identify this way have a crucial unifying feature: we know, because we’ve seen it happen, that they can each be contributing factors to Big Problems.

But this is not the only way to identify Small Problems. We can go digging in our data and find them, endeavoring to solve them before they get a chance to cause a Big Problem.

Okay, maybe we can’t prevent every Big Problem this way. And without the gating effect of post-mortem analysis, we’ll likely spend some of our time fixing Small Problems that never would’ve amounted to anything.

But even if we’re just poking around in the dark, we’ll probably end up preventing some Big Problems, right? And Big Problems are expensive. I think, on balance, it’s worth the effort.

Somebody ought to be getting paid to be Maxwell’s Demon. Ideally, nearly everybody in the org ought to be spending some of their time investigating mysteries. It ought to be an explicit part of the job.

Managing entropy in complex systems: the Maxwell’s Demon approach


Take a nontrivial software system and put it on the internet. Problems will emerge. Some problems will be serious; others less so. We won’t notice most of them.

A software system in production is a bucket filled with fluid. Each particle of the fluid is a discrete problem. The problems bounce around and collide with each other and do all kinds of stochastic stuff from moment to moment.

At the very bottom of the bucket are problems so minute that they can hardly be called problems at all. They have low energy. They don’t interact much with each other or with anything else.

Higher up, you find higher-and-higher-energy particles. Problems that cause small hiccups, or sporadic bouts of sluggishness.

Somewhere near the top, there’s a threshold. When a problem gets enough energy to cross this threshold, we passively notice it. Maybe it causes an outage, or maybe it just causes a false positive alert. Maybe a support ticket gets filed. Maybe it’s just a weird spike in a graph. However we perceive it, we’re forced to take it seriously.

Once problems get enough energy, we can’t help but notice them. But before that, they already exist.

What happens before a particle jumps this energy threshold?

Perhaps the problem is entirely novel – no part of it existed before now. A code deploy with a totally self-contained bug. A DOS attack. If it’s something like that: oh well.

But more often, a problem we just perceived has been acted upon by a more gradual process. Problems bounce around in the bucket, and occasionally they bounce into each other and you get a problem with higher energy than before. Or circumstances shift, and a problem that was once no big deal becomes a big deal. Over time, particles that started in the middle – or even at the bottom – can work their way up to the passive perception line.

If problems usually hang out below the perception threshold for a while before they cross it, then we can take advantage of that in two ways. One way is to lower the threshold for passive perception. Raise the sensitivity of our monitors without sacrificing specificity. This is hard, but worthwhile.

The other way to take advantage of the fluid-like behavior of problems is to spend energy finding and fixing problems before they boil. I call this the Maxwell’s demon approach. You go looking for trouble. You poke around in dashboards and traces and logs, find things that look weird, turn them around in your hands until you understand them, and ultimately fix them. Maybe you have a ticket backlog of possible problems you’ve found, and it’s somebody’s job to burn down that backlog. Ideally it’s the job of a team using a shared-context system like differential diagnosis.

If you make it somebody’s job to be Maxwell’s demon, you can find and fix all sorts of problems before they become bigger problems. If you don’t make it someone’s job, then no problem will get taken seriously until it’s an outage.

Takeaways: managing attention with stacks

Last week I wrote about a system I’ve been using to manage my attention. It’s inspired by the call stack, but it presupposes no knowledge of computer science. Since I need a name for it, I’ve been calling it Impulse.

Since that post, I’ve been using Impulse more seriously. It has been super illuminating. My practices are in rapid flux at the moment, but the core workflow has more or less solidified to the following:

  • Each line represents a task.
  • A task can have child tasks, which appear above the parent task and indented.
  • The top line is always the thing I’m currently working on.

At first, I thought Impulse was just a to-do list flipped upside down. But now I think it’s more than that.

Making a to-do list involves writing down all the things you need to do, and committing to do all of them. There are many who find this comforting! The list represents a finite sequence of discrete steps, and once you’ve gone through all the steps, you’re done. You can relax.

For me, it’s quite the opposite. I get paralyzed thinking about all the possible activities I could potentially do that might be valuable. If I let myself, I’ll sit there for an hour just silently panicking about what should be on the list. Should I tidy up my desk today? Should I clean the bathroom? Should I put my DuoLingo on here? Should I add a task to reply to that email from my Dad? Should I add breaks? Now the list is way too long to get done before lunch. Should I cut some things? Which things are most important? Which things are most urgent? Oh shoot, I also need to get groceries today. Which means I don’t have time to… and so on. It’s exhausting and it doesn’t get me anywhere.

With Impulse, the focus is always on the top line. That’s always the thing I’m currently doing, no matter what. I don’t have to worry so much about whether everything important is on the list, or whether it’s in the right order. I can just bang away at the thing on top, abiding in faith that if I simply spend time doing things, things will get done.

Here’s another way to look at it: with Impulse, the process of choosing what to do is decoupled from the process of doing it.

This makes work much less overwhelming. When I’m working, I never need to worry about how my current task fits into the big picture, or whether I’ve skipped a step, or whether there’s something more valuable I could be doing instead. There is only one thing to do: the line at the top of the stack.

I’ve been using Impulse for a whole bunch of purposes. As a to-do list when I’m puttering around the house:

dishes
fold laundry
throw out coffee cup
fold baby laundry
clean roomba, put back
outline followup blog post on impulse

To break down a task into pieces and make it more manageable and help keep track of where I am:

        wire clipper
        new strings
        something to put under guitar neck
        tuner
    get equipment together
        loosen all strings
        cut all strings
        throw old strings away
    take off old strings
    ...
restring guitar

To keep track of things I want to follow up on later:

write widget function
    reply to RFC comments
    tidy up desk
    order thai food
review at end of day

And I use it in many other ways that are less well defined but no less intuitive.

Current implementation

Right now I’m just using Vim and Git. The stack is a text file in a Git repository, and I also maintain some buckets as auxiliary text files. Here’s a snapshot of the buckets I’m currently using:

  • pink: (called “pink” for historical reasons) Tasks that are ready to start, which don’t demand sustained focus, and which can probably be done in under 15 minutes. Examples: check whether there's a better time for that weekly meeting, fix dead link on website, read that Stephen Malkmus interview I saw
  • blue: (called “blue” for historical reasons) Tasks that are ready to start, which do require sustained focus, and which will probably take longer than 30 minutes. Examples: read RFC that Tom sent, connect anti-tip kit to bookshelf, watch latest Philosophytube video
  • ongoing: BIGGER projects that are currently in flight. I try to keep this limited to 2, and I’ll pull tasks from ongoing into the stack when I have time. I use ongoing to think through the next few steps of a project, and those steps are organized in the same way as the stack. Examples: write impulse followup blog post, document disaster recovery testing procedure, prepare slide show for shitposting Zoom
  • project: Projects that are not in flight. I pull these into ongoing as it empties out.

In addition to the files listed above, I have a few templates, which I can pull into the stack to represent tasks that I need to do repeatedly. Some of these templates are:

  • matins: A sequence of activities I do every morning when I sit down at my desk: personal email, work email, work slack, and planning out how I’m going to spend my day.
  • vespers: A sequence of (much less structured) activities I do at the end of work.
  • restring_guitar: The process for restringing a guitar, organized into an Impulse task. It’s a thing I do so infrequently that I always forget how to do it. So Impulse is a nice way both to save that information, and to learn the process more thoroughly.

This Vim/Git approach works… okay. There’s definitely some friction. Sometimes parent tasks get separated from their children because I’ve made a mistake moving things around. Syncing is a pain, because I have to remember to pull and commit and push. But despite these and other difficulties, I still find it super helpful to work from a stack.

The software angle

As soon as I started using Impulse consistently, I ran into the age-old problem of consensus. I have both a work laptop and a personal laptop, and I use Impulse on both. It immediately became clear that the whole thing falls apart as soon as there’s more than one stack.

So the biggest challenge I can foresee in building a software implementation of Impulse is distributed consensus. I have some ideas about how to handle it, which may one day come to take the form of an eventually consistent object store built on top of Git. But I think I need to ignore the consensus problem for now.

Instead, I’ll try to hone a super streamlined ncurses-based UI for a single, in-memory Impulse instance:

--- Moving the Cursor

j ↓     move cursor down
k ↑     move cursor up
h ←     move cursor to parent
l →     move cursor to child
t       move cursor to top

--- Moving tasks

J ⇧↓    move task down (among its siblings)
K ⇧↑    move task up (among its siblings)
H ⇧←    move task left (make it a child of the task that's currently its grandparent)
L ⇧→    move task right (make it a child of the sibling directly above it)

--- Changing tasks

c       add child task(s)
s       add sibling task(s)
d       delete task
Enter   edit task name

--- Etc.

?       help (this message)

This should keep me busy for a while. Then I’ll see where I’m at.

Managing attention with stacks

I have ADHD and anxiety. They interact, and it’s impossible to say where one ends and the other begins.

I can focus on a task, but if I get distracted, it’s often very hard to bring myself back to whatever I was doing. The cost of “context switching” – rebuilding the mental edifice of what I was doing and why and what I was going to do next – is exceptionally high for me. Sometimes the context switching is so cumbersome that I get distracted again before it’s done. Depending on the degree of anxiety bouncing around in my brain, this cycle may just repeat until I run out of day.

So there’s that, and then there’s also this: I’m mostly in technical leadership roles these days. I spend a lot of time in meetings and Slack threads and emails, which means I don’t get much heads-down time in a given week. And what heads-down time I do get is often split up into hours here and half hours there. Plus I have a kid now, which, while wonderful and great, is another way that my calendar gets all chopped up.

In a given week, then, if I want to accomplish any significant effort that requires focus (usually writing, research, or coding), I need to minimize the cost of context switching. To this end I’ve lately been using a system inspired by the call stack. (Note: I say “inspired” because it is not a direct analog of the call stack; please don’t expect it to be.)

Suppose I’m coding. I open my text editor to write code, but I also open another text editor to an empty file, visible on the same monitor. I call this empty document the stack. Whatever the thing is that I need to do, I write it in the stack:

implement widget feature

Implementing the widget feature is a multi-step process, though, so I need to create a few children of the implement widget feature task:

	write definition and doc string for widget function
	write widget test
	implement widget function such that it passes tests
implement widget feature

Children of a task appear above the task and indented. The three lines above implement widget feature are its children.

When working with this stack system, the top item on the stack is always the thing I’m currently doing. So, in the example at hand, I’m currently working on the task write definition and doc string for widget function.

But maybe, when I get to work writing the doc string for the widget function, I remember that I saw a good blog post about how to design functions like this. Maybe I want to reread that blog post before I commit to an interface. That’s fine! I can push (add at the top) a line for that:

		reread that blog post i saw
	write definition and doc string for widget function
	write widget test
	implement widget function such that it passes tests
implement widget feature

Now the top item in the stack is reread that blog post i saw. So I load up the post and start reading.

Of course, after I’ve read a few paragraphs, I see a Slack notification that I need to attend to: an alert about API latency. Normally, I would go deal with that, and by the time I got back to my terminal I’d have to rebuild all my context. But with this stack system, I can just add an interrupt: a task that temporarily steals the top slot. Like so:

deal with API latency alert
		reread that blog post i saw
	write definition and doc string for widget function
	write widget test
	implement widget function such that it passes tests
implement widget feature

As always, the thing I’m currently doing occupies the top line in the stack. So now I’ll hang out in Slack and Datadog for a while, digging into this alert. That may involve adding children of the deal with API latency alert task, or it may not, depending on how much I need to hold in my head. But let’s say it doesn’t – my teammate says she’s on it, and she doesn’t need help.

So I come back to the stack and pop that task off it. By “pop” I just mean delete the line. Since my current task is always the topmost line in the stack, I always delete (pop) the top line. Now I’m back to the widget work:

		reread that blog post i saw
	write definition and doc string for widget function
	write widget test
	implement widget function such that it passes tests
implement widget feature

I don’t have to worry about remembering what I was in the middle of. The top line is reread that blog post i saw, so I can immediately jump back into that. When I’m done, I pop that task and move on to whatever’s now on top. And so on and so on, popping lines as I finish them, adding lines as I think of followup tasks or feel the need to break down tasks by adding children.

I’ve started a Github project called impulse to elaborate on this idea, but so far the README for impulse is entirely aspirational.

Anyway, that’s my stack system. If you have context switching difficulties like me, try it out! It’s cool.

Stop apologizing for bugs

Look out, honey, ’cause I’m using technology

Ain’t got time to make no apology

The Stooges, “Search and Destroy”

For the last year or so, I’ve made a conscious effort to stop apologizing for bugs in my code.

Apologizing for bugs is very tempting. I used to do it a lot. When my code was involved in a failure that screwed up a coworker’s day or caused a user-facing problem, I’d say “Whoops! Sorry! I should’ve thought about that.”

The motivation for apologizing is sound: you want to do your best for the team. In principle, you could have prevented a problem if you’d just done something slightly different. Through this lens, apologizing for bugs may seem innocuous. But it contributes to a bunch of cultural anti-patterns:

  • It reinforces the idea that any one person or piece of code can be blamed for a given failure. Short of malice, this is never the case.
  • It gives the impression that, when you wrote the code, you should have written it better. This is a counterfactual that rarely holds up to examination.
  • It positions shame as the correct emotion to feel about bugs in your code: if you were a better engineer – a better teammate – the bug wouldn’t exist.
  • If you’re a more senior engineer on your team, the effects of these anti-patterns are magnified: people see you apologizing for bugs, so they think that they should be striving to write bug-free code. They may feel ashamed if their code has bugs.

Even if you don’t intellectually believe any of these fallacies about bugs, the act of apologizing reinforces them. Your teammates can’t know what you really believe; they can only know what you say and do.

Everyone knows that all code has bugs. Code is written under constraints. Deadlines. Goals other than quality. Imperfect knowledge of the future. Even your own skill as an engineer is a constraint. If we all tried to write perfect, bugless code, we’d never accomplish anything. So how does it make sense to apologize for bugs?

This rule I’ve made for myself forces me to distinguish between problems caused by constraints and problems caused by my own faults. If I really think I caused a problem through some discrete action (or lack of action), then that’s something I’ll apologize for. But if I wrote code that got something done, and it just so happens that it didn’t work in a given situation, then I have nothing to apologize for. There was always bound to be something.

Make a resolution not to apologize for bugs. Especially if you’re in a leadership position. It’s a simple way to tweak attitudes about mistakes and failure in a positive way.

Do-nothing scripting: the key to gradual automation

Every ops team has some manual procedures that they haven’t gotten around to automating yet. Toil can never be totally eliminated.

Very often, the biggest toil center for a team at a growing company will be its procedure for modifying infrastructure or its procedure for provisioning user accounts. Partial instructions for the latter might look like this:

  1. Create an SSH key pair for the user.
  2. Commit the public key to Git and push to master.
  3. Wait for the build job to finish.
  4. Find the user’s email address in the employee directory.
  5. Send the user their private key via 1Password.

This is a relatively short example. Sometimes there are 20 steps in the process. Sometimes there are branches and special cases to keep track of as you go. Over time, these procedures can become unmanageably large and complex.

Procedures like this are frustrating because they’re focus-intensive yet require very little thought. They demand our full attention, but our attention isn’t rewarded with interesting problems or satisfying solutions – just another checkbox checked. I have a word for a procedure like this: a slog.

We know that this procedure is ripe for automation. We can easily see how to automate any given step. And we know that a computer could carry out the instructions with far greater speed and accuracy than we can, and with less tendency toward practical drift.

However, automating slogs sometimes feels like an all-or-nothing proposition. Sure, we could write a script to handle step 2, or step 5. But that wouldn’t really make the procedure any less cumbersome. It would lead to a proliferation of single-purpose scripts with different conventions and expectations, and you’d still have to follow a documented multi-step procedure for using those scripts.

This perception of futility is the problem we really need to solve in order to escape from these manual slogs. I’ve found an approach that works pretty reliably: do-nothing scripting.

Do-nothing scripting

Almost any slog can be turned into a do-nothing script. A do-nothing script is a script that encodes the instructions of a slog, encapsulating each step in a function. For the example procedure above, we could write the following do-nothing script:

import sys

def wait_for_enter():
    raw_input("Press Enter to continue: ")

class CreateSSHKeypairStep(object):
    def run(self, context):
        print("Run:")
        print("   ssh-keygen -t rsa -f ~/{0}".format(context["username"]))
        wait_for_enter()

class GitCommitStep(object):
    def run(self, context):
        print("Copy ~/new_key.pub into the `user_keys` Git repository, then run:")
        print("    git commit {0}".format(context["username"]))
        print("    git push")
        wait_for_enter()

class WaitForBuildStep(object):
    build_url = "http://example.com/builds/user_keys"
    def run(self, context):
        print("Wait for the build job at {0} to finish".format(self.build_url))
        wait_for_enter()

class RetrieveUserEmailStep(object):
    dir_url = "http://example.com/directory"
    def run(self, context):
        print("Go to {0}".format(self.dir_url))
        print("Find the email address for user `{0}`".format(context["username"]))
        context["email"] = raw_input("Paste the email address and press enter: ")

class SendPrivateKeyStep(object):
    def run(self, context):
        print("Go to 1Password")
        print("Paste the contents of ~/new_key into a new document")
        print("Share the document with {0}".format(context["email"]))
        wait_for_enter()

if __name__ == "__main__":
    context = {"username": sys.argv[1]}
    procedure = [
        CreateSSHKeypairStep(),
        GitCommitStep(),
        WaitForBuildStep(),
        RetrieveUserEmailStep(),
        SendPrivateKeyStep(),
    ]
    for step in procedure:
        step.run(context)
    print("Done.")

This script doesn’t actually do any of the steps of the procedure. That’s why it’s called a do-nothing script. It feeds the user a step at a time and waits for them to complete each step manually.

At first glance, it might not be obvious that this script provides value. Maybe it looks like all we’ve done is make the instructions harder to read. But the value of a do-nothing script is immense:

  • It’s now much less likely that you’ll lose your place and skip a step. This makes it easier to maintain focus and power through the slog.
  • Each step of the procedure is now encapsulated in a function, which makes it possible to replace the text in any given step with code that performs the action automatically.
  • Over time, you’ll develop a library of useful steps, which will make future automation tasks more efficient.

A do-nothing script doesn’t save your team any manual effort. It lowers the activation energy for automating tasks, which allows the team to eliminate toil over time.