5 production surprises worth investigating

As an SRE, I’m a vocal believer in following one’s nose: seeking out surprising phenomena and getting to the bottom of them. By adopting this habit, we can find and fix many classes of problems before they turn into incidents. Over time, this makes things run much smoother.

But what is surprising? How do we decide what to ignore and what to chase down?

Here are 5 kinds of surprise that are worth digging into in production data.

Keep in mind that this list is based on one SRE’s experience. Someone else, with different intuitions about system behavior, would make a different list. And that list would be more valuable to them than this list.

The most important thing is not that we always pick the right surprises to investigate. It’s that we make time, consistently, to seek out and explain surprises.

Anyway. Here’s my list.

1. Long running requests

Pick a time window. Sort all the requests during that window descending by duration. At the top of this list, you’ll find classes of requests that spent extraordinarily long in your system.

Were you expecting to see these specific kinds of requests? If not, then congratulations: you’re surprised. And this is quite likely a surprise worth investigating, because

  1. while these slow requests are running, they’re eating up resources (CPU, memory, sockets, and so on), which gives them an outsized potential to degrade the performance of other, unrelated requests.
  2. without intervention, the underlying problem will probably get worse over time. Why wouldn’t it? It has already gotten this bad. And if it gets bad enough, it will eventually cause customer pain.
  3. many of the phenomena contributing to this slowness probably affect a variety of requests. So, if you can make these requests perform better, you may be able to make other requests perform better too.

We don’t have to rigorously analyze and optimize every slow request that goes through the system. Just by coming to understand the issues that cause our slowest requests to be slow, we often find opportunities to make high-impact improvements and mitigate the risk of ending up on an incident call later.

2. Saturation

Saturation is when units of work are waiting in a queue to be processed – be they web requests, or IP packets, or threads, or what have you.

To the extent we’re trying to serve each request as quickly as possible, saturation anywhere is waste. We shouldn’t expect any saturation. Or, at least, we should strive to thoroughly understand every source of it. If we don’t understand it, it will probably get worse over time.

Find every metric that represents a saturation. Put them all on one dashboard. Look at that dashboard every day.

3. Crashes

Every time an exception bubbles up to the top of the stack, that’s unexpected. If it were expected, then the code would have caught and handled the exception.

A tool like Sentry or Datadog Error Tracking (I mention these specific products only because I know them, not because they’re necessarily the best) lets you classify crashes, and see which ones are new or getting more frequent. Granted: it’s often the case that there are far too many classes of crash to investigate all or even most of them. But this shouldn’t stop us from chasing down some of them, according to our intuition about their relative importance.

We have imperfect information until we get our hands dirty, so we won’t always end up picking the highest-impact investigations to undertake. But fixing crashes is always better than ignoring them, and by dividing up the extant crashes into groups and prioritizing one group at a time, we can create some very useful guarantees.

For example, if we find all the crashes that correspond to deadlocks, and fix those, then we guarantee that our system has no deadlocks. Thereafter, problems caused by deadlocks – which tend to be subtle, elusive, and therefore time-consuming to solve – can no longer quietly sow chaos.

4. Utilization spikes

It’s expected for a utilization metric (for example, network bandwidth or memory utilization) to move around a lot. But it’s not expected for utilization to exhibit large spikes. When this happens, it means some recurring event is using more than its fair share of some resource.

When utilization hits 100%, things will degrade catastrophically and immediately. So if you have utilization spikes representing 20% of capacity, then steady-state utilization only has to be 80% before catastrophic degradations start occurring. Or, if the spikes gets worse, and now they’re 30% of capacity, then 70% capacity becomes the new safe limit. We’ll have to waste at least 30% of our capacity by leaving it idle in order to prevent outages.

So why not understand the spikes now, before they get worse?

5. Correlation between latency and throughput

When we see recurring swings in latency metrics, it’s easy to say to ourselves, “Yeah, well, when we’re under more load, things slow down.” But… why? Systems don’t always behave like that! They do it when there’s a bottleneck. If the system were behaving as expected and no resources were exhausted, then the latency graph would be more or less flat.

In other words, if traffic tends to be higher mid-day on weekdays, and latency also tends to rise mid-day on weekdays, then you’re likely saturating at least one resource. Would you rather find that bottleneck is now, or during an incident?

I hope this list is a good place to start if you’re looking to dig up problems. But, ultimately, you must be the judge of what’s surprising to you. Find something you can’t explain. Explain it. Repeat.

Platform teams don’t need to act like companies

Lately you see a lot of software company R&D teams organized around internal products. The Search Team provides a Search service and its “customers” are the teams whose code consumes that service. The Developer Productivity Team’s product is a suite of tools for managing local development environments and running tests, and its “customers” are the developers who use those tools. And so on.

The idea behind this organizational trick seems to be that product-oriented teams will act like a company. They’ll make better strategic decisions because their goals will be aligned with those of their internal customers.

This is pretty silly.

If a company satisfies the needs of its customers and potential customers, it can grow and thrive and put resources into attracting more revenue. If a company fails to satisfy customer needs, it loses out to competitors. As a company changes its strategy, the market provides feedback. These dynamics don’t exist for an internal team – especially a platform (or “infrastructure,” or “ops”) team.

When asked to think like a company in this way, a platform team will define their product as something like, “a reliable, performant, and secure platform for running software products.” Fine – but think of all the things that a company can do that you can’t do:

  • You can’t attract new customers
  • You can’t pivot to a different product
  • You can’t focus on a different market segment
  • You can’t sacrifice some set of customers to attract another

If you try to “think like a company” as a platform team, you will try to do what successful tech companies do: launch attractive features. But you can waste a lot of time that way, because the most important feature of a software platform is boringness.

If the platform team must act like a company, then it should act like a struggling company that’s desperate to keep its last few enterprise customers. This implies a very different strategy than that of a startup. Change as little as possible and put the bulk of your energy into efforts that increase boringness: documenting, simplifying, and fixing. Don’t build anything exciting if you can avoid it, and if you must build something, build it in service of boringness.

Don’t build Cities of Nowhere

I just finished reading Seeing Like A State: How Certain Schemes to Improve the Human Condition Have Failed by James C. Scott (full text online). I highly recommend it. Through examples ranging from Soviet collectivization to the construction of Brasilia, the book argues that grand, centralized planning efforts in the high modernist tradition are all doomed to failure. One simply can’t substitute pure reason – no matter how beautiful and internally consistent – for local human decision-making informed by direct experience.

To take one striking anecdote, Le Corbusier spent some time lobbying Soviet intelligentsia to implement his redesign of Moscow. However:

Stalin’s commissars found his plans for Moscow as well as his project for the Palace of Soviets too radical. The Soviet modernist El Lissitzky attacked Le Corbusier’s Moscow as a “city of nowhere, … [a city] that is neither capitalist, nor proletarian, nor socialist, … a city on paper, extraneous to living nature, located in a desert through which not even a river must be allowed to pass (since a curve would contradict the style).” As if to confirm El Lissitzky’s charge that he had designed a “city of nowhere,” Le Corbusier recycled his design virtually intact—aside from removing all references to Moscow—and presented it as La ville radieuse, suitable for central Paris.

Seeing Like A State – James C. Scott
Le Corbusier's vision for the ideal Moscow. Symmetric, vertical, orderly.
A map of actual Moscow. Asymmetric, spread out across either side of the Moskva, disorderly.

In Scott’s book, this pattern plays out over and over. Planners, relying heavily on what they imagine are universal principles, produce designs for human life that are nevertheless completely at odds with how humans actually live and work. These designed spaces possess a symmetric, holistic beauty which blinds their creators to the needs of the infinitely complex human communities that are meant to populate them. The planned city ultimately fails to improve the human condition, since improving the human condition is one of the many considerations which must bow to the planner’s aesthetic.

Toward the end of the book – although this is only a short passage and certainly not the thrust – Scott gives 4 rules of thumb for development planning. Building up a SaaS product is clearly different in many ways (not least of which is the stakes) from planning human development. But the parallels got me thinking in engineering terms, and I find that these rules also work quite well as rules of thumb for making changes to a complex software system. By following them, we can mostly avoid wasting effort on huge endeavors that end up being Cities of Nowhere.

1. Take small steps

In an experimental approach to social change, presume that we cannot know the consequences of our interventions in advance. Given this postulate of ignorance, prefer wherever possible to take a small step, stand back, observe, and then plan the next small move.

In software, taking small steps is a challenge of discipline. We work with pure thought-stuff. In principle, we can build whatever we can imagine, so it’s always tempting to solve more of the problem.

But taking small steps has by now become the common wisdom in our industry. Single-feature pull requests are encouraged over massive, multifaceted ones. We roll out features to small groups of users before ramping up. Prototypes and MVPs abound.

Where we still have much to learn from Scott is the “stand back, observe” part. Often, we’re tempted to simply let the machine do the observing for us: if there’s anything wrong with our change, the integration tests will fail, or the deploy will fail, or we’ll get an alert. While such automated signals are indispensable, they’re not sufficient. To understand the real-world effects of our small changes, we have to exercise the further discipline of curiosity. With our particular change in mind, we have to search diligently for evidence of its effects, both intended and unintended, direct and indirect. Observability is not enough – we must actively observe.

2. Favor reversibility

Prefer interventions that can easily be undone if they turn out to be mistakes. Irreversible interventions have irreversible consequences. Interventions into ecosystems require particular care in this respect, given our great ignorance about how they interact. Aldo Leopold captured the spirit of caution required: “The first rule of intelligent tinkering is to keep all the parts.”

It’s pretty clear how this reversibility consideration applies to deploying software and infrastructure. Most changes should be trivially reversible by “rolling back” the deploy. Where this is impossible (such as in certain classes of database migrations and infrastructure changes), we come up with more case-specific back-out plans, or we end up inventing reversible patterns despite ourselves. This amounts to an implicit recognition that our changes can always have unexpected consequences. Which is good!

But, in a socio-technical system, the technology isn’t the only thing that gets altered over time. We must also favor reversibility with respect to the social elements – with respect to procedures, policies, and organizational structures.

One pattern I like for this is an experiment ledger. As a team, you keep a running register (e.g. in a spreadsheet) of the different experiments you’re trying. These can be anything from a new recurring meeting to a new on-call rotation to a rearrangement of your kanban columns. Each experiment in the ledger has one or more check-in dates, when the team will discuss the results of the experiment and decide whether to keep going or abandon the course.

Of course, for many reasons, not every change can be reversible. Not least because even after you reverse something, the taste stays in people’s mouths. But taken together with the rest of Scott’s advice, reversibility is a sensible attribute to strive for.

3. Plan on surprises

Choose plans that allow the largest accommodation to the unforeseen. In agricultural schemes this may mean choosing and preparing land so that it can grow any of several crops. In planning housing, it would mean “designing in” flexibility for accommodating changes in family structures or living styles. In a factory it may mean selecting a location, layout, or piece of machinery that allows for new processes, materials, or product lines down the road.

No matter how much time and sweat you put into the design of a system – no matter how much of the problem you try to solve a priorithere will always be surprises. It’s just the nature of a complex system, and even more so for a system with inputs you can’t control (e.g. customer traffic patterns).

Therefore, watch carefully for both expected and unexpected results. That’s what “plan on surprises” means to me: make small, reversible changes, and in the meantime look closely for new unexpected behaviors that you can investigate and understand. This will give you much more insight into your system’s abilities and constraints than any application of pure thought.

4. Plan on human inventiveness

Always plan under the assumption that those who become involved in the project later will have or will develop the experience and insight to improve on the design.

Write with clarity and humility on the motivations for your designs. Explain what you did and what you chose not to do, and why. The reasons for a particular design are never self-evident, no matter what cosmic beauty they may have in your head.

Taken together, Scott’s rules sketch out a pragmatic philosophy for managing the evolution of complex systems. In favor of grand redesigns that attempt to solve all problems at once, one should prefer targeted, reversible changes. We should change significant things about the system only we can fully explain why it’s necessary, and afterward we should exercise diligence and curiosity in making sure we understand what we changed.

The time horizon outlook: an alternative to technical debt

In Technical debt is not a thing, I argue that we should stop using the metaphor of technical debt to inform strategy. Instead, I propose a time horizon model, in which our goal as engineers (or what have you) is to produce the most value possible over some time window.

In the technical debt model, we identify some inefficiency in our workflow process and trace it back to some prior decision that necessitated this inefficiency. That decision, we say, entailed an accumulation of technical debt. We traded away our future productivity for an earlier delivery date. If we want that productivity back, we must “pay off” the debt by rectifying that decision.

In the time horizon model, by contrast, we don’t worry about how we arrived at the existing sociotechnical system. Over the history of the product, we’ve made many decisions to sacrifice completeness for expediency. In making those decisions we went down a path that ultimately led us to the system as it exists today. We can’t go back and choose a different path: there’s only forward.

Let’s say we’ve got a SaaS product that relies on manually generated TLS certificates. We have to do 2 hours of toil every 3 months to renew these certs.

If we believe in technical debt, we might look back at the decision to make cert renewal a manual process and say, “By not automating this, we took on technical debt. We must pay off this debt.” We’d make a ticket, give it the technical-debt tag, and eventually pick it up as part of our 20% time commitment to technical debt paydown.

By contrast, in the time horizon model, our team’s stated raison d’être is simply to produce the most value possible within our agreed-upon time window.

So instead, we’ll say something like “Manual cert renewal costs 2 hours of labor every 3 months. It would take us 15 hours of work to automate.” Those 15 hours could instead be spent delivering value directly, so we should only undertake this project if it will free us up to deliver more total value between now and the time horizon:

Our time horizonThe estimated time investment
(which is also the opportunity
cost)
The payoff over the time horizonOur decision
3 months15 hours2 hoursDon’t do it
1 year15 hours8 hoursDon’t do it
3 years15 hours24 hoursMaybe do it
5 years15 hours40 hoursMaybe do it

Of course, just because a given time investment passes the time horizon test doesn’t necessarily mean we should make that investment. We still need to compare it to the set of other efforts we could undertake, and devise an overall strategy that maximizes the value we’ll deliver over our time window.

The horizon model gives us a basis for making these comparisons, and lets us establish a lower bound for the expected return on our time investments. It helps us focus on the right things.

Using Little’s Law to scale applications

In a complex application, there are queues everywhere. Which is lucky, in a way, because it means we can use queueing theory to slice through a whole class of Gordian knots.

One of queueing theory’s most general insights is Little’s Law:

L = λW

L is the long-term average number of customers in the system, λ is the long-term average arrival rate of new customers, and W is the average amount of time that customers spend waiting.

In the parlance of queueing theory, “customer” doesn’t just mean “customer.” It means whatever unit of work needs to pass through the system. A customer can be a phone call or an IP packet or a literal customer at a grocery store or any one of infinitely many other things. As long as there are pieces of work that arrive, get queued, get processed, and then exit the system*, Little’s Law works. It’s breathtakingly general.

As an illustration, let me share an anecdote from my job.

*and as long as you’re not hitting a queue size limit

How many web servers do we need?

I’m on a team that’s responsible for a web app that looks more or less like this:

Requests come in from the Internet to the load balancer. The load balancer forwards the requests to a bunch of web servers, each of which, in turn, distributes requests among 6 independent worker threads. The worker threads run the business logic and send responses back up the stack. Pretty straightforward.

When a web server receives a request, it hands that request off to one of its worker threads. Or, if all the worker threads are busy, the request gets queued in a backlog to be processed once capacity becomes available.

If everything’s hunky dory, the backlog should be empty. There should always be idle capacity, such that requests never have to wait in a queue. But one day I noticed that the backlog wasn’t empty. The total number of backlogged requests across the fleet looked like this:

Things were getting queued at peak traffic, so we needed to scale up the number of web servers. But scale it up to what? I could have used trial and error, but instead, I turned to Little’s Law.

The first step was to establish the mapping between entities in my system and the general mathematical objects related by Little’s Law:

  • L: the number of in-flight requests. In other words, requests that have arrived at the load balancer and for which responses haven’t yet been sent back to the user.
  • λ: the rate at which new requests arrive at the load balancer.
  • W: the average request latency.

What I wanted to know – and didn’t have a metric for – was L. I did have a metric in my telemetry system for W, the average request latency.

While I didn’t exactly have a metric for λ, the arrival rate of requests, I did have the completion rate of requests (i.e. how many requests per second were being served). The long-term average arrival rate can’t differ from the completion rate, since every request does exit the system eventually. Therefore I was able to use the completion rate as a stand-in for λ. Here’s what I found (these aren’t the actual numbers):

L = λW
(average occupancy) = (arrival rate)(average wait time)
(average occupancy) = (1000 req/s)(340ms)
(average occupancy) = 340 requests

I chose an arrival rate close to the peak throughput of the system. This still works as a “long-term average,” though, since the interval between arrivals (on the order of 1 millisecond) is much less than the duration of the average request (on the order of 300 milliseconds).

So, according to Little’s Law, at peak-traffic times, there will be on average 340 requests in flight in the system. Sometimes more, sometimes less, but on average 340. From there, it was trivial to see why requests were getting queued:

(average web server occupancy) = (average occupancy) / (number of web servers)
(average web server occupancy) = (340 requests) / (40)
(average web server occupancy) = 8.5

If you recall that each web server maintains 6 worker threads, you’ll see the problem. No matter what fancy stuff we try to do with queueing discipline or load balancing algorithm or whatever, there will be on average 2.5 queued requests per worker.

Little’s Law can also tell us what we need to scale up to if we want to extract ourselves from this mire:

(total worker threads) ≥ (arrival rate)(average wait time)
(number of web servers)(worker threads per web server) ≥ 340

So we can either scale up web servers or worker threads per web server until their product is greater than 340.

Little’s Law is about long-term averages

This is only a lower bound, of course.

Little’s Law tells us about the long-term average behavior of a queueing system: nothing else. From moment to moment, the occupancy of the system will vary around this average. So, in the example above, we need to provision enough worker capacity to absorb these variations.

How much extra capacity do we need? Little’s Law can’t tell us. The answer will depend on the idiosyncrasies of our system and the traffic that passes through it. Different systems have different latency distributions, different arrival time distributions, different queueing disciplines, and so on. These variables all have some effect on our worst case and our 99th-percentile occupancy. So, in most cases, it’ll be important to get a sense for the empirical ratio between a system’s average occupancy and its occupancy at whatever percentile you decide to care about. But Little’s Law is still a very helpful tool.

If you do have a good sense of how worst-case occupancy varies with the average, you might even be able to use Little’s Law to inform your autoscaling strategy. As long as the system’s arrival rate changes much more gradually than the average latency of requests (such that you’re still working with long-term averages), you can rely on

L = λW

to predict your capacity requirements. Or, at least, I think you could. I haven’t tried it yet.

Technical debt is not a thing

In software engineering, we have this concept called “technical debt.”

When we talk about technical debt, we usually mean that, because of a given technical decision, we’ll be making “interest payments” out of our productivity until the original “debt” is paid off. These payments can take many forms, from manual toil to risk to cognitive load. Anything that saps our productivity on a recurring basis.

Technical debt is a metaphor. It’s a conceptual tool we use to grapple with elusive, abstract stuff like productivity and value. That abstract stuff gets grounded to a more concrete family of concepts: namely, money and debt.

All of which is well and good, if the metaphor helps us reason. But it doesn’t. The technical debt metaphor serves only to obscure and encumber. We must discard it.

Direct and indirect work

We use the technical debt metaphor when we’re talking about strategy: our plan for undertaking certain pieces of work in a certain order so as to achieve some goal.

The most straightforward kind of strategy for a software engineering team would be to do nothing but build high-value features. Pick the highest-value feature that can be built in a sprint, build, repeat. Get paychecks, win promotions.

This strategy works for some teams. But for most, it’s naïve. If we only do direct work (what I’m calling, more-or-less, feature work: work that creates value for customers), problems start to emerge. We find ourselves repeating manual procedures, fixing more user-facing bugs, and fighting more fires. All this unplanned work demands our labor, diverting it from direct work. It thereby reduces the effective rate at which we can create value: our team’s capacity (“velocity”, if you Agile).

In order to keep our capacity from slipping, we have to do indirect work. Examples of indirect work are many and varied: writing internal documentation, caring for test suites and CI pipelines, building metric dashboards, refactoring old crufty code.

Indirect work uses some of our capacity. But we hope that, by folding indirect work into our strategy, we can avoid a lot of disruptive and expensive unplanned work that would otherwise emerge.

However, on its own, indirect work doesn’t tend to float to the top of the priority list. We usually prioritize tasks based on how much value they directly create for customers. It’s hard to even talk about direct and indirect work in the same conversation. It feels as though they create forms of value that are incommensurable: impossible to prioritize with respect to each other according to a single metric. Therefore, direct work tends to elbow indirect work out of the roadmap.

Nevertheless, we know from experience what happens when we neglect to do enough indirect work. Everything falls apart and we fail.

How we use the technical debt metaphor

The metaphor of technical debt is mainly a tool for working around the incommensurability of direct and indirect work. It’s how we try to balance tasks whose output is an amount of value (direct work) and tasks whose output is an increase in the rate at which we can create value (indirect work).

“Rapid progress isn’t free,” we say to management. “It’s achieved by taking out a sort of loan. We get to ship sooner, but in exchange we take on an obligation to make recurring payments, in the currency of labor.” These recurring payments only touch the interest. They will never, on their own, pay down the principal. To do that, we need to undertake some more substantial effort. A “paydown” project that will liberate us from the waste – the toil, or risk, or cognitive load – that our technical debt entails.

The technical debt metaphor feels true. And it carries some useful baggage from its origins in finance to its new home in software engineering. Namely: neglecting your debts is short-sighted and irresponsible. Almost everybody agrees on this, at least with respect to real debt. So, if the metaphor holds, why shouldn’t it be true of tech debt too?

Thanks to the implicit association between paying down debts and responsibility, engineers and management both get what they want. Engineers get to spend time doing work that benefits them directly. In other words, they get a respite from the feeling of alienation from their labor product. Management gets a plausible justification for prioritizing work that keeps their team happy. It’s a win-win, right?

How the metaphor fails us

If you take out a real loan, your payments are usually proportional to the initial principal. It’s a rough proportionality, since interest rates and maturity dates vary, but in general: the bigger the loan, the higher the payments. Mortgage payments are higher than car payments.

This fact about loans has an important implication. Since the labor you have to expend to pay off a loan is (roughly) proportional to the amount of your recurring payment, loans are always worth paying off. If you have a tiny loan, it will take a tiny amount of labor to pay off, so it’s worth doing even though the interest payments are tiny. If you have a huge loan, then your interest payments constitute a significant financial drain, which makes the loan worth paying off even though it will take a lot of labor to do it. So we all share an intuition about loans: there’s no responsible alternative to paying them off.

This intuition implicitly tags along with the debt metaphor, because of how metaphor works, into the engineering space. We feel we must pay off our debts. But since technical debt paydown tasks never seem to make it onto the roadmap by themselves, we have to come up with some system to force them into the pipeline. To spend a certain proportion of our labor paying down technical debt, even though we won’t have any new features to show for it.

The number cited most often is 20%. 20% of our time feels like about how much it takes to pay down the right amount of tech debt. So we throw all the tech debt into a big bucket and try to devote about 20% of our labor to tasks in that bucket. After all, it’s debt: we can’t just not pay it.

But the dynamics of technical debt are quite different from those of real debt. Different enough that this intuition no longer stands up to reality.

How much are we paying in interest?

Imagine, for example, that a team decides to forgo automating a software product’s release process. By accepting a manual release process instead, they’re able to launch a week earlier than otherwise. But, since they lack an automated release process, they need to do an extra t hours of work each month.

The decision to accept a manual release process in order to launch earlier could be classified as taking on technical debt. Which means, if we apply our intuition about real debts, that it must be paid off. So it will go into the “tech debt” bucket and get done as part of the team’s 20% time. Regardless of the value of t. In fact, on most teams (in my experience), t isn’t even discussed. The task is qualitatively tech debt, so it goes in the bucket.

But you have to know t. If t is high enough, then there’s an enormous drain on our productivity which must be rectified urgently. If t is low enough, the task might not even be worth doing at all.

How much will it cost to pay down the principal?

Furthermore, in order to know whether it’s worth paying down a given bit of technical debt, we need to know how much it will cost to pay down.

Going back to the example from above: suppose it will take P hours of labor to automate the release process. If P is low enough, the automation is worth building. But if P is very high, it’s better to accept the toil and move on.

P doesn’t tend to be discussed either. Or, rather, first we decide that something is technical debt, and only when it comes time to prioritize it with respect to the other technical debt in the bucket do we ask about P.

Rejecting the technical debt metaphor

So, in order to strategize – to decide what work to do in what order – we need to estimate, at the very least:

  • t: The amount of value that will be created by doing a given task
  • P: The amount of labor required to do the task

Fine. But are these parameters not both necessary for evaluating feature work too? And if we need to consider them anyway, regardless of whether a task is technical debt, then what’s the point of calling some tasks technical debt paydown and others not?

The defining characteristic of technical debt is the decision to sacrifice some amount of future productivity in exchange for earlier delivery. But this decision is not unique to technical debt: it’s universal to engineering. Every change we ship – whether for a feature, or a bug, or a technical debt payment – every change carries an implicit decision about how much value to defer and how much to deliver now. “Technical debt” is just what we call it when the deferred value takes the form of a decrease in our team’s velocity. But that form doesn’t make it special or different, or incommensurable with other work.

When we make strategic decisions – decisions about what work to do in what order – we need a guiding principle. “Spend at least 20% of our time paying down tech debt” is not a principle. It’s an excuse to substitute our arbitrary personal taste for a realistic, evidence-based appraisal of cost and value.

And yet, something must be done.

But okay: so what if the technical debt discourse is useless? We’re dissatisfied with how much time we lose to recurring low-value-add tasks. Without the technical debt metaphor, how can we make the case that it’s worth investing time to eliminate these drains on our productivity?

One way is to adopt the simple model of a time horizon. Suppose a team declares the following as their goal:

To produce the most value possible over the coming year.

This team has a time horizon of 1 year. So, when there comes a proposal to make a particular improvement to their productivity (be it a process fix or some automation or whatever else), they can estimate:

  • How many hours of labor it will take to implement
  • How many weeks it will take to complete that labor
  • How many hours of labor it will save them over the year (the remainder of the year, that is, after subtracting the weeks spent implementing)

In this way, productivity investments can be evaluated each on their own merit, rather than being lumped together into the “technical debt” pile. The investments with the highest expected returns will emerge as clear priorities. And many proposals can be rejected entirely, if it’s determined that they require more labor than they can possibly save over a year.

With an agreed-upon time horizon, direct work and indirect work become commensurable. We get a more coherent strategy, and we don’t waste time wringing our hands over what’s technical debt and what’s not. The technical debt metaphor falls away, obsolete.

Learning about attention deficit through spreadsheets

I struggle with attention deficit. Often this manifests through Twitter. I’ll find myself scrolling it with no memory of how I got there. And no particular interest. Just scrolling.

This is not great. First of all, it’s a waste of time. But that’s okay – I’ve made a decision to continue having a Twitter account, and some amount of wasted time is a consequence of that. Sue me: I love posting.

The bigger problem is that whenever I unintentionally end up on Twitter, even if I only end up being there for a couple seconds, my whole mental stack gets disrupted. I lose context. If the thing I was doing was a thing that required a lot of abstraction juggling – such as coding or writing – the bowling pins may have already tumbled down by the time I even realize I’m scrolling Twitter again.

Impulse helps a lot. It gives me a trail of breadcrumbs to follow back to my prior train of thought. But still: the mental context has largely dissipated. I have to build my momentum back up again. If this happens many times a day, it’s a serious leak of labor power.

Spreadsheet!!!

Like all problems, this one can be solved with spreadsheets.

Okay, not solved. But maybe understood. Maybe. In part.

About a week ago, I logged out of Twitter on my work laptop. Now, every time I go to check Twitter, instead of wonderful, infinitely scrolling little posts, I see this:

Not posts

This immediately jars me out of my reverie. “Whoops!” I think, “This isn’t what I meant to be doing.” I can go back to the task at hand instead of wasting time on Twitter. Of course, it’s already too late for the cognitive context – that turned to salt the instant I opened the new browser tab. But at least now I can learn something from these attention faults.

I started a spreadsheet called Twitter Attempts. Over the workday, every time I see the Twitter login screen, I take it as a reminder to update this Twitter Attempts spreadsheet. I log the timestamp of the attention fault, as well as a note about what I was doing right beforehand. Then I go back to what I was doing. The spreadsheet looks roughly like so:

Observations

I’ve found this technique pretty illuminating.

Attention faults have precipitating events

I used to think of attention deficit as something like getting in a car accident. My attention is puttering along all lah-dee-dah, then pow! a different thing is happening.

The data lead(s) me to believe it’s not really like that. Instead of being spontaneous and random, attention faults are almost invariably preceded by some specific precipitating event. In other words, when I see the Twitter login screen and redirect myself to my spreadsheet, I can nearly always recall a particular event that seems to have “caused” the attention fault.

I call this sequence – precipitating event followed by attention fault – an attentional sneeze.

gross

If you’ll forgive the grossness of the metaphor, you could kind of think of a precipitating event as the pepper and checking Twitter as the snot. Sorry.

I review my spreadsheet at the end of every week. Here are some (synthesized; not verbatim) examples of the kinds of “peppers” I tend to find in the Notes column:

  • Just submitted a pull request
  • Just received an annoying Slack message
  • Just shifted from an emotionally fraught topic to a safe topic in meeting
  • Just finished writing a paragraph of documentation

An attention fault basically never just feels like, “I don’t know what happened! I was in the middle of typing a sentence and suddenly I was on Twitter.” This is surprising and I didn’t know it before. Hooray for spreadsheets!

Peppers fall into some categories

Here are some examples, straight from the data, of some of these pepper events:

  • somebody talking to me in zoom => that person talking to someone else instead
  • shift from asking in slack a question about git, to refactoring my log pipeline work
  • emotional shift from vulnerable team convo to administrivia
  • finished mtg w/ arthur
  • shift from asking for help to waiting for a reply

As I look at these entries and many more like them, there seem to emerge a few patterns. Peppers usually involve things like:

  1. A shift between pragmatic and intuitive modes of thought
  2. An emotional shift
  3. The beginning of an interval of waiting
  4. The completion of a task

These categories overlap quite a bit. In any case, they’re all boundaries between one mental state and another.

Sliding activation energy

Some days I feel centered and focused; other days I feel flustered and preoccupied. Distractable.

I’ve noticed that, on days characterized by high distractibility, the bar for what counts as “completion of a task” (the 4th type of pepper event) becomes a much lower bar. For example, suppose I’m replying to comments on a pull request. On a low-distractibility day I might finish replying to all the comments in the PR before checking Twitter. On a high-distractibility day, however, I might check Twitter after every reply! (But still – never in the middle of one.)

Limitations, plans

I don’t know where to go with any of this. It’s food for thought.

I will say that this particular well of data dries up pretty fast. There’s a positive feedback loop: I’m not logged into Twitter, so my mind eventually acclimates to this and substitutes some other type of flinch instead. As a result, fewer Twitter login attempts; as a result of that, stronger reinforcement that Twitter is no longer a viable distraction.

Another limitation of this avenue of investigation is its vulnerability to the availability heuristic. Just because I think a given attentional sneeze was caused by a particular pepper event, doesn’t mean it really was. But hey – all I have to work with is my own perceptions.

At any rate, the exercise has already been pretty worth it. And if any of this sounds interesting, I’d be super keen to hear how this experiment goes for you!

Takeaways: managing attention with stacks

Last week I wrote about a system I’ve been using to manage my attention. It’s inspired by the call stack, but it presupposes no knowledge of computer science. Since I need a name for it, I’ve been calling it Impulse.

Since that post, I’ve been using Impulse more seriously. It has been super illuminating. My practices are in rapid flux at the moment, but the core workflow has more or less solidified to the following:

  • Each line represents a task.
  • A task can have child tasks, which appear above the parent task and indented.
  • The top line is always the thing I’m currently working on.

At first, I thought Impulse was just a to-do list flipped upside down. But now I think it’s more than that.

Making a to-do list involves writing down all the things you need to do, and committing to do all of them. There are many who find this comforting! The list represents a finite sequence of discrete steps, and once you’ve gone through all the steps, you’re done. You can relax.

For me, it’s quite the opposite. I get paralyzed thinking about all the possible activities I could potentially do that might be valuable. If I let myself, I’ll sit there for an hour just silently panicking about what should be on the list. Should I tidy up my desk today? Should I clean the bathroom? Should I put my DuoLingo on here? Should I add a task to reply to that email from my Dad? Should I add breaks? Now the list is way too long to get done before lunch. Should I cut some things? Which things are most important? Which things are most urgent? Oh shoot, I also need to get groceries today. Which means I don’t have time to… and so on. It’s exhausting and it doesn’t get me anywhere.

With Impulse, the focus is always on the top line. That’s always the thing I’m currently doing, no matter what. I don’t have to worry so much about whether everything important is on the list, or whether it’s in the right order. I can just bang away at the thing on top, abiding in faith that if I simply spend time doing things, things will get done.

Here’s another way to look at it: with Impulse, the process of choosing what to do is decoupled from the process of doing it.

This makes work much less overwhelming. When I’m working, I never need to worry about how my current task fits into the big picture, or whether I’ve skipped a step, or whether there’s something more valuable I could be doing instead. There is only one thing to do: the line at the top of the stack.

I’ve been using Impulse for a whole bunch of purposes. As a to-do list when I’m puttering around the house:

dishes
fold laundry
throw out coffee cup
fold baby laundry
clean roomba, put back
outline followup blog post on impulse

To break down a task into pieces and make it more manageable and help keep track of where I am:

        wire clipper
        new strings
        something to put under guitar neck
        tuner
    get equipment together
        loosen all strings
        cut all strings
        throw old strings away
    take off old strings
    ...
restring guitar

To keep track of things I want to follow up on later:

write widget function
    reply to RFC comments
    tidy up desk
    order thai food
review at end of day

And I use it in many other ways that are less well defined but no less intuitive.

Current implementation

Right now I’m just using Vim and Git. The stack is a text file in a Git repository, and I also maintain some buckets as auxiliary text files. Here’s a snapshot of the buckets I’m currently using:

  • pink: (called “pink” for historical reasons) Tasks that are ready to start, which don’t demand sustained focus, and which can probably be done in under 15 minutes. Examples: check whether there's a better time for that weekly meeting, fix dead link on website, read that Stephen Malkmus interview I saw
  • blue: (called “blue” for historical reasons) Tasks that are ready to start, which do require sustained focus, and which will probably take longer than 30 minutes. Examples: read RFC that Tom sent, connect anti-tip kit to bookshelf, watch latest Philosophytube video
  • ongoing: BIGGER projects that are currently in flight. I try to keep this limited to 2, and I’ll pull tasks from ongoing into the stack when I have time. I use ongoing to think through the next few steps of a project, and those steps are organized in the same way as the stack. Examples: write impulse followup blog post, document disaster recovery testing procedure, prepare slide show for shitposting Zoom
  • project: Projects that are not in flight. I pull these into ongoing as it empties out.

In addition to the files listed above, I have a few templates, which I can pull into the stack to represent tasks that I need to do repeatedly. Some of these templates are:

  • matins: A sequence of activities I do every morning when I sit down at my desk: personal email, work email, work slack, and planning out how I’m going to spend my day.
  • vespers: A sequence of (much less structured) activities I do at the end of work.
  • restring_guitar: The process for restringing a guitar, organized into an Impulse task. It’s a thing I do so infrequently that I always forget how to do it. So Impulse is a nice way both to save that information, and to learn the process more thoroughly.

This Vim/Git approach works… okay. There’s definitely some friction. Sometimes parent tasks get separated from their children because I’ve made a mistake moving things around. Syncing is a pain, because I have to remember to pull and commit and push. But despite these and other difficulties, I still find it super helpful to work from a stack.

The software angle

As soon as I started using Impulse consistently, I ran into the age-old problem of consensus. I have both a work laptop and a personal laptop, and I use Impulse on both. It immediately became clear that the whole thing falls apart as soon as there’s more than one stack.

So the biggest challenge I can foresee in building a software implementation of Impulse is distributed consensus. I have some ideas about how to handle it, which may one day come to take the form of an eventually consistent object store built on top of Git. But I think I need to ignore the consensus problem for now.

Instead, I’ll try to hone a super streamlined ncurses-based UI for a single, in-memory Impulse instance:

--- Moving the Cursor

j ↓     move cursor down
k ↑     move cursor up
h ←     move cursor to parent
l →     move cursor to child
t       move cursor to top

--- Moving tasks

J ⇧↓    move task down (among its siblings)
K ⇧↑    move task up (among its siblings)
H ⇧←    move task left (make it a child of the task that's currently its grandparent)
L ⇧→    move task right (make it a child of the sibling directly above it)

--- Changing tasks

c       add child task(s)
s       add sibling task(s)
d       delete task
Enter   edit task name

--- Etc.

?       help (this message)

This should keep me busy for a while. Then I’ll see where I’m at.

Managing attention with stacks

I have ADHD and anxiety. They interact, and it’s impossible to say where one ends and the other begins.

I can focus on a task, but if I get distracted, it’s often very hard to bring myself back to whatever I was doing. The cost of “context switching” – rebuilding the mental edifice of what I was doing and why and what I was going to do next – is exceptionally high for me. Sometimes the context switching is so cumbersome that I get distracted again before it’s done. Depending on the degree of anxiety bouncing around in my brain, this cycle may just repeat until I run out of day.

So there’s that, and then there’s also this: I’m mostly in technical leadership roles these days. I spend a lot of time in meetings and Slack threads and emails, which means I don’t get much heads-down time in a given week. And what heads-down time I do get is often split up into hours here and half hours there. Plus I have a kid now, which, while wonderful and great, is another way that my calendar gets all chopped up.

In a given week, then, if I want to accomplish any significant effort that requires focus (usually writing, research, or coding), I need to minimize the cost of context switching. To this end I’ve lately been using a system inspired by the call stack. (Note: I say “inspired” because it is not a direct analog of the call stack; please don’t expect it to be.)

Suppose I’m coding. I open my text editor to write code, but I also open another text editor to an empty file, visible on the same monitor. I call this empty document the stack. Whatever the thing is that I need to do, I write it in the stack:

implement widget feature

Implementing the widget feature is a multi-step process, though, so I need to create a few children of the implement widget feature task:

	write definition and doc string for widget function
	write widget test
	implement widget function such that it passes tests
implement widget feature

Children of a task appear above the task and indented. The three lines above implement widget feature are its children.

When working with this stack system, the top item on the stack is always the thing I’m currently doing. So, in the example at hand, I’m currently working on the task write definition and doc string for widget function.

But maybe, when I get to work writing the doc string for the widget function, I remember that I saw a good blog post about how to design functions like this. Maybe I want to reread that blog post before I commit to an interface. That’s fine! I can push (add at the top) a line for that:

		reread that blog post i saw
	write definition and doc string for widget function
	write widget test
	implement widget function such that it passes tests
implement widget feature

Now the top item in the stack is reread that blog post i saw. So I load up the post and start reading.

Of course, after I’ve read a few paragraphs, I see a Slack notification that I need to attend to: an alert about API latency. Normally, I would go deal with that, and by the time I got back to my terminal I’d have to rebuild all my context. But with this stack system, I can just add an interrupt: a task that temporarily steals the top slot. Like so:

deal with API latency alert
		reread that blog post i saw
	write definition and doc string for widget function
	write widget test
	implement widget function such that it passes tests
implement widget feature

As always, the thing I’m currently doing occupies the top line in the stack. So now I’ll hang out in Slack and Datadog for a while, digging into this alert. That may involve adding children of the deal with API latency alert task, or it may not, depending on how much I need to hold in my head. But let’s say it doesn’t – my teammate says she’s on it, and she doesn’t need help.

So I come back to the stack and pop that task off it. By “pop” I just mean delete the line. Since my current task is always the topmost line in the stack, I always delete (pop) the top line. Now I’m back to the widget work:

		reread that blog post i saw
	write definition and doc string for widget function
	write widget test
	implement widget function such that it passes tests
implement widget feature

I don’t have to worry about remembering what I was in the middle of. The top line is reread that blog post i saw, so I can immediately jump back into that. When I’m done, I pop that task and move on to whatever’s now on top. And so on and so on, popping lines as I finish them, adding lines as I think of followup tasks or feel the need to break down tasks by adding children.

I’ve started a Github project called impulse to elaborate on this idea, but so far the README for impulse is entirely aspirational.

Anyway, that’s my stack system. If you have context switching difficulties like me, try it out! It’s cool.

Stop apologizing for bugs

Look out, honey, ’cause I’m using technology

Ain’t got time to make no apology

The Stooges, “Search and Destroy”

For the last year or so, I’ve made a conscious effort to stop apologizing for bugs in my code.

Apologizing for bugs is very tempting. I used to do it a lot. When my code was involved in a failure that screwed up a coworker’s day or caused a user-facing problem, I’d say “Whoops! Sorry! I should’ve thought about that.”

The motivation for apologizing is sound: you want to do your best for the team. In principle, you could have prevented a problem if you’d just done something slightly different. Through this lens, apologizing for bugs may seem innocuous. But it contributes to a bunch of cultural anti-patterns:

  • It reinforces the idea that any one person or piece of code can be blamed for a given failure. Short of malice, this is never the case.
  • It gives the impression that, when you wrote the code, you should have written it better. This is a counterfactual that rarely holds up to examination.
  • It positions shame as the correct emotion to feel about bugs in your code: if you were a better engineer – a better teammate – the bug wouldn’t exist.
  • If you’re a more senior engineer on your team, the effects of these anti-patterns are magnified: people see you apologizing for bugs, so they think that they should be striving to write bug-free code. They may feel ashamed if their code has bugs.

Even if you don’t intellectually believe any of these fallacies about bugs, the act of apologizing reinforces them. Your teammates can’t know what you really believe; they can only know what you say and do.

Everyone knows that all code has bugs. Code is written under constraints. Deadlines. Goals other than quality. Imperfect knowledge of the future. Even your own skill as an engineer is a constraint. If we all tried to write perfect, bugless code, we’d never accomplish anything. So how does it make sense to apologize for bugs?

This rule I’ve made for myself forces me to distinguish between problems caused by constraints and problems caused by my own faults. If I really think I caused a problem through some discrete action (or lack of action), then that’s something I’ll apologize for. But if I wrote code that got something done, and it just so happens that it didn’t work in a given situation, then I have nothing to apologize for. There was always bound to be something.

Make a resolution not to apologize for bugs. Especially if you’re in a leadership position. It’s a simple way to tweak attitudes about mistakes and failure in a positive way.