Concurrent locks and MultiXacts in Postgres

Pretty recently, I was troubleshooting a performance issue in a production Rails app backed by Postgres. There was this one class of query that would get slower and slower over the course of about an hour. The exact pathology is a tale for another time, but the investigation led me down the rabbit hole of a particular data structure in Postgres internals called a MultiXact.

Let’s look at what a MultiXact is, why it exists, and under what circumstances one comes into existence.

What the hell is a MultiXact?

A two-panel meme based on The Fifth Element. First panel shows Leeloo holding up her Multipass with the text, "Leeloo Dallas MultiXact". Second panel shows Korben, exasperated, saying "Yeah she knows it's a MultiXact"

In Postgres, it’s possible for two threads (or indeed, even three!) to lock the same row at the same time, as long as their locks don’t conflict with one another. The rules for what kinds of locks conflict with what other kinds of locks are admirably documented.

When you lock a row in Postgres, what you do is, you put your transaction ID in the xmax field of the corresponding tuple in storage (more on xmax on the Cybertec blog). This way, anyone who comes along looking for this row will know that you already have it locked. The latecomer can then wait for the lock:

Transaction 768Transaction 769Notes
SELECT *
FROM foo
WHERE x = 9
FOR UPDATE;
Transaction 768 now has a row lock. The row’s xmax field contains the value 768.
SELECT *
FROM foo
WHERE x = 9
FOR UPDATE;
Transaction 769 retrieves the current row, sees that Transaction 768 already holds a lock that conflicts with the lock it wants, and waits for Transaction 768 to be over.

But what if two threads both want to lock the same row simultaneously? For example:

Transaction 772Transaction 773Notes
SELECT *
FROM foo
WHERE x = 9
FOR SHARE;
Afterward, transaction 772 has a row lock. The row’s xmax field contains the value 772.
SELECT *
FROM foo
WHERE x = 9
FOR SHARE;
What happens now? 

Transaction 773 can’t just write its transaction ID into the xmax field. That would amount to preempting Transaction 772‘s lock… which would defeat the whole point of locking. To solve this problem, Postgres creates a MultiXact. A MultiXact essentially bundles together some set of transactions so that those transactions can all lock the same row at the same time. Instead of a transaction ID, a new MultiXact ID is written to the row’s xmax.

Transaction 772Transaction 773Notes
SELECT *
FROM foo
WHERE x = 9
FOR SHARE;
SELECT *
FROM foo
WHERE x = 9
FOR SHARE;
Now both transactions have the row locked. The row’s xmax field is set to 14, which is a MultiXact ID. MultiXact 14 refers to transactions 772 and 773 by their transaction IDs.
COMMIT;Transaction 773 is now over, but the row’s xmax value is still 14. Since MultiXacts are immutable, MultiXact 14 still refers to the now-defunct Transaction 773 as well as the ongoing Transaction 772.
COMMIT;With both transactions over, there are no remaining locks active on the row. Its xmax value is still 14, and will remain 14 until another thread locks the row or the table is vacuumed.

It bears repeating that MultiXacts are immutable. If transactions 104 and 108 both have row R locked as part of MultiXact 19, and transaction 117 locks row R too, transaction 117 can’t just join MultiXact 19. Instead, a new MultiXact with ID 20 is created, which contains 104, 108, and 117.

This means that, every time an additional transaction wants to lock a row, Postgres has to write the whole new MultiXact to a buffer. For large MultiXacts, the time-cost of all this reading and writing can become quite significant. Especially since access to the underlying data region is subject to a set of global mutexes.

Counterintuitively, a MultiXact doesn’t always involve multiple transactions.

When I first encountered MultiXacts in production, it was extra confusing, because I was seeing MultiXacts get created within a single transaction. “What gives?” I thought! “They’re supposed to be multi.”

Turns out, we were doing something like this:

BEGIN;
SELECT * FROM queue_jobs
WHERE id = 4
FOR SHARE;

SAVEPOINT foo;

SELECT *
FROM queue_jobs
WHERE id = 4
FOR UPDATE;

A savepoint doesn’t technically create a new transaction (I don’t think), but nevertheless, Postgres needs to keep track of the fact that the FOR UPDATE lock was taken after the savepoint, so that that lock can be relinquished in case of a subsequent ROLLBACK TO foo command. As a result, a new MultiXact gets created and its ID gets placed in the row’s xmax field.

I hope I’ve contributed to your understanding of MultiXacts in Postgres! It’s the sort of thing where you don’t ever need to think or know about it, until suddenly you do.

Squeeze the hell out of the system you have

About a year ago, I raised a red flag with colleagues and managers about Postgres performance. Our database was struggling to keep up with the load generated by our monolithic SaaS application. CPU utilization was riding between 60 and 80%, and at least once it spiked to 100%, causing a brief outage.

Now, we had been kicking the can down the road with respect to Postgres capacity for a long time. When the database looked too busy, we’d replace it with a bigger instance and move on. This saved us a lot of time and allowed us to focus on other things, like building features, which was great.

But this time, it wasn’t possible to scale the DB server vertically: we were already on the biggest instance. And we were about to overload that instance.

Lots of schemes were floated. Foremost among them:

  • Shard writes. Spin up a cluster of independent databases, and write data to one or the other according to some partitioning strategy.
  • Do micro-services. Split up the monolith into multiple interconnected services, each with its own data store that could be scaled on its own terms.

Both of these options are cool! A strong case can be made for either one on its merits. With write sharding, we could potentially increase our capacity by 2 or even 3 orders of magnitude. With micro-services, we’d be free to use “the right tool for the job,” picking data stores optimized to the requirements of each service workload. Either branch of the skill tree would offer exciting options for fault tolerance and operational resilience.

Either way, everyone had to agree: we’d outgrown our old, naïve implementation. Onward and upward! We can do hard things!

In situations like this, presented with a dazzling array of next-generation architecture options that can be built to last us through the decade, it’s easy to forget what our goal was: to get database performance under control.

Complexity costs attention.

Sometimes, leaps in complexity must be made. It’s generally a good problem to have. If enough demand is being placed on your system to render obsolete your existing technology, then even more growth is probably on the horizon! If you can just put in the investment and build the more advanced architecture now, then you’ll be looking at a bright future of unconstrained year-over-year success.

But don’t just consider the implementation cost. The real cost of increased complexity – often the much larger cost – is attention.

If you decide to shard across databases, then not only must you pay the money-, time-, and opportunity cost of building out the new architecture: you must also take the new complexity into account in every subsequent technical decision. Want to shard writes? Fine, but this complicates every future decision about backups, monitoring, migrations, the ORM, and network topology (just to name a few). And don’t get me started on micro-services.

Just think about how massive these costs are. How much feature delivery will have to be delayed or foregone to support the additional architectural complexity?

Always squeeze first

We should always put off significant complexity increases as long as possible.

When complexity leaps are on the table, there’s usually also an opportunity to squeeze some extra juice out of the system you have. By tweaking the workload, tuning performance, or supplementing the system in some way, you may be able to add months or even years of runway. When viable, these options are always preferable to building out a next-gen system.

Let’s return to the example of the overloaded Postgres instance. In that case, what we ended up doing was twofold:

  1. Two engineers (me and my colleague Ted – but mostly Ted) spent about 3 months working primarily on database performance issues. There was no silver bullet. We used our telemetry to identify heavy queries, dug into the (Rails) codebase to understand where they were coming from, and optimized or eliminated them. We also tuned a lot of Postgres settings.
  2. Two more engineers cut a path through the codebase to run certain expensive read-only queries on a replica DB. This effort bore fruit around the same time as (1), when we offloaded our single most frequent query (a SELECT triggered by polling web clients).

These two efforts together reduced the maximum weekly CPU usage on the database from 90% to 30%.

Now we can sleep at night. We have a huge amount of room to grow, both in terms of CPU headroom and our ability to shed load from the primary. And furthermore, since our work touched many parts of the codebase and demanded collaboration with lots of different devs, we now have a strong distributed knowledge base about the existing system. We’re well positioned to squeeze it even more if need be.

This doesn’t mean complexity is bad

Of course, I’m not saying complexity is bad. It’s necessary. Some day we’ll reach a fundamental limit of our database architecture, and before that day arrives, we’ll need to make a jump in complexity.

But until then, because we squeezed first, we get to keep working with the most boring system possible. This is by far the cheaper and more practical option.

Incident metrics tell you nothing about reliability

When an incident response process is created, there arise many voices calling for measurement. “As long as we’re creating standards for incidents, let’s track Mean-Time-To-Recovery (MTTR) and Mean-Time-To-Detection (MTTD) and Mean-time-Between-Failures (MTBF)!” they say things like.

Usually there’s no further discussion on this point. Everyone agrees that more data is better (“but let’s make sure we interpret the data correctly, of course!”). As practitioners, it’s exciting to think that these metrics will finally let us demonstrate concrete results for all the effort we put into reliability. Middle management, for its part, looks forward to having a “north star” for reliability work. No more wishy-washy “we believe this project improved system reliability substantially” hand-waving: the initiative either moved the needle on MTBF or it didn’t.

The assumption underlying all this is that incident data can shed light on a system’s reliability. But this assumption is never warranted. Analysis of your incident data may shed light on your incidents. But incident data really can’t tell you anything about your reliability.

Picture this. Your team owns a web service. You’ve set up an alert on latency. If the 90th percentile latency exceeds 100 milliseconds, your team gets paged and you start an incident.

One day, you find out that, really, if that 90th percentile latency even gets above 50 milliseconds, it’s already a problem. Downstream services start getting sluggish, and user experience suffers. “No problem,” you say, lowering the alert threshold to 50 milliseconds.

Now you’re able to observe more failures. Or, if you prefer: you’ve widened the definition of “failure” to include more events. But the system’s reliability has in no way been altered. And yet, you will have more incidents. Possibly quite a few more! Mean-time-between-failures will drop.

Picture this. Every once in a while, the database gets into a “wedged” state, where a bunch of queries are piled up and work grinds to a halt. It’s simple to un-wedge it: just manually kill the query that caused the pileup, and everything else gets flushed through the system. Each time the database gets wedged, the database team gets paged, they start an incident, they find and kill the query, and the incident is over. These incidents happen semi-regularly, but at least they’re over quickly.

After a while, someone finally implements a fix. Those problematic queries? Gone! No more pointless incidents spun up just to manually clear out a pileup. And, more importantly: no more service degradations resulting from this problem. The reliability of the system has improved!

So what happens to the incident metrics? MTBF goes up, of course. That checks out: there are fewer incidents now. But what about MTTR? Before the fix, MTTR was low. These query-pileup incidents were frequent and short, so they kept mean-time-to-recovery anchored down. After the fix – which, remember, corresponded to a direct and pure improvement in reliability – mean-time-to-recovery is higher. Whoopsy daisy!

Picture this. You have a particular customer whose use-case tests the limits of your product’s capabilities. And furthermore, this customer is, let’s say, somewhat of a squeaky wheel. Every 2 or 3 weeks, they file a support ticket about something-or-other, and the problem is usually tricky enough that an incident gets spun up.

Eventually this customer has had enough. They leave. This represents both a sizable blow to revenue and a scathing indictment of your product’s reliability at scale. But, on the bright side, both MTTR and MTBF benefit enormously! That’ll look great on the quarterly slide deck.

When someone says they want to measure incident metrics (MTTR or MTBF or any other one in the family), I always want to ask them, “What’s your model?”

Many people rush to collect data without first considering whether they can come up with a model powerful enough to interpret the data. If our model is bad, then we’ll make bad decisions even with good data. And if, as is often the case, we have no explicit model at all, then our implicit model is just, “Reliability up, MTBF up, MTTR down.” Which, as we’ve seen, doesn’t match our observations even in straightforward, common scenarios.

Don’t waste time and energy collecting metrics you’ll never be able to use. Learn as much as you can from the failures themselves, and then move on.

5 production surprises worth investigating

As an SRE, I’m a vocal believer in following one’s nose: seeking out surprising phenomena and getting to the bottom of them. By adopting this habit, we can find and fix many classes of problems before they turn into incidents. Over time, this makes things run much smoother.

But what is surprising? How do we decide what to ignore and what to chase down?

Here are 5 kinds of surprise that are worth digging into in production data.

Keep in mind that this list is based on one SRE’s experience. Someone else, with different intuitions about system behavior, would make a different list. And that list would be more valuable to them than this list.

The most important thing is not that we always pick the right surprises to investigate. It’s that we make time, consistently, to seek out and explain surprises.

Anyway. Here’s my list.

1. Long running requests

Pick a time window. Sort all the requests during that window descending by duration. At the top of this list, you’ll find classes of requests that spent extraordinarily long in your system.

Were you expecting to see these specific kinds of requests? If not, then congratulations: you’re surprised. And this is quite likely a surprise worth investigating, because

  1. while these slow requests are running, they’re eating up resources (CPU, memory, sockets, and so on), which gives them an outsized potential to degrade the performance of other, unrelated requests.
  2. without intervention, the underlying problem will probably get worse over time. Why wouldn’t it? It has already gotten this bad. And if it gets bad enough, it will eventually cause customer pain.
  3. many of the phenomena contributing to this slowness probably affect a variety of requests. So, if you can make these requests perform better, you may be able to make other requests perform better too.

We don’t have to rigorously analyze and optimize every slow request that goes through the system. Just by coming to understand the issues that cause our slowest requests to be slow, we often find opportunities to make high-impact improvements and mitigate the risk of ending up on an incident call later.

2. Saturation

Saturation is when units of work are waiting in a queue to be processed – be they web requests, or IP packets, or threads, or what have you.

To the extent we’re trying to serve each request as quickly as possible, saturation anywhere is waste. We shouldn’t expect any saturation. Or, at least, we should strive to thoroughly understand every source of it. If we don’t understand it, it will probably get worse over time.

Find every metric that represents a saturation. Put them all on one dashboard. Look at that dashboard every day.

3. Crashes

Every time an exception bubbles up to the top of the stack, that’s unexpected. If it were expected, then the code would have caught and handled the exception.

A tool like Sentry or Datadog Error Tracking (I mention these specific products only because I know them, not because they’re necessarily the best) lets you classify crashes, and see which ones are new or getting more frequent. Granted: it’s often the case that there are far too many classes of crash to investigate all or even most of them. But this shouldn’t stop us from chasing down some of them, according to our intuition about their relative importance.

We have imperfect information until we get our hands dirty, so we won’t always end up picking the highest-impact investigations to undertake. But fixing crashes is always better than ignoring them, and by dividing up the extant crashes into groups and prioritizing one group at a time, we can create some very useful guarantees.

For example, if we find all the crashes that correspond to deadlocks, and fix those, then we guarantee that our system has no deadlocks. Thereafter, problems caused by deadlocks – which tend to be subtle, elusive, and therefore time-consuming to solve – can no longer quietly sow chaos.

4. Utilization spikes

It’s expected for a utilization metric (for example, network bandwidth or memory utilization) to move around a lot. But it’s not expected for utilization to exhibit large spikes. When this happens, it means some recurring event is using more than its fair share of some resource.

When utilization hits 100%, things will degrade catastrophically and immediately. So if you have utilization spikes representing 20% of capacity, then steady-state utilization only has to be 80% before catastrophic degradations start occurring. Or, if the spikes gets worse, and now they’re 30% of capacity, then 70% capacity becomes the new safe limit. We’ll have to waste at least 30% of our capacity by leaving it idle in order to prevent outages.

So why not understand the spikes now, before they get worse?

5. Correlation between latency and throughput

When we see recurring swings in latency metrics, it’s easy to say to ourselves, “Yeah, well, when we’re under more load, things slow down.” But… why? Systems don’t always behave like that! They do it when there’s a bottleneck. If the system were behaving as expected and no resources were exhausted, then the latency graph would be more or less flat.

In other words, if traffic tends to be higher mid-day on weekdays, and latency also tends to rise mid-day on weekdays, then you’re likely saturating at least one resource. Would you rather find that bottleneck now, or during an incident?

I hope this list is a good place to start if you’re looking to dig up problems. But, ultimately, you must be the judge of what’s surprising to you. Find something you can’t explain. Explain it. Repeat.

Platform teams don’t need to act like companies

Lately you see a lot of software company R&D teams organized around internal products. The Search Team provides a Search service and its “customers” are the teams whose code consumes that service. The Developer Productivity Team’s product is a suite of tools for managing local development environments and running tests, and its “customers” are the developers who use those tools. And so on.

The idea behind this organizational trick seems to be that product-oriented teams will act like a company. They’ll make better strategic decisions because their goals will be aligned with those of their internal customers.

This is pretty silly.

If a company satisfies the needs of its customers and potential customers, it can grow and thrive and put resources into attracting more revenue. If a company fails to satisfy customer needs, it loses out to competitors. As a company changes its strategy, the market provides feedback. These dynamics don’t exist for an internal team – especially a platform (or “infrastructure,” or “ops”) team.

When asked to think like a company in this way, a platform team will define their product as something like, “a reliable, performant, and secure platform for running software products.” Fine – but think of all the things that a company can do that you can’t do:

  • You can’t attract new customers
  • You can’t pivot to a different product
  • You can’t focus on a different market segment
  • You can’t sacrifice some set of customers to attract another

If you try to “think like a company” as a platform team, you will try to do what successful tech companies do: launch attractive features. But you can waste a lot of time that way, because the most important feature of a software platform is boringness.

If the platform team must act like a company, then it should act like a struggling company that’s desperate to keep its last few enterprise customers. This implies a very different strategy than that of a startup. Change as little as possible and put the bulk of your energy into efforts that increase boringness: documenting, simplifying, and fixing. Don’t build anything exciting if you can avoid it, and if you must build something, build it in service of boringness.

Don’t build Cities of Nowhere

I just finished reading Seeing Like A State: How Certain Schemes to Improve the Human Condition Have Failed by James C. Scott (full text online). I highly recommend it. Through examples ranging from Soviet collectivization to the construction of Brasilia, the book argues that grand, centralized planning efforts in the high modernist tradition are all doomed to failure. One simply can’t substitute pure reason – no matter how beautiful and internally consistent – for local human decision-making informed by direct experience.

To take one striking anecdote, Le Corbusier spent some time lobbying Soviet intelligentsia to implement his redesign of Moscow. However:

Stalin’s commissars found his plans for Moscow as well as his project for the Palace of Soviets too radical. The Soviet modernist El Lissitzky attacked Le Corbusier’s Moscow as a “city of nowhere, … [a city] that is neither capitalist, nor proletarian, nor socialist, … a city on paper, extraneous to living nature, located in a desert through which not even a river must be allowed to pass (since a curve would contradict the style).” As if to confirm El Lissitzky’s charge that he had designed a “city of nowhere,” Le Corbusier recycled his design virtually intact—aside from removing all references to Moscow—and presented it as La ville radieuse, suitable for central Paris.

Seeing Like A State – James C. Scott
Le Corbusier's vision for the ideal Moscow. Symmetric, vertical, orderly.
A map of actual Moscow. Asymmetric, spread out across either side of the Moskva, disorderly.

In Scott’s book, this pattern plays out over and over. Planners, relying heavily on what they imagine are universal principles, produce designs for human life that are nevertheless completely at odds with how humans actually live and work. These designed spaces possess a symmetric, holistic beauty which blinds their creators to the needs of the infinitely complex human communities that are meant to populate them. The planned city ultimately fails to improve the human condition, since improving the human condition is one of the many considerations which must bow to the planner’s aesthetic.

Toward the end of the book – although this is only a short passage and certainly not the thrust – Scott gives 4 rules of thumb for development planning. Building up a SaaS product is clearly different in many ways (not least of which is the stakes) from planning human development. But the parallels got me thinking in engineering terms, and I find that these rules also work quite well as rules of thumb for making changes to a complex software system. By following them, we can mostly avoid wasting effort on huge endeavors that end up being Cities of Nowhere.

1. Take small steps

In an experimental approach to social change, presume that we cannot know the consequences of our interventions in advance. Given this postulate of ignorance, prefer wherever possible to take a small step, stand back, observe, and then plan the next small move.

In software, taking small steps is a challenge of discipline. We work with pure thought-stuff. In principle, we can build whatever we can imagine, so it’s always tempting to solve more of the problem.

But taking small steps has by now become the common wisdom in our industry. Single-feature pull requests are encouraged over massive, multifaceted ones. We roll out features to small groups of users before ramping up. Prototypes and MVPs abound.

Where we still have much to learn from Scott is the “stand back, observe” part. Often, we’re tempted to simply let the machine do the observing for us: if there’s anything wrong with our change, the integration tests will fail, or the deploy will fail, or we’ll get an alert. While such automated signals are indispensable, they’re not sufficient. To understand the real-world effects of our small changes, we have to exercise the further discipline of curiosity. With our particular change in mind, we have to search diligently for evidence of its effects, both intended and unintended, direct and indirect. Observability is not enough – we must actively observe.

2. Favor reversibility

Prefer interventions that can easily be undone if they turn out to be mistakes. Irreversible interventions have irreversible consequences. Interventions into ecosystems require particular care in this respect, given our great ignorance about how they interact. Aldo Leopold captured the spirit of caution required: “The first rule of intelligent tinkering is to keep all the parts.”

It’s pretty clear how this reversibility consideration applies to deploying software and infrastructure. Most changes should be trivially reversible by “rolling back” the deploy. Where this is impossible (such as in certain classes of database migrations and infrastructure changes), we come up with more case-specific back-out plans, or we end up inventing reversible patterns despite ourselves. This amounts to an implicit recognition that our changes can always have unexpected consequences. Which is good!

But, in a socio-technical system, the technology isn’t the only thing that gets altered over time. We must also favor reversibility with respect to the social elements – with respect to procedures, policies, and organizational structures.

One pattern I like for this is an experiment ledger. As a team, you keep a running register (e.g. in a spreadsheet) of the different experiments you’re trying. These can be anything from a new recurring meeting to a new on-call rotation to a rearrangement of your kanban columns. Each experiment in the ledger has one or more check-in dates, when the team will discuss the results of the experiment and decide whether to keep going or abandon the course.

Of course, for many reasons, not every change can be reversible. Not least because even after you reverse something, the taste stays in people’s mouths. But taken together with the rest of Scott’s advice, reversibility is a sensible attribute to strive for.

3. Plan on surprises

Choose plans that allow the largest accommodation to the unforeseen. In agricultural schemes this may mean choosing and preparing land so that it can grow any of several crops. In planning housing, it would mean “designing in” flexibility for accommodating changes in family structures or living styles. In a factory it may mean selecting a location, layout, or piece of machinery that allows for new processes, materials, or product lines down the road.

No matter how much time and sweat you put into the design of a system – no matter how much of the problem you try to solve a priorithere will always be surprises. It’s just the nature of a complex system, and even more so for a system with inputs you can’t control (e.g. customer traffic patterns).

Therefore, watch carefully for both expected and unexpected results. That’s what “plan on surprises” means to me: make small, reversible changes, and in the meantime look closely for new unexpected behaviors that you can investigate and understand. This will give you much more insight into your system’s abilities and constraints than any application of pure thought.

4. Plan on human inventiveness

Always plan under the assumption that those who become involved in the project later will have or will develop the experience and insight to improve on the design.

Write with clarity and humility on the motivations for your designs. Explain what you did and what you chose not to do, and why. The reasons for a particular design are never self-evident, no matter what cosmic beauty they may have in your head.

Taken together, Scott’s rules sketch out a pragmatic philosophy for managing the evolution of complex systems. In favor of grand redesigns that attempt to solve all problems at once, one should prefer targeted, reversible changes. We should change significant things about the system only we can fully explain why it’s necessary, and afterward we should exercise diligence and curiosity in making sure we understand what we changed.

The time horizon outlook: an alternative to technical debt

In Technical debt is not a thing, I argue that we should stop using the metaphor of technical debt to inform strategy. Instead, I propose a time horizon model, in which our goal as engineers (or what have you) is to produce the most value possible over some time window.

In the technical debt model, we identify some inefficiency in our workflow process and trace it back to some prior decision that necessitated this inefficiency. That decision, we say, entailed an accumulation of technical debt. We traded away our future productivity for an earlier delivery date. If we want that productivity back, we must “pay off” the debt by rectifying that decision.

In the time horizon model, by contrast, we don’t worry about how we arrived at the existing sociotechnical system. Over the history of the product, we’ve made many decisions to sacrifice completeness for expediency. In making those decisions we went down a path that ultimately led us to the system as it exists today. We can’t go back and choose a different path: there’s only forward.

Let’s say we’ve got a SaaS product that relies on manually generated TLS certificates. We have to do 2 hours of toil every 3 months to renew these certs.

If we believe in technical debt, we might look back at the decision to make cert renewal a manual process and say, “By not automating this, we took on technical debt. We must pay off this debt.” We’d make a ticket, give it the technical-debt tag, and eventually pick it up as part of our 20% time commitment to technical debt paydown.

By contrast, in the time horizon model, our team’s stated raison d’être is simply to produce the most value possible within our agreed-upon time window.

So instead, we’ll say something like “Manual cert renewal costs 2 hours of labor every 3 months. It would take us 15 hours of work to automate.” Those 15 hours could instead be spent delivering value directly, so we should only undertake this project if it will free us up to deliver more total value between now and the time horizon:

Our time horizonThe estimated time investment
(which is also the opportunity
cost)
The payoff over the time horizonOur decision
3 months15 hours2 hoursDon’t do it
1 year15 hours8 hoursDon’t do it
3 years15 hours24 hoursMaybe do it
5 years15 hours40 hoursMaybe do it

Of course, just because a given time investment passes the time horizon test doesn’t necessarily mean we should make that investment. We still need to compare it to the set of other efforts we could undertake, and devise an overall strategy that maximizes the value we’ll deliver over our time window.

The horizon model gives us a basis for making these comparisons, and lets us establish a lower bound for the expected return on our time investments. It helps us focus on the right things.

Using Little’s Law to scale applications

In a complex application, there are queues everywhere. Which is lucky, in a way, because it means we can use queueing theory to slice through a whole class of Gordian knots.

One of queueing theory’s most general insights is Little’s Law:

L = λW

L is the long-term average number of customers in the system, λ is the long-term average arrival rate of new customers, and W is the average amount of time that customers spend waiting.

In the parlance of queueing theory, “customer” doesn’t just mean “customer.” It means whatever unit of work needs to pass through the system. A customer can be a phone call or an IP packet or a literal customer at a grocery store or any one of infinitely many other things. As long as there are pieces of work that arrive, get queued, get processed, and then exit the system*, Little’s Law works. It’s breathtakingly general.

As an illustration, let me share an anecdote from my job.

*and as long as you’re not hitting a queue size limit

How many web servers do we need?

I’m on a team that’s responsible for a web app that looks more or less like this:

Requests come in from the Internet to the load balancer. The load balancer forwards the requests to a bunch of web servers, each of which, in turn, distributes requests among 6 independent worker threads. The worker threads run the business logic and send responses back up the stack. Pretty straightforward.

When a web server receives a request, it hands that request off to one of its worker threads. Or, if all the worker threads are busy, the request gets queued in a backlog to be processed once capacity becomes available.

If everything’s hunky dory, the backlog should be empty. There should always be idle capacity, such that requests never have to wait in a queue. But one day I noticed that the backlog wasn’t empty. The total number of backlogged requests across the fleet looked like this:

Things were getting queued at peak traffic, so we needed to scale up the number of web servers. But scale it up to what? I could have used trial and error, but instead, I turned to Little’s Law.

The first step was to establish the mapping between entities in my system and the general mathematical objects related by Little’s Law:

  • L: the number of in-flight requests. In other words, requests that have arrived at the load balancer and for which responses haven’t yet been sent back to the user.
  • λ: the rate at which new requests arrive at the load balancer.
  • W: the average request latency.

What I wanted to know – and didn’t have a metric for – was L. I did have a metric in my telemetry system for W, the average request latency.

While I didn’t exactly have a metric for λ, the arrival rate of requests, I did have the completion rate of requests (i.e. how many requests per second were being served). The long-term average arrival rate can’t differ from the completion rate, since every request does exit the system eventually. Therefore I was able to use the completion rate as a stand-in for λ. Here’s what I found (these aren’t the actual numbers):

L = λW
(average occupancy) = (arrival rate)(average wait time)
(average occupancy) = (1000 req/s)(340ms)
(average occupancy) = 340 requests

I chose an arrival rate close to the peak throughput of the system. This still works as a “long-term average,” though, since the interval between arrivals (on the order of 1 millisecond) is much less than the duration of the average request (on the order of 300 milliseconds).

So, according to Little’s Law, at peak-traffic times, there will be on average 340 requests in flight in the system. Sometimes more, sometimes less, but on average 340. From there, it was trivial to see why requests were getting queued:

(average web server occupancy) = (average occupancy) / (number of web servers)
(average web server occupancy) = (340 requests) / (40)
(average web server occupancy) = 8.5

If you recall that each web server maintains 6 worker threads, you’ll see the problem. No matter what fancy stuff we try to do with queueing discipline or load balancing algorithm or whatever, there will be on average 2.5 queued requests per worker.

Little’s Law can also tell us what we need to scale up to if we want to extract ourselves from this mire:

(total worker threads) ≥ (arrival rate)(average wait time)
(number of web servers)(worker threads per web server) ≥ 340

So we can either scale up web servers or worker threads per web server until their product is greater than 340.

Little’s Law is about long-term averages

This is only a lower bound, of course.

Little’s Law tells us about the long-term average behavior of a queueing system: nothing else. From moment to moment, the occupancy of the system will vary around this average. So, in the example above, we need to provision enough worker capacity to absorb these variations.

How much extra capacity do we need? Little’s Law can’t tell us. The answer will depend on the idiosyncrasies of our system and the traffic that passes through it. Different systems have different latency distributions, different arrival time distributions, different queueing disciplines, and so on. These variables all have some effect on our worst case and our 99th-percentile occupancy. So, in most cases, it’ll be important to get a sense for the empirical ratio between a system’s average occupancy and its occupancy at whatever percentile you decide to care about. But Little’s Law is still a very helpful tool.

If you do have a good sense of how worst-case occupancy varies with the average, you might even be able to use Little’s Law to inform your autoscaling strategy. As long as the system’s arrival rate changes much more gradually than the average latency of requests (such that you’re still working with long-term averages), you can rely on

L = λW

to predict your capacity requirements. Or, at least, I think you could. I haven’t tried it yet.

Technical debt is not a thing

In software engineering, we have this concept called “technical debt.”

When we talk about technical debt, we usually mean that, because of a given technical decision, we’ll be making “interest payments” out of our productivity until the original “debt” is paid off. These payments can take many forms, from manual toil to risk to cognitive load. Anything that saps our productivity on a recurring basis.

Technical debt is a metaphor. It’s a conceptual tool we use to grapple with elusive, abstract stuff like productivity and value. That abstract stuff gets grounded to a more concrete family of concepts: namely, money and debt.

All of which is well and good, if the metaphor helps us reason. But it doesn’t. The technical debt metaphor serves only to obscure and encumber. We must discard it.

Direct and indirect work

We use the technical debt metaphor when we’re talking about strategy: our plan for undertaking certain pieces of work in a certain order so as to achieve some goal.

The most straightforward kind of strategy for a software engineering team would be to do nothing but build high-value features. Pick the highest-value feature that can be built in a sprint, build, repeat. Get paychecks, win promotions.

This strategy works for some teams. But for most, it’s naïve. If we only do direct work (what I’m calling, more-or-less, feature work: work that creates value for customers), problems start to emerge. We find ourselves repeating manual procedures, fixing more user-facing bugs, and fighting more fires. All this unplanned work demands our labor, diverting it from direct work. It thereby reduces the effective rate at which we can create value: our team’s capacity (“velocity”, if you Agile).

In order to keep our capacity from slipping, we have to do indirect work. Examples of indirect work are many and varied: writing internal documentation, caring for test suites and CI pipelines, building metric dashboards, refactoring old crufty code.

Indirect work uses some of our capacity. But we hope that, by folding indirect work into our strategy, we can avoid a lot of disruptive and expensive unplanned work that would otherwise emerge.

However, on its own, indirect work doesn’t tend to float to the top of the priority list. We usually prioritize tasks based on how much value they directly create for customers. It’s hard to even talk about direct and indirect work in the same conversation. It feels as though they create forms of value that are incommensurable: impossible to prioritize with respect to each other according to a single metric. Therefore, direct work tends to elbow indirect work out of the roadmap.

Nevertheless, we know from experience what happens when we neglect to do enough indirect work. Everything falls apart and we fail.

How we use the technical debt metaphor

The metaphor of technical debt is mainly a tool for working around the incommensurability of direct and indirect work. It’s how we try to balance tasks whose output is an amount of value (direct work) and tasks whose output is an increase in the rate at which we can create value (indirect work).

“Rapid progress isn’t free,” we say to management. “It’s achieved by taking out a sort of loan. We get to ship sooner, but in exchange we take on an obligation to make recurring payments, in the currency of labor.” These recurring payments only touch the interest. They will never, on their own, pay down the principal. To do that, we need to undertake some more substantial effort. A “paydown” project that will liberate us from the waste – the toil, or risk, or cognitive load – that our technical debt entails.

The technical debt metaphor feels true. And it carries some useful baggage from its origins in finance to its new home in software engineering. Namely: neglecting your debts is short-sighted and irresponsible. Almost everybody agrees on this, at least with respect to real debt. So, if the metaphor holds, why shouldn’t it be true of tech debt too?

Thanks to the implicit association between paying down debts and responsibility, engineers and management both get what they want. Engineers get to spend time doing work that benefits them directly. In other words, they get a respite from the feeling of alienation from their labor product. Management gets a plausible justification for prioritizing work that keeps their team happy. It’s a win-win, right?

How the metaphor fails us

If you take out a real loan, your payments are usually proportional to the initial principal. It’s a rough proportionality, since interest rates and maturity dates vary, but in general: the bigger the loan, the higher the payments. Mortgage payments are higher than car payments.

This fact about loans has an important implication. Since the labor you have to expend to pay off a loan is (roughly) proportional to the amount of your recurring payment, loans are always worth paying off. If you have a tiny loan, it will take a tiny amount of labor to pay off, so it’s worth doing even though the interest payments are tiny. If you have a huge loan, then your interest payments constitute a significant financial drain, which makes the loan worth paying off even though it will take a lot of labor to do it. So we all share an intuition about loans: there’s no responsible alternative to paying them off.

This intuition implicitly tags along with the debt metaphor, because of how metaphor works, into the engineering space. We feel we must pay off our debts. But since technical debt paydown tasks never seem to make it onto the roadmap by themselves, we have to come up with some system to force them into the pipeline. To spend a certain proportion of our labor paying down technical debt, even though we won’t have any new features to show for it.

The number cited most often is 20%. 20% of our time feels like about how much it takes to pay down the right amount of tech debt. So we throw all the tech debt into a big bucket and try to devote about 20% of our labor to tasks in that bucket. After all, it’s debt: we can’t just not pay it.

But the dynamics of technical debt are quite different from those of real debt. Different enough that this intuition no longer stands up to reality.

How much are we paying in interest?

Imagine, for example, that a team decides to forgo automating a software product’s release process. By accepting a manual release process instead, they’re able to launch a week earlier than otherwise. But, since they lack an automated release process, they need to do an extra t hours of work each month.

The decision to accept a manual release process in order to launch earlier could be classified as taking on technical debt. Which means, if we apply our intuition about real debts, that it must be paid off. So it will go into the “tech debt” bucket and get done as part of the team’s 20% time. Regardless of the value of t. In fact, on most teams (in my experience), t isn’t even discussed. The task is qualitatively tech debt, so it goes in the bucket.

But you have to know t. If t is high enough, then there’s an enormous drain on our productivity which must be rectified urgently. If t is low enough, the task might not even be worth doing at all.

How much will it cost to pay down the principal?

Furthermore, in order to know whether it’s worth paying down a given bit of technical debt, we need to know how much it will cost to pay down.

Going back to the example from above: suppose it will take P hours of labor to automate the release process. If P is low enough, the automation is worth building. But if P is very high, it’s better to accept the toil and move on.

P doesn’t tend to be discussed either. Or, rather, first we decide that something is technical debt, and only when it comes time to prioritize it with respect to the other technical debt in the bucket do we ask about P.

Rejecting the technical debt metaphor

So, in order to strategize – to decide what work to do in what order – we need to estimate, at the very least:

  • t: The amount of value that will be created by doing a given task
  • P: The amount of labor required to do the task

Fine. But are these parameters not both necessary for evaluating feature work too? And if we need to consider them anyway, regardless of whether a task is technical debt, then what’s the point of calling some tasks technical debt paydown and others not?

The defining characteristic of technical debt is the decision to sacrifice some amount of future productivity in exchange for earlier delivery. But this decision is not unique to technical debt: it’s universal to engineering. Every change we ship – whether for a feature, or a bug, or a technical debt payment – every change carries an implicit decision about how much value to defer and how much to deliver now. “Technical debt” is just what we call it when the deferred value takes the form of a decrease in our team’s velocity. But that form doesn’t make it special or different, or incommensurable with other work.

When we make strategic decisions – decisions about what work to do in what order – we need a guiding principle. “Spend at least 20% of our time paying down tech debt” is not a principle. It’s an excuse to substitute our arbitrary personal taste for a realistic, evidence-based appraisal of cost and value.

And yet, something must be done.

But okay: so what if the technical debt discourse is useless? We’re dissatisfied with how much time we lose to recurring low-value-add tasks. Without the technical debt metaphor, how can we make the case that it’s worth investing time to eliminate these drains on our productivity?

One way is to adopt the simple model of a time horizon. Suppose a team declares the following as their goal:

To produce the most value possible over the coming year.

This team has a time horizon of 1 year. So, when there comes a proposal to make a particular improvement to their productivity (be it a process fix or some automation or whatever else), they can estimate:

  • How many hours of labor it will take to implement
  • How many weeks it will take to complete that labor
  • How many hours of labor it will save them over the year (the remainder of the year, that is, after subtracting the weeks spent implementing)

In this way, productivity investments can be evaluated each on their own merit, rather than being lumped together into the “technical debt” pile. The investments with the highest expected returns will emerge as clear priorities. And many proposals can be rejected entirely, if it’s determined that they require more labor than they can possibly save over a year.

With an agreed-upon time horizon, direct work and indirect work become commensurable. We get a more coherent strategy, and we don’t waste time wringing our hands over what’s technical debt and what’s not. The technical debt metaphor falls away, obsolete.

Learning about attention deficit through spreadsheets

I struggle with attention deficit. Often this manifests through Twitter. I’ll find myself scrolling it with no memory of how I got there. And no particular interest. Just scrolling.

This is not great. First of all, it’s a waste of time. But that’s okay – I’ve made a decision to continue having a Twitter account, and some amount of wasted time is a consequence of that. Sue me: I love posting.

The bigger problem is that whenever I unintentionally end up on Twitter, even if I only end up being there for a couple seconds, my whole mental stack gets disrupted. I lose context. If the thing I was doing was a thing that required a lot of abstraction juggling – such as coding or writing – the bowling pins may have already tumbled down by the time I even realize I’m scrolling Twitter again.

Impulse helps a lot. It gives me a trail of breadcrumbs to follow back to my prior train of thought. But still: the mental context has largely dissipated. I have to build my momentum back up again. If this happens many times a day, it’s a serious leak of labor power.

Spreadsheet!!!

Like all problems, this one can be solved with spreadsheets.

Okay, not solved. But maybe understood. Maybe. In part.

About a week ago, I logged out of Twitter on my work laptop. Now, every time I go to check Twitter, instead of wonderful, infinitely scrolling little posts, I see this:

Not posts

This immediately jars me out of my reverie. “Whoops!” I think, “This isn’t what I meant to be doing.” I can go back to the task at hand instead of wasting time on Twitter. Of course, it’s already too late for the cognitive context – that turned to salt the instant I opened the new browser tab. But at least now I can learn something from these attention faults.

I started a spreadsheet called Twitter Attempts. Over the workday, every time I see the Twitter login screen, I take it as a reminder to update this Twitter Attempts spreadsheet. I log the timestamp of the attention fault, as well as a note about what I was doing right beforehand. Then I go back to what I was doing. The spreadsheet looks roughly like so:

Observations

I’ve found this technique pretty illuminating.

Attention faults have precipitating events

I used to think of attention deficit as something like getting in a car accident. My attention is puttering along all lah-dee-dah, then pow! a different thing is happening.

The data lead(s) me to believe it’s not really like that. Instead of being spontaneous and random, attention faults are almost invariably preceded by some specific precipitating event. In other words, when I see the Twitter login screen and redirect myself to my spreadsheet, I can nearly always recall a particular event that seems to have “caused” the attention fault.

I call this sequence – precipitating event followed by attention fault – an attentional sneeze.

gross

If you’ll forgive the grossness of the metaphor, you could kind of think of a precipitating event as the pepper and checking Twitter as the snot. Sorry.

I review my spreadsheet at the end of every week. Here are some (synthesized; not verbatim) examples of the kinds of “peppers” I tend to find in the Notes column:

  • Just submitted a pull request
  • Just received an annoying Slack message
  • Just shifted from an emotionally fraught topic to a safe topic in meeting
  • Just finished writing a paragraph of documentation

An attention fault basically never just feels like, “I don’t know what happened! I was in the middle of typing a sentence and suddenly I was on Twitter.” This is surprising and I didn’t know it before. Hooray for spreadsheets!

Peppers fall into some categories

Here are some examples, straight from the data, of some of these pepper events:

  • somebody talking to me in zoom => that person talking to someone else instead
  • shift from asking in slack a question about git, to refactoring my log pipeline work
  • emotional shift from vulnerable team convo to administrivia
  • finished mtg w/ arthur
  • shift from asking for help to waiting for a reply

As I look at these entries and many more like them, there seem to emerge a few patterns. Peppers usually involve things like:

  1. A shift between pragmatic and intuitive modes of thought
  2. An emotional shift
  3. The beginning of an interval of waiting
  4. The completion of a task

These categories overlap quite a bit. In any case, they’re all boundaries between one mental state and another.

Sliding activation energy

Some days I feel centered and focused; other days I feel flustered and preoccupied. Distractable.

I’ve noticed that, on days characterized by high distractibility, the bar for what counts as “completion of a task” (the 4th type of pepper event) becomes a much lower bar. For example, suppose I’m replying to comments on a pull request. On a low-distractibility day I might finish replying to all the comments in the PR before checking Twitter. On a high-distractibility day, however, I might check Twitter after every reply! (But still – never in the middle of one.)

Limitations, plans

I don’t know where to go with any of this. It’s food for thought.

I will say that this particular well of data dries up pretty fast. There’s a positive feedback loop: I’m not logged into Twitter, so my mind eventually acclimates to this and substitutes some other type of flinch instead. As a result, fewer Twitter login attempts; as a result of that, stronger reinforcement that Twitter is no longer a viable distraction.

Another limitation of this avenue of investigation is its vulnerability to the availability heuristic. Just because I think a given attentional sneeze was caused by a particular pepper event, doesn’t mean it really was. But hey – all I have to work with is my own perceptions.

At any rate, the exercise has already been pretty worth it. And if any of this sounds interesting, I’d be super keen to hear how this experiment goes for you!