In Technical debt is not a thing, I argue that we should stop using the metaphor of technical debt to inform strategy. Instead, I propose a time horizon model, in which our goal as engineers (or what have you) is to produce the most value possible over some time window.
In the technical debt model, we identify some inefficiency in our workflow process and trace it back to some prior decision that necessitated this inefficiency. That decision, we say, entailed an accumulation of technical debt. We traded away our future productivity for an earlier delivery date. If we want that productivity back, we must “pay off” the debt by rectifying that decision.
In the time horizon model, by contrast, we don’t worry about how we arrived at the existing sociotechnical system. Over the history of the product, we’ve made many decisions to sacrifice completeness for expediency. In making those decisions we went down a path that ultimately led us to the system as it exists today. We can’t go back and choose a different path: there’s only forward.
Let’s say we’ve got a SaaS product that relies on manually generated TLS certificates. We have to do 2 hours of toil every 3 months to renew these certs.
If we believe in technical debt, we might look back at the decision to make cert renewal a manual process and say, “By not automating this, we took on technical debt. We must pay off this debt.” We’d make a ticket, give it the technical-debt tag, and eventually pick it up as part of our 20% time commitment to technical debt paydown.
By contrast, in the time horizon model, our team’s stated raison d’être is simply to produce the most value possible within our agreed-upon time window.
So instead, we’ll say something like “Manual cert renewal costs 2 hours of labor every 3 months. It would take us 15 hours of work to automate.” Those 15 hours could instead be spent delivering value directly, so we should only undertake this project if it will free us up to deliver more total value between now and the time horizon:
Our time horizon
The estimated time investment (which is also the opportunity cost)
The payoff over the time horizon
Don’t do it
Don’t do it
Maybe do it
Maybe do it
Of course, just because a given time investment passes the time horizon test doesn’t necessarily mean we should make that investment. We still need to compare it to the set of other efforts we could undertake, and devise an overall strategy that maximizes the value we’ll deliver over our time window.
The horizon model gives us a basis for making these comparisons, and lets us establish a lower bound for the expected return on our time investments. It helps us focus on the right things.
In a complex application, there are queues everywhere. Which is lucky, in a way, because it means we can use queueing theory to slice through a whole class of Gordian knots.
One of queueing theory’s most general insights is Little’s Law:
L = λW
L is the long-term average number of customers in the system, λ is the long-term average arrival rate of new customers, and W is the average amount of time that customers spend waiting.
In the parlance of queueing theory, “customer” doesn’t just mean “customer.” It means whatever unit of work needs to pass through the system. A customer can be a phone call or an IP packet or a literal customer at a grocery store or any one of infinitely many other things. As long as there are pieces of work that arrive, get queued, get processed, and then exit the system*, Little’s Law works. It’s breathtakingly general.
As an illustration, let me share an anecdote from my job.
*and as long as you’re not hitting a queue size limit
How many web servers do we need?
I’m on a team that’s responsible for a web app that looks more or less like this:
Requests come in from the Internet to the load balancer. The load balancer forwards the requests to a bunch of web servers, each of which, in turn, distributes requests among 6 independent worker threads. The worker threads run the business logic and send responses back up the stack. Pretty straightforward.
When a web server receives a request, it hands that request off to one of its worker threads. Or, if all the worker threads are busy, the request gets queued in a backlog to be processed once capacity becomes available.
If everything’s hunky dory, the backlog should be empty. There should always be idle capacity, such that requests never have to wait in a queue. But one day I noticed that the backlog wasn’t empty. The total number of backlogged requests across the fleet looked like this:
Things were getting queued at peak traffic, so we needed to scale up the number of web servers. But scale it up to what? I could have used trial and error, but instead, I turned to Little’s Law.
The first step was to establish the mapping between entities in my system and the general mathematical objects related by Little’s Law:
L: the number of in-flight requests. In other words, requests that have arrived at the load balancer and for which responses haven’t yet been sent back to the user.
λ: the rate at which new requests arrive at the load balancer.
W: the average request latency.
What I wanted to know – and didn’t have a metric for – was L. I did have a metric in my telemetry system for W, the average request latency.
While I didn’t exactly have a metric for λ, the arrival rate of requests, I did have the completion rate of requests (i.e. how many requests per second were being served). The long-term average arrival rate can’t differ from the completion rate, since every request does exit the system eventually. Therefore I was able to use the completion rate as a stand-in for λ. Here’s what I found (these aren’t the actual numbers):
I chose an arrival rate close to the peak throughput of the system. This still works as a “long-term average,” though, since the interval between arrivals (on the order of 1 millisecond) is much less than the duration of the average request (on the order of 300 milliseconds).
So, according to Little’s Law, at peak-traffic times, there will be on average 340 requests in flight in the system. Sometimes more, sometimes less, but on average 340. From there, it was trivial to see why requests were getting queued:
(average web server occupancy) = (average occupancy) / (number of web servers) (average web server occupancy) = (340 requests) / (40) (average web server occupancy) = 8.5
If you recall that each web server maintains 6 worker threads, you’ll see the problem. No matter what fancy stuff we try to do with queueing discipline or load balancing algorithm or whatever, there will be on average 2.5 queued requests per worker.
Little’s Law can also tell us what we need to scale up to if we want to extract ourselves from this mire:
(total worker threads) ≥ (arrival rate)(average wait time) (number of web servers)(worker threads per web server) ≥ 340
So we can either scale up web servers or worker threads per web server until their product is greater than 340.
Little’s Law is about long-term averages
This is only a lower bound, of course.
Little’s Law tells us about the long-term average behavior of a queueing system: nothing else. From moment to moment, the occupancy of the system will vary around this average. So, in the example above, we need to provision enough worker capacity to absorb these variations.
How much extra capacity do we need? Little’s Law can’t tell us. The answer will depend on the idiosyncrasies of our system and the traffic that passes through it. Different systems have different latency distributions, different arrival time distributions, different queueing disciplines, and so on. These variables all have some effect on our worst case and our 99th-percentile occupancy. So, in most cases, it’ll be important to get a sense for the empirical ratio between a system’s average occupancy and its occupancy at whatever percentile you decide to care about. But Little’s Law is still a very helpful tool.
If you do have a good sense of how worst-case occupancy varies with the average, you might even be able to use Little’s Law to inform your autoscaling strategy. As long as the system’s arrival rate changes much more gradually than the average latency of requests (such that you’re still working with long-term averages), you can rely on
L = λW
to predict your capacity requirements. Or, at least, I think you could. I haven’t tried it yet.