In a complex application, there are queues everywhere. Which is lucky, in a way, because it means we can use queueing theory to slice through a whole class of Gordian knots.
One of queueing theory’s most general insights is Little’s Law:
L = λW
L is the long-term average number of customers in the system, λ is the long-term average arrival rate of new customers, and W is the average amount of time that customers spend waiting.
In the parlance of queueing theory, “customer” doesn’t just mean “customer.” It means whatever unit of work needs to pass through the system. A customer can be a phone call or an IP packet or a literal customer at a grocery store or any one of infinitely many other things. As long as there are pieces of work that arrive, get queued, get processed, and then exit the system*, Little’s Law works. It’s breathtakingly general.
As an illustration, let me share an anecdote from my job.
*and as long as you’re not hitting a queue size limit
How many web servers do we need?
I’m on a team that’s responsible for a web app that looks more or less like this:
Requests come in from the Internet to the load balancer. The load balancer forwards the requests to a bunch of web servers, each of which, in turn, distributes requests among 6 independent worker threads. The worker threads run the business logic and send responses back up the stack. Pretty straightforward.
When a web server receives a request, it hands that request off to one of its worker threads. Or, if all the worker threads are busy, the request gets queued in a backlog to be processed once capacity becomes available.
If everything’s hunky dory, the backlog should be empty. There should always be idle capacity, such that requests never have to wait in a queue. But one day I noticed that the backlog wasn’t empty. The total number of backlogged requests across the fleet looked like this:
Things were getting queued at peak traffic, so we needed to scale up the number of web servers. But scale it up to what? I could have used trial and error, but instead, I turned to Little’s Law.
The first step was to establish the mapping between entities in my system and the general mathematical objects related by Little’s Law:
- L: the number of in-flight requests. In other words, requests that have arrived at the load balancer and for which responses haven’t yet been sent back to the user.
- λ: the rate at which new requests arrive at the load balancer.
- W: the average request latency.
What I wanted to know – and didn’t have a metric for – was L. I did have a metric in my telemetry system for W, the average request latency.
While I didn’t exactly have a metric for λ, the arrival rate of requests, I did have the completion rate of requests (i.e. how many requests per second were being served). The long-term average arrival rate can’t differ from the completion rate, since every request does exit the system eventually. Therefore I was able to use the completion rate as a stand-in for λ. Here’s what I found (these aren’t the actual numbers):
L = λW
(average occupancy) = (arrival rate)(average wait time)
(average occupancy) = (1000 req/s)(340ms)
(average occupancy) = 340 requests
I chose an arrival rate close to the peak throughput of the system. This still works as a “long-term average,” though, since the interval between arrivals (on the order of 1 millisecond) is much less than the duration of the average request (on the order of 300 milliseconds).
So, according to Little’s Law, at peak-traffic times, there will be on average 340 requests in flight in the system. Sometimes more, sometimes less, but on average 340. From there, it was trivial to see why requests were getting queued:
(average web server occupancy) = (average occupancy) / (number of web servers)
(average web server occupancy) = (340 requests) / (40)
(average web server occupancy) = 8.5
If you recall that each web server maintains 6 worker threads, you’ll see the problem. No matter what fancy stuff we try to do with queueing discipline or load balancing algorithm or whatever, there will be on average 2.5 queued requests per worker.
Little’s Law can also tell us what we need to scale up to if we want to extract ourselves from this mire:
(total worker threads) ≥ (arrival rate)(average wait time)
(number of web servers)(worker threads per web server) ≥ 340
So we can either scale up web servers or worker threads per web server until their product is greater than 340.
Little’s Law is about long-term averages
This is only a lower bound, of course.
Little’s Law tells us about the long-term average behavior of a queueing system: nothing else. From moment to moment, the occupancy of the system will vary around this average. So, in the example above, we need to provision enough worker capacity to absorb these variations.
How much extra capacity do we need? Little’s Law can’t tell us. The answer will depend on the idiosyncrasies of our system and the traffic that passes through it. Different systems have different latency distributions, different arrival time distributions, different queueing disciplines, and so on. These variables all have some effect on our worst case and our 99th-percentile occupancy. So, in most cases, it’ll be important to get a sense for the empirical ratio between a system’s average occupancy and its occupancy at whatever percentile you decide to care about. But Little’s Law is still a very helpful tool.
If you do have a good sense of how worst-case occupancy varies with the average, you might even be able to use Little’s Law to inform your autoscaling strategy. As long as the system’s arrival rate changes much more gradually than the average latency of requests (such that you’re still working with long-term averages), you can rely on
L = λW
to predict your capacity requirements. Or, at least, I think you could. I haven’t tried it yet.