In a complex system, like a SaaS product, you have countless interdependent things all happening at every moment. On a gross scale: means of production go in; value comes out. But under the hood, all sorts of interactions are going on, at various rates and latencies, with various exit conditions, stacked to various heights of indirection. And the rules for these interactions change with every deploy.
As load gets added to the system, you start to experience 🏋️♀️🔥🪦 load problems. The system just can’t keep up with traffic, and it starts to break down under various circumstances, in all kinds of different more- or less spectacular ways.
At the onset of load problems, it feels like everything is breaking at once. Like some threshold was breached and we’ve entered a new domain. We always had problems, but now we have a new Big Problem.
Big Problems, though, are always composed of smaller problems. Big Problems conditioned by load are not atomic novel phenomena. There’s a knot of interdependent processes, and when this knot gets pulled tight by load, it binds up.
Maybe there was a code deploy 6 weeks ago that inadvertently made this API endpoint’s median latency
50% higher. And maybe that API endpoint holds open an AMQP connection until it finishes. And maybe some customer just spun up 500 instances of a container that hits the endpoint. Now all these tangled Small Problems constrict, resources are exhausted, and the system stops producing.
It’s just these sorts of interdependency knots that cause an increase in the frequency of customer-facing problems as load rises.
We rely heavily on Big Problems to tell us which Small Problems need fixing. “How did this failure happen? It happened because of this and this and this contributing factor. Which of these can we fix?” And of course the results of this sort of analysis are useful. The Small Problems we identify this way have a crucial unifying feature: we know, because we’ve seen it happen, that they can each be contributing factors to Big Problems.
But this is not the only way to identify Small Problems. We can go digging in our data and find them, endeavoring to solve them before they get a chance to cause a Big Problem.
Okay, maybe we can’t prevent every Big Problem this way. And without the gating effect of post-mortem analysis, we’ll likely spend some of our time fixing Small Problems that never would’ve amounted to anything.
But even if we’re just poking around in the dark, we’ll probably end up preventing some Big Problems, right? And Big Problems are expensive. I think, on balance, it’s worth the effort.
Somebody ought to be getting paid to be Maxwell’s Demon. Ideally, nearly everybody in the org ought to be spending some of their time investigating mysteries. It ought to be an explicit part of the job.
Pingback: Descriptive engineering: not just for post-mortems – Dan Slimmon