Squeeze the hell out of the system you have

About a year ago, I raised a red flag with colleagues and managers about Postgres performance. Our database was struggling to keep up with the load generated by our monolithic SaaS application. CPU utilization was riding between 60 and 80%, and at least once it spiked to 100%, causing a brief outage.

Now, we had been kicking the can down the road with respect to Postgres capacity for a long time. When the database looked too busy, we’d replace it with a bigger instance and move on. This saved us a lot of time and allowed us to focus on other things, like building features, which was great.

But this time, it wasn’t possible to scale the DB server vertically: we were already on the biggest instance. And we were about to overload that instance.

Lots of schemes were floated. Foremost among them:

Shard writes. Spin up a cluster of independent databases, and write data to one or the other according to some partitioning strategy.
Do micro-services. Split up the monolith into multiple interconnected services, each with its own data store that could be scaled on its own terms.

Both of these options are cool! A strong case can be made for either one on its merits. With write sharding, we could potentially increase our capacity by 2 or even 3 orders of magnitude. With micro-services, we’d be free to use “the right tool for the job,” picking data stores optimized to the requirements of each service workload. Either branch of the skill tree would offer exciting options for fault tolerance and operational resilience.

Either way, everyone had to agree: we’d outgrown our old, naïve implementation. Onward and upward! We can do hard things!

In situations like this, presented with a dazzling array of next-generation architecture options that can be built to last us through the decade, it’s easy to forget what our goal was: to get database performance under control.

Complexity costs attention.

Sometimes, leaps in complexity must be made. It’s generally a good problem to have. If enough demand is being placed on your system to render obsolete your existing technology, then even more growth is probably on the horizon! If you can just put in the investment and build the more advanced architecture now, then you’ll be looking at a bright future of unconstrained year-over-year success.

But don’t just consider the implementation cost. The real cost of increased complexity – often the much larger cost – is attention.

If you decide to shard across databases, then not only must you pay the money-, time-, and opportunity cost of building out the new architecture: you must also take the new complexity into account in every subsequent technical decision. Want to shard writes? Fine, but this complicates every future decision about backups, monitoring, migrations, the ORM, and network topology (just to name a few). And don’t get me started on micro-services.

Just think about how massive these costs are. How much feature delivery will have to be delayed or foregone to support the additional architectural complexity?

Always squeeze first

We should always put off significant complexity increases as long as possible.

When complexity leaps are on the table, there’s usually also an opportunity to squeeze some extra juice out of the system you have. By tweaking the workload, tuning performance, or supplementing the system in some way, you may be able to add months or even years of runway. When viable, these options are always preferable to building out a next-gen system.

Let’s return to the example of the overloaded Postgres instance. In that case, what we ended up doing was twofold:

Two engineers (me and my colleague Ted – but mostly Ted) spent about 3 months working primarily on database performance issues. There was no silver bullet. We used our telemetry to identify heavy queries, dug into the (Rails) codebase to understand where they were coming from, and optimized or eliminated them. We also tuned a lot of Postgres settings.
Two more engineers cut a path through the codebase to run certain expensive read-only queries on a replica DB. This effort bore fruit around the same time as (1), when we offloaded our single most frequent query (a SELECT triggered by polling web clients).

These two efforts together reduced the maximum weekly CPU usage on the database from 90% to 30%.

Now we can sleep at night. We have a huge amount of room to grow, both in terms of CPU headroom and our ability to shed load from the primary. And furthermore, since our work touched many parts of the codebase and demanded collaboration with lots of different devs, we now have a strong distributed knowledge base about the existing system. We’re well positioned to squeeze it even more if need be.

This doesn’t mean complexity is bad

Of course, I’m not saying complexity is bad. It’s necessary. Some day we’ll reach a fundamental limit of our database architecture, and before that day arrives, we’ll need to make a jump in complexity.

But until then, because we squeezed first, we get to keep working with the most boring system possible. This is by far the cheaper and more practical option.

15 thoughts on “Squeeze the hell out of the system you have”

should’ve posted the code. This article is beyond useless

2023/08/12 at 08:24 Reply

Johnathan W

You missed the point of the article if you needed the code. Don’t be a dick.

It was an insightful article that proposed to the reader to attempt to get the most out of your current (simple) architecture before adding complexity. A rule of thumb article.

2023/08/19 at 17:11 Reply

Thanks for the article, it’s something I truly believe in, we keep using bigger and bigger systems or more complicated designs to offset any sort of work at optimising the current system.

I would have loved to know the % fix from each. Both having read dedicated systems, optimised queries and tuning can affect performance so massively.

Having cached reads or views for information that doesn’t have to be completely up to date is something I have used as well, like a view which only updates every 5 minutes and all queries use the cached result with that expectation.

2023/08/12 at 14:53 Reply

Dan Slimmon

I would love to know the percentage from each, but it all became really hard to pick apart. I would say, by the time we switched any read-only queries to the reader DB, we had already gotten down from 90% to 50%. So that 40% must have come from some combination of perf tweaking and query optimization.

It’s the sort of thing where, you make one PR, and it moves the needle a very small amount, and you make another PR, and it moves a different needle a bit, and only when you step back after a concerted effort do you see how much impact you’ve had overall.

It was important to measure the effect of each small change, but not because we were gonna add them all up later. Just because it’s evidence that we’re going in the right direction.

2023/08/12 at 18:26 Reply

Pingback: Links 13/08/2023: KDE Development Report, Debian’s Abuse Culture | Techrights

Pingback: New best story on Hacker News: Squeeze the hell out of the system you have – Cassinni

Pingback: Squeeze the hell out of the system you have – Veritas Reporters

Pingback: Squeeze the hell out of the system you have – Latest-News-Events

Really enjoyed this pragmatic write-up, Dan. Having been part years ago of situations where we (carelessly) chose complexity due to shiny object syndrome, I can definitely relate.

These days as a leadership coach and advisor to startups, it strikes me that the same thinking applies to organizational and process complexity.

Appreciate the thought-provoking essay!

2023/08/18 at 20:22 Reply

The one thing I noticed here was that your most expensive query was a SELECT from a polling web client. My instinct would be to figure out how to get away from polling and to something event-based.

2023/08/20 at 21:52 Reply

Great read! As one of my ex-manager’s used to say you cannot hardware yourself out of a software problem. While vertical scaling is the easy button don’t forget code optimization 🙂

2023/08/21 at 16:34 Reply

Pingback: Squeeze the hell out of the system you have. – P.S. You Should Know…

Pingback: An Internet of PHP – Timo Tijhof

Pingback: Postgresso №8 (57) - Prog.World

Pingback: Interviewing engineers for diagnostic skills – Dan Slimmon