3 questions that will make you a phenomenal rubber duck

As a Postgres reliability consultant and SRE, I’ve spent many hours being a rubber duck. Now I outperform even the incisive bath toy.

“Rubber duck debugging” is a widespread, tongue-in-cheek term for the practice of explaining, out-loud, a difficult problem that you’re stumped on. Often, just by putting our troubles into words, we suddenly discover insights that unlock progress. The person we’re speaking to could just as well be an inanimate object, like a rubber duck. Hence the term.

Rubber ducks are great, but a human can add even more value. In this article, I’ll share my 3 favorite questions to ask when someone comes to me feeling stumped in a troubleshooting endeavor. These questions work even when you have no particular expertise in the problem domain. Master them, and you’ll quickly start gaining a reputation as the person to talk to when you’re stuck. This is a great reputation to have!

Question 1: How did you first start investigating this?

As we investigate a problem, our focus shifts from one thing to another to another. We go down one path and forget about others. We zoom in on details and neglect to zoom back out. It’s easy to lose perspective.

“How did you first start investigating this?” works well because, through the act of recounting their journey from initial observation to where they are now, your colleague will often regain perspective they’ve lost along the way. And by asking this particular question, you avoid having to suggest that they may have lost perspective – which could make them defensive.

Even if your colleague hasn’t lost perspective, hearing the story of the investigation so far will help you ask better questions and help them organize their thoughts.

Question 2: What observations have you made?

In troubleshooting a complex problem, it’s easy to forget what you already know. As you go along, you make lots of observations, small and large, interesting and boring, relevant and irrelevant. You can’t hold them all in your head.

When someone’s stuck, it often helps to review their observations. Not theories, not difficulties, not actions: directly observed facts.

Reviewing observations can help in a few different ways:

  • They may be entertaining a hypothesis that clashes with some previously learned (but since forgotten) fact. If so, they can now go ahead and discard that hypothesis.
  • Juxtaposing two observations may suggest a hypothesis that never occurred to them before, because they never held those two observations in their head simultaneously.
  • Listing out their observations may bring to mind something they haven’t looked at yet.

As your colleague recounts their observations, write them down in a numbered list. And, if you can, ask clarifying questions. Questions like “Does X always happen concurrently with Y, or only sometimes?” and “How does this differ from the normal behavior?”

Never underestimate the power of precisely stating the facts.

Question 3: If your hypothesis were wrong, how could we disprove it?

This question is my favorite.

One of the most common ways people get stuck in troubleshooting is tunnel vision. They get a single idea in their head about the cause of the problem, and that becomes all they can think about.

This question, “If your hypothesis were wrong, how could we disprove it?” flips the script. Instead of racking their brain trying to prove their theory, it gets them thinking about other possibilities. Asking this question can lead to lots of different outcomes, all of which represent progress:

  1. You come up with a way to disprove the hypothesis, and successfully disprove it. This may make your colleague sad for a few hours, but when they come back to the problem, they’ll make huge strides.
  2. You come up with a way to disprove the hypothesis, but fail to disprove it. The hypothesis is thus bolstered, and the next step becomes clear: elaborate a few different versions of it and try to disprove those.
  3. You can’t think of any way to disprove it. This means it’s probably not a hypothesis at all, since it’s not falsifiable. Therefore you must replace it with a new hypothesis. This may feel like a setback, but it’s really the only way forward.

How it fits together

Under the hood, these 3 questions are just different ways of invoking hypothetico-deductive reasoning, which I’ve written about previously (see Troubleshooting On A Distributed Team Without Losing Common Ground and You Know Who’s Smart? Friggin’ Doctors, Man.). I don’t know of any better way to achieve consistent problem-solving results in the face of complexity.

If you’re interested in learning how to apply these techniques in your career or in your organization, I can help. Shoot me an email!

Why transaction order matters, even if you’re only reading

There are 4 isolation levels defined by the SQL standard, and Postgres supports them through the SET TRANSACTION statement. They are:

  • Read uncommitted: It’s possible to read data written by another concurrent uncommitted transaction (this is called a “dirty read”). (Note: this isolation level is not really implemented by Postgres)
  • Read committed: Dirty reads are not possible.
  • Repeatable read: “Nonrepeatable reads” are not possible. A nonrepeatable read is when you do a read once, then you do the same read again and see that data has been modified by another concurrent transaction.
  • Serializable: The strongest isolation level, this guarantees that repeating a read within a transaction will not only always return the same data for each row, but also the same set of rows. And it further guarantees that the end result of any set of concurrent transactions will be a possible outcome of having run those same transactions in serial, in some order.

This last guarantee is one against serialization anomalies. A serialization anomaly is any sequence of events that produces a result that would be impossible if those transactions ran one-at-a-time (“in serial”). This guarantee is tricky to reason about, but the Postgres docs have a great example to demonstrate it.

A simple serialization anomaly

The example from the docs goes like this. You have the following table, mytab:

class | value
-------+-------
1 | 10
1 | 20
2 | 100
2 | 200

And the following two transactions run concurrently:

Transaction ATransaction BNotes
BEGIN;BEGIN;
SELECT SUM(value)
FROM mytab
WHERE class = 1;
The result is 30.
SELECT SUM(value)
FROM mytab
WHERE class = 2;
The result is 300.
INSERT INTO mytab
(class, value)
VALUES (2, 30);
INSERT INTO mytab
(class, value)
VALUES (1, 300);
COMMIT;COMMIT;

This is a serialization anomaly, since no serial ordering of A and B could produce the same end result:

  • A => B: Transaction B writes 330.
  • B => A: Transaction A writes 600.

The explanation is pretty straightforward once you walk through it. Each transaction writes data that the other reads in order to determine what data to write. So you end up with a situation that could never occur if the transactions had been run in serial.

But I also noticed this passage in the docs:

A READ ONLY transaction may be able to release its SIRead locks before completion, if it detects that no conflicts can still occur which could lead to a serialization anomaly. In fact, READ ONLY transactions will often be able to establish that fact at startup and avoid taking any predicate locks. If you explicitly request a SERIALIZABLE READ ONLY DEFERRABLE transaction, it will block until it can establish this fact.

This surprised me because, intuitively, I wouldn’t think that a READ ONLY transaction could participate in a serialization anomaly. How can it be that just reading data concurrently with other transactions could cause a result inconsistent with serial execution? Such a transaction can’t even lock rows. I racked my brains trying to think of such a situation, but came up empty.

A serialization anomaly from a READ ONLY transaction

Lucky for me, back in about 2000, somebody else racked their brains about it too, and produced this 3-page paper describing just such a situation. Right near the beginning, I read:

It has been widely assumed that, under SI [a.k.a. SERIALIZABLE transaction isolation mode], read-only transactions always execute serializably provided the concurrent update transactions are serializable. The reason for this is that all SI reads return values from a single instant of time when all committed transactions have completed their writes and no writes of non-committed transactions are visible. This seems to imply that read-only transactions will not read anomalous results so long as the update transactions with which they execute do not write such results.

“Yeah, exactly!” I thought. But it turns out to be possible, and here’s how it works.

Suppose we have the following table, representing bank account balances.

              Table "public.accounts"
Column | Type | Collation | Nullable | Default
--------+---------+-----------+----------+---------
id | integer | | not null |
value | integer | | not null |
Indexes:
"accounts_pkey" PRIMARY KEY, btree (id)

And suppose further that the following two accounts exist, with zero balance:

# SELECT * FROM accounts;
id | value
----+-------
1 | 0
2 | 0

These two accounts are owned by the same customer, and the sum of their balances must always be positive. If the sum goes negative, then a $1 overdraft fee is assessed. Now consider the following 3 DB transactions:

  • T1: Deposits $20 into account 2.
  • T2: Withdraws $10 from account 1.
  • T3: Retrieves the balances of both accounts and prints them out.

Let’s walk through these transactions and see how a serialization anomaly can emerge:

T1T2T3Notes
BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE;BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE;BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE;
SELECT id, value
FROM accounts
WHERE id IN (1,2);
T2 needs the balances of both accounts in order to know whether their sum is negative.
Both are 0.
SELECT value
FROM accounts
WHERE id = 2;
Still 0.
UPDATE accounts
SET value = 20
WHERE id = 2;
Deposits $20 into account 2.
COMMIT;T2 is now working with an outdated value.
SELECT id, value
FROM accounts
WHERE id IN (1,2);
T3 gets values (0, 20).
COMMIT;T3 sends this result to the printer.
UPDATE accounts
SET value = -11
WHERE id = 1;
Withdraws $10 from account 1, charging a $1 overdraft fee.
COMMIT;

When T2 tries to commit, it fails with the following error:

ERROR:  could not serialize access due to read/write dependencies among transactions
DETAIL: Reason code: Canceled on identification as a pivot, during write.
HINT: The transaction might succeed if retried.

Postgres has identified a potential serialization anomaly and prevented it by aborting T2. It’s a serialization anomaly because, if T2 were allowed to commit, then the customer would be charged an overdraft fee for withdrawing $10 despite having a print-out showing a total balance of $20. This could never happen if these transactions were run one-at-a-time in any order:

  • T1 => T2 => T3: print-out has a total balance of $10, and no overdraft fee is charged.
  • T1 => T3 => T2: print-out shows $20, and no overdraft fee is charged.
  • T2 => T1 => T3: print-out shows $9 ($-11 + $20).
  • T2 => T3 => T1: print-out shows $-11.
  • T3 => T1 => T2: print-out shows $0.
  • T3 => T2 => T1: print-out shows $0.

And, as we can also see, this serialization anomaly couldn’t occur if T3 were absent. Either of the possible orderings of T1 and T2 produces an internally consistent result.

Avoiding this problem

If you’re not using the Serializable isolation level, then instead of an error at COMMIT time, you just get a bug. And a scrap of paper that the SEC is going to have some embarrassing questions about.

So you can either use Serializable, or stick with a lower isolation level such as Read Committed and introduce a lock. For example, if T2 were to exclusively lock the row for account 1 when reading it, by using SELECT FOR UPDATE, then T1 would block until T2 committed. This would eliminate the bug.

Another fix, using the Repeatable Read isolation level, would be to add an additional row containing the sum of balances for each group of accounts, and require that any transaction updating one of those balances also update the sum row. Whichever transaction tried to update that row last would get an error and be forced to retry the transaction, thus eliminating the bug.

If you can, I recommend starting with Serializable, and only switching to a lower isolation level on a transaction-by-transaction basis when the performance cost of serializability checks becomes too high.