The Universe is not listening

One of the oldest forms of superstition is the belief that particular words have a magical effect. Spells, incantations, Kotodama, “om”; it goes on and on. The word “abracadabra” is thought to have come from the ancient Aramaic for “I have created through speech” [srsly].

We really seem to wish the Universe were listening to us.

And it doesn’t even have to be a benevolent force listening to our words. Think about the old “Bloody Mary” urban legend, about the names of Yahweh and Voldemort. Even the ancient Greek horror stories about insulting the gods.

Now I’m not anti-folklore, but I do think that mixing superstition with engineering is a terrible idea. It’s corrosive to the data-oriented devops culture that we should all be trying to build.

The “Now You’ve Done It!” Fallacy

You’ve probably had or heard this conversation before:

ALTHEA: Hey, if you think about it, we haven’t had an outage for quite some time.

REGINALD: Althea, don’t say that! Now we’re going to have one.

Reginald is an asshole, and he’s hurting your team. Don’t be like Reginald.

Let’s think about what happened in this exchange. First, Althea made an observation: that it’s been abnormally long since the site went down. Bangin’, right? If true, this observation could have very interesting implications. The last thing you should do is ignore it. And if it’s part of a trend toward higher uptimes altogether, you should be trying to understand where that trend comes from, so you can keep it going.

Then Reginald was all like “Hey Althea, you need to stop commenting on good things that happen. We only talk about our failures on this team, because talking about our successes summons vengeful spirits to corrupt our hard disks and steal our packets.”

Sure you should be skeptical of Althea’s statement, and ideally you’re keeping track of the frequency of your outages so you can confirm it. But Reginald’s attitude actively discourages rational inquiry.

And I’ve got another bone to pick with Reginald. He just told Althea that our job, in ops, is to live in constant terror of the next disaster. He’s saying that downtime just happens and that, when we have a good run of uptime, it’s just a fluke. But uptime doesn’t just happen: we work at it, we push for it, and we should celebrate when we achieve it.

Dispelling Superstition

A superstition is the belief that one event leads to the cause of another without any physical process linking the two events. Crack-stepping leads to broken maternal spines; a Pennsylvanian groundhog seeing his shadow leads to aberrations in global weather patterns; you know. And Reginald’s warning to Althea is nothing more than superstition: talking about good uptime, he says, causes downtime.

Reginald probably doesn’t even really believe this, but that doesn’t matter. He’s damaged the team’s culture by chiding Althea, since she’ll be less likely to vocalize her observations in the future.

Superstitions emerge from confirmation bias, which is very powerful. In the past, Reginald has probably suffered through one or two outages that occurred soon after somebody made a comment about good uptime. When the outage started, he immediately remembered the comment, and went “A-ha! I knew this would happen.” So now it doesn’t matter how many outages happen in the absence of such a comment, because in those cases he’s not thinking about his superstition. Likewise, it doesn’t matter how many of those comments get made nowhere near an outage. The only feedback Reginald’s superstition gets is positive. He’s like one of B.F. Skinner’s parrots.

So how do you fight superstition? With empiricism, of course!

Track your MTBF: mean time between failures. Whenever a failure event happens, give it a rough severity rating and a timestamp and stick it in a spreadsheet. Get into this habit, and soon you’ll have a solid dataset that you can average by week or by month and show to Reginald (and Althea). If you’ve been working hard to improve stability and your MTBF has increased, then you should talk about that and you should feel good about it. You can use this data to erase Reginald’s counterproductive attitude and prevent it from harming your team’s culture any further.

