Sometimes, during an incident post-mortem, we need to talk about the actions of a specific person. It’s awkward. We want to be blameless, right? If we say, “Steve rebooted the wrong server,” aren’t we undermining the principle of blamelessness? Shouldn’t we say, “An engineer rebooted the wrong server” instead?
This tension is rooted in a misunderstanding of blamelessness. The purpose of the blameless post-mortem is not to make everyone feel comfortable. Discomfort can be healthy and useful. The purpose of the blameless post-mortem is to let us find explanations deeper than human error.
If we say, “This outage happened because Steve rebooted the wrong server,” then we make Steve the sole agent in the narrative. By this logic, the only way to avoid more future outages is for Steve to do better (or get fired). This is the kind of shallow analysis that blameless post-mortems help us avoid.
But it’s okay to say, “Steve rebooted the wrong server,” as long as you don’t blame Steve for the failure. You’re just stating a fact. Everybody knows it was Steve who rebooted the server. Steve sure knows it: he’s been beating himself up about it all weekend.
But why did Steve reboot the wrong server? Could it have had something to do with the fact that the server’s command prompt contains the opaque EC2 instance ID instead of a meaningful hostname? Was Steve coming off a 70-hour week? Come to think of it, why is it that rebooting that particular server can cause a total failure? If you’re asking questions like these instead of stopping at “Steve fucked up,” then you’re doing blamelessness.
So just say the name. It’s fine.
P.S. – obviously this doesn’t apply if you’re in the kind of organization that will retaliate against Steve for making an honest mistake. Don’t sell out your coworkers to bloodthirsty management.