Troubleshooting On A Distributed Team Without Losing Common Ground

I work on a team that fixes complex systems under time pressure. My teammates have different skill sets, different priorities, and different levels of expertise. But we all have to troubleshoot and solve problems together.

This is really hard to do effectively. Fortunately for us in the relatively new domain of DevOps, situations like ours have been studied extensively in the last couple decades. We can use the results of this research to inform our own processes and automation for troubleshooting.

One of the most important concepts to emerge from recent teamwork research, common ground, helps us understand why collaborative troubleshooting breaks down over time. This breakdown leads to wasted effort and mistakes, even if the team maintains constant communication in a chat room. But if we extend ChatOps by drawing on some ideas from medical diagnosis, we can make troubleshooting way easier without losing the benefits of fluid team conversation.

Common Ground

Ergonomics researchers D.D. Woods and Gary Klein (the latter of whom I wrote about in What makes an expert an expert?) published a phenomenally insightful paper in 2004 called Common Ground and Coordination in Joint Activity. In it, they describe a particular kind of failure that occurs when people engage in joint cognition: the Fundamental Common Ground Breakdown. Once you learn about the Fundamental Common Ground Breakdown, you see it everywhere. Here’s how the Woods/Klein paper describes the FCGB:

  • Party A believes that Party B possesses some knowledge
  • Party B doesn’t have this knowledge, and doesn’t know he is supposed to have it.
  • Therefore, he or she doesn’t request it.
  • This lack of a request confirms to Party A that Party B has the knowledge.

When this happens, Party A and Party B lose common ground, which Woods & Klein define as “pertinent knowledge, beliefs and assumptions that are shared among the involved parties.” The two parties start making incorrect assumptions about each other’s knowledge and beliefs, which causes their common ground to break down further and further. Eventually they reach a coordination surprise, which forces them to re-synchronize their understanding of the coordinated activity:

csel_eng_ohio-state_edu_woods_distributed_CG_final_pdf

Seriously, the FCGB is everywhere. Check out the paper.

I’m especially interested in one particular area where an understanding of common ground can help us do better teamwork: joint troubleshooting.

Common Ground Breakdown in Chatroom Troubleshooting

Everybody’s into ChatOps these days, and I totally get it. When a critical system is broken, it’s super useful to get everybody in the same room and hash it out. ChatOps allows everybody to track progress, coordinate activities, and share results. And it also helps to have lots of different roles represented in the room:

  • Operations folks, to provide insight into the differences between the system’s normal behavior and its current state
  • Software engineers, who bring detailed knowledge of the ways subsystems are supposed to work
  • Account managers and product managers and support reps: not just for their ability to translate technical jargon into the customer’s language for status reporting, but also because their understanding of customer needs can help establish the right priorities
  • Q.A. engineers, who can rule out certain paths of investigation early with their intuition for the ways in which subsystems tend to fail

The process of communicating across role boundaries isn’t just overhead: it helps us refine our own understanding, look for extra evidence, and empathize with each other’s perspectives.

But ChatOps still offers a lot of opportunities for common ground breakdown. The FCGB can occur whenever different people interpret the same facts in different ways. Interpretations can differ for many different reasons:

  • Some people have less technical fluency in the system than others. A statement like “OOM killer just killed Cassandra on db014” might change an ops engineer’s whole understanding of the problem, but such a shift could fly under the radar of, say, a support engineer.
  • Some people are multitasking. They may have a stake in the troubleshooting effort but be unable to internalize every detail from the chat room in real time.
  • Some people are co-located. They find it easier to discuss the problem using mouth words or by physically showing each other graphs, thereby adjusting their own shared understanding without transmitting these adjustments to the rest of the team.
  • Some people enter the conversation late, or leave for a while and come back. These people will miss common ground changes that happen during their absence.

These FCGB opportunities all become more pronounced as the troubleshooting drags on and folks become tired, bored, and confused. And when somebody says they’ve lost track of common ground, what do we do? Two main things: we provide a summary of recent events and let the person ask questions until they feel comfortable; or we tell them to read the backlog.

The Q&A approach has serious drawbacks. First of all, it requires somebody knowledgeable to stop what they’re doing and summarize the situation. If people are frequently leaving and entering the chat room, you end up with a big distraction. Second of all, it leaves lots of room for important information to get missed. The Fundamental Common Ground Breakdown happens when somebody doesn’t know what to ask, so fixing it with a Q&A session is kind of silly.

The other way people catch up with the troubleshooting effort is by reading the backlog. This is even more inefficient than Q&A. Here’s the kind of stuff you have to dig through when you’re reading a chat backlog:

tng-hipchat

There’s a lot to unpack there – and that’s just 18 messages! Imagine piecing together a troubleshooting effort that’s gone on for hours, or days. It would take forever, and you’d still make a lot of mistakes. It’s just not a good way to preserve common ground.

So what do we need?

Differential Diagnosis as an Engine of Common Ground

I’ve blogged before about how much I love differential diagnosis. It’s a formalism that doctors use to keep the diagnostic process moving in the right direction. I’ve used it many times in ops since I learned about it. It’s incredibly useful.

In differential diagnosis, you get together with your team in front of a whiteboard – making sure to bring together people from a wide variety of roles – and you go through a cycle of 3 steps:

  1. Identify symptoms. Write down all the anomalies you’ve seen. Don’t try to connect the dots just yet; just write down your observations.
  2. Generate hypotheses. Brainstorm explanations for the symptoms you’ve observed. This is where it really helps to have a good cross-section of roles represented. The more diverse the ideas you write down, the better.
  3. Test hypotheses. Now that you have a list of things that might be causing the problem, you start narrowing down that list by coming up with a test that will prove or disprove a certain hypothesis.

Once you’re done with step #3, you can cross out a hypothesis or two. Then you head back to step #1 and repeat the cycle until the problem is identified.

A big part of the power of differential diagnosis is that it’s written down. Anybody can walk into the room, read the whiteboard, and understand the state of the collaborative effort. It cuts down on redundant Q&A, because the most salient information is summarized on the board. It eliminates inefficient chat log reading – the chat log is still there, but you use it to search for specific pieces of information instead of reading it like a novel. But, most importantly, differential diagnosis cuts down on fundamental common ground breakdowns, because everybody has agreed to accept what’s on the whiteboard as the canonical state of troubleshooting.

Integrating Differential Diagnosis with ChatOps

We don’t want to lose the off-the-cuff, conversational nature of ChatOps. But we need a structured source of truth to provide a point-in-time understanding of the effort. And we (read: I) don’t want to write a whole damn software project to make that happen.

My proposal is this: use Trello for differential diagnosis, and integrate it with the chat through a Hubot plugin. I haven’t written this plugin yet, but it shouldn’t take long (I’ll probably fork hubot-trello and start from there). That way people could update the list of symptoms, hypotheses, and tests on the fly, and they’d always have a central source of common ground to refer to.

In the system I envision, the chat room conversation would be peppered with statements like:

Geordi: hubot symptom warp engine going full speed, but ship not moving

Hubot: Created (symp0): warp engine going full speed, but ship not moving

Beverly: hubot falsify hypo1

Hubot: Falsified (hypo1): feedback loop between graviton emitter and graviton roaster

Geordi: hubot finish test1

Hubot: Marked (test1) finished: reboot the quantum phase allometer

And the resulting differential diagnosis board, containing the agreed-upon state of the troubleshooting effort, might look like this example, with cards labeled to indicate that they’re no longer in play.

What do you think?

Let me know if your organization already has something like this, or has tried a formal differential diagnosis approach before. I’d love to read some observations about your team’s process in the comments. Also, VictorOps has a pretty neat suite of tools that approaches what I have in mind, but I still think a more conceptually structured (not to mention free) solution could be very useful.

Automation is most effective when it’s a team player. By using automation to preserve common ground, we can solve problems faster and more thoroughly, with less frustration and less waste. And that all sounds pretty good to me.

10 thoughts on “Troubleshooting On A Distributed Team Without Losing Common Ground

  1. Dieter_be

    Great post Dan, thanks.
    I’ve also been thinking around adding more structure around conversations. Troubleshooting sessions can be quite chaotic and the singular, linear dimension of a conversation (chat or otherwise) puts a big damper on the effectiveness of working together. I think the same holds true for team meetings where during conversations you often come up with things that should be addressed, extra questions etc, but the conversation can rarely support evolving in more than 1 way at the same time. So i had been thinking about a “2nd stream” (next to the actual conversation/meeting) which could be a chat room in which people could add notes like ” question”, ” ” etc. Part of the use here is that you can get them out of your head so you can focus on the conversation at hand, but also the ability to make sure all points eventually do get addressed, and i feel like it would also helps in optimizing the time of the people involved. some topics only require a subset of people to participate etc, and it’s always nice to be able to organize a meeting to optimize for that, even if the topics of the meeting change while it’s being held.

    anyway, this may seem like a different topic but I think the problems are quite similar and perhaps a solution can be built that addresses both.
    Any particular reason why you think a kanban board would be a good fit? The usual benefits of a kanbard don’t seem to apply (progression through columns, WIP limits, etc) so it seems like you’re chosing to use it as a generic way to organize things into a few categories, unless i’m missing something?

    I had been thinking of just a command line tool / custom little web app or something.

    1. You’re right: the reason I jumped to Trello was because it already exists. It’s a way of organizing pieces of information into groups, and it lets you do the most important things for my use case: label things as no-longer-active, attach images to things, and provide an API.

      I needed something a little more structured and automatable than, say, a Google Doc. But for the case of managing tangents in a meeting, I use Google Docs and I find the solution pretty compelling. It takes a little more effort to parse, but it’s also more flexible and user-friendly than any DIY tool I could write.

  2. Hi Dan,

    Thank you for this post. Definitely can commiserate with the chatOps/fog of war/getting up to speed chaos in the heat of the moment thing.

    I read your earlier post here https://danslimmon.wordpress.com/2012/07/20/you-know-whos-smart-friggin-doctors-man/. When you were diagnosing the donations dip issue you mentioned “we” offhand how do you know who “we” are? As this post states diversity of opinion at the outset will give the best outcome for this differential diagnosis method you’ve described. Are there concrete examples of a devOps way of getting core stakeholders at the beginning of this differential exercise?

    Also it’s really interesting that you looked toward the domain of medicine to find problem solving methods in a communal setting when devOps also can trace it’s origins back to lean manufacturing e.g Toyota car manufacturing.

    1. Great question about diversity of opinions. I don’t have a formal answer for it, nor can I cite one from the medical literature. So I looked at social science literature on brainstorming, and I’m still can’t find much research on the effect of role diversity.

      My inspiration for emphasizing the need for role diversity in this post came directly from this paper about space shuttle mission control:

      https://drive.google.com/file/d/0B4ObM99Dmka6TWRFMDk4NFdncWs/view

      which argues that, in translating one team’s ideas to the parlance of a different team, we force ourselves to refine our own understanding and gather stronger evidence. All of which improves our results.

      So, to your question: I try to involve people from all parts of the service chain in the hope that together we’ll have a holistic model of the problem, all the way from customer to server. As long as you’re trying to draw on the strengths of every part of the team, you’ll pick roughly the right set of people. For the “dip” issue, I believe we had 2 ops people, a coder, an account manager, and the VP of technology in the room.

  3. Leila Johannessen’s PhD work showed clearly the need to invest in establishing & maintaining common ground as a prerequisite for future problem solving. She observed experts investing time and attention in tracking a process and in tracking another cognitive agent’s understanding of that process in order to be able to solve new problems quickly. The process was too complex to allow an expert to be called in after the problem was engaged — the time and effort needed to get up to speed in order to begin work on solving the problem was too much and too long for the expertise to be of use. In all likelihood what you are calling common ground is only one part of the Common Ground that is required to solve devops problems.

    Sustaining common ground depends on the individuals staying abreast of the situation — both of the process and the understanding that others have about that process — and this requires substantial effort. It is one reason that true outsiders rarely contribute much to cutting-edge problem solving in devops. One might think of the expertise that can be brought to bear as a penumbra: not directly involved but aware of what is going on and (quickly) recruitable and (quickly) up to speed.

    There are studies showing how groups accomplish this, some in highly demanding and complex settings. NASA’s Mission Control uses voice loops, described by Emily Patterson and Jennifer Watts. These are audio channels dedicated to specific roles or functions but available for monitoring by the larger group. Interesting in their study is the fact that (1) very few people speak on these channels and those that do speak briefly so that the channels are largely quiet and (2) people monitor multiple channels simultaneously with ease while they work. People choose which channels to monitor depending on their roles and the mission conditions. It is clear that the monitoring adds some small cognitive load but that this listening in pays back dividends. Perhaps most notable of the dividends is the relative paucity of direct comm coordination overhead required and the ease and especially the smoothness with the network of actors can be engaged in problem solving. Although this is surely a unique, mature situation where the network of actors and the supporting comm have been tailored over many years, it provides a good example of what can be accomplished.

    The Mission Control setting is particularly powerful because there is a great deal of shared information projected to all the actors. We are mostly lacking this in devops, although the dashboards and other monitoring information available goes some way to support the same sort of common awareness. Because the kinds of problems that occur in internet operational settings are so varied it’s probably not possible to create a priori a single display that can support Common Ground or even common ground ;-> I know that Mark Imbriaco’s group is seeking to build tools that can support common information displays in order to promote common grounding. This is a challenging area and likely to see significant change over the next 2 years.

    See
    Johannessen, Cook, & Woods, 1995. “Grounding Explanations in Evolving, Diagnostic Situations.” NASA-CR-198027. http://ntrs.nasa.gov/archive/nasa/casi.ntrs.nasa.gov/19950018915.pdf

    a more complete account of the study is found in Johanneson L, 2008. Maintaining Common Ground: An Analysis of Cooperative Communication in the Operating Room. In Nemeth, C., ed. Improving Healthcare Team Communication: Building on Lesson from Aviation and Aerospace. Ashgate Press. pp. 179-203. (ISBN 978-0754670254)

    Patterson, Watts-Perotti, Woods, 1999. “Voice Loops as Coordination Aids in Space Shuttle Mission Control.” Comp Supp Coop Work 8: 353-371. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.84.7042&rep=rep1&type=pdf

  4. Pingback: No observability without theory – Dan Slimmon

  5. Pingback: SRE Weekly Issue #348 – SRE WEEKLY

  6. Pingback: SRE Weekly Issue #348 – FDE

  7. Pingback: Troubleshooting on a Distributed Team Without Losing Common Ground - My Blog

  8. Pingback: Tech roundup 169: a journal published by a bot - Javi López G.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s