Think Like a Detective: Using 5w2h to Solve Production Mysteries.

Think Like a Detective: Using 5w2h to Solve Production Mysteries.

I love the idea behind your build and run it, it’s a great way to ensure the team is accountable for the product they are creating. But, when things go wrong, it’s not always easy to find out what happened, I can relate to the feeling of being lost in the middle of a production incident, and not being able to find the root cause of the problem.

In the end, we are all detectives, trying to solve the mysteries of our production systems. And, like any good detective, we need to ask the right questions to get to the bottom of things. That’s where the 5w2h method comes in.

What is the 5w2h method?

The 5W2H method is a straightforward yet incredibly powerful technique, originally developed at Toyota. It’s a fantastic way to ensure you’re asking the right questions.

I’ve seen the 5W2H method applied in various contexts, from action plans to project planning. My first encounter with it was years ago while working with a product manager who was a big advocate of the method. He consistently used it to ensure we were asking the right questions when planning new features.

I was struck by its simplicity and effectiveness in getting to the root of problems. This inspired me to think: why not apply the 5W2H method to solve production mysteries?

Applying the 5W2H Method to Production Issues

The acronym stands for seven key questions: What, Why, Who, When, Where, How, and How Much.

What?

The first question is “What?”, and it’s about understanding what happened. What is the problem? What is the impact? What is the root cause?

For example, if you’re trying to figure out why a service went down, you can start by asking:

  • What is the error message?
  • What is the unexpected behavior?

It’s important to gather as much information as possible, so you can have a clear picture of what happened.

Why?

The second question is “Why?”, and it’s about understanding why it happened. Why did the problem occur? Why did the system behave that way?

In this moment, you can start brainstorming possible causes, based on the information you gathered in the previous step, a pro tip is to leverage historical data, such as logs, and incidents reports, to help you understand the context of the problem.

Where?

The third question is “Where?”, and it’s about understanding where it happened. Where did the problem occur? Where is the impact?

So it’s time to pinpoint the affected part of the system, and understand the scope of the problem, it’s important to know if the problem is isolated or if it’s affecting multiple services. If applicable, identify the environment, such as production, staging, version, etc.

When?

The fourth question is “When?”, and it’s about understanding when it happened. When did the problem occur? When did the system start behaving that way?

It’s important to know the timeline of the problem, so you can understand the context of the problem, and identify possible triggers, such as deployments, configuration changes, etc.

Also, if it’s a time-sensitive problem or if it’s a recurring problem, it’s important to know when it’s more likely to happen.

Who?

The fifth question is “Who?”, and it’s about understanding who was involved. Who was affected by the problem? Who was responsible for the system?

This information is important to understand what is the team that should be involved in the investigation, and make sure that the right people are aware of the problem.

How?

The sixth question is “How?”, and it’s about understanding how it happened. How did the problem occur? How did the system behave that way?

It’s a nice moment to outline the steps that led to the problem, and understand the sequence of events that caused the problem.

Keeping track of the steps that led to the problem can help you identify possible causes, and understand the context of the problem.

How much?

The seventh question is “How much?”, and it’s about understanding how much it happened. How much did the problem occur? How much was the impact?

Many people might be envolved depending on the impact of the problem, so it’s important to understand the scope of the problem, and how much it affected the system.

For example, depending on the impact, you might need to open a status page, or send a notification to customers, so it’s important to understand the impact of the problem.

Conclusion

One of the secrets of being a good detective is asking the right questions. The 5W2H method is a fantastic way to ensure you’re asking the right questions when trying to solve production mysteries, but don’t push yourself too hard, you don’t need to ask all the questions at once, take your time, and ask the questions that make sense for the context of the problem, also it’s a method to support your investigation, not a rule, so feel free to adapt it to your needs.

I hope you enjoyed this blog post, and if you have any questions, please feel free to reach out to me on Twitter , let me know if pretending to use the 5w2h method helped you to solve a production mystery, I would love to hear your story. Don’t forget to share this post with your friends and colleagues. 🚀