How to drive root cause analysis and fix defects at any company
When was the last time your systems failed and there was a customer impact? I’ve had this happen hundreds of times in my career. Sometimes I was the engineer that caused the issue, especially in my early years at Amazon, and sometimes I was the General Manager of mission-critical services.
So shit happened - your service is not functioning and you are not responsive. The first critical task is resolving the customer issue and then resolving the service issue. Once you’ve done that, what do you do now? Do you move on and hope things are resolved in a way that enables you to move forward? No, please don’t do that!
You will need to step back and do a post-incident response. Then use those learnings to ensure that the defects and systematic issues have been eliminated. In an ideal world, failures don’t happen. Until we live in that world, the intention behind the post-incident response is to aid teams in structuring an incident review and facilitating conversation around what can be done better. It’s an opportunity to look at the team, process, and technology in an introspective way and openly discuss how the customer can be better served.
There are likely hundreds of articles on the importance of root cause analysis, but there is much confusion on how to do it effectively. Blame is not at the heart of the process. Instead, the goal is to identify the issue and take concrete steps so that the same issue is avoided in the future. If you don’t understand how you failed, it’s almost impossible to avoid the same situation.
You will want to assign a clear owner for the Correction of Error (COE, aka the post-incident response), and have that team spend the next few days documenting details.
You, as a leader, must ensure they are blame-free. Individuals should never be called out or shamed.
It is important that the COE is reviewed broadly. Not only to identify the right cause of the error but to ensure that the lessons learned from the incident can be shared and understood by other teams. Actions that come out of COE are accountable by that team and team members must sign up to prioritize and complete these actions. The focus is on learning lessons and making corrections so that future errors can be avoided.
There are 9 key sections in a typical AWS COE: (more info here)
Summary: A concise and self-critical description of what happened.
Metrics/Graphs: A visualization that conveys the duration and severity of the event.
Customer Impact: A section answering questions related to the impact on customers during the event.
How many customers were impacted?
For how long did the event occur?
What was the impact on the customers through the event?
Incident Response Analysis: An analysis of our response to the event.
How did we do during the event?
What did we understand while it was happening?
Post-Incident Analysis: An analysis of what was learned since the event occurred.
Timeline: An almost forensic account, minute by minute (or even second by second), of what happened.
The 5 Whys: A casual graph for why this event happened. Ask why it happened and iterate with another ‘Why?’ until you get to the underlying factor that was the real culprit.
Lessons Learned: A list of the main lessons the team learned from the event.