When implementing an IT Service Management (ITSM) system, I always look forward to spending time on root cause analysis (RCA). Of course Incident and Problem Management play the central role in ITSM design- it's crucial to give your teams, customers, and systems intuitive ways to communicate when something has gone wrong. However, it is equally important that organizations spend time identifying the key driver of these problems by performing an RCA to prevent them from reoccurring. This is because, at the end of the day, incidents and problems cost your organization money, and a good RCA can help save it. It's this viewpoint that has led me to dub RCA the under-appreciated hero of ITSM and in this post I will share with you the aspects of a successful RCA that can help vanquish problems once and for all.
It's important to distinguish between Problem Management and Incident Management. In broad strokes: the goal of Problem Management is to get to root cause, and we can understand its goal to be increasing the meantime between failures by determining root cause of one or more incidents thereby addressing with appropriate change to prevent recurrence of the incident; in this sense it's a proactive approach. On the other hand, Incident Management's goal is to reduce the meantime to recovery by responding and resolving fast; its approach is reactive.
The core function of root cause analysis is to uncover the core reason why a problem occurred. While there are many different tools and approaches to perform an RCA, I've consolidated the key steps into the diagram below:
I've worked with teams who have a well-defined RCA process and others who are just beginning. I reference this diagram when we focus on RCA because it helps to illustrate how simple of a process RCA can be. There aren't rigid guidelines or rules to follow; organizations can adopt their own RCA policies. What many don't realize, especially those who have yet to adopt RCA as a business process, is that it has a big pay-off: cost savings.
Root cause analysis can be a cost saving tool for organizations for a couple of reasons. First, identifying and acting on problems early saves money. The longer a problem goes on the more money it costs the organization, and a properly deployed RCA process is built to help organizations become more proactive rather than reactive. Second, the main goal of the RCA process is to prevent incidents from cropping up again. If the incident does not reoccur, then there won't be downtime or lost production, saving money in the long run.
When working with organizations to implement an RCA process, there are several aspects that I help coach my clients on which can help the organization embrace RCA. They are:
The goal of Problem Management is to get to root cause.
Incident Management goal: reduce the meantime to recovery (by responding and resolving fast); reactive
Problem Management goal: increase the meantime between failures (by determining root cause of one or more incidents thereby addressing with appropriate change to prevent recurrence of the incident); proactive.
Ultimately, where incidents and problems cost your organizations money, RCA saves it. It is for this reason that I think of RCA as an under-appreciated hero of ITSM. While the biggest barrier to accomplishing RCA can be time, putting in the time upfront to accomplish the RCA process will prevent repeat incidents from cropping up, saving your company time and resources in the long run. By implementing a few of these tips, I hope you come to appreciate RCA as I have, and if you have any questions let us know, we'd love to help.