As mentioned in episode 81 – What is “Systems Thinking,” I will talk about root cause analysis or RCA – another aspect of a systems approach, Let’s define it and talk about how it can be applied.
In episode 33: Lean Safety, I recommended some excellent books to help you out. TapRoot® is another great resource as well as the book Pre-Accident Investigation by Todd Conklin – I have this one and contains great insight. But remember, whatever books you read or concepts you follow, if all you get is a strategic overview of these principals, then you will need to follow that up with tactical training to be able to use the right tools and techniques for a given situation. For example, I went through HOP training, was a H.O.P. coach for a former employer, so jump on Linkedin and ask your network how they are applying these concepts, ask for help.
One thing I want to share about Conklin’s book that has stuck with me (there is a lot, but this one is a favorite); “Safety is not the absence of events; it is the presence of defenses.” So true!
So Let’s get into a definition of RCA. First, I want to reference TapRoot®. Over 30 years ago, they started with research into human performance and the best incident investigation and root cause analysis systems available. They put this knowledge to use to build a systematic investigation process with a coherent investigation philosophy. Then, they used and refined the system in the field. In 1991, they wrote the first TapRoot® Manual that put all of this experience together into an incident investigation system called TapRoot®.
According to TapRoot®, a definition of Root Cause Analysis is as follows: “The systematic process of finding the knowledge or best practice needed to prevent a problem.“ What root cause analysis tools or methods should you use? Here is guidance to help you pick the root cause analysis system you should use:
- You need to understand what happened. Because you can’t understand WHY an incident occurred if you don’t understand HOW it happened, and before that, you need to know what happened.
- You need to identify the multiple Causal Factors (this is the HOW) that caused the problem (the incident). Your root cause analysis system should have tools to help you identify these points that will be the start of finding root causes.
- You will need to dig deeper and find each of the Causal Factor’s root causes. These are the causes of human performance and equipment reliability issues. TapRoot, with many years of experience, has found that investigators (even experienced investigators) need guidance – an expert system – to help them consistently identify the root causes of human performance and equipment reliability issues. This guidance should be part of the root cause analysis system. Plus, the root cause analysis tool must find fixable causes of human error without placing blame. Blame is a major cause of failed root cause analysis.
- If this is a major issue, you should go beyond the specific root causes of this particular incident. For major investigations, you should look one level deeper for the Generic (systemic) Cause of each root cause. Not every root cause will have a Generic Cause. But, if you can identify the Generic Cause of a root cause, you may be able to develop corrective action that will eliminate a whole class of problems. Thus, your systematic process should guide you to find Generic Causes for major investigations.
- Root cause analysis is useless if you don’t develop effective corrective actions (fixes) that will prevent repeat incidents. TapRoot has seen that investigators may not be able to develop effective fixes for problems they haven’t seen fixed before. Therefore, your root cause analysis system should have guidance for developing effective fixes.
- Finally, you will need to get management approval to make the changes (the fixes) to prevent repeat problems. Thus, your root cause analysis system should include tools to effectively present what you have found and the corrective actions to management so they can approve the resources needed to make the changes happen.
Another take on this is that RCA involves investigating the patterns of adverse effects, finding hidden flaws in the system, and discovering specific actions that could have contributed directly and indirectly to the problem. Which often means that RCA reveals more than one root cause. And thus the title lends itself to criticism by some folks – it implies one root cause when we know that isn’t the intent. I call these folks word nerds, and these are the Gurus I refer to when I point this out – not the ones I mentioned earlier that understand this. They take the literal meaning instead of using their brains a little and understanding that it is just a catchphrase, and the purpose or spirit of the phrase is what is essential.
The second example of an organization focused on this topic; ASQ. The American Society for Quality (ASQ), formerly the American Society for Quality Control (ASQC), they are a knowledge-based global community of quality professionals, with nearly 80,000 members dedicated to promoting and advancing quality tools, principles, and practices in their workplaces and communities.
According to an ASQ article published back in 2004:
“Simply stated, RCA is a tool designed to help identify not only what and how an event occurred, but also why it happened. Only when investigators can determine why an event or failure occurred will they be able to specify workable corrective measures that prevent future events of the type observed.
Understanding why an event occurred is the key to developing effective recommendations. Imagine an occurrence during which an operator is instructed to close valve A; instead, the operator closes valve B. The typical investigation would probably conclude operator error was the cause.
This is an accurate description of what happened and how it happened. However, if the analysts stop here, they have not probed deeply enough to understand the reasons for the mistake. Therefore, they do not know what to do to prevent it from occurring again.
In the case of the operator who turned the wrong valve, we are likely to see recommendations such as retrain the operator on the procedure, remind all operators to be alert when manipulating valves, or emphasize to all personnel that careful attention to the job should be maintained at all times. Such recommendations do little to prevent future occurrences.
Generally, mistakes do not just happen but can be traced to some well-defined causes. In the case of the valve error, we might ask, “Was the procedure confusing? Were the valves clearly labeled? Was the operator familiar with this particular task?”
The answers to these and other questions will help determine why the error took place and what the organization can do to prevent a recurrence. In the case of the valve error, recommendations might include revising the procedure or performing procedure validation to ensure references to valves match the valve labels found in the field.
Identifying root causes is the key to preventing similar recurrences. An added benefit of an effective RCA is that, over time, the root causes identified across the population of occurrences can be used to target major opportunities for improvement.”
So this is why the battle against the word “why” confuses me (see, I cannot even avoid using the word in that statement!) – I mean, I get that it can be used negatively, as in “why did you do that?” Tone and inflection play into this a lot, as well. But we know that it isn’t the reason for asking, nor is it how we should ask the question. I mean, if you have ever heard a child ask “why” (and repeats it 25 times!), then you probably know that it is merely an attempt to understand the world around them. We don’t discourage children from learning more about the world around them by asking “why,” but we do as adults for some reason, and as adults, we are supposed to be smarter than that; able to use logic and reason – you think we would have figured that out. So instead of hating on the word why, then shouldn’t we spend more time training and education on the proper use of it?
Let me illustrate the use of “why” with an example not related to workplace safety. Simon Sinek, best known for popularizing the concept of WHY (and wrote the book “Start with Why“) in his first TED Talk in 2009, the third most popular TED video of all time. He talked about the importance of why, before the how or what. He gave an example of Apple; why is it they are so much more successful than other tech companies? They have access to the same talent pool, same consultants, same media. So what makes them so special? You wouldn’t explain this by describing what they make, or how they make those items – but the why gets to their core purpose, which then determines how they make what they make. Remember the TapRoot® or ASQ approach? Sound familiar?
Ok, so you might be saying that the context of the Ted Talk is completely different when it comes to workplace incidents or accidents. Maybe, but in both cases, asking why is meant to seek a deeper understanding of potential drivers or root causes, rather than explaining how something happened or even simply describing what happened in a single word or phrase. For example:
What happened: There was a fire in the furnace room.
How it happened: Combustible materials were stored too close to the furnace.
Why it happened: We have not established a fire prevention program or an effective housekeeping program, updated roles/responsibilities, assigned accountability, conducted/tracked training, developed a tracking system, inspected/audited, and reviewed to ensure it was being followed/maintained.
So use the word why, don’t use the word – as long as you seek to get a deeper understanding as to all of the potential contributors to an incident. It is important to define this topic accurately because it is crucial to continuous process improvement – the foundation of a safety management system, which kicked this whole series off a few episodes ago.
Before I wrap this up, let me say for those criticizing traditional RCA, claiming it doesn’t get it done, likely do not fully understand root cause analysis, or they are peddling something – their book or training course. Or they could just be taking the term literally when most of us are using it to describe ALL of the things we need to do to “root out” all of the potential contributors (causal factors) to failure. See, everyone gets hung up on the words. I mean yes, they matter to an extent. But someone with a high EQ, incredibly self-aware will be able to understand the context. And we should expect that in our profession.
I have even read posts saying things like, “don’t start at the end of the event and work back because there will be bias. Start at the beginning and work forward” – but if this approach is based on the fact that humans are flawed and will bring bias into the investigation (since we already know what happened), don’t you think those same flaws (biases) will be present regardless of where you start? Of course, they will. So we must focus on how we go about investigations – in this case, objectivity is the critical component.
Besides, we were likely trained this way – for recordable injuries look at the old OSHA 301 form; the very first question is NOT “What happened, starting with the event.” It is “What was the employee doing just before the incident occurred? Describe the activity, as well as the tools, equipment, or material the employee was using. Be specific.” So yeah, start BEFORE the event. Oh, and by the way, starting at the beginning is but ONE technique, ONE approach, it still gets to potential root causes and is, therefore, a root cause analysis tool.
There are other types of incidents where you may need to start with the current-state and walk backward. I am reminded of my old fire days. If there were a fire and no one was around, are you going to ask the fire how it started its day? If you happen to have video available, guess where you are going to start – the point at which you see the first sign of smoke or flames.
Working backward, in this case, might be needed to establish a timeline – from there, you can determine at what point in the process it started, what stage of the cycle the process was in, time of day, environmental conditions or temperature at the time, etc. The reason you are investigating will also drive how you approach the investigation, as well as the tools and techniques you use, in the case of criminal investigations, insurance, fraud or liability investigations, for example.
In the 1988 edition of Modern Accident Investigation Analysis, Ted S. Ferry has several chapters breaking down various aspects of accidents that need to be investigated. Among them are Chapter 4: Human Aspects; Chapter 7: Systems Investigation; Chapter 14: Where Did Management Fail? – and all along the way, discussing the relationship between people, environment, and systems.
We have come a long way in many areas, but one thing is sure; we have known for a long time that this approach – a systems approach, looking at all aspects, is the best way to truly root out all potential contributors so that we may make smarter decisions moving forward.
Ferry writes: “Mishaps are a sign of inefficient operations and poor operating practices. We can conclude that a good investigation will uncover, among other things, management practices or oversights that have contributed to the mishap.”
We can even go way back to the 1960s, to the MORT Process, which defines a mishap as “the unwanted transfer of energy that produces injury to persons, damage to property, degradation of an ongoing process, and other unwanted losses.” MORT is based on the concept that all accidental losses arise from two sources: (1) specific job oversights and omissions, and (2) the management systems factors that control the job. It is frequently used to solve other management problems, as well.
I could go on, but I think you get my point. Become a student of these approaches. Learn about the history of these topics. Our industry will continue to adapt as we learn more about the interactions humans have with systems and organizations. But I wanted to tackle this topic since it is so critical to supporting a safety management system and how we execute this, Root Cause Analysis, will drive every improvement we make to our people, processes, policies/procedures and how leaders continue to develop the sense through which they view their businesses.
I want to hear your take – how do you use continuous process improvement and RCA to support your safety management system? Have you struggled with any of the tools? Are you confused about which tools are ideal for which situations or problems for which you are trying to solve? Let me know – reach out via LinkedIn or shoot me an email.