The Challenges of RCA in ITIL and the "New" Deming Cycle
RCA Work is Hard!
The incident was caused by a primary access router, which was 'flapping' due to a defective port. (...) Because the router was not 'hard down', redundancy was not invoked.
Site contact / escalation information not current. Inaccurate documentation on site business description.
Both examples show that it is easy to find some plausible explanations or description of what happened. However, the questions that should have followed were, Why was that defective port 'flapping'? and Why was that documentation inaccurate or not current?
The RCA methodologies do not provide much guidance about when you have reached a "true"root cause. To determine a true root cause is difficult because of four reasons:
a. Technical Aspects - IT technology can be complex. Interdependencies between different IT towers, obscure utilities, databases, applications, networks, security, etc., make the task of finding a root cause challenging. Increasingly fast technology developments (e.g., in the areas of virtualization or security) and the associated needs for continuous training can result in a never-ending race against time.
b. Multiple Causes At times, there is multi-causation. There is a main root cause and some secondary contributing factors. For example: a break/fix is the root cause, but an operator missed a system message during the outage duration. The change exceeded the change window and got wrecked by the scheduled backup.
c. Experience Nothing beats experience in the process of acquiring a broad range of knowledge and understanding of a technical architecture. Been there, done that is an obvious advantage when looking for a root cause. Some of that experience walks out the door when seasoned IT professionals leave the organization.
d.Logic - Books (e.g., by Kahneman and Tversky) have been written to explain how people make choices in a complex decision tree that would be depicted in an Ishikawa fishbone diagram. One has only to look at the entry for the topic logic in the online Wikipedia to be floored by the science and the number of theories around the topic. Investigating logical relations between IT incidents and their root causes is not an activity to be pursued by people who are intimidated by hard sudokus.
The above four reasons should mandate the usage of experienced, sharp, and costly resources to work on RCA efforts.
Novices think that by following the heuristic [as outlined in any methodology], they will arrive at the correct solution; however, difficult problems often require a trial and error method. Yet novices will stubbornly stick to a failing solution, where as experts with deep conceptual understandings will quickly see that a solution is not working and respond with a completely new procedure. Their problem solving has everything to do with adaptability and deep knowledge structures and nothing to do with the simple problem solving methods described above. (Taken from the article Leadership and Direction by Donald Clark).
In an internal IT environment, it is difficult to free up such resources for proactive work. In an outsourced environment, the availability of these resources might be even more difficult because margins can be paper-thin. When outsourcing service providers sign a new contract, they often staff newly hired employees to service that particular account. Because of utilization and profit considerations, RCAs are sometimes assigned to these fresh resources instead of to seasoned professionals.
Metrics and Help Desk Bonuses
Another hurdle might be just plain open revolt against pPM. In the afore-mentioned financial services company, the service desk leads got a quarterly bonus based on the First Call Resolution (FCR) rates. The pPM would have brought the FCR down. If you eliminate the top 10 incident reasons, you eliminate the tickets that are coming over and over to the service desk. However, the service desk people love those calls, because they know the answer and the tickets can be easily solved, which boosts the FCR percentage.
In the above company, there was a protest and outcry from the service desk leads against the potential elimination of these tickets and the effect on their bonuses.
When the service desk or service desk is outsourced there might be similar reasons for the service provider to be unhappy about the elimination of the frequently occurring tickets. On the one hand, easy revenue is eliminated. On the other, higher skilled and more expensive resources are needed to address the not-so-frequent tickets. Lower profits and unhappy account delivery managers can be the result.