Home �   ITIL�  Index

The Challenges of RCA in ITIL and the "New" Deming Cycle

Proactive problem management (pPM) and root cause analysis (RCA) is the right path but you may find open revolt along the way, writes ITSM Watch columnist Jan Vromant of Deloitte Consulting.
Jun 5, 2008

Jan Vromant

At a recent consulting engagement, I was helping an automobile OEM supplier with the documentation of its ITIL processes. This multi-billion dollar company had outsourced its whole IT department to a trio of major outsourcing service providers. Gary, the manager in charge of most of the ITIL processes, was irritated and complained about the service providers' lack of understanding of a Root Cause Analysis (RCA).

His annoyed comment reminded me of that picture of “Plan, Do, Stop”, that takes the place of the “Plan, Do, Check, Act” of the Deming cycle. Based upon Gary’s explanation, it appeared that the service providers were unwilling or unable to support the “Check” and “Act” steps, even though Gary stated that their contracts clearly stipulated the obligation.

Root Cause Analysis

In terms of ITIL, an RCA finds its rightful place in the Problem Management area. In reactive Problem Management (rPM), the focus is on finding an immediate work-around. However, in proactive Problem Management (pPM), the focus is squarely on the RCA. You could consider the RCA as the “Check” in the Deming cycle and identifying the change that could reduce the likelihood of the recurrence of the incident(s) would then be the last piece or the "Act" in the cycle.

ITIL v3 has appropriately split up reactive and proactive Problem Management: rPM is now in Service Operations, pPM is in Continual Service Improvement.

The maturity of Problem Management, according to the ITIL v3 book “Continual Service Improvement”, is 1.83 on a scale of 5—barely above Configuration Management. It puzzled me why more companies have not established a mature pPM. After all, isn't proactive Problem Management one of the processes where it is relatively easy to show a return on effort?

At $10 to $25 per incident ticket, the return on investment can be convincingly demonstrated, and can result in money in the bank. However, the complaint from Gary made it crystal clear that it is not that easy.

Throughout my client experiences, I have found multiple reasons why Problem Management is relatively immature and why it is so difficult to implement pPM:

No Time

A common reason (and one I have heard over and over again) is, “We don't have the time.” For these clients, the current situation seems to be an endless loop of fire-fighting without the time to effectively address the issues at hand. This challenge reminds me of the story of a logger who is cutting a tree with a blunt saw. A visitor once asked him why he didn’t sharpen his saw. “Don't you see that I don't have the time?” answered the logger.

Similarly, some IT environments reward “heroes” in a fire-fighting environment and, as a result, implicitly encourage and perpetuate “cowboy” behavior which tends to reduce any available time for proactive activities.

No Money

Another possible scenario involves the proposal of specific changes by IT to management to reduce incidents. Often, the changes do not get implemented because the manager does not have access to sufficient funds in an already stretched budget. In addition, often the technicians are not consistently able to demonstrate that the proposed change can indeed eliminate or prevent the recurring incident. As a result, the proposed change is shelved.

Eventually your rank-and-file troops and their managers get disheartened and give up such proactive efforts.

Chasing Which Incidents?

Another challenge involves the lack of details about incident categories. Organizations sometimes use an ad-hoc process to quantify and report on incident frequencies. The process often depends on manual classification by service desk personnel and on the manual review of samples of incident records.

Usually, each ticket contains a terse text description of what caused the customer to contact the service desk. Understandably, the service desk people want to resolve the incident rather than spend their time describing how they solved it. As a result, the lack of details and the difficulty to automatically quantify the incident categories can make it difficult to prioritize the activities of pPM resources.

Why Do It?

The benefits of pPM as related to end-user productivity are sometimes not well understood, because the effect of an incident on the end-user productivity is not consistently visible or acknowledged. The service desk typically provides a report on the average time spent on the resolution of a ticket, but that time frame tells an incomplete story about the potential productivity loss.

For example, at a major financial services company, the top ticket category recorded at several thousands of tickets per month, was end-users complaining about printing some insurance information from the internet to a PDF-file and winding up with a blank page in their finished PDF. Before the end-user called the service desk chances are high that he or she had already spent time trying a number of different options in an effort to resolve the issue independently.

    1 2 >> Last Page