www.itbusinessedge.com

Login Register

www.itbusinessedge.com

 

www.developer.com

Login Register

www.developer.com

 

www.developer.com

Login Register

www.developer.com

 

www.itbusinessedge.com

Login Register

www.itbusinessedge.com

 
Internet.com logo
IT Professionals
Communications

Database

Enterprise Applications

Hardware

IT Management

IT News

Mobile

Networking

Security

Server

Small Business

Storage

ITManagement
CIO Update

Datamation

Earthweb

Enterprise IT Planet

Intranet Journal

IT Career Planet

IT Channel Planet

ITSM Watch

Project Manager Planet

Developers
Architect

Java / OS

Microsoft Technology

Web Development

Sign in Sign in

http://www.itsmwatch.com/itil/article.php/3751206/The-Challenges-of-RCA-in-ITIL-and-the-New-Deming-Cycle.htm
Back to Article

By Jan Vromant
Jun 5, 2008

At a recent consulting engagement, I was helping an automobile OEM supplier with the documentation of its ITIL processes. This multi-billion dollar company had outsourced its whole IT department to a trio of major outsourcing service providers. Gary, the manager in charge of most of the ITIL processes, was irritated and complained about the service providers' lack of understanding of a Root Cause Analysis (RCA).

His annoyed comment reminded me of that picture of “Plan, Do, Stop”, that takes the place of the “Plan, Do, Check, Act” of the Deming cycle. Based upon Gary’s explanation, it appeared that the service providers were unwilling or unable to support the “Check” and “Act” steps, even though Gary stated that their contracts clearly stipulated the obligation.

Root Cause Analysis

In terms of ITIL, an RCA finds its rightful place in the Problem Management area. In reactive Problem Management (rPM), the focus is on finding an immediate work-around. However, in proactive Problem Management (pPM), the focus is squarely on the RCA. You could consider the RCA as the “Check” in the Deming cycle and identifying the change that could reduce the likelihood of the recurrence of the incident(s) would then be the last piece or the "Act" in the cycle.

ITIL v3 has appropriately split up reactive and proactive Problem Management: rPM is now in Service Operations, pPM is in Continual Service Improvement.

The maturity of Problem Management, according to the ITIL v3 book “Continual Service Improvement”, is 1.83 on a scale of 5—barely above Configuration Management. It puzzled me why more companies have not established a mature pPM. After all, isn't proactive Problem Management one of the processes where it is relatively easy to show a return on effort?

At $10 to $25 per incident ticket, the return on investment can be convincingly demonstrated, and can result in money in the bank. However, the complaint from Gary made it crystal clear that it is not that easy.

Throughout my client experiences, I have found multiple reasons why Problem Management is relatively immature and why it is so difficult to implement pPM:

No Time

A common reason (and one I have heard over and over again) is, “We don't have the time.” For these clients, the current situation seems to be an endless loop of fire-fighting without the time to effectively address the issues at hand. This challenge reminds me of the story of a logger who is cutting a tree with a blunt saw. A visitor once asked him why he didn’t sharpen his saw. “Don't you see that I don't have the time?” answered the logger.

Similarly, some IT environments reward “heroes” in a fire-fighting environment and, as a result, implicitly encourage and perpetuate “cowboy” behavior which tends to reduce any available time for proactive activities.

No Money

Another possible scenario involves the proposal of specific changes by IT to management to reduce incidents. Often, the changes do not get implemented because the manager does not have access to sufficient funds in an already stretched budget. In addition, often the technicians are not consistently able to demonstrate that the proposed change can indeed eliminate or prevent the recurring incident. As a result, the proposed change is shelved.

Eventually your rank-and-file troops and their managers get disheartened and give up such proactive efforts.

Chasing Which Incidents?

Another challenge involves the lack of details about incident categories. Organizations sometimes use an ad-hoc process to quantify and report on incident frequencies. The process often depends on manual classification by service desk personnel and on the manual review of samples of incident records.

Usually, each ticket contains a terse text description of what caused the customer to contact the service desk. Understandably, the service desk people want to resolve the incident rather than spend their time describing how they solved it. As a result, the lack of details and the difficulty to automatically quantify the incident categories can make it difficult to prioritize the activities of pPM resources.

Why Do It?

The benefits of pPM as related to end-user productivity are sometimes not well understood, because the effect of an incident on the end-user productivity is not consistently visible or acknowledged. The service desk typically provides a report on the average time spent on the resolution of a ticket, but that time frame tells an incomplete story about the potential productivity loss.

For example, at a major financial services company, the top ticket category recorded at several thousands of tickets per month, was end-users complaining about printing some insurance information from the internet to a PDF-file and winding up with a blank page in their finished PDF. Before the end-user called the service desk chances are high that he or she had already spent time trying a number of different options in an effort to resolve the issue independently.

At a recent consulting engagement, I was helping an automobile OEM supplier with the documentation of its ITIL processes. This multi-billion dollar company had outsourced its whole IT department to a trio of major outsourcing service providers. Gary, the manager in charge of most of the ITIL processes, was irritated and complained about the service providers' lack of understanding of a Root Cause Analysis (RCA).

His annoyed comment reminded me of that picture of “Plan, Do, Stop”, that takes the place of the “Plan, Do, Check, Act” of the Deming cycle. Based upon Gary’s explanation, it appeared that the service providers were unwilling or unable to support the “Check” and “Act” steps, even though Gary stated that their contracts clearly stipulated the obligation.

Root Cause Analysis

In terms of ITIL, an RCA finds its rightful place in the Problem Management area. In reactive Problem Management (rPM), the focus is on finding an immediate work-around. However, in proactive Problem Management (pPM), the focus is squarely on the RCA. You could consider the RCA as the “Check” in the Deming cycle and identifying the change that could reduce the likelihood of the recurrence of the incident(s) would then be the last piece or the "Act" in the cycle.

ITIL v3 has appropriately split up reactive and proactive Problem Management: rPM is now in Service Operations, pPM is in Continual Service Improvement.

The maturity of Problem Management, according to the ITIL v3 book “Continual Service Improvement”, is 1.83 on a scale of 5—barely above Configuration Management. It puzzled me why more companies have not established a mature pPM. After all, isn't proactive Problem Management one of the processes where it is relatively easy to show a return on effort?

At $10 to $25 per incident ticket, the return on investment can be convincingly demonstrated, and can result in money in the bank. However, the complaint from Gary made it crystal clear that it is not that easy.

Throughout my client experiences, I have found multiple reasons why Problem Management is relatively immature and why it is so difficult to implement pPM:

No Time

A common reason (and one I have heard over and over again) is, “We don't have the time.” For these clients, the current situation seems to be an endless loop of fire-fighting without the time to effectively address the issues at hand. This challenge reminds me of the story of a logger who is cutting a tree with a blunt saw. A visitor once asked him why he didn’t sharpen his saw. “Don't you see that I don't have the time?” answered the logger.

Similarly, some IT environments reward “heroes” in a fire-fighting environment and, as a result, implicitly encourage and perpetuate “cowboy” behavior which tends to reduce any available time for proactive activities.

No Money

Another possible scenario involves the proposal of specific changes by IT to management to reduce incidents. Often, the changes do not get implemented because the manager does not have access to sufficient funds in an already stretched budget. In addition, often the technicians are not consistently able to demonstrate that the proposed change can indeed eliminate or prevent the recurring incident. As a result, the proposed change is shelved.

Eventually your rank-and-file troops and their managers get disheartened and give up such proactive efforts.

Chasing Which Incidents?

Another challenge involves the lack of details about incident categories. Organizations sometimes use an ad-hoc process to quantify and report on incident frequencies. The process often depends on manual classification by service desk personnel and on the manual review of samples of incident records.

Usually, each ticket contains a terse text description of what caused the customer to contact the service desk. Understandably, the service desk people want to resolve the incident rather than spend their time describing how they solved it. As a result, the lack of details and the difficulty to automatically quantify the incident categories can make it difficult to prioritize the activities of pPM resources.

Why Do It?

The benefits of pPM as related to end-user productivity are sometimes not well understood, because the effect of an incident on the end-user productivity is not consistently visible or acknowledged. The service desk typically provides a report on the average time spent on the resolution of a ticket, but that time frame tells an incomplete story about the potential productivity loss.

For example, at a major financial services company, the top ticket category recorded at several thousands of tickets per month, was end-users complaining about printing some insurance information from the internet to a PDF-file and winding up with a blank page in their finished PDF. Before the end-user called the service desk chances are high that he or she had already spent time trying a number of different options in an effort to resolve the issue independently.


At a recent consulting engagement, I was helping an automobile OEM supplier with the documentation of its ITIL processes. This multi-billion dollar company had outsourced its whole IT department to a trio of major outsourcing service providers. Gary, the manager in charge of most of the ITIL processes, was irritated and complained about the service providers' lack of understanding of a Root Cause Analysis (RCA).

His annoyed comment reminded me of that picture of “Plan, Do, Stop”, that takes the place of the “Plan, Do, Check, Act” of the Deming cycle. Based upon Gary’s explanation, it appeared that the service providers were unwilling or unable to support the “Check” and “Act” steps, even though Gary stated that their contracts clearly stipulated the obligation.

Root Cause Analysis

In terms of ITIL, an RCA finds its rightful place in the Problem Management area. In reactive Problem Management (rPM), the focus is on finding an immediate work-around. However, in proactive Problem Management (pPM), the focus is squarely on the RCA. You could consider the RCA as the “Check” in the Deming cycle and identifying the change that could reduce the likelihood of the recurrence of the incident(s) would then be the last piece or the "Act" in the cycle.

ITIL v3 has appropriately split up reactive and proactive Problem Management: rPM is now in Service Operations, pPM is in Continual Service Improvement.

The maturity of Problem Management, according to the ITIL v3 book “Continual Service Improvement”, is 1.83 on a scale of 5—barely above Configuration Management. It puzzled me why more companies have not established a mature pPM. After all, isn't proactive Problem Management one of the processes where it is relatively easy to show a return on effort?

At $10 to $25 per incident ticket, the return on investment can be convincingly demonstrated, and can result in money in the bank. However, the complaint from Gary made it crystal clear that it is not that easy.

Throughout my client experiences, I have found multiple reasons why Problem Management is relatively immature and why it is so difficult to implement pPM:

No Time

A common reason (and one I have heard over and over again) is, “We don't have the time.” For these clients, the current situation seems to be an endless loop of fire-fighting without the time to effectively address the issues at hand. This challenge reminds me of the story of a logger who is cutting a tree with a blunt saw. A visitor once asked him why he didn’t sharpen his saw. “Don't you see that I don't have the time?” answered the logger.

Similarly, some IT environments reward “heroes” in a fire-fighting environment and, as a result, implicitly encourage and perpetuate “cowboy” behavior which tends to reduce any available time for proactive activities.

No Money

Another possible scenario involves the proposal of specific changes by IT to management to reduce incidents. Often, the changes do not get implemented because the manager does not have access to sufficient funds in an already stretched budget. In addition, often the technicians are not consistently able to demonstrate that the proposed change can indeed eliminate or prevent the recurring incident. As a result, the proposed change is shelved.

Eventually your rank-and-file troops and their managers get disheartened and give up such proactive efforts.

Chasing Which Incidents?

Another challenge involves the lack of details about incident categories. Organizations sometimes use an ad-hoc process to quantify and report on incident frequencies. The process often depends on manual classification by service desk personnel and on the manual review of samples of incident records.

Usually, each ticket contains a terse text description of what caused the customer to contact the service desk. Understandably, the service desk people want to resolve the incident rather than spend their time describing how they solved it. As a result, the lack of details and the difficulty to automatically quantify the incident categories can make it difficult to prioritize the activities of pPM resources.

Why Do It?

The benefits of pPM as related to end-user productivity are sometimes not well understood, because the effect of an incident on the end-user productivity is not consistently visible or acknowledged. The service desk typically provides a report on the average time spent on the resolution of a ticket, but that time frame tells an incomplete story about the potential productivity loss.

For example, at a major financial services company, the top ticket category recorded at several thousands of tickets per month, was end-users complaining about printing some insurance information from the internet to a PDF-file and winding up with a blank page in their finished PDF. Before the end-user called the service desk chances are high that he or she had already spent time trying a number of different options in an effort to resolve the issue independently.


The cost of the ticket at this financial services company was calculated to be around $20, but the end-user productivity loss could easily be higher than the cost of the ticket. There is a cumulative effect of the ticket cost and, the sometimes much bigger, productivity loss.

RCA Work is Hard!

There are several methods to do RCAs, and none of them are easy. Some examples are Kepner-Tregoe, the 5 Why's, or the Ishikawa Methodology. These methods are well documented, and there is abundant information available from a variety of sources. However, the effective use of these methods requires training and experience. Here are a couple of examples of RCAs that highlight the important need for training and experience:

“The incident was caused by a primary access router, which was 'flapping' due to a defective port. (...) Because the router was not 'hard down', redundancy was not invoked.”

“Site contact / escalation information not current. Inaccurate documentation on site business description.”

Both examples show that it is easy to find some plausible explanations or description of what happened. However, the questions that should have followed were, “Why was that defective port 'flapping'?” and “Why was that documentation inaccurate or not current?”

The RCA methodologies do not provide much guidance about when you have reached a "true"root cause. To determine a true root cause is difficult because of four reasons:

a. Technical Aspects - IT technology can be complex. Interdependencies between different IT towers, obscure utilities, databases, applications, networks, security, etc., make the task of finding a root cause challenging. Increasingly fast technology developments (e.g., in the areas of virtualization or security) and the associated needs for continuous training can result in a never-ending race against time.

b. Multiple Causes – At times, there is “multi-causation”. There is a main root cause and some secondary contributing factors. For example: a break/fix is the root cause, but an operator missed a system message during the outage duration. The change exceeded the change window and got wrecked by the scheduled backup.

c. Experience – Nothing beats experience in the process of acquiring a broad range of knowledge and understanding of a technical architecture. “Been there, done that” is an obvious advantage when looking for a root cause. Some of that experience walks out the door when seasoned IT professionals leave the organization.

d.Logic - Books (e.g., by Kahneman and Tversky) have been written to explain how people make choices in a complex decision tree that would be depicted in an Ishikawa fishbone diagram. One has only to look at the entry for the topic “logic” in the online Wikipedia to be floored by the science and the number of theories around the topic. Investigating logical relations between IT incidents and their root causes is not an activity to be pursued by people who are intimidated by hard sudokus.

The above four reasons should mandate the usage of experienced, sharp, and costly resources to work on RCA efforts.

“Novices think that by following the heuristic [as outlined in any methodology], they will arrive at the correct solution; however, difficult problems often require a trial and error method. Yet novices will stubbornly stick to a failing solution, where as experts with deep conceptual understandings will quickly see that a solution is not working and respond with a completely new procedure. Their problem solving has everything to do with adaptability and deep knowledge structures and nothing to do with the simple problem solving methods described above.” (Taken from the article “Leadership and Direction” by Donald Clark).

In an internal IT environment, it is difficult to free up such resources for proactive work. In an outsourced environment, the availability of these resources might be even more difficult because margins can be paper-thin. When outsourcing service providers sign a new contract, they often staff newly hired employees to service that particular account. Because of utilization and profit considerations, RCAs are sometimes assigned to these fresh resources instead of to seasoned professionals.

Metrics and Help Desk Bonuses

Another hurdle might be just plain open revolt against pPM. In the afore-mentioned financial services company, the service desk leads got a quarterly bonus based on the First Call Resolution (FCR) rates. The pPM would have brought the FCR down. If you eliminate the top 10 incident reasons, you eliminate the tickets that are coming over and over to the service desk. However, the service desk people love those calls, because they know the answer and the tickets can be easily solved, which boosts the FCR percentage.

In the above company, there was a protest and outcry from the service desk leads against the potential elimination of these tickets and the effect on their bonuses.

When the service desk or service desk is outsourced there might be similar reasons for the service provider to be unhappy about the elimination of the frequently occurring tickets. On the one hand, easy revenue is eliminated. On the other, higher skilled and more expensive resources are needed to address the not-so-frequent tickets. Lower profits and unhappy account delivery managers can be the result.


High FCR and Poor pPM

The understanding of FCR metrics can be a major hurdle. As described in the above example, a high FCR can be the sign of a poorly functioning pPM. Most outsourcing contracts mention the FCR as a specific service level. A typical contractual phrasing is, “First Call Resolution will be greater than or equal to 75%.” Such a service level could undermine the usage of pPM because pPM normally drives down the FCR. You actually want the FCR to go down, as it could indicate a well-functioning pPM and the reduction of the number of easy tickets over time. (It also might point out a decreasing performance of the service desk caused by - for example - skyrocketing attrition rates.). The improvement can only be achieved by having a good reporting system and a keen understanding of the metrics and the service levels of your IT operations.

Link with Change Management

A poor change management process can be another hurdle because of the cycle of Incident Management ► Problem Management ► Change Management. Maybe the weekly Change Advisory Board (CAB) doesn't have sufficient financial knowledge representation and the return on investment (ROI) effect of a particular change on the number of incidents is poorly understood.

Proactive changes tend to often fall through the cracks in a typical CAB meeting, because they lack the urgency of changes related to immediate operational needs or the importance of mega-projects. In addition, the documentation of proactive changes often does not clearly convey or communicate the financial benefits of the proposed change and instead focuses on technological gobbledygook.

Countermeasures

There are several countermeasures that you can put in place to boost your pPM efforts:

1. Understand How to Get to the Root Cause

There is a relatively easy way to understand when you have reached the root cause. You know you have gotten to the root of the problem when the elimination of that issue, through a formal change, will eliminate the recurrence of that set of tickets. When you do a deep enough analytical dive into the root causes of most issues, the final result you will find for any root cause is “because we are human.”

Some examples include:

·Programming an application and failing to correct known bugs due to time constraints;

·Switching off the electricity of a major data center with 635 servers because of work on the air-conditioning system; and

·Not finding the “any” key on the keyboard.

A similar situation is when you determine that the problem is caused by something out of your control. The following example conveys the idea:

I know that my 3rd party application freezes whenever I do a certain thing. I have no idea what code is causing it, but I have communicated it to the vendor, who provides me with a patch that fixes it. Do I know the root cause? No. Have I solved the problem and prevented its recurrence? Yes.

The real or fundamentally correct root cause does not matter, as long as you can eliminate the tickets and enhance your end-user productivity. The moment you can identify any change that costs less than the cost of the incidents it eliminates, you have a mini-investment project with a positive ROI.

2. A Good Tool

Trying to implement pPM can be painful if you don’t have an effective Incident and Problem Management tool. Preferably, the tool should be linked with your Change and Configuration Management process. The absence of a tool makes the RCA research and the implementation and documentation of your pPM efforts particularly difficult in two areas.

The first area is the categorization of the incidents. An effective tool will make the proper categorization of the incidents mandatory. Thus the tool will help spot and report the Top 10 incident categories and help in prioritizing your pPM activities. Second, the linkage with Change and Configuration Management will assist in the seamless changing and updating of your environment to reduce the likelihood of recurrence of the incidents.

3. Process

A well-defined process structure with clear roles and responsibilities, metrics, and strong cross-functional links between users, support personnel, and service providers is the foundation to facilitate pPM.

Accountability: In an internal IT environment, a way to tackle the challenge of pPM is by assigning responsibility for its implementation to an individual—not a committee! In addition, you should define appropriate metrics to judge the performance of the pPM process. These metrics should then flow into the personnel performance review. This is the “throat-to-choke” principle. You should be on the road to improvement the moment you define the accountability and the consequences for not reaching mutually agreed upon goals.


People: Because pPM is technically challenging, you should try to dedicate people for it. Driving down the number, duration, and impact of incidents is a skill that requires training and constant vigilance. The usage of dedicated resources also facilitates the financial tracking of the effort, as it is easier to measure the cost of pPM when you are working with full-time people, instead of with scattered time slices of shared resources.

Metrics: It seems redundant to mention reporting and metrics. Everybody knows the importance of appropriate metrics and reports to manage your environment, right?

I recently visited a client who had its service desk outsourced to an East-European entity. The level of reporting was far below expectation. The few reports that reached our client were riddled with inconsistencies. More than 60% of the tickets were uncategorized. There was no effort whatsoever to match clustered service calls with specific incidents in the IT environment. The reported number of tickets was actually the number of “touches” and included Level 2 support. The FCR metric was not available.

With such level of reporting maturity, the probability of a high-performing pPM tends to be low. A solid reporting structure and the correct metrics are necessary to start tackling pPM.

4. Incentives

In an outsourcing environment, it is relatively easy for the client to include a clause and a corresponding service level in the contract to increase the probability that pPM is actually done. The basic effect of the contractual clause is to put the onus on the outsourcing service provider and to use metrics and an approval process to reinforce that pPM needs to be performed.

I believe it is not unreasonable to demand a 5% to 15% reduction of the number of tickets year-over-year. As a corollary, it can also make sense to provide the outsourcer with an incentive based on the improvements, e.g., an “uplift” on the price per ticket.

In an internal IT environment, incentives can be monetary or in the form of a drawing for a prize or dinner for two at a favorite restaurant. The monetary value of the incentive is not very important. However, you can enhance morale and encourage the desired behavior by recognizing the effort.

5. Communication

The importance of communication can not be over-emphasized. In training sessions or introductory presentations, I typically start with the statement that ITIL and its processes are about the principle of the 3 C’s: Communication, Communication, and Communication.

The pPM is a great example of the principle. Since it is invisible to the customer, there is a need for good data that communicates the proactive work you have done. If you don’t communicate, you are in a situation where “there is no glory in pPM.”

·Here are possible elements of a communication strategy that could enhance your pPM by making the Problem Management people and their activities more visible:

·

  • Announcements and compliments to the group that solved an insidious problem;
  • ·

  • Money-saving case studies in the IT newsletter; and
  • Additions to the Frequently Asked Questions (FAQs) intranet website with the name of the IT person who added the FAQ.
  • 6. "Check, Act"

    Finally, keep the Deming or Shewhard cycle (“Plan, Do, Check, Act”) in mind, and specifically, the last two elements of the cycle.

    The “Check” part is the measuring and collecting of metrics regarding the effect on recurrence of tickets. The most important aspect of “Check” is to define the review mechanism or process, and to appoint a person (again, not a committee!) responsible for determining the root causes within a particular time frame. If the required changes are relatively important, the review activity in which you gauge the ROI of particular changes that could eliminate major chunks of incident numbers might be a part of your project portfolio process.

    The “Act” part is based upon the metrics, the financial impact, and the ROI of the decision. It deals not only with the approval and execution of the change to reduce the likelihood of the recurring tickets, but also with the governance and appointment of the people responsible for the identified corrective action items.

    Conclusion

    Like so many aspects of IT, the initiation and execution of RCA in an IT environment needs a crisp definition of processes, internal discipline and governance. These elements are the core drivers of action and reaction and the resulting enhanced corporate productivity. When you have these elements in place, your organization can go for the full Deming “Plan, Do, Check, Act” cycle, and not for the half-baked “Plan, Do, Stop.”

    (Thanks to David Cannon, Jeanette Smith, Lynn Sturdevant, and Dirk Weber for their help with this article.)

    Jan Vromant is a process architecture consultant, focusing on outsourcing related services and ITIL processes and is a Lead in the Outsourcing Advisory Services of Deloitte Consulting LLP. He is an ITIL (v2) Service Manager and certified ISO/IEC20000 consultant, and earned an MBA from Rice University in Houston. Before joining Deloitte, Jan worked at Royal Dutch/Shell, BMC Software, PricewaterhouseCoopers Consulting, and Hewlett-Packard. He is originally from Belgium and lives in the Detroit area.


     

    Sitemap | Contact Us
    Terms of Service | Licensing & Permissions | Privacy Policy
    About the Developer.com Network | Advertise
    Terms of Service | Licensing & Permissions | Privacy Policy
    About the IT Business Edge Network | Advertise
    Acceptable Use Policy
    Terms of Service | Licensing & Permissions | Privacy Policy
    About the Developer.com Network | Advertise
    Acceptable Use Policy
    Terms of Service | Licensing & Permissions | Privacy Policy
    About the IT Business Edge Network | Advertise