http://www.itsmwatch.com/news/article.php/3678556/Improving-Incident-Management.htm
Back to Article
|
|
|
|
By George Spafford May 18, 2007 Incident Management is concerned with deviations from, and threats to, the standard operation of services. During the course of time, even the best services will have incidents. How IT reacts to them will be pivotal not just to operations in their drive to reduce mean time to repair and mean time between failures, but also to customer satisfaction. As a result, many IT departments strive to find opportunities to improve their Incident Management process. One approach is to understand the expanded Incident Management lifecycle and look for means to improve each stage. Incident resolution does follow an observable lifecycle. Consider the following six stages:
Now, when we look at each of the above steps of the expanded lifecycle, we can look for process improvement opportunities. This approach allows for greater scrutiny as there is a model to mentally walk stakeholders through. Lets step through each of the steps now: First, there isnt much we can do about incident occurring so lets start with step two the detection of incidents. One approach is to identify deviations from standard operational norms through automated alerts. This is a necessary reactive stance aimed at identifying that something has occurred, such as a service on a server failing. The second approach is more proactive and involves the use of identified thresholds to send alerts and/or alarms. Some monitoring tools allow for multiple levels of events. Thus, a warning alert may be sent at 90% of disk utilization and then a critical alarm sent at 98%. First Things First The last opportunity involves monitoring trends and making management decisions accordingly. The intent is to use one of the three at the exclusion of the others. Instead, a blended approach should be pursued to prevent incidents in the first place and to effectively and efficiently detect them when they do transpire. When diagnosing incidents, the most important first question to always ask is, What changed? For persons not schooled in the linkages between changes and incidents this isnt always done. We know statistically that an incident is typically preceded by some change in state. If that information can be detected through automated tools and then shared with Incident Management then the ability to diagnose rapidly will accelerate with both the first fix rate and MTTR metrics improving along with customer satisfaction. In addition to interfacing change data to Incident Management the problem solving skills of personnel can be improved as well through education in the configuration items they are responsible for and tracking incidents and resolutions developed by vendors. Incident Management is concerned with deviations from, and threats to, the standard operation of services. During the course of time, even the best services will have incidents. How IT reacts to them will be pivotal not just to operations in their drive to reduce mean time to repair and mean time between failures, but also to customer satisfaction. As a result, many IT departments strive to find opportunities to improve their Incident Management process. One approach is to understand the expanded Incident Management lifecycle and look for means to improve each stage. Incident resolution does follow an observable lifecycle. Consider the following six stages:
Now, when we look at each of the above steps of the expanded lifecycle, we can look for process improvement opportunities. This approach allows for greater scrutiny as there is a model to mentally walk stakeholders through. Lets step through each of the steps now: First, there isnt much we can do about incident occurring so lets start with step two the detection of incidents. One approach is to identify deviations from standard operational norms through automated alerts. This is a necessary reactive stance aimed at identifying that something has occurred, such as a service on a server failing. The second approach is more proactive and involves the use of identified thresholds to send alerts and/or alarms. Some monitoring tools allow for multiple levels of events. Thus, a warning alert may be sent at 90% of disk utilization and then a critical alarm sent at 98%. First Things First The last opportunity involves monitoring trends and making management decisions accordingly. The intent is to use one of the three at the exclusion of the others. Instead, a blended approach should be pursued to prevent incidents in the first place and to effectively and efficiently detect them when they do transpire. When diagnosing incidents, the most important first question to always ask is, What changed? For persons not schooled in the linkages between changes and incidents this isnt always done. We know statistically that an incident is typically preceded by some change in state. If that information can be detected through automated tools and then shared with Incident Management then the ability to diagnose rapidly will accelerate with both the first fix rate and MTTR metrics improving along with customer satisfaction. In addition to interfacing change data to Incident Management the problem solving skills of personnel can be improved as well through education in the configuration items they are responsible for and tracking incidents and resolutions developed by vendors.
How IT reacts to them will be pivotal not just to operations in their drive to reduce mean time to repair and mean time between failures, but also to customer satisfaction. As a result, many IT departments strive to find opportunities to improve their Incident Management process. One approach is to understand the expanded Incident Management lifecycle and look for means to improve each stage. Incident resolution does follow an observable lifecycle. Consider the following six stages:
Now, when we look at each of the above steps of the expanded lifecycle, we can look for process improvement opportunities. This approach allows for greater scrutiny as there is a model to mentally walk stakeholders through. Lets step through each of the steps now: First, there isnt much we can do about incident occurring so lets start with step two the detection of incidents. One approach is to identify deviations from standard operational norms through automated alerts. This is a necessary reactive stance aimed at identifying that something has occurred, such as a service on a server failing. The second approach is more proactive and involves the use of identified thresholds to send alerts and/or alarms. Some monitoring tools allow for multiple levels of events. Thus, a warning alert may be sent at 90% of disk utilization and then a critical alarm sent at 98%. First Things First The last opportunity involves monitoring trends and making management decisions accordingly. The intent is to use one of the three at the exclusion of the others. Instead, a blended approach should be pursued to prevent incidents in the first place and to effectively and efficiently detect them when they do transpire. When diagnosing incidents, the most important first question to always ask is, What changed? For persons not schooled in the linkages between changes and incidents this isnt always done. We know statistically that an incident is typically preceded by some change in state. If that information can be detected through automated tools and then shared with Incident Management then the ability to diagnose rapidly will accelerate with both the first fix rate and MTTR metrics improving along with customer satisfaction. In addition to interfacing change data to Incident Management the problem solving skills of personnel can be improved as well through education in the configuration items they are responsible for and tracking incidents and resolutions developed by vendors.
Moving along, repairing a CI involves having the right skills, tools and, if appropriate, spare parts needed. In some cases, repairing also means involving vendors. To improve this phase of the lifecycle involves looking for opportunities to reduce the time spent waiting for parts, people, etc. It may mean negotiating with vendors. It may require training internal staff and evolving the vendor relationship. In some cases spare parts may be purchased in advance, or stored on-site in a depot arrangement with the vendors, to reduce the time spent waiting. In situations where CIs are distributed, then avenues for reducing transit time between systems through automation, additional staffing, augmented work schedules, and so on. Recovering CIs involves putting them back into production; physically and logically. Again, distributed systems can be a challenge. Tools, staffing augmentation and other means to reduce the time spent. The objective is to drive down time lost to transit, waiting, and so on. Lastly, the service that the incident involved with is restored into production. A subtle means of speeding this up is to ensure there are efficient and effective means to communicate with users that a service is back online. Again, in distributed environments this can be a challenge but there are means to overcome them via email, text messaging, etc. In conclusion, Incident Management is a very important process to ensure the needs of the business are met. It is also important because it is very customer and user facing. IT will be judged in no small part by how it responds to incidents and how users are supported during the course of the incident. By reviewing the expanded Incident Management lifecycle, groups can work together to identify opportunities to improve. To learn more about the importance of Change Management and linking it to Incident Management, please visit the IT Process Institute and read about Visible Ops at http://www.itpi.org. George Spafford is a principal consultant with Pepperweed Consulting and a long-time IT professional. George's professional focus is on compliance, security, management and overall process improvement. |