www.itbusinessedge.com

Login Register

www.itbusinessedge.com

 

www.developer.com

Login Register

www.developer.com

 

www.developer.com

Login Register

www.developer.com

 

www.itbusinessedge.com

Login Register

www.itbusinessedge.com

 
Internet.com logo
IT Professionals
Communications

Database

Enterprise Applications

Hardware

IT Management

IT News

Mobile

Networking

Security

Server

Small Business

Storage

ITManagement
CIO Update

Datamation

Earthweb

Enterprise IT Planet

Intranet Journal

IT Career Planet

IT Channel Planet

ITSM Watch

Project Manager Planet

Developers
Architect

Java / OS

Microsoft Technology

Web Development

Sign in Sign in

http://www.itsmwatch.com/news/article.php/3678556/Improving-Incident-Management.htm
Back to Article

By George Spafford
May 18, 2007

Incident Management is concerned with deviations from, and threats to, the standard operation of services. During the course of time, even the best services will have incidents.

How IT reacts to them will be pivotal not just to operations in their drive to reduce mean time to repair and mean time between failures, but also to customer satisfaction.

As a result, many IT departments strive to find opportunities to improve their Incident Management process. One approach is to understand the expanded Incident Management lifecycle and look for means to improve each stage.

Incident resolution does follow an observable lifecycle. Consider the following six stages:

  • Occurrence – Something happens to a configuration item (CI).
  • Detection – The incident is detected either by monitoring tools, IT personnel or, worse case, the user.
  • Diagnosis – the next step is to determine what has happened.
  • Repair – Then the CI needs to be corrected. This may be a true solution or a temporary work-around aimed at getting the user back to some degree of productive work.
  • Recover – The CI is then put back into production.
  • Restore – Finally the service is put back into production.
  • Now, when we look at each of the above steps of the expanded lifecycle, we can look for process improvement opportunities. This approach allows for greater scrutiny as there is a model to mentally walk stakeholders through. Let’s step through each of the steps now:

    First, there isn’t much we can do about incident occurring so let’s start with step two – the detection of incidents. One approach is to identify deviations from standard operational norms through automated alerts. This is a necessary reactive stance aimed at identifying that something has occurred, such as a service on a server failing.

    The second approach is more proactive and involves the use of identified thresholds to send alerts and/or alarms. Some monitoring tools allow for multiple levels of events. Thus, a warning alert may be sent at 90% of disk utilization and then a critical alarm sent at 98%.

    First Things First

    The last opportunity involves monitoring trends and making management decisions accordingly. The intent is to use one of the three at the exclusion of the others. Instead, a blended approach should be pursued to prevent incidents in the first place and to effectively and efficiently detect them when they do transpire.

    When diagnosing incidents, the most important first question to always ask is, “What changed?”

    For persons not schooled in the linkages between changes and incidents this isn’t always done. We know statistically that an incident is typically preceded by some change in state. If that information can be detected through automated tools and then shared with Incident Management then the ability to diagnose rapidly will accelerate with both the first fix rate and MTTR metrics improving along with customer satisfaction.

    In addition to interfacing change data to Incident Management the problem solving skills of personnel can be improved as well through education in the configuration items they are responsible for and tracking incidents and resolutions developed by vendors.

    Incident Management is concerned with deviations from, and threats to, the standard operation of services. During the course of time, even the best services will have incidents.

    How IT reacts to them will be pivotal not just to operations in their drive to reduce mean time to repair and mean time between failures, but also to customer satisfaction.

    As a result, many IT departments strive to find opportunities to improve their Incident Management process. One approach is to understand the expanded Incident Management lifecycle and look for means to improve each stage.

    Incident resolution does follow an observable lifecycle. Consider the following six stages:

  • Occurrence – Something happens to a configuration item (CI).
  • Detection – The incident is detected either by monitoring tools, IT personnel or, worse case, the user.
  • Diagnosis – the next step is to determine what has happened.
  • Repair – Then the CI needs to be corrected. This may be a true solution or a temporary work-around aimed at getting the user back to some degree of productive work.
  • Recover – The CI is then put back into production.
  • Restore – Finally the service is put back into production.
  • Now, when we look at each of the above steps of the expanded lifecycle, we can look for process improvement opportunities. This approach allows for greater scrutiny as there is a model to mentally walk stakeholders through. Let’s step through each of the steps now:

    First, there isn’t much we can do about incident occurring so let’s start with step two – the detection of incidents. One approach is to identify deviations from standard operational norms through automated alerts. This is a necessary reactive stance aimed at identifying that something has occurred, such as a service on a server failing.

    The second approach is more proactive and involves the use of identified thresholds to send alerts and/or alarms. Some monitoring tools allow for multiple levels of events. Thus, a warning alert may be sent at 90% of disk utilization and then a critical alarm sent at 98%.

    First Things First

    The last opportunity involves monitoring trends and making management decisions accordingly. The intent is to use one of the three at the exclusion of the others. Instead, a blended approach should be pursued to prevent incidents in the first place and to effectively and efficiently detect them when they do transpire.

    When diagnosing incidents, the most important first question to always ask is, “What changed?”

    For persons not schooled in the linkages between changes and incidents this isn’t always done. We know statistically that an incident is typically preceded by some change in state. If that information can be detected through automated tools and then shared with Incident Management then the ability to diagnose rapidly will accelerate with both the first fix rate and MTTR metrics improving along with customer satisfaction.

    In addition to interfacing change data to Incident Management the problem solving skills of personnel can be improved as well through education in the configuration items they are responsible for and tracking incidents and resolutions developed by vendors.


    Incident Management is concerned with deviations from, and threats to, the standard operation of services. During the course of time, even the best services will have incidents.

    How IT reacts to them will be pivotal not just to operations in their drive to reduce mean time to repair and mean time between failures, but also to customer satisfaction.

    As a result, many IT departments strive to find opportunities to improve their Incident Management process. One approach is to understand the expanded Incident Management lifecycle and look for means to improve each stage.

    Incident resolution does follow an observable lifecycle. Consider the following six stages:

  • Occurrence – Something happens to a configuration item (CI).
  • Detection – The incident is detected either by monitoring tools, IT personnel or, worse case, the user.
  • Diagnosis – the next step is to determine what has happened.
  • Repair – Then the CI needs to be corrected. This may be a true solution or a temporary work-around aimed at getting the user back to some degree of productive work.
  • Recover – The CI is then put back into production.
  • Restore – Finally the service is put back into production.
  • Now, when we look at each of the above steps of the expanded lifecycle, we can look for process improvement opportunities. This approach allows for greater scrutiny as there is a model to mentally walk stakeholders through. Let’s step through each of the steps now:

    First, there isn’t much we can do about incident occurring so let’s start with step two – the detection of incidents. One approach is to identify deviations from standard operational norms through automated alerts. This is a necessary reactive stance aimed at identifying that something has occurred, such as a service on a server failing.

    The second approach is more proactive and involves the use of identified thresholds to send alerts and/or alarms. Some monitoring tools allow for multiple levels of events. Thus, a warning alert may be sent at 90% of disk utilization and then a critical alarm sent at 98%.

    First Things First

    The last opportunity involves monitoring trends and making management decisions accordingly. The intent is to use one of the three at the exclusion of the others. Instead, a blended approach should be pursued to prevent incidents in the first place and to effectively and efficiently detect them when they do transpire.

    When diagnosing incidents, the most important first question to always ask is, “What changed?”

    For persons not schooled in the linkages between changes and incidents this isn’t always done. We know statistically that an incident is typically preceded by some change in state. If that information can be detected through automated tools and then shared with Incident Management then the ability to diagnose rapidly will accelerate with both the first fix rate and MTTR metrics improving along with customer satisfaction.

    In addition to interfacing change data to Incident Management the problem solving skills of personnel can be improved as well through education in the configuration items they are responsible for and tracking incidents and resolutions developed by vendors.


    There are skills specific to diagnosing incidents that evolve over time and having periodic reviews to understand how successful engineers diagnose specific incidents will benefit everyone.

    Moving along, repairing a CI involves having the right skills, tools and, if appropriate, spare parts needed. In some cases, repairing also means involving vendors. To improve this phase of the lifecycle involves looking for opportunities to reduce the time spent waiting for parts, people, etc.

    It may mean negotiating with vendors. It may require training internal staff and evolving the vendor relationship. In some cases spare parts may be purchased in advance, or stored on-site in a depot arrangement with the vendors, to reduce the time spent waiting.

    In situations where CIs are distributed, then avenues for reducing transit time between systems through automation, additional staffing, augmented work schedules, and so on.

    Recovering CIs involves putting them back into production; physically and logically. Again, distributed systems can be a challenge. Tools, staffing augmentation and other means to reduce the time spent. The objective is to drive down time lost to transit, waiting, and so on.

    Lastly, the service that the incident involved with is restored into production. A subtle means of speeding this up is to ensure there are efficient and effective means to communicate with users that a service is back online.

    Again, in distributed environments this can be a challenge but there are means to overcome them via email, text messaging, etc.

    In conclusion, Incident Management is a very important process to ensure the needs of the business are met. It is also important because it is very customer and user facing.

    IT will be judged in no small part by how it responds to incidents and how users are supported during the course of the incident. By reviewing the expanded Incident Management lifecycle, groups can work together to identify opportunities to improve.

    To learn more about the importance of Change Management and linking it to Incident Management, please visit the IT Process Institute and read about Visible Ops at http://www.itpi.org.

    George Spafford is a principal consultant with Pepperweed Consulting and a long-time IT professional. George's professional focus is on compliance, security, management and overall process improvement.


     

    Sitemap | Contact Us
    Terms of Service | Licensing & Permissions | Privacy Policy
    About the Developer.com Network | Advertise
    Terms of Service | Licensing & Permissions | Privacy Policy
    About the IT Business Edge Network | Advertise
    Acceptable Use Policy
    Terms of Service | Licensing & Permissions | Privacy Policy
    About the Developer.com Network | Advertise
    Acceptable Use Policy
    Terms of Service | Licensing & Permissions | Privacy Policy
    About the IT Business Edge Network | Advertise