Home �   ITIL�  Index

The Role of Application Monitoring in ITIL

By Suparno Biswas and Xavier Idrovo Application monitoring is a critical tool for successful ITIL-based Incident and Problem Management.
Jan 30, 2006

ITSM Watch Staff

By Suparno Biswas and Xavier Idrovo

In ITIL the service support area implicitly addresses “monitoring and operations” through Incident and Problem Management. Although this seems obvious to a practitioner, it may send unclear messages to an ITIL newcomer.

Incidents in ITIL can be automated tool-generated events or they can be service calls from users experiencing service quality deterioration (email, phone call, fax, internet browser). The source of the confusion may be nipped in the bud by distinguishing between automated events from infrastructure components and user-reported manual incidents.

Here is an observation from a mid-size IT organization (300-400 employees):

The Challenge: The Helpless Helpdesk

A Service Desk exists which is essentially a help desk for service requests and service calls. The tool in place helps in logging the calls and processing it through its various states until it is closed.

The various components of the infrastructure, such as the network, database, and servers, are monitored. The alarms, however, are not accessible to the Service Desk, but only to the specialists.

The real-time state of infrastructure, which is a source for “incidents,” goes unnoticed by the Service Desk. This immediately adds to the time needed to resolve service calls by Service Desk personnel. Further, end users are aware of the deteriorated or failed infrastructure components before the Desk – a traditional “reactive” approach.

The Sleepless Specialists

In the current situation the specialists are the knowledgeable personnel and the administrators who can successfully decode the alarms from the infrastructure components, troubleshoot and finally resolve them.

The prime reason for having all the alarms directed to the admin team is because they have the expertise to resolve the problem quickly, which is the goal of Incident Management.

Now, due to IT budget constraints there is always a possibility that the admin team may not have backup or may not be prepared to handle incidents or alarms at off-hours for each specialized area. The Help Desk, though, is up and running during “off-hours” with shifts.

What ends up happening is that some resources are not fully utilized as the experts are working round the clock losing sleep over repeated workarounds. Needless to say the specialists' bandwidth can be better used for innovational activities than routine tasks.

The Fix

First and foremost, a robust monitoring system needs to be in place. The Service Desk needs to play an active role in designing this system. When determining what needs to be monitored and which events generated, very often the default out-of-the-box configurations are chosen. As a result the alarm browser is inundated with messages and the tired operators or users start ignoring events. A very goal-oriented, minimalist approach is helpful -- start small and grow in phases. (General rule of thumb, if it only shows 80% of what you need, go with it!)

The subject matter experts should regularly update Service Desk with workarounds and known problems. If an alarm is displayed, the receiver should have a clear, well-documented plan of action to resolve the incident. For low-impact situations, automate and delegate certain tasks to the Service Desk. These situations should be well-defined and bounded. In short, enhance the Service Desk’s incident-resolving capabilities through periodic knowledge transfer.

Develop “application probes” and run them periodically. This is a very comprehensive way to emulate users' experiences and, although there are technological hurdles, it is worth the investment.

The monitoring system is as good or as bad as the notification system. There should not be a SPOF (single point of failure) in the notification architecture. Email can be a very effective channel of notification, until the email system goes down. A multichannel approach is needed and which might be tied to the Disaster Recovery/Service Continuity effort.

Define an escalation path to ensure the issue is handled within the necessary time frame. Needless to say, the escalation path must be enforced by senior management and tied into service level agreements.

The Role of Operations Bridge in ITIL Processes

Lately there has been a lot of emphasis on application management, service-oriented architecture. The basic foundation to all of the above is based upon the event-based monitoring systems. These alarms/events are correlated and rolled up into more service-centric reports (real-time or not). So, if the due diligence is missing from the monitoring of applications, servers, databases, network and the other vital infrastructural components, the overall quality of the service status will be flawed -- the "garbage in, garbage out" syndrome. This can generate a huge amount of frustration and loss of faith in the management tools.

Incident Management and Service Desk need an operational view to the infrastructure they are responsible for. At the minimum this should be policy for Service Desk and Operations. There are organizational challenges about Service Desk and Operations-type functions. If they are two separate organizations, then efforts should be in place to increase the cohesiveness.

A communications plan is another significant work item responsible for the success of any ITSM or IT Service Improvement program. At various stages of the project, reinforcement is needed to remind the teams about the goals, objectives, charters and policies, which in turn helps prioritize and reorient tasks. (A simple and effective way to address this is via ITIL certification and training classes.)

Suparno Biswas is a senior IT manager at New Jersey Manufacturers Insurance, and has led various teams of engineers, architects and managers to successfully deliver IT infrastructure management projects. He has 17 years' experience in the computing industry, focusing on enterprise management, software development lifecycle, system architecture and design.

Xavier Idrovo has more than 15 years in the IT computing industry and more than five years implementing ITIL best practices with emphasis on Incident management and Service Desk. He is currently employed by New Jersey Manufacturers Insurance Group, leading the Service Desk and plays a leading role in Incident Management.