Home �   ITIL�  Index

Getting Started with Event Management

Event management isn’t monitoring, it’s what you do about it, writes ITSM Watch columnist George Spafford of Pepperweed Consulting.
Jan 25, 2008

George Spafford

Ensuring confidentiality, integrity and availability is becoming increasingly important as organizations realize that business stops or is at least greatly hindered when IT technologies fail. As a result, IT organizations have implemented increasingly sophisticated monitoring systems to trying and identify the health of production services.

In ITIL v2 we knew that alerts should be routed through Incident Management for proper dispositioning. Now, with ITIL v3 we have additional perspectives to consider: Event Management. With Event Management we can define a process that helps us leverage automation to manage events to become more effective and efficient.

ITIL defines an event as “… any detectable or discernable occurrence that has significance for the management of the IT infrastructure or the delivery of IT service and evaluation of the impact a deviation might cause to the services.” (ITIL Service Operation book, Page 35). While it may sound like monitoring, the two are different. Monitoring happens all the time whether an event is present or not. Event management is concerned with interpreting the monitored data and taking an appropriate action.

When an event transpires, it must first be detected. As groups deploy increasingly sophisticated monitoring systems, the issue becomes one of understand what needs to be looked for and that is defined in service design.

The next step after an event is detected is to filter it. As monitoring tends to collect more data than we are interested in, there will be some events that are logged by the device but which are not interested in and thus they are not processed by Event Management.

After filtering we need to determine the significance of the event and there are now systems that can help perform event correlation. To explain event correlation applications are known as “event correlation engines” and essentially apply business rules to look for events of interest. Rules can factor in how many times and in what time frame an event has transpired, other configuration items with similar events, if a configuration item has not responded for the last two minutes, etc.

These rules can get very complex but the intent is to use automation to find the proverbial needle in the haystack. In other words, it would be more effective and efficient if we could use automation to find the events we should be acting on versus someone manually trying to find events.

After it is filtered and the significance of the event determined, there are three paths that can be taken:

Informational – These are events that should be logged for potential future analysis including confirming if the service is operating as expected. No further action is required. Examples include capacity data, licensing utilization, batch job completion results, and so on.

Warning – During service design, thresholds are identified that help gauge the status of a system. When the threshold is reached, predefined parties, or notification groups, are alerted that the threshold has been reached. These people then review the warning and take appropriate action. Warnings could be generated when certain levels are reached for disk storage, network activity, processor utilization, and so on.

Exception – This branch is reserved for configuration items (hardware, software or service) that are operating abnormally or have failed. Abnormal behavior criteria should be defined during service design to better understand what types of scenarios trigger what types of exception handling. Depending on what is defined, an exception could trigger an incident, problem or change record to be opened. Exceptions include events such as a device failing, a service’s response time drops below defined SLA levels, and so on.

As you can see, some of the responses are manual and others could be automated. In the cases of automatic responses such as rebooting a system, we still need records to indicate what happened and why in case there is later root cause analysis. This could be recorded via standard request for changes, incident records, etc. We always want to enable people to be able to answer the question “What changed?”

Like any process, you will start event management and learn. An important step to have is to review major incidents/significant events and see if the event management process performed adequately. In some cases faster methods of detection may be needed. In other situations there may be need to work with other process areas such as Availability and Capacity Management to improve the business rules in the event correlation engine. The important point is to have a means to ensure that the process is continuously evolving to meet the needs of the business.

George Spafford is a principal consultant with Pepperweed Consulting and a long-time IT professional. George's professional focus is on compliance, security, management and overall process improvement.

IT Management Daily Newsletter

Related Articles

Most Popular