Home �   ITIL�  Index

Rightsizing Your Service Level Management Implementation

You can't measure everything, so don't, advises ITSM Watch columnist Frank Bucalo of CA.
Mar 27, 2008

Frank Bucalo


This article takes the ITIL implementation principles we presented in earlier articles on rightsizing service cost models and applies them to the challenge of service level monitoring.

More Aritcles from Frank Bucalo on ITSM Watch

Rightsizing Service Cost Models, Part I

Rightsizing Service Cost Models, Part II

Rightsizing Your ITIL Implementation

If you want to comment on these or any other articles you see on CIO Update, we'd like to hear from you in our IT Management Forum. Thanks for reading.

- Allen Bernard, Managing Editor.

FREE IT Management Newsletters


As we did earlier, we are using a worst-case scenario as our use case—a mission-critical service with more than 50 configuration items involved in providing this service from end to end.


Approach 1 – Measure Everything


A bottom-up approach would require a prohibitive amount of resources to determine the physical state of one service. One would have to:


·       Determine the physical state of each component at a given point in time.

·       Determine the logical state of each layer (e.g., database, application server) in the system at any moment, including all fail-over and load-balancing.

·       Determine the logical state of the end to end system.

·       Consider all service schedules and business calendars, possibly across multiple geographic regions.

·       Aggregate metrics and perform evaluations against agreed service levels over agreed time periods.

·       Potentially re-compute service levels when required due to false negatives or positives.


Such an approach would be difficult and expensive to establish and maintain, but may be desirable given the right business scenario (e.g., a real-time equity trading system). At times, such an approach is not possible. For example, a policy prohibiting monitoring of security devices would make it impossible to determining system availability and would invalidate all other measurements.


Approach 2 – End User Input


A top-down approach may prove to be more efficient for many systems. For example, suppose you have a lightly used system where risk analysis has told you there is little cost associated with system unavailability. In this case, one can use end-user service desk calls as a proxy for actual measurement of service availability. This is obviously much less complex and expensive to establish and maintain.


Applying Intelligent Principles


Given the requirement to use actual system metrics, you can potentially reduce the resources required to establish and maintain service level agreement (SLA) compliance monitoring. We have discovered some general principles in rightsizing SLA monitoring:


Consider the capabilities of your enterprise management tool set. In many cases, an existing enterprise network management system may be able to both monitor and aggregate available information.


Prefer state-change events over polling events. For example, an SLA based on availability of 10,000 servers using polling every 10 seconds will generate 3.6 million SLA events every hour. This could quickly overwhelm SLA aggregation engines. Since state-changes are rare, pushing state-change events when servers go down and come up will generate only a few SLA events per hour. Such a load can be easily managed by most SLA aggregation engines.


Correlate at the lowest level of aggregation. For example, a failure in a single server that is clustered may not represent a state change in the availability of the subsystem (e.g., database) because of the existence of the failover servers. Some enterprise management systems feature the ability to do some level of correlation by applying rules to determine logical system state (e.g., business process views).


Applying these rules to determine logical system state could significantly improve esource utilization:


-      Set your SLA measurement granularity appropriately. For example, suppose you have an SLA based on the average transaction processing time and you are generating 1 million transactions per month.

-      If you check the processing time on each transaction, you would generate 2 million SLA events per month ((1 start time + 1 end time) * (1 million transactions)) for your SLA engine to digest and compute. This would probably overwhelm the engine.

-      If we are already recording individual transaction times in a local database for auditing purposes, we can access that average with a single SLA event per month by using a SQL statement against the application database: – (pseudo-SQL) “Select average(transaction_time) from application_audit table where transaction_start_time is between <month start>, <month_end>”.

-      In practice, it is wise to use a lower level of granularity so that you can proactively monitor your SLA measurement for compliance. For example, you could adjust the SQL statement to generate one record per day or even every hour. Such low volume would rarely be a problem for an SLA correlation and aggregation engine.



As you can see, by understanding your business scenario, using capabilities of your existing tools, such as your service desk and enterprise management systems, and smart design, SLA monitoring can be rightsized to provide high-benefit with minimal cost.


Frank Bucalo is a senior architect at CA. Frank has more than 20 years of experience implementing business applications for the Wall Street community. Over the last five years, Frank has a track record of successfully delivering ITIL implementations – from business analysis, through intelligent design, and technical implementation.

IT Management Daily Newsletter

Related Articles

Most Popular