You can't measure everything, so don't, advises ITSM Watch columnist Frank Bucalo of CA.
This article takes the ITIL implementation principles we presented in earlier articles on rightsizing service cost models and applies them to the challenge of service level monitoring.
| More Aritcles from Frank Bucalo on ITSM Watch | |
|
Rightsizing Service Cost Models, Part I
Rightsizing Service Cost Models, Part II
Rightsizing Your ITIL Implementation
If you want to comment on these or any other articles you see on CIO Update, we'd like to hear from you in our IT Management Forum. Thanks for reading.
- Allen Bernard, Managing Editor.
|
As we did earlier, we are using a worst-case scenario as our use casea mission-critical service with more than 50 configuration items involved in providing this service from end to end.
Approach 1 Measure Everything
A bottom-up approach would require a prohibitive amount of resources to determine the physical state of one service. One would have to:
· Determine the physical state of each
· Determine the logical state of each layer (e.g., database, application server) in the system at any moment, including all fail-over and load-balancing.
· Determine the logical state of the end to end system.
· Consider all service schedules and business calendars, possibly across multiple geographic regions.
· Aggregate metrics and perform evaluations against agreed service levels over agreed time periods.
· Potentially re-
Such an approach would be difficult and expensive to establish and maintain, but may be desirable given the right business scenario (e.g., a real-time equity trading system). At times, such an approach is not possible. For example, a policy prohibiting monitoring of security devices would make it impossible to determining system availability and would invalidate all other measurements.
Approach 2 End User Input
A top-down approach may prove to be more efficient for many systems. For example, suppose you have a lightly used system where risk analysis has told you there is little cost associated with system unavailability. In this case, one can use end-user service desk calls as a proxy for actual measurement of service availability. This is obviously much less complex and expensive to establish and maintain.
Applying Intelligent Principles
Given the requirement to use actual system metrics, you can potentially reduce the resources required to establish and maintain service level agreement (SLA)
Consider the capabilities of your enterprise management tool set. In many cases, an existing enterprise network management system may be able to both monitor and aggregate available information.
Prefer state-change events over polling events. For example, an SLA based on availability of 10,000 servers using polling every 10 seconds will generate 3.6 million SLA events every hour. This could quickly overwhelm SLA aggregation engines. Since state-changes are rare, pushing state-change events when servers go down and
Correlate at the lowest level of aggregation. For example, a failure in a single server that is clustered may not represent a state change in the availability of the subsystem (e.g., database) because of the existence of the failover servers. Some enterprise management systems feature the ability to do some level of correlation by applying rules to determine logical system state (e.g., business process views).
Applying these rules to determine logical system state could significantly improve esource utilization:
- Set your SLA measurement granularity appropriately. For example, suppose you have an SLA based on the average transaction processing time and you are generating 1 million transactions per month.
- If you check the processing time on each transaction, you would generate 2 million SLA events per month ((1 start time + 1 end time) * (1 million transactions)) for your SLA engine to digest and
- If we are already recording individual transaction times in a local database for auditing purposes, we can access that average with a single SLA event per month by using a SQL statement against the application database: (pseudo-SQL) Select average(transaction_time) from application_audit table where transaction_start_time is between <month start>, <month_end>.
- In practice, it is wise to use a lower level of granularity so that you can proactively monitor your SLA measurement for
Summary
As you can see, by understanding your business scenario, using capabilities of your existing tools, such as your service desk and enterprise management systems, and smart design,
Frank Bucalo is a senior architect at CA. Frank has more than 20 years of experience implementing business applications for the Wall Street
