6 Steps to Service Outage Analysis
SOA provides both a valuable learning exercise as well as a clear and justified RFC to improve service availability and improve customer satisfaction, writes ITSM Watch columnist Hank Marquis of itSM Solutions.As is quite common since the ITIL is descriptive and not prescriptive, ITIL does not explain how to carry out a SOA. In this article I will explain what an SOA is, its benefits, and give you an easy to follow six-step guide to performing SOA.
The reason to use SOA is to identify the causes of outages and thus reduce the frequency and duration of outages. SOA aims to improve mean-time-to-repair (MTTR).
The result of a SOA is clear understanding of what happened to cause an outage, and exposes the risk of future outages due to the same cause or causes. Finally, an SOA can produce recommendations for improvement to avoid the issue in the future.
With these types of benefits, you might think that performing an SOA is complicated but, in reality, just the opposite is true: You can perform a SOA without any major investment in software, tools, or training.
Performing an SOA is straight forward. Working with problem management and customers, you examine past outages to identify configuration items (CI) (products, people, or process) related to an outage. In effect, you simply review the impact to the organization and infrastructure as reflected by how the organization responded to an outage.
This is different from proactive problem management since availability management has a scope that includes the organization (people, process, training, staffing, etc.)
Getting Started
To get going, collect outage data in the form of incidents, any related closed problems, or known errors. Gather together a team of people familiar with the outages, the infrastructure, processes, procedures, people, and so on. Be sure to include a customer representative and perhaps some users on the team as well (their input will be critical in guiding the team through the SOA process).
Once you have the team empowered, lead them through the six following steps:
Group related outages together by vendor, product, family, application, customer, etc. Then, using customer and user input as appropriate, categorize each outage as significant or less significant." Focus only on those labeled significant, and monitor the less significant for future outages.
For each outage tagged as significant review the root cause of the unavailability (this requires closed incidents and problems.) For example, faulty hardware or software. This is probably already known since the outage is resolved.
Perform a simple Pareto analysis to break the significant issues into a smaller group. The Using the Pareto 80/20 rule you can rank the related outages and their causes.
You will find that the majority (80%) of the outages result from a select few causes (20% of the organization or infrastructure.) Of course, you want to focus on the 80% of the outages caused by the 20% of the causes.
