Why MTTR MattersMany people have heard of MTBF, or Mean Time Between Failure. Here's another metric that high-performing organizations use to measure and improve their processes.
As IT practitioners look for solutions, frameworks like ITIL are gaining popularity because of their time-tested best practices and process-based approach. While opinions may vary on what constitutes a high-performing IT organization, a critical characteristic is having a "culture of causality." In other words,, these organizations use causality as the primary problem-solving tool, instead of blindly applying fixes (i.e., a culture of "let's see if this works").
The Challenge: Change Creates Risks
Change inherently creates risk. A 2001 IDC study reinforces this notion, showing that 78% of all IT downtime is caused by change. If you could simply eliminate change from the computing environment, you would substantially decrease risk.
Sadly, this is just wishful thinking for most IT shops. Changes are required to add new features or software required by the business, replace or upgrade hardware, apply software patches, and so forth.
Sometimes controlling and managing change risk is difficult, because the source of change is beyond our control. For example, the security requirements to keep production configurations current necessitate that certain changes be made. Reading any article on the current patching crisis shows the scope of the problem, and this is only one of many sources of change.
However, many changes are within our span of control, and the challenge becomes how to reduce complexity and scope. For example, consider when someone makes a change to a critical system and, simultaneously, another person makes a seemingly unrelated change, while a third person installs a maintenance release that completely undoes all the work of the other two people.
In this situation, it is very likely that at least one person will experience some surprising and undesired side effects of all these changes (e.g., a critical service interruption, system failure, test failure, server reboot, etc.). When the failure occurs, an inordinate amount of time will be spent by all the parties to diagnose what just happened. Worse, multiple people may unknowingly be working on the same issue and actually undermine each others' efforts.
To make matters worse, consider this scenario happening in a test environment, specifically designed to fully test the changes before being deployed in order to reduce outage risks. Now there is no accurate documentation of what the pre-production changes were. When it is released into production, when failures occur, the production team is doomed to repeat the firefighting, without any benefits of having fought the same fires before.
Mean Time To Repair (MTTR)
While MTBF measures the time between failures, MTTR measures the time between the service interruption and service restoration. MTTR includes problem diagnosis and problem repair. When changes are uncontrolled and unmanaged, as in the scenario described above, MTTR is dominated by problem diagnosis.
Typically, 80% of MTTR is spent in the diagnosis phase. However, when changes are better documented and managed, the time required to successfully diagnose the problem is much lower, and consequently, MTTR is lower as well.
Consider two scenarios resulting from a critical service interruption. In each case, a team is escalated to restore service, referred to in ITIL parlance as problem managers.
In one scenario the problem managers scramble to diagnose the issue, trying to find the causal factors that could have contributed to the outage. Critical time is lost while they make phone calls, convene meetings, perhaps make an incorrect diagnosis and apply a fix that fails, and so forth. All of these contribute to a high MTTR.
In the other scenario, all causal factors that could have contributed to the outage are easily at hand. The problem managers refer to the change calendar to see all authorized and scheduled changes, see all changes to the affected asset and by whom, and use a configuration database to see all its dependencies. In this case, MTTR is not only much lower, but correct diagnoses can be made without even logging onto any servers.
The difference between the two scenarios shows why high MTTR is so closely correlated with uncontrolled change. As the level of control of each change decreases, MTTR becomes longer and less predictable.
Many approaches can be taken to measure and reduce MTTR. However, five key elements must be addressed in order to reduce MTTR and improve operations:
1. Establish a Change Advisory Board (CAB). Have a team comprised of IT operations, security and the relevant stakeholders review changes prior to going into production. Map out the types of changes and the review processes needed. It is important to include an emergency review process for changes that must happen during a crisis.
2. Keep Track of Changes. It sounds simple but it takes a lot of time and discipline. For every change, there should be a log entry. To start, this could even be a manual journal. Larger shops will want to investigate configuration management databases.
3. Deploy Detective Controls. To avoid accidental or purposeful bypassing of change control procedures, systems should be deployed that detect variances and notify the appropriate parties. This reinforces the need to follow proper processes plus improves security by ensuring that unauthorized malicious changes aren't taking place.
4. Track MTTR. Keep track of MTTR at a system, type of system, and department levels, or whatever reporting structure is appropriate for your organization. The intent is to apply statistical process control and see how MTTR is trending over time.
5. Regularly Review the Data. Collecting the data doesn't matter unless it is routinely reviewed to identify areas for improvement. Depending on your goals and resources available, you may opt to focus on the systems with the top five MTTR values and see what can be done to improve. Certainly, it is beneficial to use MTTR as part of a continuous improvement process (CIP).
By itself, MTTR is not a magic number to track. As with any single metric, it can be misrepresented or abused. For instance, suppose an organization has a large number of unplanned outages with a long MTTR. To hide this, they can put all changes into an overly large maintenance window. Because there is no official outage, the failed change and corresponding MTTR is never factored into the official MTTR score.
Gene Kim is the chief technology officer and co-founder of Tripwire, Inc. He can be reached at firstname.lastname@example.org. George Spafford is an IT process consultant with more than 12 years of experience in business, information technology and project management. He can be reached at email@example.com.