Home �   ITIL�  Index

Understanding 'Mean Time Between Failure'

May 14, 2004

George Spafford

One must also readily conclude that if the chance of errors increases with the level of complexity, then so to must the probability of errors that can cause security breaches. Thus, not only is failure inevitable, but the likelihood that at least one exploitable security hole exists is inevitable.

Mean Time to Repair (MTTR)

Let's face it, accidents can and will happen. Fault tolerance can create a false sense of security. From our 30,000-hour example, we could unrealistically expect 3.42 years of uninterrupted bliss, but reality and Mr. Murphy don't like this concept.

Yes, fault tolerance reduces the chance of some errors, but as the system's inherent complexity and level of interaction increases, the chance of an accident increases. How often is a fault-tolerant system simple? How many people in your organization fully understand your fault-tolerant systems? There are many questions that can be asked, but here is the most important question: "When the system fails, and it will fail, how easy will it be to recover?"

Not too surprisingly, there often is a dichotomy with highly fault-tolerant systems. On one hand, their likelihood of failure is less than a standard system lacking redundancy, but on the other, when they do fail, they can be a bear to troubleshoot and get back on line.

Instead of spending tens, if not hundreds of thousands of dollars on fault-tolerant hardware, what if IT balanced the costs of the fault tolerance with an eye toward unrelentingly driving down the MTTR of the systems? True systemic fault tolerance is a combination of hardware, software, processes, training, and effective documentation. Sometimes, teams focus on the hardware involved first, the software requirements are a distant second and then they totally overlook the process, training, and documentation needs.

Always remember that availability can be addressed by trying to prevent downtime through fault tolerance as well as by reducing the time spent recovering when an actual outage does occur. Therefore, activities surrounding the rapid restoration of service and problem resolution are essential. The ITIL Service Support book provides great guidance on both initially restoring services through Incident Management and ultimately addressing the root causes of the outage via Problem Management.

The Berkeley/Stanford Recovery-Oriented Computing (ROC) research team, a joint project at Berkeley and Stanford, also provides great information about ROC. You can find it here.


Of course MTBF matters -- it is an important metric to track in regard to system reliability. The main point is that even fault-tolerant servers fail. As the level of complexity and coupling increases, systemic failure due to the accumulation of component failures interacting in previously unexpected ways is inevitable. IT should look at availability holistically and consider addressing both initial system design fault tolerance and the speed in which a failed system can be recovered.

In some cases, it may make far more sense to invest less in capital intensive hardware and more on the training, documentation and processes necessary to both prevent and recover from failures.