Understanding 'Mean Time Between Failure'Mean Time Between Failure (MTBF) is an important IT management metric for discrete components as well as overall systems. But misreading MTBF can be costly.
For simplicity, let's define MTBF as the average time between failures. It's either based on historical data or estimated by vendors and is used as a benchmark for reliability. Organizations trending MTBF over time can readily see devices that are failing above average and take appropriate action.
If we assume that there are 8,760 hours per year (365 days x 24 hours per day) then we can divide MTBF claims from vendors and look at how long the system will run in years. If we buy a system, or component, with a rating of 30,000 MTBF, then we might assume that on average, the system would run 3.42 years without a failure. Granted, there are always statistical variations around the average, but 3.42 years doesn't seem bad at all, does it?
There's a problem with this rationale, however, especially when applied to complex systems. First, as previously mentioned, it is both an estimate and an average. You run the risk of being one of the seemingly statistical anomalies with a far higher frequency of failure that gets smoothed out by the averaging! The reason could simply be that the MTBF estimate was subjected to different environmental factors such as heat and power.
Second, fault-tolerance costs accelerate very rapidly as higher and higher MTBF levels are sought. Third and perhaps the most important, fault-tolerant systems (hardware, software, documentation and processes) in general become increasingly complex as the level of fault tolerance increases. Fault-tolerant systems typically are more complex than non-fault-tolerant systems. This increased level of complexity, in and of itself, creates fertile ground for disasters.
Coupling, Complexity and Normal Accidents
In 1984, Charles Perrow wrote an amazing book titled Normal Accidents: Living with High Risk Technologies. In it he observed that system accidents can be the result of one big failure, but most often are caused by the unexpected interactions between failures of multiple components.
In other words, complex systems whose components are tightly integrated typically fail through the culmination of multiple components failing and interacting in unexpected ways. For example, it's very rare that a plane has a wing fall off mid-flight. It's far more likely that several component failures interact in unpredictable ways that, when combined, cause a catastrophe. Let's investigate this line of reasoning further.
First, errors can be readily visible or latent. The former we can deal with when we detect them. The latter are far more insidious because they can be in a system and undetected, "waiting to spring," if you will.
Second, complex systems made up of hundreds, if not thousands of components, that interact tightly are considered to be tightly "coupled." The possibly pathways of interaction are not necessarily predictable. Perrow points out that during an accident, the interaction of failed components can initially be incomprehensible.
Let's take a highly fault-tolerant database server with its own external RAID and then use clustering software to join it with another server located in another data center.
At this point, we have a pretty complex system comprised of thousands of components that are all tightly coupled. The IT operations staff is capable and diligent, performing nightly backups.
Now, imagine that there is a programming error on the RAID controller caused by an unexpected combination of data throughput and multi-threaded on-board processor activity that causes a periodic buffer overflow and subsequent data corruption that is then written to the drives. It doesn't happen often, but it does happen. As the systems are exact duplicates of one another, the issue happens on both nodes of the cluster.
At an observable level, everything would seem to be OK because the error is latent. It isn't readily apparent until one or more database structures become sufficiently corrupt to raise awareness of the issue. Once it does happen, the network and database people scramble to find out what is wrong and go tracking through the logs looking for clues and checking for security breaches because "it was running fine." The point is that multiple components can interact in unforeseen ways to bring down a relatively fault-tolerant system.