Fault Tree Analysis in 6 StepsOften misunderstood, FTA requires nothing more complex than paper, pencil, and an understanding of the service at hand, writes ITSM Watch columnist Hank Marquis of itSM Solutions.
It sounds complicated, but its actually fairly easy to use. Reactively, FTA starts with a top-level event, like a service outage, which you evaluate to determine the root cause. Proactively, FTA begins with an event you do not want to occur, like a server failure, and then helps you understand how to prevent the event from occurring.
FTA requires nothing more complex than paper, pencil, and an understanding of the service at hand. You will need accurate CI (configuration items) contextual information in order to get the most value from FTA. The following 6 simple steps can help you resolve tough design issues or problems quickly and easily.
1. Select a top level event for analysis. Try to be specific, for example, Email server down for more than 4 hours. Sources of top level events include: Problem/Known Error Records; service outage analysis; potential failures from brainstorming; and what-if scenarios based on service level agreements, etc.
2. Identify faults that could lead to the top level event. Continuing the above example, some possible faults leading to an outage lasting more than four hours might be loss of power, another might be hardware failure. List all the faults under the top level event in boxes and connect the fault boxes to the top level event box by drawing lines.
3. For each fault, list as many causes as possible in boxes below the related fault. Continuing the example above, in the case of loss of power," some causes might be electrical outage, power supply failure, and so on. Connect the boxes to the appropriate fault box.
4. Draw a diagram of the fault tree." Two logic operators and and or, also known as logic gates are used to represent the sequencing of faults and causes.
For example, Email server down for more than 4 hours could be caused by loss of power or hardware fault." Another might be loss of building power and battery backup exhausted.
Update faults and causes by grouping logically related items using and or or between faults and events; and faults and causes. Re-draw the lines from top level event to logic gates to faults to logic gates to causes.