Home �   ITIL�  Index

Fault Tree Analysis in 6 Steps

Often misunderstood, FTA requires nothing more complex than paper, pencil, and an understanding of the service at hand, writes ITSM Watch columnist Hank Marquis of itSM Solutions.
Dec 21, 2006

Hank Marquis

If you are IT Infrastructure Library (ITIL) certified you have no doubt heard of fault tree analysis, or FTA, as a means to improve availability. FTA requires no special software and can help you discover the root causes of a failure or detect where potential failures may occur.

It sounds complicated, but it’s actually fairly easy to use. Reactively, FTA starts with a top-level event, like a service outage, which you evaluate to determine the root cause. Proactively, FTA begins with an event you do not want to occur, like a server failure, and then helps you understand how to prevent the event from occurring.

In either case, you use the fault tree diagram to identify countermeasures to eliminate the causes of the failure. FTA does this through structured analysis of contributing faults and their cause that lead (or has led) to its occurrence.

FTA requires nothing more complex than paper, pencil, and an understanding of the service at hand. You will need accurate CI (configuration items) contextual information in order to get the most value from FTA. The following 6 simple steps can help you resolve tough design issues or problems quickly and easily.

1. Select a top level event for analysis. Try to be specific, for example, “Email server down for more than 4 hours.” Sources of top level events include: Problem/Known Error Records; service outage analysis; potential failures from brainstorming; and “what-if” scenarios based on service level agreements, etc.

2. Identify faults that could lead to the top level event. Continuing the above example, some possible faults leading to an outage lasting more than four hours might be “loss of power”, another might be “hardware failure.” List all the faults under the top level event in boxes and connect the fault boxes to the top level event box by drawing lines.

3. For each fault, list as many causes as possible in boxes below the related fault. Continuing the example above, in the case of “loss of power," some causes might be “electrical outage,” “power supply failure,” and so on. Connect the boxes to the appropriate fault box.

4. Draw a diagram of the “fault tree." Two logic operators – and and or, also known as logic gates – are used to represent the sequencing of faults and causes.

For example, “Email server down for more than 4 hours” could be caused by “loss of power” or “hardware fault." Another might be “loss of building power” and “battery backup exhausted.”

Update faults and causes by grouping logically related items using and or or between faults and events; and faults and causes. Re-draw the lines from top level event to logic gates to faults to logic gates to causes.

    1 2 >> Last Page

IT Management Daily Newsletter

Related Articles

Most Popular