The Operator Did It
The realistic answer is that nobody could know that all of these independent issues would interact to cause an outage. How would you address this type of scenario?
The hard part, and this is why operators get blamed the most, is that many times the failures in the component systems cause scenarios that were previous unfathomable. If you visit this site and go to the accidents and errors page, you will see a long list of accidents, of which many are blamed on operators. Why? My hunch is that they needed to blame somebody. "We are going to put you into a situation for which you are totally unprepared and see how you do. If you fail, we'll blame you." That sure sounds promising, doesn't it?
So what does this mean? First, vendors and internal groups must place an emphasis on proper design and thorough testing. The testing needs to be formalized and there are software test engineers versed in the proper methodologies. Note, quality must come first, before features, and while testing is a much-needed detective control that assists in ensuring quality, it is not the total solution and must be integrated with the overall system such that feedback is generating process improvement loops. To borrow a phrase from manufacturing -- you don't inspect quality in, you build quality in.
Second, an effective change management process must be in place and followed. There must be detective controls that can assist with the mapping of changes found in production back to authorized change orders. Only authorized changes should be allowed to remain. The ITIL Service Support book and the ITPI Visible Ops methodology provide great guidance here.
Third, we must evolve adaptive processes that rapidly recognize and adapt to variations from the understood mean. This applies not just to application logic, but manual human processes as well. Systems and their operators must be adept at recognizing the need to change and then actually changing in a secure, timely and efficient manner.
Fourth, members of failure review boards must avoid taking the easy way out. Rather than flag the outcome as a result of operator error, ask yourself these two simple questions: "Could anyone have realistically known what to do in that situation?" and "Could anyone have computed the solution and acted in the timeframe allotted?" Quite often, the answer is "no," which points back to systemic issues that are process and/or technically based.
Rather than expect operators to perform superhuman acts of omniscience, we must confront systemic issues in processes and technology that prevent the accidents from happening again. Yes, people can and do make mistakes. The point is that it is too simple to blame the operator for unexplained failures.
Organizations must dig in and ensure that there is learning after the accident and that appropriate measures are introduced to prevent reoccurrences in the future. This is done by addressing root causes -- not just playing the blame game.