Home    ITIL  Index

How to Mitigate the Risk of Failed Changes, Part I

So long as change processes are ignored risk will be ignored, too, writes ITSMWatch columnist George Spafford.
Dec 11, 2009
By

George Spafford





To begin with, a change should be labeled “Failed” if it cannot be implemented according to the formally defined plan, if it causes incidents to occur, or the stated change objective is not achieved. Organizations often spend a surprising amount of time arguing about what a failed change is and ultimately get the underpinning concepts wrong, create more risks, and lose the ability to make proper management decisions and perform effective corrective actions.

In other words, if a person in production is installing the change, either individually or as part of a release, and that change cannot be implemented as planned, then the change should be aborted and the rollback plan executed. If the person doing the install begins making ad hoc changes to get the change implemented then the predictability of outcomes will diminish and risk of errors, and thus incidents, increase.

For example, if it is identified that a dependent file needs to be updated during an install but was not part of the planning and testing of the change, then doing the upgrade ad hoc may create problems for other applications that rely on that file. Indeed, to make matters more interesting (or worse, depending where you sit), the newly introduced error may be latent, or dormant, until period-end when the monthly batch processes kick off and people have forgotten all about the “quick fix” the installer did two months earlier.

As long as people make changes on the fly, even changes to install procedures, then those changes will not have the support of management-approved processes including Change Management, Service Testing and Validation, Release and Deployment (if appropriate), and so on. Furthermore, if those changes aren’t recorded appropriately in the configuration management system (CMS), then there is a very real risk that a future change will fail or undo what was corrected on the fly. The bottom line is, if the installer deviates from plan, the risks to the organization increase―sometimes dramatically―and should be avoided.

Incidents and Problems

First, let’s define the two terms. An incident is anything that impacts, or may impact, the standard operation of a system. A problem is the root cause of one or more incidents.

If a change either causes an incident and is immediately identified as such or later problem management conducts an analysis and identifies that a change is the root cause, then the related change record needs to have its appropriate status, or outcome, changed to “failed” and a decision made as to how to best proceed.

If the incident happens immediately after the implementation, then the rollback plan should be executed. As time passes, the ability to use the rollback plan diminishes due to new changes that the rollback plan could negatively influence. If the rollback plan is deemed no longer valid, then a new change needs to be planned and submitted to Change Management following the correct change model. The important points here are that the change is then deemed as failed and that incidents and/or problem records that relate to the change need to be properly associated in the CMS.

Emergency Changes

One initial management reaction is to say that emergency changes will be submitted to correct all implementation errors. This should not be the automatic response. Of all the change models, emergency changes carry the greatest degree of risk because they also typically have the least testing and overall scrutiny prior to implementation in production. The premise that the implementer(s) will remember not only to create the emergency change record but also remember the details of what was done in the heat of battle, is weak.

Worse yet, allowing emergency changes to be the default response will not only send a message that ill-conceived changes are okay but it will also give an illusion of safety because people will assume that all changes will be captured, properly documented and reflected in development and test systems. In reality, you will find that people forget to create the emergency change records and/or do not remember everything they changed.

This mindset of changing-on-the-fly and documenting after the fact then creates a culture where the production environment is unknown thus creates new and unknown risks for the organization. The active errors that blow up are one thing but the latent errors introduced are not immediately observable and much like icebergs, remain largely hidden until a collision happens.

Sometimes Though ...

Now, despite all that I've said so far and the problems associated with emergency changes, there can be situations where they are valid. For example, imagine a case where a huge marketing campaign is already under way and customers expect a webpage with certain functionality to be working on Monday yet the implementation ran into problems Friday night. In that case, management needs to discuss the risks and using an emergency change to forge ahead may be a valid response.

The main point here is that emergency changes should only be used where the risk of not making the change exceeds the risk of making the change. Emergency changes should always be exceptions, not the rule.

In the next article, I will talk about the opposite of failure: success.

George Spafford is an experienced executive, a prolific author and speaker, and has consulted and conducted training on strategy, IT management, information security and overall process improvement globally. He can be reached at gspaff@hotmail.com.

Tags:
change management, ITIL, ITSM, Spafford, risk



Comments  (click to add your comment)

Comments

    Name or nickname

    Email address

    Website

    Write comment
    You have characters left. (Maximum characters: 1200).