How to Mitigate the Risk of Failed Changes, Part ISo long as change processes are ignored risk will be ignored, too, writes ITSMWatch columnist George Spafford.
In other words, if a person in production is installing the change, either individually or as part of a release, and that change cannot be implemented as planned, then the change should be aborted and the rollback plan executed. If the person doing the install begins making ad hoc changes to get the change implemented then the predictability of outcomes will diminish and risk of errors, and thus incidents, increase.
As long as people make changes on the fly, even changes to install procedures, then those changes will not have the support of management-approved processes including Change Management, Service Testing and Validation, Release and Deployment (if appropriate), and so on. Furthermore, if those changes arent recorded appropriately in the configuration management system (CMS), then there is a very real risk that a future change will fail or undo what was corrected on the fly. The bottom line is, if the installer deviates from plan, the risks to the organization increase―sometimes dramatically―and should be avoided.
Incidents and Problems
First, lets define the two terms. An incident is anything that impacts, or may impact, the standard operation of a system. A problem is the root cause of one or more incidents.
If a change either causes an incident and is immediately identified as such or later problem management conducts an analysis and identifies that a change is the root cause, then the related change record needs to have its appropriate status, or outcome, changed to failed and a decision made as to how to best proceed.
If the incident happens immediately after the implementation, then the rollback plan should be executed. As time passes, the ability to use the rollback plan diminishes due to new changes that the rollback plan could negatively influence. If the rollback plan is deemed no longer valid, then a new change needs to be planned and submitted to Change Management following the correct change model. The important points here are that the change is then deemed as failed and that incidents and/or problem records that relate to the change need to be properly associated in the CMS.
One initial management reaction is to say that emergency changes will be submitted to correct all implementation errors. This should not be the automatic response. Of all the change models, emergency changes carry the greatest degree of risk because they also typically have the least testing and overall scrutiny prior to implementation in production. The premise that the implementer(s) will remember not only to create the emergency change record but also remember the details of what was done in the heat of battle, is weak.
Worse yet, allowing emergency changes to be the default response will not only send a message that ill-conceived changes are okay but it will also give an illusion of safety because people will assume that all changes will be captured, properly documented and reflected in development and test systems. In reality, you will find that people forget to create the emergency change records and/or do not remember everything they changed.
This mindset of changing-on-the-fly and documenting after the fact then creates a culture where the production environment is unknown thus creates new and unknown risks for the organization. The active errors that blow up are one thing but the latent errors introduced are not immediately observable and much like icebergs, remain largely hidden until a collision happens.
Sometimes Though ...
Now, despite all that I've said so far and the problems associated with emergency changes, there can be situations where they are valid. For example, imagine a case where a huge marketing campaign is already under way and customers expect a webpage with certain functionality to be working on Monday yet the implementation ran into problems Friday night. In that case, management needs to discuss the risks and using an emergency change to forge ahead may be a valid response.
The main point here is that emergency changes should only be used where the risk of not making the change exceeds the risk of making the change. Emergency changes should always be exceptions, not the rule.
In the next article, I will talk about the opposite of failure: success.
George Spafford is an experienced executive, a prolific author and speaker, and has consulted and conducted training on strategy, IT management, information security and overall process improvement globally. He can be reached at firstname.lastname@example.org.