Emergency Changes Shouldn't Change AnythingYour response to any IT emergency should be as systematic and controlled as any other activity in IT, writes ITSM Watch columnist George Spafford of Pepperweed Consulting.
The ITIL change management process is intended to balance the risks associated with making a change against the risks to the organization of not making the change. To do this, it recommends a series of controls that help manage risk including the formal submission of requests for change, creation of change records, scrutiny of requests, testing, and so on.
These steps, of course, take time and during a crisis, time is a scarce commodity. To facilitate the ability to respond quickly, while still supplying a modicum of risk management, ITIL recommends that one or more emergency change models be created.
Emergency Change Models
The intent is that personnel involved with changes always have procedures/formal guidance to follow vs. trying to perform tasks ad hocespecially in an emergency. Too often, groups have normal change models but when a crisis happens the entire change management process is literally suspended as a flurry of unmanaged and untracked activity takes place that introduces even more risk to the organization.
Instead, a procedure must be put in place that allows rapid changes to occur but also allows for a level of review and communication that management is comfortable with. As any change creates risk, the compensating controls wrapped in the emergency change procedures must trade off speed of action against the risk of making matters worse. For example, testing may be reduced which speeds up the ability to get the change into production but also increases the risk that a change with problems may get into production and cause problems either immediately or at some later date.
For each service, the required emergency change model should be identified. Each model should identify what testing is needed, communications, and so on. A balance must also be struck between having too few models and sacrificing agility and against having so many models that there is confusion and errors made during an emergency. The following sections
Define the Criteria
The next step is to define what the situation must be in order for it to be used. Some groups limit emergency changes to only break/fix work required to get a service operational again. Others will include some business criteria.
Who can Approve
For each model we must determine who can approve an emergency request for change. Some emergency changes are known, are predictable, have low risk and thus can be pre-authorized much like a standard change. For example, if a database server is known to have memory leaks and that a reboot solves the problem, then that reboot could follow a change model that has been pre-approved. Essentially it is logged for tracking purposes.
Another type of emergency change model is to require that an emergency change advisory board (E-CAB) be convened, either physically or virtually, to quickly review the change and decide whether to authorize it or not. The E-CAB is either a subset of the normal change advisory board (CAB) or certain duly authorized people. Note: The E-CAB is defined in advance. Membership is not decided ad hoc.
Its also important to understand that in a crisis the change manager can make an emergency decision to approve a change. Some groups allow this model and others allow it only if the E-CAB can not be assembled. The idea is that this is the person held accountable to manage the risks of changes and in an emergency he/she is empowered to make a decision.
Post Emergency Review
There are two things that must be done as a crisis is resolved. First, the emergency changes made must be reviewed to see if they were truly emergencies. We do not want the emergency change model to devolve into an escape route for groups who fail to adequately plan. Emergency changes must be the exception and fewer than 10% of the total number of changes. Why? Because they carry the greatest degree of risk to the organization since the normal testing, review and other steps were likely bypassed for expediency.
This brings us to the second point. All emergency changes need to be reviewed and tested afterwards. If security, regression and integration testing were skipped, for example, then they need to be tested appropriately as soon as possible. Some errors are immediately visible when changes are implemented. They are known as active errors and we can immediately address them. Errors that remain hidden until they combine with something unforeseen else to cause a failure or security event are what are known as latent errors. These latent errors are what we especially need to check for after emergency changes are implemented.
The development and implementation of one or more emergency change models can greatly benefit organizations. When crises happen, risks associated with changes still need to be managed and the best way to do this is by following well designed processes and procedures.
George Spafford is a principal consultant with Pepperweed Consulting and a long-time IT professional. George's professional focus is on compliance, security, management and overall process improvement.