Say What ... Changed? (aka, How'd That Happen?!?)Having a better understanding of what changed aids not only IT operations but also information security and reduces risk, writes ITSMWatch columnist George Spafford.
The first question that must be asked at the outset of a security or operational incident is “What changed?” Why? Because the majority of network availability incidents stem from human error either during design or ongoing operation. Also, the majority of outages are usually preceded by some form of change to code/executables or configurations at the application or infrastructure layers. So, the faster you can understand if anything changed, the faster the appropriate response can be pursued and service restored.
When we look at the expanded incident management lifecycle, we can see an incident occurs, is then detected (either by IT or users), diagnosed, repaired, recovered and, finally, the service is restored. When looking at incident response times in the real-world, there are two significant gaps: the first is in the actual detection that an event just happened (or might happen) and the second is in diagnosing what is wrong.
At this point, we need to take a step back. From a services perspective, we want to capture what needs to be monitored, why and how to respond as early in the lifecycle as possible. This includes how to monitor individual configuration item (CI) resources relating to memory, storage, processor, network, environmental concerns, etc. The actual monitoring modalities should be part of the design, build, testing and deployment. This combines elements of the service lifecycle flowing through design into transition and then operations with feedback loops established to identify new events and new methods which are then looped back into design to then be built into the next release.
A dimension of CI state that we should also monitor and manage are changes to applications and infrastructure software. All things being equal, an incident is most likely to arise due to a software issue caused by a change and not due to hardware failure or environmental issues. We have talked about detective controls for change management in the past but we need to evolve our thinking leveraging ITIL v3’s event management process and then triggering other processes including incident management, security management, and so on.
To level set, every change represents a risk to IT services and the business overall. Changes that do not follow the management approved change management process are, by definition, uncontrolled risks. In other words, each change is introducing risks that have not been properly evaluated to reduce risk.
Practitioners often oversimplify their changes or do not fully think through what can go wrong. The law of unintended consequences is unforgiving in IT. In some cases problematic changes create active errors where systems crash and the newly introduced error is very obvious. In others, the errors are latent, or dormant, and will not negatively impact performance until they combine in some way that is probably unforeseen. Active and latent errors lead to incidents that rob IT of resources, help enable hackers to penetrate the organization, lower availability and indirectly frustrate customers.
To counter this, we need to know about all changes to services in scope. In order to understand the entire population of relevant changes on a given CI, automated tools are needed to detect changes, apply rules and generate meaningful alerts and reporting.
To be clear, we cannot rely on reported changes because, for example, an accidental change or a malicious change will not be reported in the change management system! All things being equal, the change management tool is likely contain a subset of the actual change population.
When any change is detected, there are several possible vectors from which a given change was introduced:
An authorized person making an authorized change – This is exactly what we want but not what we always get! The result is the new approved state for the CI.
An authorized person making an unauthorized change but with good intentions – Here, a person thinks that they are helping by making a quick change without reporting but instead are putting an organization at risk. The change should be rolled back and the person formally warned because they are creating unacceptable risks.
An authorized person making an unauthorized change accidentally – This can happen very easily. Human error is a huge factor and is repeatedly blamed for incidents. All it takes is one typographical error or to drag and drop the wrong files and errors can be introduced. These mistakes need to be detected and proactively corrected.
An authorized person making an unauthorized change with malicious intentions – This is a very bad situation. Someone with knowledge of the systems is engaging in wrongful acts intentionally. If detected, a security event needs to trigger, counter measures taken, forensics collected, and so on per the security incident response process. Once approved by information security, the system should be restored to its approved state.
An unauthorized person making an unauthorized change – Here we have a hacker! As with the previous item, once activity is detected, a security event needs to trigger, counter measures taken, forensics collected, and so on per the security incident response process. Once approved by information security, the system should be restored to its approved state.
Thus, we need to detect changes and route through event management accordingly including the reconciliation of detected changes against approved changes. This will help both operations and security to make decisions about any changes that do not reconcile. In addition, this detected change data needs to be supplied to incident, problem and information security management teams so they can immediately answer the question “What changed?” and decide how to proceed.
Having the ability to detect unauthorized changes and having defined consequences will improve availability, mean-time-to-repair, reduce unplanned work, increase the amount of time spent on projects and improve security … what’s not to like?
George Spafford is an experienced consultant, a prolific author and speaker, and has consulted and conducted training on strategy, IT management, information security and overall process improvement globally. He can be reached at firstname.lastname@example.org.