Ten years ago, almost to the day, I wrote a series of articles that I still consider some of my best stuff on IT Service Management. Since the original publisher is defunct, I thought I’d resurface them here.
I got lots of positive feedback about my series of articles for The ITSM Review, which use a train crash in Cherry Valley, Illinois as a case study for understanding incident and problem management. (It was part of a wider theme of my articles for The ITSM Review using railroad examples for service management).
I was reminded of it today by this nasty accident
UEA_PA_Alerts via Twitter
It always mystifies me that people (and ITIL) don’t grok this simple model: incident management is about users, problem management is about causes.
Incident management is
not a data model
not a process
not a team
not a policy
not goals and metrics
Incident management is ALL OF THEM – the incident practice.
Once you use that proper perspective then I hope you stop trying to stuff things into incident management that have nothing to do with restoring the user experience.
The team who work on incidents are skilled in making users happy, and focused on that, not debugging technical causes of incidents.
If all the people who work on incidents do not collaborate as a team, whether they are distributed across organisational units or not, you have a fundamental issue.
And if you do not have a single view of all problems as a portfolio (across organisational units), with a manager determining their relative priorities and allocating resources, then your problem resolution practice will be sub-optimal.
We need a clear measure of how good we are at maintaining the user experience. A focused incident management practice gives us that.
Conversely, we need to manage our programme of work to manage the risks of problems, and to systematically eliminate them. We can only do that if we centrally manage ALL causes of incidents in one place, within the problem practice.
Blurring the delineation between problem and incident practices, as ITIL does, undermines all this. It’s all in Cherry Valley :
It had been raining for days in and around Rockford, Illinois that Friday afternoon in 2009, some of the heaviest rain locals had ever seen. Around 7:30 that night, people in Cherry Valley – a nearby dormitory suburb – began calling various emergency services: the water that had been flooding the road and tracks had broken through the Canadian National railroad’s line, washing away the trackbed.
An hour later, in driving rain, freight train U70691-18 came through the level crossing in Cherry Valley at 36 m.p.h, pulling 114 cars (wagons) mostly full of fuel ethanol – 8 million litres of it – bound for Chicago. Although ten cross-ties (sleepers) dangled in mid air above running water just beyond the crossing, somehow two locomotives and about half the train bounced across the breach before a rail weld fractured and cars began derailing. As the train tore in half the brakes went into emergency stop. 19 ethanol tank-cars derailed, 13 of them breaching and catching fire.
In a future article we will look at the story behind why one person waiting in a car at the Cherry Valley crossing died in the resulting conflagration, 600 homes were evacuated and $7.9M in damages were caused.
Here we will be focused on the rail traffic controller (RTC) who was the on-duty train dispatcher at the CN‘s Southern Operations Control Center in Homewood, Illinois. We won’t be concerned for now with the RTC’s role in the accident: we will talk about that next time. For now, we are interested in what he and his colleagues had to do after the accident.
While firemen battled to prevent the other cars going up in what could have been the mother of all ethanol fires, and paramedics dealt with the dead and injured, and police struggled to evacuate houses and deal with the road traffic chaos – all in torrential rain and widespread surface flooding – the RTC sat in a silent heated office 100 miles away watching computer monitors. All hell was breaking loose there too. Some of the heaviest rail traffic in the world – most of it freight – flows through and around Chicago; and one of the major arteries had just closed.
One of the major services of a railroad is delivering goods, on time. Nobody likes to store materials if they can help it: railroads deliver “just in time”. Many of the goods carried are perishables: fruit and vegetables from California, stock and meat from the midwest, all flowing east to the population centres of the USA. The railroad had made commitments regarding the delivery of those goods: what we would call Service Level Targets. Those SLTs were enshrined in contractual arrangements – Service Level Agreements – with penalty clauses. And now trains were late: SLTs were being breached.
A number of RTCs and other staff in Homewood switched into familiar routines:
- The US rail network is complex – a true network. Trains were scheduled to alternate routes, and traffic on those routes was closed up as tightly bunched together as the rules allowed to create extra capacity.
- Partner managers got on the phone to the Union Pacific and BNSF railroads to negotiate capacity on their lines under reciprocal agreements already in place for situations just such as this one.
- Customer relations staff called clients to negotiate new delivery times.
- Traffic managers searched rail yard inventories for alternate stock of ethanol, that could be delivered early.
- Crew managers told crews to pick up their trains in new locations and organised transport to get them there.
Fairly quickly, service was restored: oranges got squeezed in Manhattan, pigs and cows went to their deaths, and corn hootch got burnt in cars instead of all over the road in Cherry Valley.
This is Incident Management.
None of it had anything to do with what was happening in the little piece of hell that Cherry Valley had become. The people in heavy waterproofs, hi-viz and helmets, splashing around in the dark and rain, saving lives and property and trying to restore some semblance of local order – that’s not Incident Management.
At least I don’t think it is. I think they had a problem.
An incident is an interruption to service and a problem is an underlying cause of incidents. Incident Management is concerned with the restoration of expected levels of service to the users. Problem Management is concerned with removing the underlying causes.
To me that is a simple definition that works well. If you read the books and listen to the pundits you will get more complex models that seem to imply that everything done until trains once more rolled smoothly though Cherry Valley is Incident Management. I beg to differ. If the customer gets steak and orange juice then Cherry Valley could be still burning for all they care: Incident Management has met its goals.
Railways (railroads) remind us of how the real world works.
In our last article, we left Cherry Valley, Illinois in its own little piece of hell. For those who missed it, in 2009 a Canadian National railroad train carrying eight million litres of ethanol derailed at a level crossing in the little town of Cherry Valley after torrential rain washed out the roadbed beneath the track. 19 tankers of ethanol derailed, 13 of them split or spilled, and the mess somehow caught fire in the downpour. One person in the cars waiting at the crossing died and several more were seriously injured.
In that previous article we looked at the Incident Management. As I said then, an incident is an interruption to service and a problem is an underlying cause of incidents. Incident Management is concerned with the restoration of expected levels of service to the users. Problem Management is concerned with removing the underlying causes. I also mentioned that ITIL doesn’t see it that crisply delineated. Anyway, let us return to Cherry Valley…
One group of people worked inside office buildings making sure the trains kept rolling around the obstruction so that the railroad met its service obligations to its users. This was the Incident Management practice: restoring service to the users, focusing on perishable deliveries such as livestock and fruit.
Another group thrashed around in the chaos that was Cherry Valley, trying to fix a situation that was very very broken. Their initial goal was containment: save and treat people in vehicles, evacuate surrounding houses, stop the fire, stop the spills, move the other 100 tank-cars of ethanol away, get rid of all this damn flooding and mud.
The intermediate goal was repair and restore: get trains running again. Often this is done with a “shoo-fly”: a temporary stretch of track laid around the break, which trains inch gingerly across whilst more permanent repairs are effected. This is not a Workaround as we use the term in ITSM. The Workaround was to get trains onto alternate routes or pass freight to other companies. A shoofly is temporary infrastructure: it is part of the problem fix just as a temporary VM server instance would be. While freight ran on other roads or on a shoofly, they would crane the derailed tankers back onto the track or cart them away, then start the big job of rebuilding the road-base that had washed away – hopefully with better drains this time – and relaying the track. Compared to civil engineering our IT repairs look quick, and certainly less strenuous.
Which brings us to the longer-term goal: permanent remediation of the problem. Not only does the permanent fix include new rail roadbed and proper drainage; the accident report (www.ntsb.gov/doclib/reports/2012/RAR1201.pdf) makes it clear that CN’s procedures and communications were deficient as well. Cherry Valley locals were calling 911 an hour beforehand to report the wash-out.
We will talk more about the root causes and long term improvement later. Let’s stay in Cherry Valley for now. It is important to note that the lives and property the emergency responders were saving were unconnected to the services, users or customers of the railroad. All the people working on all these aspects of the problem had only a secondary interest in the timeliness of pigs and oranges and expensive petrol. They were not measured on freight delivery times: they were measured on speed, quality and permanence of the fix, and prevention of any further damage.
If you read the books and listen to the pundits you will get more complex models that seem to imply that everything done until trains once more rolled smoothly though Cherry Valley is Incident Management. I beg to differ. To me it is pretty clear: Incident and Problem practices are delineated by different activities, teams, skills, techniques, tools, goals and metrics. Incident: user service levels. Problem: causes.
While I am arguing with ITIL definitions, let’s look at another aspect of Incidents. ITIL says that something broken is an Incident if it could potentially cause a service interruption in future. Once again this ignores the purpose, roles, skills and tools of Incident Management and Problem Management. Such a fault is clearly a Problem, a (future) cause of an Incident.
(Incidentally, it is hard to imagine many faults in IT that aren’t potentially the cause of a future interruption or degradation of service. If we follow this reasoning to its absurd conclusion, every fault is an incident and nothing is a problem).
Perhaps one reason ITIL hangs these “potential incidents” where it does is because of another odd definition: a Problem is apparently the cause of “one or more incidents”. What’s odd about that? ITIL is big on pro-active (better called pre-emptive) problem management, and yet apparently we need to wait until something causes at least one incident before we can start treating it as a problem. I think the washout in Cherry Valley was a problem long before train U70691-18 barrelled into town.
One of my favourite railroad illustrations is about watching trains. When a train rolls by, keep an eye on nearby staff: those on platforms, down by the track, on waiting trains. On most railroads, staff will stop what they are doing and watch the train – the whole train, watching until it has all gone by. In the old days they would wave to the guard (conductor) on the back of the train. Nowadays they may say something to the driver via radio.
Laziness? Sociability? Railfans? Possibly. But quite likely it is part of their job – it may well be company policy that everybody watches every passing train. The reason is visual inspection. Even in these days of radio telemetry from the FRED (Flashing Rear End Device, a little box on the back that replaces the caboose/guardsvan of old) and trackside detectors for cracked wheels and hotboxes (overheating bearings), there is still no substitute for the good old human eyeball for spotting anything from joyriders to dragging equipment. It is everyone’s responsibility to watch and report: not a bad policy in IT either.
What they are spotting are Problems. The train is still rolling so the service hasn’t been interrupted … yet.
Other Problems make themselves known by interrupting the service. A faulty signal stops a train. In the extreme case the roadbed washes away. We can come up with differing names for things that have and haven’t interrupted/degraded service yet, but I think that is arguing about angels dancing on pinheads. They are all Problems to me: the same crews of people with heavy machinery turn out to fix them while the trains roll by delivering they care not what to whom. Oh sure, they have a customer focus: they care that the trains are indeed rolling and on time, but the individual service levels and customer satisfaction are not their direct concern. There are people in cozy offices who deal with the details of service levels and incidents.
Next time we will return to the once-again sleepy Cherry Valley to discuss the root causes of this accident.
Proactive Problem Management
Just because you rebuild the track doesn’t mean the train won’t derail again.
We have been looking in past articles at the tragic events in little Cherry Valley, Illinois in 2009. One person died and several more were seriously injured when a train-load of ethanol derailed at a level crossing. We talked about the resulting Incident Management, which focused on customers, trains and cargo – ensuring the services still operated, employing workarounds. Then we considered the Problem Management: the injured people and the wreck and the broken track – removing the causes of service disruption, restoring normal service.
In a previous article I said ITIL has an odd definition of Problem. ITIL says a Problem is the cause of “one or more incidents”. ITIL promotes proactive (better called pre-emptive) Problem Management, and yet apparently we need to wait until something causes at least one Incident before we can start treating it as a Problem. I think the washout in Cherry Valley was a problem long before train U70691-18 barrelled into town. A Problem is in fact the cause of zero or more Incidents. A Problem is a problem, whether it has caused an Incident yet or not.
We talked about how I try to stick to a nice crisp simple model of Incident vs. Problem. To me, an incident is an interruption to service and a problem is an underlying (potential) cause of incidents. Incident Management is concerned with the restoration of expected levels of service to the users. Problem Management is concerned with removing the underlying causes.
ITIL doesn’t see it that crisply delineated: the two concepts are muddied together. ITIL – and many readers – would say that putting out the fires, clearing the derailed tankers, rebuilding the roadbed, and relaying the rails can be regarded as part of the Incident resolution process because the service isn’t “really” restored until the track is back.
In the last article I said this thinking may arise because of the weird way ITIL defines a Problem. I have a hunch that there is a second reason: people consider removing the cause of the incident to be part of the incident because they see Incident=Urgent, Problem=Slow. They want the incident Manager and the Service Desk staff to hustle until the cause is removed. This is just silly. There is no reason why Problems can’t be resolved with urgency. Problems should be categorised by severity and priority and impact just like Incidents are. The Problem team should go into urgent mode when necessary to mobilise resources, and the Service Desk are able to hustle the Problem along just as they would an Incident.
This inclusion of cause-removal over-burdens and de-focuses the Incident Management process. Incident Management should have a laser focus on the user and by implication the customer. It should be performed by people who are expert at serving the user. Its goal is to meet the user’s needs. Canadian National’s incident managers were focused on getting deliveries to customers despite a missing bit of track.
Problem Management is about fixing faults. It is performed by people expert at fixing technology. The Canadian National incident managers weren’t directing clean-up operations in Cherry Valley: they left that to the track engineers and the emergency services.
But the way ITIL has it, some causes are removed as part of Incident resolution and some are categorised as Problems, with the distinction being unclear (“For some incidents, it will be appropriate…” ITIL Service Operation 2011 188.8.131.52). The moment you make Incident Management responsible for sometimes fixing the fault as well as meeting the user’s needs, you have a mashup of two processes, with two sometimes-conflicting goals, and performed by two very different types of people. No wonder it is a mess.
It is a mess from a management point of view when we get a storm of incidents. Instead of linking all related incidents to an underlying Problem, we relate them to some “master incident” (this isn’t actually in ITIL but it is common practice).
It is a mess from a prioritisation point of view. The poor teams who fix things are now serving two processes: Incident and Problem. In order to prioritise their work they need to track a portfolio of faults that are currently being handled as incidents and faults that are being handled as problems, and somehow merge a holistic picture of both. Of course they don’t. The Problem Manager doesn’t have a complete view of all faults nor does the Incident Manager, and the technical teams are answerable to both.
It is a mess from a data modelling point of view as well. If you want to determine all the times that a certain asset broke something, you need to look for incidents it caused and problems it caused.
Every cause of a service impact (or potential impact) should be recorded immediately as a problem, so we can report and manage them in one place.
That whole tirade is by way of introducing the idea of reactive and proactive Problem Management.
Reactive Problem Management responds to an incident to remove the cause of the disruption to service. The ITIL definition is more tortuous because it treats “restoring the service” as Incident Management’s job, but it ends up saying a similar thing: “Reactive problem management is concerned with solving problems in response to one or more incidents” (SO 2011 4.4.2).
Pro-active Problem Management fixes problems that aren’t currently causing an incident to prevent them causing incidents (ITIL says “further” incidents).
So cleaning up the mess in Cherry Valley and rebuilding the track was reactive Problem Management.
Once the trains were rolling they didn’t stop there. Clearly there were some other problems to address. What caused the roadbed to be washed away in the first place? Why did a train thunder into the gap at normal track speed? Why did the tank-cars rupture and how did they catch fire?
In Cherry Valley, the drainage was faulty. Water was able to accumulate behind the railway roadbed embankment, causing flooding and eventually overflowing the roadbed, washing out below the track, leaving rails dangling in the air. The next time there was torrential rain, it would break again. That’s a problem to fix.
Canadian National’s communication processes were broken. The dispatchers failed to notify the train crew of a severe weather alert, which they were supposed to do. If they had, the train would have operated at reduced speed. That’s a problem to fix.
The CN track maintenance processes worked, perhaps lackadaisically but they worked as designed. The processes could have been a lot better, but were they broken? No.
The tank cars were approved for transporting ethanol. Those were not required to be equipped with head shields (extra protection at the ends of the tank to resist puncturing), jackets, or thermal protection. In March 2012 the NTSB recommended (R-12-5 ) “that all newly manufactured and existing general service tank cars authorized for transportation of denatured fuel ethanol … have enhanced tank head and shell puncture resistance systems”. The tank-cars weren’t broken (before the crash). This is not fixing a problem, it is improving the safety to mitigate the risk of rupture.
I don’t think pro-active Problem Management is about preventing problems from occurring, preventing causes of service disruption from entering the environment, making systems more robust or higher quality. That is once again over-burdening a process. If you delve too far into preventing future problems, you cross over into Availability and Capacity and Risk Management and Service Improvement, (and Change Management!), not Problem Management.
ITIL agrees: “Proactive problem management is concerned with identifying and solving problems and known errors before further incidents related to them can occur again”. Proactive Problem Management prevents the recurrence of Incidents, not Problems.
In order to ensure that incidents will not recur, we need to dig down to find all the underlying causes. In many methodologies we go after that mythical beast, the Root Cause. We will talk about that next time.
Railroads don’t like derailments
Most readers have got the story now from my recent articles: Cherry Valley, Illinois, 2009, rain bucketing down, huge train-load of ethanol derails, fire, death, destruction.
Eventually the Canadian National’s crews and the state’s emergency services cleaned up the mess, and CN rebuilt the track-bed and the track, and trains rolled regularly through Cherry Valley again.
Then the authorities moved in to find out what went wrong and to try to prevent it happening again. In this case the relevant authority is the US National Transportation Safety Board (NTSB).
Every organisation should have a review process for requests, incidents, problems and changes, with some criteria for triggering a review. In this case it was serious enough that an external agency reviewed the incident. The NTSB had a good look and issued a report (www.ntsb.gov/doclib/reports/2012/RAR1201.pdf). Read it as an example of what a superb post-incident review looks like. Some of our major IT incidents involve as much financial loss as this one and sadly some also involve loss of life.
IT has a fascination with “root cause”. Root Cause Analysis (RCA) is a whole discipline in its own right. The Kepner-Tregoe technique (ITIL Service operation 2011 Appendix C) calls it “true cause”.
The rule of thumb is to keep asking “Why?” until the answers aren’t useful any more and that – supposedly – is your root cause.
This belief in a single underlying cause of things going wrong is a misguided one. The world doesn’t work that way – it is always more complex.
The NTSB found a multitude of causes for the Cherry Valley disaster. Here are just some of them:
- It was extreme weather
- The CN central rail traffic controller (RTC) didn’t put out a weather warning to the train crew which would have made them slow down, although required to do so and although he was in radio communication with the crew
- The RTC did not notify track crews
- The track inspector checked the area at 3pm and observed no water build-up
- Floodwater washed out a huge hole in the track-bed under the tracks, leaving the rails hanging in the air.
- Railroads normally post their contact information at all grade crossings but the first citizen reporting the washout could not find the contact information at the crossing where the washout was, so he called 911
- The police didn’t communicate well with CN about the washed out track: they first alerted two other railroads
- There wasn’t a well-defined protocol for such communication between police and CN
- Once CN learned of the washout they couldn’t tell the RTC to stop trains because his phone was busy
- Although the train crew saw water up to the tops of the rails in some places they did not slow down of their own accord
- There was a litany of miscommunication between many parties in the confusion after the accident
- The federal standard for ethanol cars didn’t require them to be double-skinned or to have puncture-proof bulkheads (it will soon: this tragedy triggered changes)
- There had been a previous washout at the site and a 36” pipe was installed as a relief drain for flooding. Nobody calculated what size pipe was needed and nobody investigated where the flood water was coming from. After the washout the pipe was never found.
- The county’s storm water retention pond upstream breached in the storm. The storm retention pond was only designed to handle a “ten year storm event”.
- Local residents produced photographic evidence that the berm and outlet of the pond had been deteriorating for several years beforehand.
OK you tell me which is the “root cause”.
Causes don’t normally arrange themselves in a nice tree all leading back to one. There are several fundamental contributing causes. Anyone who watches Air Crash Investigation knows it takes more than one thing to go wrong before we have an accident.
Sometimes one of them stands out like the proverbial. So instead of calling it root cause I’m going to call it primary cause. Sure the other causes contributed but this was the biggest contributor, the primary.
Ian Clayton once told me that root cause …er… primary cause analysis is something you do after the fact as part of the review and wash-up. In the heat of the crisis who gives a toss what the primary cause is – remove the most accessible of the causes. Any disaster is based on multiple causes. Removing any one cause of an incident will likely restore service. Then when we have time to consider what happened and take steps to prevent a recurrence, we should probably try to address all causes. Don’t do Root cause Analysis, just do Cause Analysis, seeking the multiple contributing causes. If we need to focus efforts then the primary cause is the one, which implies that a key factor in deciding primacy is how broad the potential is for causing more incidents.
It is not often you read something that completely changes the way you look at IT. This paper How Complex Systems Fail rocked me. Reading this made me completely rethink ITSM, especially Root Cause Analysis, Major Incident Reviews, and Change Management. You must read it. Now. I’ll wait.
It says that all complex systems are broken. It is only when the broken bits line up in the right way that the system fails.
It dates from 1998! Richard Cook is a doctor, an MD. He seemingly knocked this paper off on his own. It is a whole four pages long, and he wrote it with medical systems in mind. But that doesn’t matter: it is deeply profound in its insight into any complex system and it applies head-on to our delivery and support of IT services.
“Complex systems run as broken systems”
“Change introduces new forms of failure”
“Views of ‘cause’ limit the effectiveness of defenses against future events… likelihood of an identical accident is already extraordinarily low because the pattern of latent failures changes constantly.”
“Failure free operations require experience with failure.”
Many times the person “to blame” for a primary cause was just doing their job. All complex systems are broken. Every day the operators make value judgements and risk calls. Sometimes they don’t get away with it. There is a fine line between considered risks and incompetence – we have to keep that line in mind. Just because they caused the incident doesn’t mean it is their fault. Think of the word “fault” – what they did may not have been faulty, it may just be what they have to do every day to get the job done. Too often, when they get away with it they are considered to have made a good call; when they don’t they get crucified.
That’s not to say negligence doesn’t happen. We should keep an eye out for it, and deal with it when we find it. Equally we should not set out on cause analysis with the intent of allocating blame. We do cause analysis for the purpose of preventing a recurrence of a similar Incident by removing the existing Problems that we find.
I will close by once again disagreeing with ITIL’s idea of Problem Management. As I said in my last article, pro-active Problem Management is not about preventing problems from occurring, or preventing causes of service disruption from entering the environment, or making systems more robust or higher quality.
It is overloading Problem Management to also make it deal with “How could this happen?” and “How do we prevent it from happening again?” That is dealt with by Risk Management (an essential practice that ITIL does not even recognise) feeding into Continual Service Improvement to remove the risk. The NTSB were not doing Problem Management at Cherry Valley.
Next time we will look at continual improvement and how it relates to problem prevention.
Problem, risk, change , CSI, service portfolio, projects: they all make changes to services. How they inter-relate is not well defined or understood. We will try to make the model clearer and simpler.
Problem and Risk and Improvement
As I said in a previous article, pro-active Problem Management is not about preventing problems from occurring, or preventing causes of service disruption from entering the environment, or making systems more robust or higher quality. It is overloading Problem Management to also make it deal with “How could this happen?” and “How do we prevent it from happening again?”
In this series of articles, we have been talking about an ethanol train derailment in the USA as a case study for our discussions of service management. The US National Transport Safety Board wrote a huge report about the disaster, trying to identify every single factor that contributed and to recommend improvements. The NTSB were not doing Problem Management at Cherry Valley. The crews cleaning up the mess and rebuilding the track were doing problem management. The local authorities repairing the water reservoir that burst were doing problem management. The NTSB was doing risk management and driving service improvement.
Arguably, fixing procedures which were broken was also problem management. The local dispatcher failed to tell the train crew of a severe weather warning as he was supposed to do, which would have required the crew to slow down and watch out. So training and prompts could be considered problem management.
But somewhere there is a line where problem management ends and improvement begins, in particular what ITIL calls continual service improvement or CSI.
In the Cherry Valley incident, the police and railroad could have communicated better with each other. Was the procedure broken? No, it was just not as effective as it could be. The type of tank cars approved for ethanol transportation were not required to have double bulkheads on the ends to reduce the chance of them getting punctured. Fixing that is not problem management, it is improving the safety of the tank cars. I don’t think improving that communications procedure or the tank car design is problem management, otherwise if you follow that thinking to its logical conclusion then every improvement is problem management.
But wait: unreliable communications procedure and the single-skinned tank cars are also risks. A number of thinkers including Jan van Bon argue that risk and problem management are the same thing. I think there is a useful distinction: a problem is something that is known to be broken, that will definitely cause service interruptions if not fixed; a “clear and present danger”. Risk management is something much broader, of which problems are a subset. The existence of a distinct problem management practice gives that practice the focus it needs to address the immediate and certain risks.
(Risk is an essential practice that ITIL – strangely – does not even recognise as a distinct practice; the 2011 edition of ITIL’s Continual Service Improvement book attempts to plug this hole. COBIT does include risk management, big time. USMBOK does too, though in its own distinctive way it lumps risk management under Customer services; I disagree: there are risks to our business too that don’t affect the customer.)
So risk management and problem management aren’t the same thing. Risk management and improvement aren’t the same thing either. CSI is about improving the value (quality) as well as reducing the risks.
To summarise all that: problem management is part of risk management which is part of service improvement.
Service Portfolio and Change
Now for another piece of the puzzle. Service Portfolio practice is about deciding on new services, improvements to services, and retirement of services. Portfolio decisions are – or should be – driven by business strategy: where we want to get to, how we want to approach getting there, what bounds we put on doing that.
Portfolio decisions should be made by balancing value and risk. Value is benefits minus costs. There is a negative benefit and a set of risks associated with the impact on existing services of building a new service: there is the impact of the project dragging people and resources away from production, and the on-going impact of increased complexity, the draining of shared resources etc…. So portfolio decisions need to be made holistically, in the context of both the planned and live services. And in the context of retired services too: “tell me again why we are planning to build a new service that looks remarkably like the one we killed off last year?” A lot of improvement is about capturing the learnings of the past.
Portfolio management is a powerful technique that is applied at multiple levels. Project and Programme Portfolio Management is all the rage right now, but it only tells part of the story. Managing projects in programmes and programmes in portfolios only manages the changes that we have committed to make; it doesn’t look at those changes in the context of existing live services as well. When we allocate resources across projects in PPM we are not looking at the impact on business-as-usual (BAU); we are not doling out resources across projects and BAU from a single pool. That is what a service portfolio gives us: the truly holistic picture of all the effort in our organisation across change and BAU.
Service portfolio management is a superset of organisational change management. Portfolio decisions are – or should be – decisions about what changes go ahead for new services and what changes are allowed to update existing services, often balancing them off against each other and against the demands of keeping the production services running. “Sure the new service is strategic, but the risk of not patching this production server is more urgent and we can’t do both at once because they conflict, so this new service must wait until the next change window”. “Yes, the upgrade to Windows 13 is overdue, but we don’t have enough people or money to do it right now because the new payments system must go live”. “No, we simply cannot take on another programme of work right now: BAU will crumble if we try to build this new service before we finish some of these other major works”.
Or in railroad terms: “The upgrade to the aging track through Cherry Valley must wait another year because all available funds are ear-marked for a new container terminal on the West Coast to increase the China trade”. “The NTSB will lynch us if we don’t do something about Cherry Valley quickly. Halve the order for the new double-stack container cars”.
Everything we change is service improvement. Why else would we do it? If we define improvement as increasing value or reducing risk, then everything we change should be to improve the services to our customers, either directly or indirectly.
Therefore our improvement programme should manage and prioritise all change. Change management and service improvement planning are one and the same.
So organisational change management is CSI. They are looking at the beast from different angles, but it is the same animal. In generally accepted thinking, organisational change practice tends to be concerned with the big chunky changes and CSI tends to be focused more on the incremental changes. But try to find the demarcation between the two. You can’t decide on major change without understanding the total workload of changes large and small. You can’t plan a programme of improvement work for only minor improvements without considering what major projects are planned or happening.
In summary, change/CSI is one part of service portfolio management which also considers delivery of BAU live services. A railroad will stop doing minor sleeper (tie) replacements and other track maintenance when they know they are going to completely re-lay or re-locate the track in the near future. After decades of retreat, railroads in the USA are investing in infrastructure to meet a coming boom (China trade, ethanol madness, looming shortage of truckers); but they better beware not to draw too much money away from delivering on existing commitments, and not to disrupt traffic too much with major works.
Simplifying service change
ITIL as it is today seems to have a messy complicated story about change. We have a whole bunch of different practices all changing our services, from Service Portfolio to Change Management to Problem Management to CSI. How they relate to each other is not entirely clear, and how they interact with risk management or project management is undefined.
There are common misconceptions about these practices. CSI is often thought of as “twiddling the knobs”, fine-tuning services after they go live. Portfolio management is often thought of as being limited to deciding what new services we need. Risk management is seen as just auditing and keeping a list. Change Management can mean anything from production change control to organisational transformation depending on who you talk to.
It is confusing to many. If you agree with the arguments in this article then we can start to simplify and clarify the model:
I have added in Availability, Capacity, Continuity, Incident and Service Level Management practices as sources of requirements for improvement. These are the feedback mechanisms from operations. In addition the strategy, portfolio and request practices are sources of new improvements. I’ve also placed the operational change and release practices in context as well.
These are merely the thoughts of this author. I can’t map them directly to any model I recall, but I am old and forgetful. If readers can make the connection, please comment below.