BA computer chaos: the unanswered questions
I admit I’m no IT expert, but over the past few days I’ve spoken to plenty of people who are.
These are people who have either engineered airline IT networks or actually worked on British Airways’ systems in the past.
What I’ve heard is a lot of confusion and scepticism at the idea that a local power surge could have wreaked such havoc.
There is also confusion as to why back-up systems didn’t do their job.
Only the people in the room know exactly what happened, so these views are based on the information made public, and bucketfuls of IT experience, including at BA.
One put it like this: “BA has two data centres near Heathrow, about a kilometre apart, so how could a power surge affect both?”
Then there are all the fail-safes in place.
The two data centres mirror each other I’m told, so when one collapses the other should take over.
All the big installations have back-up power. If the mains fails, a UPS (uninterruptable power supply) kicks in. It’s basically a big battery that keeps things ticking over until the power comes back on, or a diesel generator is fired up.
This UPS is meant to take the hit from any “surge”, so the servers don’t have to.
All the big servers and large routers, I’m told, also have dual power supplies fed from different sources.
I’m also told that, certainly a while ago, they used to have regular outages to confirm all the back-up bits were working. And daily inspections of the computer room. There is no reason to think these were stopped.
It’s not even clear who was monitoring the system at the crucial time. Was it a contractor? How much experience did they have?
The point is this. Certainly up until a while ago, British Airways’ IT systems had a variety of safety nets in place to protect them from big dumps of uncontrolled power, and to get things back on their feet quickly if there was any problem.
I’m assuming those safety nets are still there, so why did they fail? And did human error play a part in all this?
British Airways chief executive, Alex Cruz, told me recently that the company has launched an exhaustive investigation into what went wrong, although no one can say when it will report back, and whether the findings will ever be made public.
If BA wants to repair its reputation, its owner IAG needs to convince the public that making hundreds of IT staff redundant last year did not leave them woefully short of experts who could have fixed the meltdown sooner. And that it won’t happen again – at least not on this epic scale.
Mr Cruz was adamant, by the way, that the outsourcing did not contribute in any way to this mess.