Delta’s problems started with a small fire and a power outage. The fire was extinguished quickly, but by then it had sparked a chain reaction that led to more than 2,000 cancelled flights, millions of dollars lost, a tarnished reputation and a lot of questions for IT professionals.
During the Delta outage about 300 of the airline’s 7,000 servers weren’t connected to the backup power system, a vulnerability that the company had not been aware of despite investing “hundreds of millions of dollars in technology infrastructure upgrades and systems, including backup systems,” Delta CEO Ed Bastian said in a video message.
“It’s not clear the priorities in our investment have been in the right place,” Bastian added. “It has caused us to ask a lot of questions which candidly we don’t have a lot of answers for.”
If you’re an IT civilian like me (who wasn’t flying Delta this week), the CEO’s frank admission elicits sympathy — except it shouldn’t.
“Not having servers hooked up to backup power — that’s IT 101,” said DR expert Michael Herrera, CEO of MHA Consulting, a business continuity consulting firm in Glendale, Ariz. “When you do a full shutdown of your data center and go to backup power to make sure it can handle the load, this is something you find right away. In my opinion, this should not have happened at this level of an organization.”
Mark Jaggers a data center recovery and continuity analyst at Gartner, also thinks Delta may have dropped the ball on power source configuration testing, but sees the airline’s situation as indicative of a larger issue that plagues IT departments —insufficient disaster recovery planning and testing.
“A lot of people do disaster recovery testing around moving a workload between different sites, but once they have done that, do they go back and look for defects in the design of the systems that are there? I don’t know that many companies are doing that sort of testing after the fact or as part of a disaster recovery test,” he said.
It takes a failure of just one component or one human mistake to cause a widespread outage. Stephanie Balaourasanalyst, Forrester Research
“IT environments have become so complex with intricate interdependencies that outages like this are becoming the norm, because it takes a failure of just one component or one human mistake to cause a widespread outage,” Balaouras said.
Delta isn’t the only airline to experience this cascading effect. Last month, Southwest Airlines suffered computer problems that canceled hundreds of flights and caused major delays. Similarly, last year 5,000 United Airlines planes were affected by a computer glitch.
“I think we’re seeing this a lot with airlines now because of the scale a lot of airlines have reached,” Jaggers said. “Any time they have a failure it can affect a lot of people immediately and then because of the way that airlines are structured, the business recovery time takes quite a while. It tends to impact more than just the six hours of downtime. In Delta’s case it’s taking four days.”
‘It can happen’
What else can IT executives learn from the Delta outage? It all comes back to the basics, says Roberta Witty a Gartner analyst focused on business continuity management.
“Focus on what’s mission-critical and make sure that’s up and running and working well all the way through a loss of a component, a loss of power, loss of network and loss of data so that at least you’ve got a strong base for mission-critical IT services. And then move on to less mission-critical,” Witty said.
“You want to put your investment where it has most meaning for the company — and that’s all through risk assessment and business impact analysis,” she said.
Of course, that is easier said than done. A lot of changes happen in just one year, so one can imagine the changes organizations go through in three, five or seven years, Witty said, emphasizing the challenge of keeping your disaster recovery environment up-to-date.
But vigilance, while time consuming and expensive, is required.
“What firms need to do is continuously rationalize and modernize their IT environments and maintain a continuous dependency mapping,” Balaouras said. In addition, each time there’s a change to the environment, all plans and policies should be updated. Whenever there’s a substantial change, a major test of the DR plan should be mandatory.
“It’s one area where almost all enterprises are terrible,” she added.
The best place to have this kind of review process, said Gartner’s Witty, is in the early stages of the systems development lifecycle because issues can be caught before the system has even been established. It can also be a lot more expensive to do DR after the fact, she added.
But perhaps the most fundamental change IT executives need to make in regards to DR is their outlook. When DR experts like Herrera and his team perform threat and risk assessments that lay out the most relevant threats for organizations, many times clients don’t believe certain threat situations will ever happen, going so far as laughing Herrera out of the room. His take:
“I think people in today’s world first have to say ‘it can happen.'”
CIO news roundup for week of August 8
The massive Delta outage wasn’t the only big story this week. Here’s what else grabbed headlines:
- Data theft cases on the rise. In a recent survey of IT professionals by Ponemon Institute, 76% of respondents said their companies experienced loss or theft of company data over the past two years, compared to 67% of respondents in the 2014 survey; 78% of respondents were concerned about ransomware threats. Unstructured data like emails and documents are among the most prized data in most breaches. As for the reasons behind the rise in data thefts? The study cites poor governance, including “compromises in insider accounts that are exacerbated by far wider employee and third-party access to sensitive information than is necessary,” coupled with the lack of monitoring employee and third-party file and email activity as the main factors. The study, sponsored by Varonis, surveyed 3,027 professionals across U.S. and Europe.
- Bagging AI startups. Chip maker Intel said Tuesday it was acquiring San Diego-based deep learning startup Nervana Systems for over $ 400 million. “Their IP and expertise in accelerating deep learning algorithms will expand Intel’s capabilities in the field of AI,” Diane Bryant, executive vice president and general manager of the data center group for Intel, said in a statement. The Intel acquisition comes on the heels of Apple’s purchase last week of Seattle-based artificial intelligence and machine learning startup Turi for $ 200 million. In other acquisition news, Hewlett Packard Enterprise is buying SGI for $ 275 million in an effort to expand in the areas of big data analytics and high performance computing.
- GV chief executive quits. Bill Maris, the founder and now former CEO of the venture capital arm of Google’s parent company Alphabet, said this week he was leaving the firm after eight years. “I have an 11-month-old son and a wife. And I legitimately want to spend more time with them,” Maris told Recode in an interview. GV’s managing partner David Krane was named as the new CEO. Formerly known as Google Venture, the firm has $ 2.4 billion in assets. Uber is among the many companies it has invested in. Maris’ exit follows the departure of Chris Urmson, CTO of Google’s autonomous car project and that of Nest CEO Tony Fadell.
- CVS launches mobile payment: CVS Pharmacy customers can now make payments for their purchases, pick up prescriptions and earn loyalty program rewards just by a scan of the barcode when using the CVS mobile app at store checkout. “We’ve been excited by the level of customer adoption of these digital solutions, and we will continue our quick pace of innovation and deployment to make our customers’ health care experience even easier,” Brian Tilzer, senior vice president and chief digital officer at CVS Health, said in a statement.
Assistant editor Mekhala Roy contributed to this week’s news roundup.
Five must-haves for every corporate disaster recovery plan
Need a better disaster recovery plan? You’re not alone
Avoid these disaster recovery test plan mistakes