Delta CEO says CrowdStrike-Microsoft outage cost the airline $500 million

MicroWave@lemmy.world · 9 months ago

Delta CEO says CrowdStrike-Microsoft outage cost the airline $500 million

ricecake · 8 months ago

You are correct that Delta was an outlier, but it wasn’t with regards to the scale of the outage, it was that their scheduling software was down far longer and they handled a lot of the customer side of things significantly less well.

Generally, your protection against operating system issues is the aforementioned restriction on changes and how they go out.
If something is stable, you can expect it to remain stable unless something changes or random chance breaks something.
The operational cost of running multiple operating systems in production like you describe would be high. Typically software is only written to work on one platform, and while it can be modified to work on others, it’s usually a cost with no benefit outside of a consumer environment.
Different operating systems have different performance characteristics you need to factor in for load scaling, different security models, and different maintenance requirements.
Often, but not always, server administrators will focus on one OS, so adding more to the mix can mean people are rusty with whichever is your backup, which can be worse than just focusing on fixing the issue with the primary.
OS bugs are rare, and they usually manifest early or randomly. It’s why production deployments tend to use the OS as long as it’s supported: change means learning the new issues and you’ve probably already encountered all the bullshit with what you’re currently using. That’s why the Linux distros tend to have long term support versions, and windows server edition tends to just get support for a long time with terrible documentation.

I’m a Linux guy, so defending windows feels weird, and I want to include that I don’t think anyone should use it, particularly for a server, but the professional in me acknowledges that it’s a perfectly functional hammer.

As we’ve learned more, I’ve become more disparaging of deltas choice to not keep the scheduling system modernized in a way that could recover faster, and not investing enough in making systems homogeneous across different airports. I still think that these issues are largely independent of their actual disaster recovery or resiliency plans.
Inevitably, the lawsuits will determine that the blame for the damage is split between the two of them. My bet is 70/30 crowdstrike/delta, since they can easily demonstrate that the issue was fundamentally caused by crowdstrike and negatively impacted other airlines and businesses in general. Some was clearly deltas fault for just failing to keep a system modernized to handle a massive shift like this, and would have been similarly disrupted by any outage with flight cancellations.

rekorse@lemmy.world · 8 months ago

Would you say that an OS forced update type error like this is so rare that Delta didnt need to plan for it? If I understand you right, its not actually a problem that Delta used Windows for their servers, at least not to the point it would affect liability.

If Delta was the only airline who set up their infrastructure in this way, to the point it was markedly different than other companies, could they argue they essentially didnt protect at all?

I’m still having a lot of trouble figuring out how CrowdStrike would even assess a risk like this if the possible payment is based on how well a company recovers and how much income they lost.

I actually agree with your 70/30 split but unless Delta paid more than the other airlines to justify the pay out in damages, its still confusing to me how the amount CrowdStrike has to pay to some degree does depend on Deltas setup and restoration.

I think theres just not any better of a way to handle this and I’m searching for an answer that doesnt exist.

Furbland@lemmy.world · 8 months ago

oh hey you’re the vegan cat guy

ricecake · 8 months ago

Only for the sake of specific-ness: Crowdstrike forced the update, not the OS. :) and yeah, that’s generally unheard of. Like so unheard of that it’s a professional recommendation reversing occurrence based purely on how they could release a product that bypassed user expectations so aggressively and without any documentation that it was happening.
I work in the security sector with computers, and before all this I would have said “yeah, crowdstrike is a widely deployed product and if it fits your requirements it’s reasonable to use”. Now I would strongly recommend against it, not because of this incident, but because of the engineering, product and safety culture that thought it was okay to design a product this way without user controls or even documentation around any part of it. Their after incident report is horrifying in testing it communicates they weren’t doing.

I wouldn’t advise someone to use windows for a server, but that’s a preference thing, not a “hazard” thing. If they had a working windows setup I wouldn’t even comment on it.

What sounds like happened to Delta is that they were set-up roughly like other companies. Maybe a little loose on different setups at different airports. That’s a forgivable level of slop. Where they differed was in having a piece of software that couldn’t handle being entirely shut off, and then immediately loaded to 100% with no ease in.
Scheduling is a type of computer problem that’s very susceptible to getting increasingly difficult the bigger the number of things being worked with. Like exponentially more difficult, but it’s actually worse than exponential.
I know nothing about they’re system, but I can guess that it worked fine when it was running because it needed to make a small number of scheduling decisions at a time, and could look at the existing state of things as a decided “fact”. Start the system fresh, and suddenly it needs to compare the hundreds of airports, more hundred of planes and crews, and thousands of possible routes to each other and is looking at literally billions of possible schedules which it needs to sort through to pick the best ones.
Other airlines appear to have scheduling systems that were either developed using more modern techniques that can find “good enough” very efficiently, or the application was written to fail less easily or had better hardware so it could work faster.

For whatever reason, delta was the only one that had the key bit of software fail to come back up.

Delta has higher costs than the other airlines because there are regulations protecting travelers and ensuring they get appropriate refunds and accomodations if their flights are cancelled. Other airlines were able to shift people around and get going again before they had to pay out too much in ticket refunds, food, or hotels.
Delta is arguing that crowdstrike is responsible for the total cost of the incident, which would include all the refunds and hotels, since they caused it.
Crowdstrike recently responded that they think their liability is no greater than $10mil. They seem to be taking the position that they’re only responsible for the immediate effects, so things like diverting aircraft, needing to manually poke systems and all that.

“Yeah I t-boned you when I ran a red light, so I owe you for the damage to your car, but your car was a dangerous piece of crap so I’m not responsible for your broken legs, hospital bills or lost wages”.
I think the judge will find that running the red light means they are responsible for the extended consequences of their actions, even if they’re vastly in excess of what anyone would have predicted up front, but that the car was pretty dangerous so it was really only a matter of time so it’s not all on them.

If there’s one thing I’ve learned from reading about court cases, it’s that a civil suit like this will get really complicated with how they assess damages and responsibilities.

And yeah, there’s no perfect answer for computer system stability. You can never get perfect stability, and each 9 you add to your 99.9% uptime costs more than the last one. Eventually you have teams of people whose full time job is keeping the system up for an additional second per year. And even with that, sometimes Google still goes down because it’s all a numbers game.

I didn’t mean to ramble so long, but I have opinions and I get type-y before bed. :)