Could a CrowdStrike-Type Outage Hit IBM i?
August 21, 2024 Alex Woodie
It’s been a month since CrowdStrike caused a massive IT outage that took down large swaths of the Internet, causing billions of dollars in damage. The CrowdStrike outage was a Windows-only affair, but it’s worth asking: Could a similar incident impact IBM i?
The largest IT outage in history began the morning of July 19, when a poorly configured update to CrowdStrike’s endpoint security software, called Falcon, caused 8.5 million Windows computers to crash, resulting in the dreaded blue screen of death (BSOD).
For computer users of a certain vintage, the BSOD is – or at least was – commonplace. Windows used to crash all the time, forcing users to either unplug the computer or execute the three-finger salute: Ctrl-Alt-Del. But Microsoft eventually got its act together, and today’s Windows systems are much more stable than they used to be, which is one reason the mass BSOD was such a head turner.
The impact of the CrowdStrike outage was immediate and far-reaching. Tens of thousands of banks, hospitals, governments, schools, news organizations, stores, and airlines faced partial or full outages. More than 5,000 flights were cancelled around the world as airlines struggled to restart their systems.
Delta Airways was hit particularly hard, as the airline was still struggling to gets its systems reset more than a week after the initial outage. Microsoft Azure, a heavy user of the Falcon software, was also hit hard, as the public cloud service needed to restart many systems individually. Worldwide damage from the outage was estimated at $10 billion.
The outage raised many questions. What was the cause of it? Doesn’t CrowdStrike test its updates before pushing them out? How could such a massive outage happen in today’s sophisticated IT age? And is the IBM i susceptible to such an event?
The answers, it turns out, are revealing.
Like all security software companies, CrowdStrike continually updates its client software to keep up with constantly changing security threats. As one of the world’s most popular provider of endpoint protection software, and a close Microsoft partner, CrowdStrike has nearly unfettered access to low-level internals of the Windows operating system. Despite its trusted position, CrowdStrike fell victim to bad DevOps practices, according to CrowdStrike’s own root cause analysis of the incident.
The post-mortem discovered several problems with CrowdStrike’s coding and testing processes. The core problem had to do a parameter count mismatch in the machine learning-based sensor detection engine that led to the BSOD, what it calls the Channel File 291 incident. But the bigger problem is arguably the lack of testing conducted by CrowdStrike.
According to CrowdStrike, an update to Channel File 291 contained an array that was coded with 21 fields, or parameters. However, the sensor itself was developed with only 20 parameters.
“The Content Interpreter expected only 20 values,” the company says. “Therefore, the attempt to access the 21st value produced an out-of-bounds memory read beyond the end of the input data array and resulted in a system crash.”
The parameter count mismatch “evaded multiple layers of build validation and testing,” the company said. Compounding the problem was a failure by CrowdStrike to conduct regression tests for compatibility with older data formats; it tested only “the happy path.” It also didn’t test the update using invalid data; only valid data was used.
Compounding the error was the fact that CrowdStrike did not conduct a staggered release. Instead, it pushed the update out to all clients simultaneously, causing the BSOD on nearly 10 million computers worldwide.
CrowdStrike has pledged to fix the DevOps defects that led to the outage, including better test procedures, more automated testing, and more deployment layers and acceptance checks. Customers will be given more control over how the updates are deployed, and CrowdStrike will work with two third-party security vendors to review its coding and end-to-end quality control and release processes, the company says.
“We are using the lessons learned from this incident to better serve our customers,” CrowdStrike CEO and founder George Kurtz wrote.
So, could such an event happen on IBM i? While the IBM midrange box is immune to the BSOD (the IBM i’s death screen would likely be green), it’s a question that many have raised, according to Pascal Polverini, the CTO of Polverini & Partners.
“I’ve seen some posts from people saying ‘This could not happen on IBM i,” Polverini told IT Jungle recently. “No, it could also happen on IBM i. No matter what, if you are an insurance company and you’ve got cloud services and you’ve got a bug, that’s it. You are fried.”
Polverini recently launched a suite of end-to-end testing tools for IBM i. From unit and integration tests to regression and stress tests, the ReplicTest suite is designed to serve all the testing needs of IBM i customers.
The CrowdStrike incident was a rude reminder of just how critical testing is. Without conducting a thorough analysis of changes to software, even small errors can turn into large events that cause ripple effects across a company, an industry, and even the globe.
“We all know testing is important, of course. Any programmer spends on average one-third of his time to test,” Polverini said. “If you can free this time because you have a complete suite tool to do that, everyone sleeps better, which is always good. Everything will be more solid, and you will have time to program more.”
Besides not having a staggered deployment, IMHO the fundamental flaw here is of design and computer architecture. Yes, good architecture, like the physical one, protects you, from complexity and accidents.
People seems not to realize how many – now critical – systems are really run on very frail foundations.
Having a third party component, updated externally from an external company whenever it wishes, and basically run in kernel space (!?!), it’s a really big hazard.
Windows was born out of a single user focused operating system, and it shows.
In IBMi the advantage is that SLIC level patches and system are issued by a single company, that manages the hardware too, and it’s object level concept can protect from things that shouldn’t be executable.
Ideally, an operating system should have a properly and strictly specified interface for security software to interface with, maybe even running such security software in a dedicated core (we have plenty today) and shouldn’t be running freely on a system. A fault or bug in such security system should be isolated and managed policy wise (i.e. restrict the system, continue running…). That the industry is accepting such “crowdstr0ke” architecture for sensible things is really out of this world IMHO.