If you were part of the endless legions of IT workers furiously fixing Windows machines over the weekend thanks to the CrowdStrike bug, I salute your service—and if you were affected by the disruptions to flights, hospital services, banking and more, I commiserate. Most of us, however, remained unaffected, as according to Microsoft only 1% of Windows devices fell victim to the bug.
Still, that’s 8.5 million devices causing turmoil worldwide, and as a result, Microsoft says it deployed hundreds of Microsoft engineers and experts to work with customers to restore their stricken services (via The Verge). MS also engaged directly with CrowdStrike to work on a solution, with the company releasing its own, separate statement regarding some of the technical issues that caused the event.
At the core of the fault was a configuration file contained in an update for CrowdStrike’s Falcon platform, which triggered a logic error that in turn caused a BSOD loop on Windows systems running Falcon sensor software.
The update was designed to “target newly observed, malicious named pipes being used by common C2 frameworks in cyberattacks”, but instead threw some very important infrastructure into a loop, causing a gigantic knock-on effect.
CrowdStrike has since corrected the logic error in a further update, and Microsoft has released a custom recovery tool to remove it. Prior to the release of the tool, admins needed to reboot affected Windows devices into Safe Mode or the Recovery Environment and manually remove the buggy file.
However, questions have been asked as to how such an update was allowed into critical Windows systems en masse in the first place, causing a disaster that may end up being one of the worst tech outages of all time. Ex-Microsoft engineer David W Plummer has tweeted a comparison of how Windows debugging processes were handled during his time at the company, and how this particular event differs.
How we did this in the old days:When I was on Windows, this was the type of thing that greeted you every morning. Every. Single. Morning.You see, we all had a secondary “debug” PC, and each night we’d run NTStress on all of them, and all the lab machines. NTStress would… pic.twitter.com/rZkvpujbcrJuly 20, 2024
The problem, in this case, is that this event was created by a CrowdStrike driver that passed WHQL testing but still possessed the capability to download and execute p-code that hadn’t been signed by Microsoft. Essentially, a third-party driver at the heart of a system can still bring it down with a dodgy update, even if Microsoft’s processes for its own updates have appropriate levels of testing and certification.
Well, it’s all been a bit of a clusterfudge, hasn’t it? Microsoft is unlikely to be happy that its name is once again in the headlines for server-related issues, although in recent years it’s often been security breaches that have earned it criticism. As of now, the issue appears to have been fixed, at least, and perhaps some lessons have been learned for third-party updates in future.