One year ago, a faulty update from a cybersecurity firm took down hospitals, airlines, banks, and government offices around the world.

On July 19, 2024, Crowdstrike pushed an update to its Falcon program used by Microsoft Windows computers to collect data on potential new cyberattack methods. 

The routine operation turned into a “Blue Screen of Death” (BSOD) for roughly 8.5 million Microsoft users in what many considered one of the largest internet outages in history. 

The fallout meant significant financial losses for Crowdstrike’s customers, estimated at around $10 billion (€8.59 billion). 

“There were no real warning signs that an incident of this nature was likely,” Steve Sands, fellow of the Chartered Institute for IT, told Euronews Next.

“Most organisations that rely on Windows would have had no planning in place to cater for such an event”.

But what did Crowdstrike learn from the outage and what can other companies do to avoid the next one? 

‘Round-the-clock’ surveillance of IT environment needed

A year after Crowdstrike, outages at banks and “major service providers” would suggest that the cybersecurity community hasn’t changed much, according to Eileen Haggerty, vice president of product and solutions at cloud security company NETSCOUT. 

So far this year, a cloud outage from Cloudflare brought down Google Cloud and Spotify in June, changes to Microsoft’s Authenticator app led to an outage for thousands using Outlook or Gmail in July, and a software flaw at SentinelOne deleted the critical networks necessary to keep its programs running. 

Haggerty said that companies need to have visibility to respond to possible software problems before they happen by having “round-the-clock monitoring” of their networks and their entire IT environment. 

Haggerty suggests that IT teams conduct “synthetic tests,” which simulate how a site would handle real traffic before a critical function fails. 

These tests would provide companies “with the vital foresight they need to anticipate issues before they even have a chance to materialise,” she added. 

In a blog post, Microsoft said that synthetic monitoring is not airtight and is not always “representative of the user experience,” because organisations often push new releases, which can cause the whole system to become unstable. 

The blog post added that it can improve the response time to fix a mistake once spotted. 

After an outage happens, Haggerty also suggests building a detailed repository of information about why the incident happened so they can anticipate any potential challenges before they become an issue.

Sands said these reports should include plans for resilience and recovery, along with an evaluation of where the company has a reliance on external companies.

Any company looking to build with “resilience” should do it as early as possible, since it is difficult to be “bolted on later,” he said.

“Many companies will have updated their incident response plans based on what happened,” Sands said.

“However, experience tells us that many will already have forgotten the relatively short-term impact and chaos caused and will have done little or nothing”.

Nathalie Devillier, an expert at the EU European Cyber Competence Centre, told Euronews last year that European cloud and IT security providers should be based on the same continent.

“Both should be in the European space so as not to rely on foreign technology solutions that, as we can see today, have impacts on our machines, on our servers, on our data every day,” she said at the time. 

What has Crowdstrike itself done after the outage?

Crowdstrike said in a recent blog post this month that it developed a self-recovery mode to “detect crash loops and … transition systems into safe mode,” by itself. 

There’s also a new interface that helps the company’s customers have greater flexibility to test for system updates, such as setting different deployment schedules for test systems and critical infrastructure so that it doesn’t happen at the same time. 

A content pinning feature also lets customers lock specific versions of their content and choose when and how updates are applied. 

CrowdSource also now has a Digital Operations Center that it says will give the company a “deeper visibility and faster response” to the millions of computers using the technology worldwide. 

It also conducts regular reviews of their code, quality processes and operational procedures. 

“What defined us wasn’t that moment, it was everything that came next,” George Kurtz, the CEO of Crowdstrike, said in a LinkedIn post this week, noting that the company is now “grounded in resilience, transparency and relentless execution”. 

While Crowdstrike has made some changes, Sands believes it might be “an impossible ask” to avoid another outage at that same level because computers and networks “are by their nature highly complex with many dependencies”.

“We can certainly improve the resilience of our systems from an architecture and design perspective … and we can prepare better to detect, respond and recover our systems when outages happen,” he said.

Share.
Exit mobile version