Navigating tech catastrophes: five key lessons from the CrowdStrike outage
By Chris Drumgoole, General Manager - Cloud and Infrastructure, DXC Technology
Tuesday, 10 September, 2024
In today’s hyperconnected world, where businesses rely heavily on digital platforms, any interruption in service can be catastrophic. A significant disruption impacted many of the world’s largest enterprises this year when a defective software update caused a major shutdown across various sectors, including airlines, banks, government agencies and retailers, all operating on Microsoft’s Windows operating system. Recognising the severe impact such downtime can have on businesses, some IT experts have described the CrowdStrike incident as one of the most extensive outages in history.
As organisations continue to recover from this incident, it is essential to reflect on the key lessons learned to mitigate the effects of future disruptions.
1. Contingency planning is essential
The outage has prompted industry-wide discussions about vulnerabilities, data protection, supply chain impacts and other critical concerns. When facing such a crisis, prioritising tasks effectively is vital. Focusing on the most critical aspects first, such as restoring essential systems, can make a significant difference in minimising downtime.
This incident also highlights the need for thorough testing, risk assessments and clear communication protocols to prevent widespread disruptions. Including the entire supply chain in contingency planning is crucial, as third-party risks can significantly impact business operations during outages or cyber threats.
2. Around-the-clock commitment
IT outages do not adhere to traditional business hours, making it imperative for businesses to maintain a 24/7 response capability. Continuous network monitoring, swift incident response and effective resource management are essential for prompt restoration of services.
The ability to respond quickly, regardless of the time of day, can make all the difference in minimising the impact on customers.
3. The human element is vital
While technical solutions are critical, the human touch remains an essential aspect of problem-solving. This outage stresses the challenges of integrating best practices for cloud-based infrastructure while ensuring that humans are kept informed and equipped for technology testing.
Technicians often needed to engage directly with end users, guiding them through the complex restoration process. For example, at DXC Technology, some technicians had to deal with non-technical users over the phone, and it is during such times that patience and empathy are required, despite the high stakes.
4. Vendor relationships are crucial
Strong relationships with vendors can be pivotal during an IT crisis. Regular contact with vendors, understanding their update processes and maintaining direct communication lines are all essential for effective incident response.
The ability to quickly collaborate with vendors can help address issues more efficiently and reduce the overall impact of the outage.
5. Effective communication is key
Clear and timely communication during a crisis is paramount. Promptly updating customers about the situation, managing expectations and providing transparent information can significantly reduce confusion and anxiety.
Establishing reliable communication channels ensures clarity and helps maintain trust with customers. Additionally, gathering feedback from customers about their experience during the incident can help refine response strategies for future preparedness.
The CrowdStrike outage serves as a powerful reminder that no system is immune to failure. However, by focusing on resilience in infrastructure, transparent communication, preparedness through incident response, continuous monitoring and learning from each incident, organisations can navigate technological catastrophes more effectively.
Why embedding trust in AI is critical to its future
The maturity of regulation and frameworks to effectively manage AI is still catching up with the...
Enter the IT leader: the evolving role of IT professionals
IT workers have evolved into strategic leaders within businesses, and moving forward they will...
Putting people first in the AI revolution will drive your innovation engine
The role of tech leaders is to enable an organisation's people to harness the transformative...