Misapplied config change caused Google service outage
Google has blamed a misapplied configuration change for a network issue impacting access to Google services for some users in multiple markets including parts of Australia.
The disruption in Google’s network in parts of the US reduced performance and increased error rates on Google services including Google Cloud Platform, YouTube, Gmail, Google Drive and others for up to four hours on Sunday.
In a blog post, Google Vice President of Engineering Benjamin Treynor Sloss said preliminary analysis suggests that the disruption was caused by a minor configuration change meant for a single region being applied to more servers and regions than expected.
This caused the inadvertently reconfigured regions to stop using more than half of their available network capacity, resulting in network congestion that had flow-on effects as far as Australia.
“For most Google users there was little or no visible change to their services — search queries might have been a fraction of a second slower than usual for a few minutes but soon returned to normal, their Gmail continued to operate without a hiccup and so on,” he said.
“However, for users who rely on services homed in the affected regions, the impact was substantial, particularly for services like YouTube or Google Cloud Storage which use large amounts of network bandwidth to operate.”
While Google’s engineering team identified the issue within seconds, diagnosis and correction took significantly longer. The problem was discovered just before 6 am AEST and was mostly fixed by 9 am, but complete mitigation took until just before 10 am.
The disruption resulted in a 2.5% drop in YouTube views and a 30% reduction in Google Cloud Storage traffic. In addition, around 1% of active Gmail users reported problems with their account.
Treynor Sloss said Google’s engineering teams are now conducting a thorough post-mortem of the incident, and plan to follow this up with a “focused engineering sprint” to ensure Google has mitigation strategies in place to ideally prevent anything like this incident happening again.
“We know that people around the world rely on Google’s services, and over the years have come to expect Google to always work. We take that expectation very seriously — it is our mission, and our inspiration. When we fall short, as happened Sunday, it motivates us to learn as much as we can, and to make Google’s services even better, even faster and even more reliable.”
Information Technology Professionals Association (ITPA) is a not-for-profit organisation focused on continual professional development for its 18,700 members. To learn more about becoming an ITPA member, and the range of training opportunities, mentoring programs, events and online forums available, go to www.itpa.org.au.
Agility Robotics' humanoid robot, Digit, is stepping out of the lab and into the world, and...
Large amounts of mobile traffic bound for European mobile networks was inadvertently routed...
Users of outdated and obsolete versions of Windows have been urged to patch against the critical...