93 minutes & counting
Google’s admitted the 93 minute Compute Engine outage was the fault of autoscale not working. So, it was more of an autofail really. Ha ha ha.
Still, at least it wasn’t a real life human person that cocked up. For once.
The Big G put it down to a “network programming failure” and said the autoscaler didn’t work as it should have. This led to Compute Engine being out for over an hour and a half.
Too late to apologise?
In an extraordinarily long quasi-apology, they dribbled:
“Propagation of Google Compute Engine networking configuration for newly created and migrated VMs is handled by two components. The first is responsible for providing a complete list of VM’s, networks, firewall rules, and scaling decisions. The second component provides a stream of updates for the components in a specific zone.”
OK mate, but what does that mean? Well, it means that during the outage, the first step of the process sent no data. Which means VMs in other zones didn’t know how to speak to their mates. The Autoscaler also needed this initial data, so it too fell over.
Why though? Apparently a good old “stuck process” (maybe they should have turned it all off and on again?).
What should have happened is the fail over should have essentially ctrl alt deleted and killed the errant process. But instead, it decided not to bother. Maybe it was busy. Maybe it just couldn’t be bothered. Who can tell.
“The engineering team was alerted when the propagation of network configuration information stalled. They manually failed over to the replacement task to restore normal operation of the data persistence layer.”
They went on to promise they would:
“Stop VM migrations if the configuration data is stale” and “the data persistence layer to re-resolve their peers during long-running processes, to allow failover to replacement tasks.”
<shamelessplug>Sometimes, having a fully managed host like, er, well, I dunno, Hyve seems like a very good idea compared to AWS…</shamelessplug>