Yesterday there was a global outage from content delivery network providers, Cloudflare, which affected websites around the world. Users saw 502 errors and were unable to load websites for over half an hour.
Cloudflare has around 16 million customers, providing services to some of the world’s most popular websites. They provide a range of services including content delivery networks, load balancing, routing, DDoS protection and firewall services.
Visitors to websites that used Cloudflare’s services received 502 errors, which were caused by a massive spike in CPU utilisation on their network. The CPU spike was caused by a software deployment that backfired, causing sites to not load and show the error page. The software deploy was rolled back and service then returned to normal.
This latest Cloudflare downtime comes shortly after network performance issues just a week ago, where sites were affected by a BGP routing leak.
Ironically, even the popular monitoring site Downdetector was also down, meaning that people were unable to use the service to check if sites were affected.
There was speculation that Cloudflare had suffered a DDoS attack due to the scale of the downtime, but they have since revealed on their official blog that the outage was due to issues with software deployment.
In a blog post on their website, they stated, “Starting at 13:42 UTC today we experienced a global outage across our network that resulted in visitors to Cloudflare-proxied domains being shown 502 errors (“Bad Gateway”). The cause of this outage was the deployment of a single misconfigured rule within the Cloudflare Web Application Firewall (WAF) during a routine deployment of new Cloudflare WAF Managed rules.”
Now that so much of the internet is centralised and people often use one provider for all of their performance and security services, instances like this are more likely to happen. Website owners should not rely on one provider without having backup or business continuity services in place.
Cloudflare constantly makes software deployments across the network and has automated systems for testing and deploying updates. Any company that carries out global deployments should roll out updates incrementally, as even if the deployment is ‘overnight’ in that time zone, it is still the day time somewhere in the world.
Were you affected by the Cloudflare outage? Let us know in the comments or Tweet us @Hyve!