On a Tuesday at the beginning of last month, there was a sudden spike in the amount of support requests we were getting. Various companies were reporting that seemingly completely unrelated services had all started throwing errors. It turned out these were all related. Half of the internet had gone AWOL...
Upon futher investigation, it transpired that Cloudflare were having a major outage. Cloudflare may not be a company name you've ever even heard before. However, they provide services that are essential to the backbone of a massive chunk of the internet. If that fails, as it did here, all the services that rely on these bits of the internet being available for their operation / data just stop working. In this situation, waiting is the only solution - there's no way to coax these apps back into life until they can get back at the data they need.
Why did it happen?
In this specific instance, the problem was down to human error. Cloudflare have written a very good post-mortem blog post that explains what went wrong in great depth, and what they've done to ensure it can't go wrong in that way again. It gets very technical very quickly, but this sort of analysis is refreshing to see when there's often very little detail available at all when problems like these occur.
Cloudflare is by no means the only company providing internet services that have had issues. Pretty much all services experience the occasional downtime - Microsoft Office 365, Google et al. As we move more and more of our data into "the cloud" this will become an increasingly serious problem for many companies, as it can cause massive disruption to people's ability to get their job done (depending on the service that has disappeared).
"This would never have happened back in my day..."
Of course, while it's tempting to wave your fist at "the cloud" and wonder if we're moving backwards, it's easy to forget that it's not like hosting these services onsite is necessarily any more reliable. Any system with moving parts is either going to break, or need maintenance that includes downtime to prevent it from breaking.
The difference when a service is run locally is that you feel more in control when these things happen. You (or we) have direct access to the thing that has gone wrong and can take direct action to resolve it. That doesn't necessarily mean it can be fixed quickly though. The most serious problems can sometimes take days to fix.
This is a problem that cloud services have mitigated somewhat. Yes, it can go wrong. However, it's done on such a massive scale with the sort of redundancy that only the biggest companies could ever afford that, while annoying, these outages tend to be quite short and rarely catastrophic.
Nevertheless, we do share your frustration when one of these disappear and there's little we can do in the moment to help. The inevitability of these sorts of problems with any service may be something that early adopters of Elon Musk's recently proposed plan to put chips into our brains want to consider before they take the plunge...