Mark Zuckerberg was left counting the personal cost of bad PR yesterday (about $6 billion, according to Bloomberg) on a day when his company couldn’t get out of the news headlines, for all the wrong reasons.
The billionaire Facebook CEO’s bad day at the office started with whistleblower Frances Haugen finally revealing her identity in a round of interviews that looked set to lay siege to the Monday headlines. Anonymous revelations by the former Facebook product manager had fuelled an entire Wall Street Journal series about the harm inflicted or ignored by Instagram and Facebook, and her unmasking was its denouement. It was supposed to be big news, and for a while it was.
But then something even bigger happened.
Facebook, Instagram, and WhatsApp completely disappeared. For six hours.
Despite losing access to the world’s favourite confirmation bias apparatus, conspiracy theorists didn’t miss a beat. Putting two and two together to make five, they decided that it was all too convenient and that Facebook was using the dead cat strategy to rob Haugen of the spotlight!
It was a convenient theory, but there is no evidence for it besides an interesting coincidence, and it ignores the fact that Facebook taking itself out to silence a whistleblower is a far more interesting story than Facebook simply taking itself out by accident. I’m afraid that in the absence of more compelling information, Hanlon’s Razor will have to suffice: “Never attribute to malice that which is adequately explained by stupidity”.
What we can say for sure, is that Facebook took itself and its stablemates out with a spectacular self-inflicted wound, in the form of a toxic Border Gateway Protocol (BGP) update.
The Internet is a patchwork of hundreds of thousands of separate networks, called Autonomous Systems, that are stitched together with BGP. To route data across the Internet, Autonomous Systems need to know which IP addresses other Autonomous Systems either control or can route traffic to. They share this information with each other using BGP.
According to Cloudflare—which has published an excellent explanation of what it saw—Facebook’s trouble started when its Autonomous System issued a BGP update withdrawing routes to its own DNS servers. Without DNS servers, the address facebook.com stopped working. In Cloudflare’s words: “With those withdrawals, Facebook and its sites had effectively disconnected themselves from the Internet.”
Cloudflare appears to have noticed the problem almost straight away, so we can assume that Facebook did too. So why did it take six more hours to fix it? The social media scuttlebutt, later confirmed in Facebook’s own terse explanation, was that the outage disabled the very tools Facebook’s enormous number of remote workers would normally rely on to both communicate with each other and to fix the problem.
The underlying cause of this outage also impacted many of the internal tools and systems we use in our day-to-day operations, complicating our attempts to quickly diagnose and resolve the problem.
The unconfirmed part of the same scuttlebutt is that Facebook is so 21st century that folks were locked out of offices, and even a server room, which had to be entered forcibly in order to fix the configuration issue locally.
Of course that could just be another conspiracy theory, but as somebody who has themselves been stranded outside a building, forced to look through a glass door at the very computer that controls that door attempting and failing to boot from the broken network I had come to investigate, let me assure you that it’s not an outrageous suggestion.
The Facebook Empire withdrawing itself from the Internet didn’t stop people looking for it though. In fact, it made them look much, much harder (just imagine everyone, everywhere, frustrated, hitting “refresh” or reinstalling Instagram until they’re bored, and you get the idea). Unanswered DNS requests spiked, and DNS resolvers groaned, as computers groped around in the dark looking for the now non-existent facebook.com domains.
When they weren’t pummelling DNS resolvers, the rest of the Facebook diaspora was forced to find other forms of entertainment or other means of communication. Some local mobile phone operators reported being overwhelmed, and encrypted messaging app Signal said it welcomed “millions” of new users as people looked for alternatives to WhatsApp.
And let’s not forget that there are companies that rely on Facebook, Instagram, and WhatsApp to drive business, and there are services that use Facebook logins for authentication. And then there’s the influencers. All of them had to stop. For six hours. Won’t somebody think of the influencers?
When it finally sank in that nobody could use Facebook, Instagram, or WhatsApp, it started to dawn on us all just how much so many of us have put Facebook and its products at the centre of our lives.
And then we all went to Twitter to tell everyone else how good or bad it all was. Thankfully, it withstood the onslaught.
Which leads us to the “so what?” part of our story. This is a security blog after all, and if this wasn’t a cyberattack you may be wondering what all of this has to do with security. Where’s the lesson in all of this?
Single points of failure people.
That’s it. That’s the tweet.