Facebook Failure

For several hours on 4 October, 2021 Facebook disappeared from the Internet. Not just Facebook, but all of the services that Facebook provides. Big deal, right? It’s just a website. Just social media.

Except Facebook isn’t just a website. It is an entire communications ecosystem that has worked very hard to get every person on the planet using its platform for everything. The amount of things that broke during the outage is far more than most people would expect.

So, what happened to cause this outage? And what lessons can we learn from this?

What happened?

The two best accounts for what happened are from Cloudflare and Facebook themselves. Cloudflare describes what things looked like from the outside, while Facebook recounts what happened inside the company.

In short, an automation tool (possibly a code review bot) pushed an update that essentially told the Internet that Facebook no longer exists. Again, if this were nearly any other company, this would have been no big deal. It also would have been easy to revert the change once it was detected. It would take a little bit for the change to propagate through the Internet and actually be accessible again, but it could happen.

Unfortunately for Facebook, it was not that simple for them. This appears to be a direct result of running everything off of their own infrastructure. For example, most companies pay another company to act as a registrar – that is, to tell the Internet “I own my domain name, and the web address for my domain name is facebook.com.” This means that even if the company’s network goes down, the pointer to the network is still up. Facebook does not do this. They run their own registrar. So when they told the Internet that Facebook no longer exists, they took that pointer with them.

Essentially, they not only bulldozed all of the roads that lead to Facebook, making it impossible to reach. They also removed all signs that point to Facebook, erased it from maps, smudged over satellite imagery of the location, and generally made it appear as though Facebook never existed. This meant that technicians could not remotely make any changes (which is usually the preferred method).

It gets worse.

Facebook runs nearly everything off of its own network. Not just normal corporate communications, but things like access control systems, electronic door locks, telephones… everything. This caused an impressive series of cascading failures where people couldn’t get into buildings, couldn’t call the security specialist that could bypass the electronic access controls, couldn’t access server racks to physically access the networking devices to walk back the bad update.

This is why the outage lasted as long as it did. How do you contact your security specialist when your phone system runs off of the network that is down? And if you have a list of backup phone numbers, it’s probably on an online page, which is also down.

Again, the impacts of Facebook being down extended well beyond facebook.com. Facebook processes many of the world’s domain name queries, which meant many non-FB services appeared to fail until queries switched over to other providers. As Facebook also owns WhatsApp, that too was down. This was especially important in areas of the world where WhatsApp is the only free messaging service. FB has worked to provide free access to Facebook and WhatsApp in some less developed countries, so those two services are essentially the Internet and the phone for many people. Thus, an American social media company’s technical blip caused communications isolation for many.

Lessons Learned

That’s all well and good, but what do we do about it?

First, check out this thread from Foone. A couple of points in particular I’d like to highlight:

Print out your most important documents. Points of contact for what to do when you lose power, network, phone, or have to evacuate the building and notify someone need to be on paper, not just digits. They also need to be easily accessible. Common techniques include sticking a document protector to the door and slipping your important stuff inside or putting them all in a binder and placing that next to the door (if you’ve ever seen SDS/MSDS, you know what I’m talking about). Make sure that you’ve pared down the info to just what you are comfortable with having on the nonsecure side of the door, because that’s where they’re going to be most useful to you in case you can’t get the door open.

For every system, ask “what happens if that system goes down?”. You don’t need to build an elaborate threat model involving state actors hacking into your network. I’ve probably seen more outages from someone tripping over a power cable than any malicious intent. What happens *when* a system goes down (because failure is always an option) is much more important for resilience planning than thinking about *why* the system goes down.

One of the biggest problems for Facebook was that they did not have proper out of band (OOB) communications. OOB comms are an alternate communications path, so that when one method goes down, you still have a backup. This is very much like a PACE (Primary, Alternate, Contingency, Emergency) plan. The problem was that the ACE for Facebook rode the same network as P. That is, they didn’t actually have a separate path.

A good OOB comms plan will have separate physical infrastructure from a separate provider. For example, if my main Internet provider is AT&T, then I need a backup that is not AT&T. If my phones are also through AT&T, it is not truly OOB communications. Same for cell phones. Yes, those are three paths, but they share enough infrastructure that I still need something different to truly have OOB communications.

Ensuring that some of your specialized systems are on separate networks is also advisable. A good rule of thumb is if the system is important enough to have a battery backup in case of power failure, it should have some sort of network backup or be on an alternate network in case of network failure. Electronic door locks are an example from this scenario. Because the locks were on the same network that had user traffic, they failed right along with everything else. If they had been on their own, dedicated network (as is common for systems that are more operational technology (OT) than information technology (IT), such as Industrial Control Systems, HVAC, etc) that had a separate path to the authentication servers, then they could have continued working even when the main network went down.

In case standing up a completely separate network for your non-IT systems is not feasible, at least have a method of physically bypassing the electronic device to complete the job manually. The Facebook incident has just added to the body of evidence that many systems do not handle failure to connect to the Internet well. Thermostats that set heat to max, or automated pet feeders that stop dispensing food are just some examples of poor design. Every electronic lock needs a physical key, every thermostat needs to accept button-pushes, not just app instructions.

As Foone stated above, sometimes you’ll just bypass the lock/door completely. Depending on your risk profile, this may mean that you set off some alarms. This is particularly true in areas that need special access controls, such as server rooms and SCIFs. To that end, part of the documentation you should have printed out is when it is acceptable to take an action that will raise an alarm (preferably with the phone number of the people that will receive the alarm, so that you can explain the situation and they don’t send cops guns blazing).

Potentially the hardest part of preparing for network failure is simply understanding what traverses your network. What are your business processes? What resources do those processes access? What is the path to those resources? What are the components that can cause that path, and the process, to fail? What processes happen completely locally? What processes need to traverse the Internet? This is simple to state, but extremely difficult to do in practice.

Once you have that information, now what? This is the essence of risk management – understanding what is going on, what risks to accept, and what risks to mitigate. This can be especially difficult on tactical networks because the Internet connection can be degraded, but the power, space, and cooling to host the storage and compute to complete the task may not be practical. There is often no good answer.