Facebook Failure

For several hours on 4 October, 2021 Facebook disappeared from the Internet. Not just Facebook, but all of the services that Facebook provides. Big deal, right? It’s just a website. Just social media.

Except Facebook isn’t just a website. It is an entire communications ecosystem that has worked very hard to get every person on the planet using its platform for everything. The amount of things that broke during the outage is far more than most people would expect.

So, what happened to cause this outage? And what lessons can we learn from this?

What happened?

The two best accounts for what happened are from Cloudflare and Facebook themselves. Cloudflare describes what things looked like from the outside, while Facebook recounts what happened inside the company.

In short, an automation tool (possibly a code review bot) pushed an update that essentially told the Internet that Facebook no longer exists. Again, if this were nearly any other company, this would have been no big deal. It also would have been easy to revert the change once it was detected. It would take a little bit for the change to propagate through the Internet and actually be accessible again, but it could happen.

Unfortunately for Facebook, it was not that simple for them. This appears to be a direct result of running everything off of their own infrastructure. For example, most companies pay another company to act as a registrar – that is, to tell the Internet “I own my domain name, and the web address for my domain name is facebook.com.” This means that even if the company’s network goes down, the pointer to the network is still up. Facebook does not do this. They run their own registrar. So when they told the Internet that Facebook no longer exists, they took that pointer with them.

Essentially, they not only bulldozed all of the roads that lead to Facebook, making it impossible to reach. They also removed all signs that point to Facebook, erased it from maps, smudged over satellite imagery of the location, and generally made it appear as though Facebook never existed. This meant that technicians could not remotely make any changes (which is usually the preferred method).

It gets worse.

Facebook runs nearly everything off of its own network. Not just normal corporate communications, but things like access control systems, electronic door locks, telephones… everything. This caused an impressive series of cascading failures where people couldn’t get into buildings, couldn’t call the security specialist that could bypass the electronic access controls, couldn’t access server racks to physically access the networking devices to walk back the bad update.

This is why the outage lasted as long as it did. How do you contact your security specialist when your phone system runs off of the network that is down? And if you have a list of backup phone numbers, it’s probably on an online page, which is also down.

Again, the impacts of Facebook being down extended well beyond facebook.com. Facebook processes many of the world’s domain name queries, which meant many non-FB services appeared to fail until queries switched over to other providers. As Facebook also owns WhatsApp, that too was down. This was especially important in areas of the world where WhatsApp is the only free messaging service. FB has worked to provide free access to Facebook and WhatsApp in some less developed countries, so those two services are essentially the Internet and the phone for many people. Thus, an American social media company’s technical blip caused communications isolation for many.

Lessons Learned

That’s all well and good, but what do we do about it?

First, check out this thread from Foone. A couple of points in particular I’d like to highlight:

Print out your most important documents. Points of contact for what to do when you lose power, network, phone, or have to evacuate the building and notify someone need to be on paper, not just digits. They also need to be easily accessible. Common techniques include sticking a document protector to the door and slipping your important stuff inside or putting them all in a binder and placing that next to the door (if you’ve ever seen SDS/MSDS, you know what I’m talking about). Make sure that you’ve pared down the info to just what you are comfortable with having on the nonsecure side of the door, because that’s where they’re going to be most useful to you in case you can’t get the door open.

For every system, ask “what happens if that system goes down?”. You don’t need to build an elaborate threat model involving state actors hacking into your network. I’ve probably seen more outages from someone tripping over a power cable than any malicious intent. What happens *when* a system goes down (because failure is always an option) is much more important for resilience planning than thinking about *why* the system goes down.

One of the biggest problems for Facebook was that they did not have proper out of band (OOB) communications. OOB comms are an alternate communications path, so that when one method goes down, you still have a backup. This is very much like a PACE (Primary, Alternate, Contingency, Emergency) plan. The problem was that the ACE for Facebook rode the same network as P. That is, they didn’t actually have a separate path.

A good OOB comms plan will have separate physical infrastructure from a separate provider. For example, if my main Internet provider is AT&T, then I need a backup that is not AT&T. If my phones are also through AT&T, it is not truly OOB communications. Same for cell phones. Yes, those are three paths, but they share enough infrastructure that I still need something different to truly have OOB communications.

Ensuring that some of your specialized systems are on separate networks is also advisable. A good rule of thumb is if the system is important enough to have a battery backup in case of power failure, it should have some sort of network backup or be on an alternate network in case of network failure. Electronic door locks are an example from this scenario. Because the locks were on the same network that had user traffic, they failed right along with everything else. If they had been on their own, dedicated network (as is common for systems that are more operational technology (OT) than information technology (IT), such as Industrial Control Systems, HVAC, etc) that had a separate path to the authentication servers, then they could have continued working even when the main network went down.

In case standing up a completely separate network for your non-IT systems is not feasible, at least have a method of physically bypassing the electronic device to complete the job manually. The Facebook incident has just added to the body of evidence that many systems do not handle failure to connect to the Internet well. Thermostats that set heat to max, or automated pet feeders that stop dispensing food are just some examples of poor design. Every electronic lock needs a physical key, every thermostat needs to accept button-pushes, not just app instructions.

As Foone stated above, sometimes you’ll just bypass the lock/door completely. Depending on your risk profile, this may mean that you set off some alarms. This is particularly true in areas that need special access controls, such as server rooms and SCIFs. To that end, part of the documentation you should have printed out is when it is acceptable to take an action that will raise an alarm (preferably with the phone number of the people that will receive the alarm, so that you can explain the situation and they don’t send cops guns blazing).

Potentially the hardest part of preparing for network failure is simply understanding what traverses your network. What are your business processes? What resources do those processes access? What is the path to those resources? What are the components that can cause that path, and the process, to fail? What processes happen completely locally? What processes need to traverse the Internet? This is simple to state, but extremely difficult to do in practice.

Once you have that information, now what? This is the essence of risk management – understanding what is going on, what risks to accept, and what risks to mitigate. This can be especially difficult on tactical networks because the Internet connection can be degraded, but the power, space, and cooling to host the storage and compute to complete the task may not be practical. There is often no good answer.

Bridging Captive Portals

So, I’m working on my research project, and it involves a bunch of virtual machines (which I have set up according to @da_667‘s instructions from his book). Cool, too easy, everything works fine.

…until I have to travel. Fortunately, my lab fits on my laptop. Unfortunately, when I connected to hotel Internet, none of my labs had Internet. The host had no issues, so what was up with the VMs?

As part of his instructions, DA said to put the firewall (PFSense) into bridged mode. This essentially put it parallel to the host OS on the network interface. So it operates completely independently of the host OS. So when the host OS made its way through the captive portal, the PF didn’t. And since the PF box doesn’t exactly have a web browser from which I can authenticate to the captive portal… no Internet.

Solution: change the PF from Bridged to NAT. This places it behind the host OS, so once the captive portal was dealt with, everything that is NATed is fine. Worked like a charm!

Question: why place it bridged in the first place? I honestly can’t be sure, but I suspect that the host OS can cause some unexpected behavior to connections that are NATed through it. I definitely have lots of network hiccups happening (that aren’t getting in the way of my actual research, but are still a touch annoying), and while I can’t confirm that is the cause, it wouldn’t surprise me.

Other question: is there a method of traversing captive portals from the PF? Maybe? As long as this worked, I didn’t care enough to keep troubleshooting, as I had my actual research to do.

Supporting the Business

In Information Security (Infosec), one of the first things most of us learn is the concept of CIA – Confidentiality, Integrity, and Availability. This is such an important concept that is part of US Code:

(1) The term “information security” means protecting information and information systems from unauthorized access, use, disclosure, disruption, modification, or destruction in order to provide—

(A) integrity, which means guarding against improper information modification or destruction, and includes ensuring information nonrepudiation and authenticity;

(B) confidentiality, which means preserving authorized restrictions on access and disclosure, including means for protecting personal privacy and proprietary information; and

(C) availability, which means ensuring timely and reliable access to and use of information.

From https://www.law.cornell.edu/uscode/text/44/3542

While this may look nice and balanced on paper, in the real world, they are anything but balanced. Each person will have different priorities. Members of the intelligence community are often very focused on confidentiality, newspapers want information to be available, and you probably want your bank account to maintain an accurate balance (integrity). Because each of those use cases has a different priority, they will take different approaches to securing their systems.

It is for this reason that the first step in applying the Risk Management Framework (RMF) is understanding the business context:

Software risk management occurs in a business context. Risks are unavoidable and are a necessary part of software development. Management of risks, including the notions of risk aversion and technical tradeoff, is deeply impacted by business motivation. Thus, the first stage of software risk management involves getting a handle on the business situation. Commonly, business goals are neither obvious nor explicitly stated. In some cases, you may even have difficulty expressing these goals clearly and consistently. During this stage, the analyst must extract and describe business goals, priorities, and circumstances in order to understand what kinds of software risks to care about and which business goals are paramount. Business goals include, but are not limited to, increasing revenue, meeting service level agreements, reducing development costs, and generating high return on investment.

From https://www.us-cert.gov/bsi/articles/best-practices/risk-management/risk-management-framework-%28rmf%29

If you are lucky, your organization has a vision and strategy that gives you some sense of what the business’s goals are. While these documents may not explicitly state that they are more concerned with integrity than confidentiality, they should give you a sense of what the business values. They will often spell out specific lines of effort, that let you know what it is that you need to support.

Odds are, you’re not going to find “build and secure communications infrastructure” as one of those lines of effort unless you’re working for a communications infrastructure company. You’re more likely to see things involving creating new products, or improving the service that is provided to customers. This is why the company exists: to provide a service. The network that you build and secure does not exist for its own sake; it exists to support that company’s purpose.  If you are going to build and secure a network that support’s your organization’s purpose, you must first understand your organization’s purpose.

As an example: let’s say that you work for a construction company. You have been tasked with procuring a new vehicle to support the company’s operations. The identified capability gap is that while most of the supplies get delivered to the construction by freight vehicles, there’s always a few things that come up during the course of the job and someone needs to make runs to a local hardware store to pick things up. Do you select a sweet fuel-efficient compact car, or do you get the less fuel-efficient pickup truck?

Unless you have an extremely high degree of confidence that you’ll never have to pick up any wood, pipes, or anything else more than 3 feet long, you’re probably going to go with the pickup truck. Yes, it is more expensive and will cost more in fuel, but it actually meets the needs of the company. You need to conduct the same sort of analysis and make decisions based off of the needs of the organization when you are designing security solutions.

Security often has the reputation for being the office that says “no”. No, you can’t synch all of your corporate files to your personal Google Drive. No, you can’t plug your personal phone into your corporate computer, even if you’re just charging it. No, you can’t go out and install a file server under your desk.

Each of these things that security says “no” to are legitimate business needs that aren’t being met with the current IT infrastructure. What we need to start doing, as a community, is figuring out how to get to “yes”. There’s a couple of things that you need to do in order to switch from saying “no” and leaving it at that. The first is that you need to actually (gasp) talk to your customers (even if they’re in your company, they’re your customers. Remember, you exist to support their needs) about what it is that they’re trying to do. Why do they want to synch their files to Google Drive? Are they having a hard time accessing files while on business trips? Is your corporate file share super flaky and backups regularly fail, so files get lost? Do they want to be able to roll back to previous versions of files? Take the time to tease the actual business need out of the stated requirement, and you can probably find something that you are happy saying “yes” to.

While you’re talking to your customers, take some time to learn about what they do, and how they do their jobs. It is very easy for us techno-geeks to fall into the trap of thinking that the people around us aren’t smart because they don’t computer well. The reality as that most people are just smart in different disciplines. The unassuming professionals over in accounting can do wizardry with spreadsheets the likes of which you’ve never seen. The executive assistants not only can keep their boss’s calendar in their heads, but most of the organization’s structure, who’s who, and can play politics with the best of them. If you’re open to it, the people around you will impress the heck out of you.

Back to why you need to learn what they do and how they do it: these are the people whose workflows you need to support. It is really easy to say “macros are evil” and ban them across the company, but what will you affect by doing that? Those accounting pros probably rely on them. And they email them to each other. And to outside parties. And they get documents with macros emailed to them. So you need to figure out how to handle macros as safely as possible. Similarly, it is literally the job of your recruiters to open documents sent to them by random strangers with no prior coordination. How do you support that process?

In supporting these processes, it is important to not make the workflows of your customers any more convoluted. “Spin up a Linux VM, where you SSH to the file share to pull down the resume that you copied there from your email” is not a good process, is nontrivial to learn, and not something that most of your users have the time to troubleshoot. “Build a portal where people submit resumes, the recruiter only sees an image of the resume, the recruiter can add notes to the candidate’s packet, and the candidate gets tracked through the process to hiring them” is a potential solution that is much more likely to support the recruiter’s workflow instead of making it more onerous. And is probably less work than teaching your recruiters Linux commands.

Another area where Infosec tends to freak out is legacy systems, especially where some sort of controller is concerned. Medical and lab equipment are common culprits for this – the printer for the lab will only speak WindowsXP, can’t be virtualized (needs archaic physical ports), and costs $100k to get a new one. So, figure out how you can support the business. Does it need to be networked? Can you firewall it? Segment it from normal computers? Can you baseline what its legitimate communications look like and monitor it very closely? Yes, these all take time and hard work and research and that is what you are getting paid for.

It is your job to take your knowledge about CIA and apply it to your organization. It is also your job to understand that different sections within your organization may have different priorities. Some sections may be more tolerant to losing access to their data than others – it is your job to know which is which. This not only helps you to identify what are the correct protections for each business unit, but also helps you in prioritizing your own efforts. Is it more important for your engineers to have access to blueprints from two years ago, your payroll department to be able to process transactions, or recover your public website from being defaced? I can’t answer that question for you – you need to know your business.

If all of this sounds like you need to spend a lot of time in receive mode from company leadership, you’d be correct. They are the ones ultimately choosing what direction the company moves in, so they are the ones that set your priorities. As such, you need to not only understand their language, but be able to speak it back. Especially if you are one of the senior Infosec people on staff, you are in a position to shape their understanding of the threats facing the company. It helps tremendously if you can speak of risks to the business, not just technobabble. Bonus points if you can use stories of bad things happening to your competitors.

As you are working to communicate risks to your management, make sure that you understand the difference between vulnerability and exploitability. It is possible to have a system that is extremely vulnerable (hasn’t been patched in two years) that isn’t actually exploitable (it isn’t networked). Granted, rarely will you be this lucky. But as you are assessing the risks to your business, make sure that you have an understanding of what the actual paths to exploit a given system are, and factor that into your severity calculations.

What happens when you get this wrong? If you’re particularly lucky, you’ll just have some extremely frustrated users who feel like they can’t do their jobs. If you’re less lucky, those frustrated users will find ways of going around you. This is commonly known as shadow IT. This is where your users go out and procure their own AWS instance because you won’t give them space to hang a web app, or start connecting their work computers to a MiFi because your proxy doesn’t allow them to do necessary research. Your “no” attitude, instead of making the enterprise more secure actually reduces security because you now have more of your network exposed, not monitored by you, and probably poorly configured.

One of the best ways of avoiding shadow IT is to have members of your team that have been those frustrated users, and can recognize when you saying “no” might have some unintended consequences. Where do you find former frustrated users? Recruit people who have had a career before Infosec. These people may have gotten interested in IT/Infosec precisely because they’ve been frustrated users and want to make things better (cherish these people!) or they may just have gotten disenchanted with whatever they were doing before. In either case, they bring a wealth of experience and perspective to your team that you probably won’t get if Infosec has been your only career. Listen to diverse voices, and include them in your decision-making. Above all else: remember who you work for.


Here’s an example of what I’m *not* trying to do – I don’t want to add more crap to home networks; I just want juicy juicy log data. And to be able to understand it:


I’m definitely going to need to learn me some PowerShell to make this work – to dig through and make events sensible:

Event Log Queries Using PowerShell


Search the event log with the Get-WinEvent PowerShell cmdlet