BGP, DNS and the fragility of our critical systems


This is an article written by Malcolm Heath, Senior Threat Researcher at F5 Labs.

On October 4, 2021, Facebook Properties experienced a six-hour outage. The global outage extended to Facebook-related properties including WhatsApp, Instagram and Oculus VR.

In a blog postSantosh Janardhan, Facebook’s vice president of engineering, said the outage began when the company’s engineers issued a command that unintentionally disconnected Facebook’s data centers from the rest of the world.

During routine maintenance work, the command was issued with the intention of assessing the availability of global backbone capacity, however, this command removed all connections from Facebook’s backbone network. Facebook’s systems are designed to audit orders to prevent such errors, but a bug in an auditing tool prevented it from stopping the order. This change caused a disconnection of network connections between Facebook’s data centers and the Internet.

Given the magnitude of this event, we thought it would be good to dig a little deeper into some of the internet technologies that we rely on so heavily.

It’s always DNS

Domain Name System (DNS) is a single point of failure for Internet systems. DNS maps names, such as facebook.com, to IP addresses, allowing users to easily refer to sites by name.

DNS, in effect, provides a translation between names and IP addresses, like an address book. When a site’s DNS servers are down, this lookup cannot take place and users will not be able to access your site. Keeping your DNS servers up, running, and secure is an essential part of site reliability.

Except when it’s BGP

Underneath is another technology that is at least as critical as DNS. This is a routing protocol (one of many) called Border Gateway Protocol (BGP). BGP is the protocol that allows autonomous systems (sets of large networks controlled by a single entity) to let other autonomous systems know how to reach the networks they control. It does not do routing directly but it is the protocol that shares information between routers. After receiving this information, routers can decide where to forward the data.

Why is BGP important?

For example, one can type “f5.com” into a web browser. This forces your computer to do a DNS lookup and the local DNS server your computer is using will hopefully return an IP address of 107.162.162.40. This is the address book part.

However, your computer should now be able to send traffic to that IP address. It is important to note that routing decisions are made hop by hop. Each router through which your data passes will decide the next leg of the route by looking at the destination IP address and consulting its routing table to determine where to forward the data next.

If the router participates in BGP, this routing table is built from advertisements it has received from other BGP-enabled routers.

This will include information on which networks can be reached by which routers. It will also contain information about the proximity of this router to the destination. Close, in this case, does not mean the number of routers the data will have to traverse, but rather the number of autonomous systems the data will traverse. There is a complex algorithm used to determine which of the possible routes is the best. The best can also mean a lot of things, as factors such as exit policies and transit agreements between ISPs are also taken into account.

If it turns out that Router A’s routing table shows two routers that it can forward data to reach 107.162.162.40, it will choose one of the two, based on those metrics.

Similar routing decisions are made by each router that receives the data, either forwarding it to another router or determining that it is directly connected to the 107.162.0.0/16 network and forwarding the data to the final destination. The same process will be run in reverse to re-route traffic through another set of routers and then back to the client.

There are many benefits to this diet. As long as a possible end-destination router for traffic is available – and most companies with a large Internet presence have many such routers – our data should (eventually) make it there. Because the information needed to serve a site is broken down into multiple packets, it may even take different routes.

This is a feature – if an intermediate router goes down, the packets that make up our request or response can be rerouted to avoid the problem. It’s fine if the routing tables are consistent and contain good information. After all, the Internet was originally designed to circumvent nuclear strikes.

Can you provide an illuminating metaphor?

Imagine that you want to go to your friend’s house, but you have never been there. You are looking for their address. It’s like the DNS part. Now you need to figure out how to get there, so you go to the nearest intersection and ask someone which direction you need to go. They tell you to turn left. You continue along this road until you reach another intersection and ask again. This person tells you to go right.

You continue this process until you reach your destination. It is possible that someone will tell you, “normally I would say go over the bridge, but the bridge is out, so go left here and ask at the next intersection”. Or they may say, “going left is more direct, but going right and taking the freeway is actually faster.”

The route you take won’t always be the most direct way to get there, or even necessarily the fastest, but it will help you avoid roadblocks, collapsed bridges, and washed out roads. If everyone you ask has good information, you’ll get where you’re going. The medium by which this good information is communicated is BGP. If BGP provides incorrect information, or no information on how to get where you want to go, bad things can happen.

Is BGP bulletproof?

In a word, no. It is very robust and scalable, which is an essential feature when trying to interconnect billions of hosts. But problems can arise.

A route advertisement may omit the routes it should provide, which means that the associated network simply disappears from the Internet. No one knows how to get there, and traffic destined for that network will be dropped.

This is sometimes done intentionally – it’s called blocking a route, and it’s usually done to block connections to or from a given network. There are a variety of cases. For example, to block DDoS traffic from a hostile network or, in certain circumstances, to take an entire country off the Internet during a civil crisis. The result is that network traffic is simply dropped, often without notification to the sender. The blocked network will not receive any traffic and will effectively be cut off from the (digital) world.

A route may also be announced incorrectly. Misconfiguration on the part of an autonomous system can give the impression that it can route traffic to networks it does not control. Done intentionally, it’s called BGP hijacking, and while there are defenses against it, it’s happened many times, causing large amounts of traffic to go to very strange places, perhaps to attempt to capture and inspect traffic for espionage purposes. .

Accidents are much more frequent. For example, a network operator or automated system misconfigures something. Either the necessary route disappears entirely, or the misconfiguration ends up creating a routing loop (where traffic is endlessly forwarded between two routers), or it sends the traffic to a router that doesn’t know anything about the route, which then sends it Drop it.

A wake up call

The outage was an unexpected incident for Facebook. However, it is proof that any organization, regardless of size, can be affected by outages.

The story is a good reminder for all of us to pay a little more attention to this lesser-known but critically important part of internet plumbing, and how it helps us get all those cat videos on our browsers in one (possible) piece.

Malcolm Heath is a senior threat researcher at F5 Labs. His career has included incident response, program management, penetration testing, code auditing, vulnerability research, and exploit development in organizations large and small. Prior to joining F5 Labs, he was a Senior Security Engineer at F5 SIRT.

Previous ICANN warns of alt-root blockchain domain names - domain name wire
Next IPS officer Param Bir Singh lands in Mumbai and appears before criminal branch: The Tribune India