When an outage affects one component of the Internet infrastructure, there can often be ripple effects downstream affecting other components or services, directly or indirectly. We would like to share our observations of this impact in the case of two recent such outages, measured at different levels of the DNS hierarchy, and discuss the resulting increase in query volume due to the behavior of recursive resolvers.
In early October 2021, the internet experienced two major outages, affecting Facebook’s services and the .club top-level domain, both of which did not properly resolve for some time. Throughout these outages, Verisign and other DNS carriers reported significant increases in query volume. We’ve provided consistent answers throughout, with the correct delegation data pointing to the correct nameservers.
While these higher request rates do not affect Verisign’s ability to respond, they do raise a broader operational question: Could the repeated nature of these requests, indicating a lack of negative caching, be mistaken for a denial of service attack?
On October 4, 2021, Facebook experienced a widespread outage, which lasted nearly six hours. Meanwhile, most of its systems were inaccessible, including those that provide Facebook’s DNS service. The outage affected facebook.com, instagram.com, whatsapp.net and other domain names.
Under normal conditions, the .com and .net authoritative name servers respond to around 7,000 queries per second in total for the three domain names mentioned earlier. During this particular outage, however, request rates for these domain names reached over 900,000 requests per second (a more than 100x increase), as shown in Figure 1 below.
During this outage, the recursive nameservers received no response from Facebook’s nameservers. Instead, these queries timed out. In such situations, recursive name servers typically return a SERVFAIL or “server failure” response, presented to end users as a “this site cannot be reached” error.
Figure 1 shows an increasing request rate over the duration of the outage. Facebook uses relatively low TTLs, a setting that tells DNS resolvers how long to cache a response against their DNS records before issuing a new query, anywhere from one to five minutes. This in turn means that, five minutes into the outage, all relevant records would have expired from all recursive resolver caches, or at least from those that respect the publisher’s TTLs. It’s not immediately clear why the request rate continues to climb throughout the outage, or whether it would have eventually plateaued had the outage continued.
To get an idea of where traffic is coming from, we group query sources by their Autonomous System number. The five main autonomous systems, along with all the others grouped together, are shown in Figure 2.
From Figure 2, we can see that at their peak, queries for these domain names to Verisign’s .com and .net authoritative name servers from the most active recursive resolvers – those of Google and Cloudflare – increased by around 7,000x and 2,000x respectively during their average non-failure rates.
On October 7, 2021, three days after the Facebook outage, the .club and .hsbc TLDs also experienced a three-hour outage. In this case, the affected authoritative servers remained reachable, but responded with SERVFAIL messages. The effect on recursive resolvers was essentially the same: since they received no useful data, they repeatedly retried their queries to the parent area. During the incident, the A-root and J-root servers operated by Verisign observed a 45x increase in requests for .club domain names, from 80 requests per second before, to 3,700 requests per second during the breakdown.
Similar to the previous example, this outage also demonstrated an upward trend in the request rate for the duration of the outage. In this case, it could be because the .club delegation records in the root zone use two-day TTLs. However, the theoretical analysis is complicated by the fact that authoritative nameserver records in child zones use longer TTLs (six days), while authoritative nameserver address records use longer TTLs shorter (10 minutes). Here we are not seeing a significant amount of query traffic from Google sources; instead, the increase in query volume is largely attributable to the long tail of recursive resolvers in “All Others”.
Earlier this year, Verisign set up a botnet sinkhole and analyzed the received traffic. This botnet uses over 1,500 second-level domain names, presumably for command and control. We observed queries from about 50,000 customers every day. As an experiment, we configured our sinkhole nameservers to return SERVFAIL and REFUSED responses for two of the botnet’s domain names.
When configured to return a valid response, each domain name’s request rate peaks at around 50 requests per second. However, when configured to return SERVFAIL, the request rate for a single domain name increases to 60,000 per second, as shown in Figure 5. Additionally, the request rate for the botnet domain name increases also at the level of the TLD and root nameservers, these services are functioning normally and the data relating to the domain name of the botnet has not changed, just like during the two failures described above. Figure 6 shows data from the same experiment (although for a different date), colored by the source autonomous system. Here we can see that about half of the increase in query traffic is generated by an organization’s recursive resolvers.
These two failures and one experiment all demonstrate that recursive name servers can become unnecessarily aggressive when query responses are not received due to connectivity issues, timeouts, or misconfigurations.
In all three of these cases, we are seeing significant increases in the rate of recursive resolver queries across the internet, with particular contributors, such as Google Public DNS and Cloudflare’s resolver, identified at every opportunity.
Often in cases like this, we turn to Internet Standards for guidance. RFC2308 is a 1998 Standards Track specification that describes negative DNS query caching. The RFC covers name errors (eg, NXDOMAIN), missing data, server crashes, and timeouts. Unfortunately it says negative caching for server crashes and timeouts is optional. We submitted a Internet-Draft which proposes to update RFC 2308 to require negative caching for DNS resolution failures.
We believe it is important to the security, stability, and resiliency of the Internet’s DNS infrastructure that implementers of recursive resolvers and public DNS services carefully consider the behavior of their systems under circumstances where none of the servers Authoritative names of a domain name don’t provide answers, but the parent zones provide appropriate referrals. We find it difficult to rationalize the patterns we are currently seeing, such as hundreds of requests per second from individual recursive resolver sources. The global DNS would be better served by more appropriate rate limiting and algorithms such as exponential backoff, to handle these types of cases that we have highlighted here. Verisign remains committed to leading and contributing to the continued security, stability, and resiliency of DNS for all Internet users worldwide.