Stop using .IO domain names for production traffic



Nick parsons

Director of Developer Marketing @ GetStream.io

Note: If you are using Flux, be sure to update your API client to the latest version for a big improvement in reliability. For those of you on a custom API client, take a look at our update REST Documentation.

Domain resolution is one of the basic services of the Internet. This is something we usually spend very little time thinking about. Of course, that changes when it breaks. Over the past year, domain IO outages have been the number one reason our customers couldn’t use Stream. Concretely, the blackout of September 20, 2017 turned out to be a major headache. This article will go into detail about .IO domain name reliability issues and how we get around them.

The Internet Domain Name System (DNS) infrastructure is large and complex. Due to its decentralized nature, if the issue is with your DNS provider or the larger DNS infrastructure, there is nothing you can do but sit back and wait for the issue to be resolved. The only practical solution to deal with DNS failures is to fall back on a backup domain.

This makes DNS failures quite unpleasant. Many risks are complex and costly to mitigate, and in some scenarios it is virtually impossible to do so.

What went wrong: A global outage at the ‘.io’ top level domain

On September 20, 2017, our system monitors and health checks started showing intermittent failures. Pings to our website and API servers were failing to resolve “getstream.io” records to a valid hostname.

Domain name resolution is required to access our core API service and dashboard. Without it, customers will not be able to find the address of our servers. It goes without saying that this was immediately classified as critical and received the full attention of our team.

After an initial investigation, we discovered that resolving any getstream.io record would randomly fail with an incorrect NXDOMAIN error returned. Subsequently, one of our engineers identified that resolving .io domains would consistently fail on 2 of the 6 authoritative .io name servers. The other four were working fine, which explained the seemingly random nature of the errors.

A bad one looks like the following:

Since this happened on authoritative nameservers, we contacted our DNS provider and then also tried contacting NIC.io. To our surprise, we discovered that NIC.io was only reachable by phone between 7:00 a.m. and 12:00 a.m. UTC Monday through Friday and did not reveal any status regarding the health of the service..

In the meantime, we started researching who else was affected by this outage and posted about it on Twitter and Hacker News. While waiting for the outage to end, we have also increased the TTL DNS so that the number of DNS queries is as low as possible. Shortly thereafter, we received a response from G Cadeaux informing us that NIC.io was solving the problem.

The outage lasted for almost 2 hours, during which time 1 / 5th of the DNS queries for any .getstream.io record failed. For something that is in front of our service this is a huge issue and has raised more than a few questions from our end.

Couldn’t this happen with any TLD?

We understood. Sometimes things get broken. In reality, a similar failure could have occurred in any top level domain.

When we started in 2014 we decided .io was great from a brand perspective. Stream is a technical product and our audience is primarily technical, so .io seemed like a good game. Using the same domain for APIs was more of a consequence than a thoughtful decision.

It is not possible to estimate the likelihood that .com nameservers will experience the same type of failures as .io nameservers. One thing that surprised us is that while about 20% of DNS resolutions for all .io domains were totally broken, it was hard to find people complaining about it on Twitter. In fact, I think we were one of the first to tweet this. If this had happened on all .com domains, all news sources would have been on fire.

What went really wrong

Unfortunately, we have discovered the hard way that NIC.IO is not equipped with the technical support and systems necessary to manage a top level domain. Not being able to reach them when a major outage was occurring is unacceptable.

Looking further, it doesn’t take much research to discover that the .io TLD team has made several mistakes over the past few years. Just to name a few:

Searching for .io on HN returns a long list of similar failures.

What is the best immediate solution?

Adding a .com domain and using it by default on all of our API clients is clearly the fruit at hand. Of course, we could have the same problem if .com were down, however, we are much more confident in the management behind .com. It is clear that not only would the problem have been identified earlier, but it also wouldn’t have taken people hours to recognize and remedy the situation.

Our roadmap to a trouble-free DNS service

These DNS issues have made us pause and think about all the ways a DNS can fail.

  1. We are losing control of our own domain. This can alarmingly happen in several ways:
  1. Breakdown of Route53. Since we delegate getstream.io to Route53 nameservers, failure of their nameservers would disrupt our service. The DynDNS DDoS outage of 2016 is one example.
  2. TLD .com failure.

Since we control API clients, implementing a failover mechanism is easy. Setting up and maintaining a backup domain and / or a backup DNS provider can be very difficult. In the first case, we would need to synchronize hundreds of DNS records and double our SSL certificates; second, we would only have to modify our infrastructure so as not to use Route53-specific functionality. For this we need to synchronize all DNS records on two different providers and make sure that we are not using any provider specific functionality. As an AWS customer, this is a major challenge because DNS is deeply integrated in so many ways.

Going forward, our plan is to add a .org domain and find a DNS provider to manage the nameservers.

Conclusion

Looking back, using a .IO domain for our core APIs was not a good choice. The September 20 blackout demonstrated the severity of the problems and the supporting infrastructure. Based on our experience, we do not recommend using a .IO domain name if availability is high.

To work around the DNS issue, Stream API traffic now runs on a .com domain name. The site is still running on .io because it’s harder to change and not as critical in terms of uptime. To further improve reliability, we are considering:

  • Addition of a backup .ORG domain name.
  • Using a backup DNS provider for the .COM or .ORG domain name.
  • Client-side DNS failover implementation in our SDKs.

DNS as a whole is one of those things that most take for granted, but can easily cause serious downtime and issues. Using a widely used TLD like .com / .net / .org is the best and easiest way to ensure reliability.

This is a collaboration of the GetStream.io team, led by Tommaso Barbugli, CTO at GetStream.io. The original blog post is available at https://getstream.io/blog/stop-using-io-domain-names-for-production-traffic/.

Key words



Previous IPS officers may need to specialize in policing to gain promotion
Next Sci-Hub loses domain names, but remains resilient * TorrentFreak