Facebook says that a configuration error broke its connection to a key network backbone, disconnecting all of its data centers from the Internet and leaving its DNS servers unreachable, the company said.
The unusual combination of errors took down the web operations of Facebook, Instagram and WhatsApp in a massive global outage that lasted more than five hours. In effect, Facebook said, a single errant command took down web services used by more than 7 billion accounts worldwide.
Early external analyses of the outage focused on Facebook’s domain name servers (DNS) and changes in a network route in the Border Gateway Protocol (BGP), issues which were clearly visible from Internet records. Those turned out to be secondary issues triggered by Facebook’s backbone outage.
During planned network maintenance, “a command was issued with the intention to assess the availability of global backbone capacity, which unintentionally took down all the connections in our backbone network, effectively disconnecting Facebook data centers globally,” according to ablog post by Facebook VP of Infrastructure Santosh Janardhan.
The errant command would normally be caught by an auditing tool, but “but a bug in that audit tool didn’t properly stop the command,” Facebook said.
Technical Overview of the Facebook Outage
Here’s the section of the blog post that explains the issue and resulting outage, which is worth reading in full:
The data traffic between all these computing facilities is managed by routers, which figure out where to send all the incoming and outgoing data. And in the extensive day-to-day work of maintaining this infrastructure, our engineers often need to take part of the backbone offline for maintenance — perhaps repairing a fiber line, adding more capacity, or updating the software on the router itself.
This was the source of yesterday’s outage. During one of these routine maintenance jobs, a command was issued with the intention to assess the availability of global backbone capacity, which unintentionally took down all the connections in our backbone network, effectively disconnecting Facebook data centers globally. Our systems are designed to audit commands like these to prevent mistakes like this, but a bug in that audit tool didn’t properly stop the command.
This change caused a complete disconnection of our server connections between our data centers and the internet. And that total loss of connection caused a second issue that made things worse.
One of the jobs performed by our smaller facilities is to respond to DNS queries. DNS is the address book of the internet, enabling the simple web names we type into browsers to be translated into specific server IP addresses. Those translation queries are answered by our authoritative name servers that occupy well known IP addresses themselves, which in turn are advertised to the rest of the internet via another protocol called the border gateway protocol (BGP).
To ensure reliable operation, our DNS servers disable those BGP advertisements if they themselves can not speak to our data centers, since this is an indication of an unhealthy network connection. In the recent outage the entire backbone was removed from operation, making these locations declare themselves unhealthy and withdraw those BGP advertisements. The end result was that our DNS servers became unreachable even though they were still operational. This made it impossible for the rest of the internet to find our servers.
Manual Restarts Extend the Delay
Recovery became difficult because all Facebook’s data centers were inaccessible, and the DNS outage hobbled many network tools that would normally be key in trouble-shooting and repairing the problems.
With remote management tools unavailable, the affected systems has to be manually debugged and restarted by technicians in the data centers. “It took extra time to activate the secure access protocols needed to get people onsite and able to work on the servers. Only then could we confirm the issue and bring our backbone back online,” said Janardhan.
A final problem was how to restart Facebook’s huge global data center network and handle an immediate surge of traffic. This is a challenge that goes beyond network logjams to the data center hardware and power systems.
“Individual data centers were reporting dips in power usage in the range of tens of megawatts, and suddenly reversing such a dip in power consumption could put everything from electrical systems to caches at risk,” said Janardhan.
The data center industry exists to eliminate downtime in IT equipment by ensuring power and network are always available. A key principle is to eliminate single points of failure, and Monday’s outage illustrates how hyperscale networks that serve global audiences can also enable outages at unprecedented scale.
Now that the details of the outage are known and available, Facebook’s engineering team will assess what went wrong, and seek to prevent a similar issue from recurring in the future.
“Every failure like this is an opportunity to learn and get better, and there’s plenty for us to learn from this one,” Janardhan said. “After every issue, small and large, we do an extensive review process to understand how we can make our systems more resilient. That process is already underway. … From here on out, our job is to strengthen our testing, drills, and overall resilience to make sure events like this happen as rarely as possible.”