On August 30, CenturyLink experienced a major network outage that lasted for over five hours and which disrupted CenturyLink customers nationwide as well as many other networks. What was unique about the outage was the scope of the disruptions as the outage affected video streaming services, game platforms, and even webcasts of European soccer.
This is an example of how telecom network outages have expanded in size and scope and can now be global in scale. This is a development that I find disturbing because it means that our telecom networks are growing more vulnerable over time.
The story of what happened that day is fascinating and I’m including two links for those who want to peek into how the outages were viewed by outsiders who are engaged in monitoring Internet traffic flow. First is this report from a Cloudflare blog that was written on the day of the outage. Cloudflare is a company that specializes in protecting large businesses and networks from attacks and outages. The blog describes how Cloudflare dealt with the outage by rerouting traffic away from the CenturyLink network. This story alone is a great example of modern network protections that have been put into place to deal with major Internet traffic disruptions.
The second report comes from ThousandEyes, which is now owned by Cisco. The company is similar to Cloudflare and helps clients deal with security issues and network disruptions. The ThousandEye report comes from the day after the outage and discusses the likely reasons for the outage. Again, this is an interesting story for those who don’t know much about the operations of the large fiber networks that constitute the Internet. ThousandEyes confirms the suspicions that were expressed the day before by Cloudflare that the issue was caused by a powerful network command issued by CenturyLink using Flowspec that resulted in a logic loop that turned off and restarted BGP (Border Gateway Protocol) over and over again.
It’s reassuring to know that there are companies like Cloudflare and ThousandEye that can stop network outages from permeating into other networks. But what is also clear from the reporting of the event is that a single incident or bad command can take out huge portions of the Internet.
That is something worth examining from a policy perspective. It’s easy to understand how this happens at companies like CenturyLink. The company has acquired numerous networks over the years from the old Qwest network up to the Level 3 networks and has integrated them all into a giant platform. The idea that the company owns a large global network is touted to business customers as a huge positive – but is it?
Network owners like CenturyLink have consolidated and concentrated the control of the network to a few key network hubs controlled by a relatively small staff of network engineers. ThousandEyes says that the CenturyLink Network Operation Center in Denver is one of the best in existence, and I’m sure they are right. But that network center controls a huge piece of the country’s Internet backbone.
I can’t find where CenturyLink ever gave the exact reason why the company issued a faulty Flowspec command. It may have been used to try to tamp down a problem at one customer or have been part of more routine network upgrades implemented early on a Sunday morning when the Internet is at its quietest. From a policy perspective, it doesn’t matter – what matters is that a single faulty command could take down such a large part of the Internet.
This should cause concerns for several reasons. First, if one unintentional faulty command can cause this much damage, then the network is susceptible to this being done deliberately. I’m sure that the network engineers running the Internet will say that’s not likely to happen, but they also would have expected this particular outage to have been stopped much sooner and easier.
I think the biggest concern is that the big network owners have adopted the idea of centralization to such an extent that outages like this one are more and more likely. Centralization of big networks means that outages can now reach globally and not just locally like happened just a decade ago. Our desire to be as efficient as possible through centralization has increased the risk to the Internet, not decreased it.
A good analogy for understanding the risk in our Internet networks comes by looking at the nationwide electric grid. It used to be routine to purposefully allow neighboring grids to automatically interact until it because obvious after some giant rolling blackouts that we needed firewalls between grids. The electric industry reworked the way that grids interact, and the big rolling regional outages disappeared. It’s time to have that same discussion about the Internet infrastructure. Right now, the security of the Internet is in the hands of few corporations that stress the bottom line first, and which have willingly accepted increased risk to our Internet backbones as a price to pay for cost efficiency.