Network Outages Go Global

On August 30, CenturyLink experienced a major network outage that lasted for over five hours and which disrupted CenturyLink customers nationwide as well as many other networks. What was unique about the outage was the scope of the disruptions as the outage affected video streaming services, game platforms, and even webcasts of European soccer.

This is an example of how telecom network outages have expanded in size and scope and can now be global in scale. This is a development that I find disturbing because it means that our telecom networks are growing more vulnerable over time.

The story of what happened that day is fascinating and I’m including two links for those who want to peek into how the outages were viewed by outsiders who are engaged in monitoring Internet traffic flow. First is this report from a Cloudflare blog that was written on the day of the outage. Cloudflare is a company that specializes in protecting large businesses and networks from attacks and outages. The blog describes how Cloudflare dealt with the outage by rerouting traffic away from the CenturyLink network. This story alone is a great example of modern network protections that have been put into place to deal with major Internet traffic disruptions.

The second report comes from ThousandEyes, which is now owned by Cisco. The company is similar to Cloudflare and helps clients deal with security issues and network disruptions. The ThousandEye report comes from the day after the outage and discusses the likely reasons for the outage. Again, this is an interesting story for those who don’t know much about the operations of the large fiber networks that constitute the Internet. ThousandEyes confirms the suspicions that were expressed the day before by Cloudflare that the issue was caused by a powerful network command issued by CenturyLink using Flowspec that resulted in a logic loop that turned off and restarted BGP (Border Gateway Protocol) over and over again.

It’s reassuring to know that there are companies like Cloudflare and ThousandEye that can stop network outages from permeating into other networks. But what is also clear from the reporting of the event is that a single incident or bad command can take out huge portions of the Internet.

That is something worth examining from a policy perspective. It’s easy to understand how this happens at companies like CenturyLink. The company has acquired numerous networks over the years from the old Qwest network up to the Level 3 networks and has integrated them all into a giant platform. The idea that the company owns a large global network is touted to business customers as a huge positive – but is it?

Network owners like CenturyLink have consolidated and concentrated the control of the network to a few key network hubs controlled by a relatively small staff of network engineers. ThousandEyes says that the CenturyLink Network Operation Center in Denver is one of the best in existence, and I’m sure they are right. But that network center controls a huge piece of the country’s Internet backbone.

I can’t find where CenturyLink ever gave the exact reason why the company issued a faulty Flowspec command. It may have been used to try to tamp down a problem at one customer or have been part of more routine network upgrades implemented early on a Sunday morning when the Internet is at its quietest. From a policy perspective, it doesn’t matter – what matters is that a single faulty command could take down such a large part of the Internet.

This should cause concerns for several reasons. First, if one unintentional faulty command can cause this much damage, then the network is susceptible to this being done deliberately. I’m sure that the network engineers running the Internet will say that’s not likely to happen, but they also would have expected this particular outage to have been stopped much sooner and easier.

I think the biggest concern is that the big network owners have adopted the idea of centralization to such an extent that outages like this one are more and more likely. Centralization of big networks means that outages can now reach globally and not just locally like happened just a decade ago. Our desire to be as efficient as possible through centralization has increased the risk to the Internet, not decreased it.

A good analogy for understanding the risk in our Internet networks comes by looking at the nationwide electric grid. It used to be routine to purposefully allow neighboring grids to automatically interact until it because obvious after some giant rolling blackouts that we needed firewalls between grids. The electric industry reworked the way that grids interact, and the big rolling regional outages disappeared. It’s time to have that same discussion about the Internet infrastructure. Right now, the security of the Internet is in the hands of few corporations that stress the bottom line first, and which have willingly accepted increased risk to our Internet backbones as a price to pay for cost efficiency.

Our Aging Internet Protocols

HeartbleedThe Internet has changed massively over the last decade. We now see it doing amazing things compared to what it was first designed to do, which was to provide communications within the government and between universities. But the underlying protocols that are still the core of the Internet were designed in an on-line world of emails and bulletin boards.

Those base protocols are always under attack from hackers because the protocols were never designed with safety in mind or designed for the kind of uses we see today on the Internet. The original founders of the Internet never foresaw that people with malicious intent would ever attack the underlying protocols and wreak havoc. In fact, they never expected it to grow much outside their cosy little world.

There is one group now looking at these base protocols. The Core Infrastructure Initiative (CII) was launched in April of 2014 after the Heartbleed virus wreaked havoc across the Internet by attacking OpenSSL. There are huge corporations behind this initiative, but unfortunately not yet huge dollars. But companies like Amazon, Adobe, Cisco, Dell, Facebook, Google, HP, IBM, Microsoft and about every other big name in computing and networking is a member of the group. The group currently is funding proposals from groups who want to research ways to upgrade and protect the core protocols underlying the Internet. There is not yet a specific agenda or plan to fix all of the protocols, but rather some ad hoc projects. But the hope is that somebody will step up to overhaul these old protocols over time to create a more modern and safer web.

The genesis of the CII is to be able to marshall major resources after the next Heartbleed-like attack. It took the industry too long to fix Heartbleed and the concept is that if all of the members of the organization mobilize, then major web disruptions can be diagnosed and fixed quickly.

Following are some of the base protocols that have been around since the genesis of the Internet. At times each of these has been the target of hackers and malicious software.

IPv4 to IPv6. I just wrote last week about the depletion of IPv4 IP addresses. At some future point in time the industry will throw the switch and kill IPv4 and there is major concern that hackers have already written malicious code to pounce on networks that first day they are solely using IPv6. Hackers have had years to think about how to exploit the change while companies have instead been busy figuring out how to get through the conversion.

BGP: Border Gateway Protocol. BGP is used to coordinate changes in Internet topology and routing. The problem with the protocol is that it’s easily spoofed because nobody can verify if a specific web address belongs to a specific network. Fixing BGP is a current priority at the Core Infrastructure Initiative.

DNS: Domain Name System. This is the system that translates IP addresses into domain names. DNS is often the target of hacking and is how the Syrian Electronic Army hacked the New York Times. There are serious flaws in the DNS protocol that have been hastily patched but not fixed.

NTP: Network Time Protocol. NTP’s function is to keep clocks in sync between computer networks. In the past, flaws in the system have been used to launch denial-of-service attacks. It appears that this has been fixed for now, but the protocol was not designed for safety and could be exploited again.

SMTP: Simple Mail Transfer Protocol. SMTP is a protocol used to transfer emails between users. The protocol has no inherent safety features and was an early target of hackers. Various add-ons are now used to patch the protocol, but any server not using these patches (and many don’t) can put other networks at risk. Probably the only way to fix this is to find an alternative to email.

SSL: Secure Sockets Layer. SSL was designed to provide encryption protection for application layer connections like HTTP. Interestingly the protocol has had a replacement in place since 1997 – Transfer Layer Security. But SSL is still included in most networks to provide backward compatibility and 0.3% of web traffic still uses it. SSL was exploited in the infamous POODLE attack and the easiest way to make this secure would be to finally shut it down.