The Huge CenturyLink Outage

At the end of December CenturyLink had a widespread network outage that lasted over two days. The outage disrupted voice and broadband service across the company’s wide service territory.

Probably the most alarming aspect pf the outage is that it knocked out the 911 systems in parts of fourteen states. It was reported that calls to 911 might get a busy signal or a recording saying that “all circuits are busy’. In other cases, 911 calls were routed to the wrong 911 center. Some jurisdictions responded to the 911 problems by sending out emergency text messages to citizens providing alternate telephone numbers to dial during an emergency. The 911 service outages prompted FCC Chairman Ajit Pai to call CenturyLink and to open a formal investigation into the outage.

I talked last week to a resident of a small town in Montana who said that the outage was locally devasting. Credit cards wouldn’t work for most of the businesses in town including at gas stations. Businesses that rely on software in the cloud for daily operations like hotels were unable to function. Bank ATMs weren’t working. Customers with CenturyLink landlines had spotty service and mostly could not make or receive phone calls. Worse yet, cellular service in the area largely died, meaning that CenturyLink must have been supplying the broadband circuits supporting the cellular towers.

CenturyLink reported that the outage was caused by a faulty networking management card in a Colorado data center that was “propagating invalid frame packets across devices”. It took the company a long time to isolate the problem, and the final fix involved rebooting much of the network electronics.

Every engineer I’ve spoken to about this says that in today’s world it’s hard to believe that it would take 2 days to isolate and fix a network problem caused by a faulty card. Most network companies operate a system of alarms that instantly notify them when any device or card is having problems. Further, complex networks today are generally supplied with significant redundancy that allows the isolation of troubled components of a network in order to stop the kind of cascading outage that occurred in this case. The engineers all said that it’s almost inconceivable to have a single component like a card in a modern network that could cause such a huge problem. While network centralization can save money, few companies route their whole network through choke points – there are a dozen different strategies to create redundancy and protect against this kind of outage.

Obviously none of us knows any of the facts beyond the short notifications issued by CenturyLink at the end of the outage, so we can only speculate about what happened. Hopefully the FCC enquiry will uncover the facts – and it’s important that they do so, because it’s always possible that the cause of the outage is something that others in the industry need to be concerned about.

I’m only speculating, but my guess is that we are going to find that the company has not implemented best network practices in the legacy telco network. We know that CenturyLink and the other big telcos have been ignoring the legacy networks for decades. We see this all of the time when looking at the conditions of the last mile network, and we’ve always figured that the telcos were also not making the needed investments at the network core.

If this outage was caused by outdated technology and legacy network practices then such outages are likely to recur. Interestingly, CenturyLink also operates one of the more robust enterprise cloud services in the country. That business got a huge shot in the arm through the merger with Level 3, with new management saying that all of their future focus is going to be on the enterprise side of the house. I have to think that this outage didn’t much touch that network, just more likely the legacy network.

One thing for sure is that this outage is making CenturyLink customers look for an alternative. A decade ago the local government in Cook County, Minnesota – the northern-most county in the state – was so frustrated by continued prolonged CenturyLink network outages that they finally built their own fiber-to-the-home network and found alternate routing into and out of the County. I talked to one service provider in Montana who said they’ve been inundated after this recent outage by businesses looking for an alternate to CenturyLink.

We have become so reliant on the Internet that major outages are unacceptable. Much of what we do everyday relies on the cloud. The fact that this outage extended to cellular outages, a crash of 911 systems and the failure of credit card processing demonstrates how pervasive the network is in the background of our daily lives. It’s frightening to think that there are legacy telco networks that have been poorly maintained that can still cause these kinds of widespread problems.

I’m not sure what the fix is for this problem. The FCC supposedly washed their hands of the responsibility for broadband networks – so they might not be willing to tackle any meaningful solutions to prevent future network crashes. Ultimately the fix might the one found by Cook County, Minnesota – communities finding their own network solutions that bypass the legacy networks.

FCC’s Recommendations to Avoid Network Outages

The FCC’s Public Safety and Homeland Security Bureau just released a list of recommended network practices. These recommendations are not a comprehensive list of good network practices, but rather are compiled by analyzing the actual network outages reported to the FCC over the last five years. Telcos are required to notify the FCC of significant network outages and every item on this list represents multiple actual network outages. It’s easy to look at some of the items on the list as think they are common sense, but there obviously there are regulated telcos that triggered had outages due to ignoring each of these network practices.

Following are some of the more interesting recommendations on the list:

Network Operators, Service Providers and Property Managers together with the Power Company and other tenants in the location, should verify that aerial power lines are not in conflict with hazards that could produce a loss of service during high winds or icy conditions. This speaks to having a regular inspection and tree trimming process to minimize damage from bad storms.

Network Operators and Property Managers should consider pre-arranging contact information and access to restoral information with local power companies. This seems like common sense, but I’ve been involved in outages where the technicians did not know how to immediately contact other utilities.

Network Operators, Service Providers and Public Safety should establish a routing plan so that in the case of lost connectivity or disaster impact affecting a Public Safety Answering Point (PSAP), 9-1-1 calls are routed to an alternate PSAP answering point. A lot of the recommendations on the FCC’s list involve 9-1-1 and involve having contingency plans in place to keep 9-1-1 working in the case of network failures.

Network Operators, Public Safety, and Property Managers should consider conducting physical site audits after a major event (e.g., weather, earthquake, auto wreck) to ensure the physical integrity and orientation of hardware has not been compromised. It’s easy to assume that sites that look undamaged after big storms are okay. But damage often doesn’t manifest as outages until days, weeks or months later.

Network Operators and Service Providers should verify both local and remote alarms and remote network element maintenance access on all new critical equipment installed in the network, before it is placed into service. I’ve seen outages where equipment was installed but the alarms were not tested. You don’t want to find out that an alarm isn’t working when it’s needed.

Network Operators, Service Providers, Public Safety and Property Managers should engage in preventative maintenance programs for network site support systems including emergency power generators, UPS, DC plant (including batteries), HVAC units, and fire suppression systems. This might easily be the biggest cause of network outages. ISPs get busy and don’t test all of the components critical to maintaining systems. A lot of outages I’ve been involved with were due to failures of minor components like fans or air conditioning compressors.

Network Operators, Service Providers, Public Safety, and Equipment Suppliers should consider the development of a vital records program to protect vital records that may be critical to restoration efforts. Today there is often software, databases and other vital records that must be restored in order first to get equipment up and functioning. Electronics records of this type need to be kept in a secure system that is separate and doesn’t rely on the network to be functioning, but that also can be accessed easily when needed.

Network Operators, Service Providers, Public Safety and Property Managers should take appropriate precautions to ensure that fuel supplies and alternate sources of power are available for critical installations in the event of major disruptions in a geographic area (e.g., hurricane, earthquake, pipeline disruption). Consider contingency contracts in advance with clear terms and conditions (e.g., Delivery time commitments, T&Cs). This is a lesson most recently experienced after the recent hurricanes where local gasoline supplies dried up and several utilities without their own private fuel supply were stranded along with the rest of the public.

This FCC list is a great reminder that it’s always a good idea to periodically assess your disaster and outage readiness. You don’t want to discover gaps in your processes during the middle of an outage.