At the end of December CenturyLink had a widespread network outage that lasted over two days. The outage disrupted voice and broadband service across the company’s wide service territory.
Probably the most alarming aspect pf the outage is that it knocked out the 911 systems in parts of fourteen states. It was reported that calls to 911 might get a busy signal or a recording saying that “all circuits are busy’. In other cases, 911 calls were routed to the wrong 911 center. Some jurisdictions responded to the 911 problems by sending out emergency text messages to citizens providing alternate telephone numbers to dial during an emergency. The 911 service outages prompted FCC Chairman Ajit Pai to call CenturyLink and to open a formal investigation into the outage.
I talked last week to a resident of a small town in Montana who said that the outage was locally devasting. Credit cards wouldn’t work for most of the businesses in town including at gas stations. Businesses that rely on software in the cloud for daily operations like hotels were unable to function. Bank ATMs weren’t working. Customers with CenturyLink landlines had spotty service and mostly could not make or receive phone calls. Worse yet, cellular service in the area largely died, meaning that CenturyLink must have been supplying the broadband circuits supporting the cellular towers.
CenturyLink reported that the outage was caused by a faulty networking management card in a Colorado data center that was “propagating invalid frame packets across devices”. It took the company a long time to isolate the problem, and the final fix involved rebooting much of the network electronics.
Every engineer I’ve spoken to about this says that in today’s world it’s hard to believe that it would take 2 days to isolate and fix a network problem caused by a faulty card. Most network companies operate a system of alarms that instantly notify them when any device or card is having problems. Further, complex networks today are generally supplied with significant redundancy that allows the isolation of troubled components of a network in order to stop the kind of cascading outage that occurred in this case. The engineers all said that it’s almost inconceivable to have a single component like a card in a modern network that could cause such a huge problem. While network centralization can save money, few companies route their whole network through choke points – there are a dozen different strategies to create redundancy and protect against this kind of outage.
Obviously none of us knows any of the facts beyond the short notifications issued by CenturyLink at the end of the outage, so we can only speculate about what happened. Hopefully the FCC enquiry will uncover the facts – and it’s important that they do so, because it’s always possible that the cause of the outage is something that others in the industry need to be concerned about.
I’m only speculating, but my guess is that we are going to find that the company has not implemented best network practices in the legacy telco network. We know that CenturyLink and the other big telcos have been ignoring the legacy networks for decades. We see this all of the time when looking at the conditions of the last mile network, and we’ve always figured that the telcos were also not making the needed investments at the network core.
If this outage was caused by outdated technology and legacy network practices then such outages are likely to recur. Interestingly, CenturyLink also operates one of the more robust enterprise cloud services in the country. That business got a huge shot in the arm through the merger with Level 3, with new management saying that all of their future focus is going to be on the enterprise side of the house. I have to think that this outage didn’t much touch that network, just more likely the legacy network.
One thing for sure is that this outage is making CenturyLink customers look for an alternative. A decade ago the local government in Cook County, Minnesota – the northern-most county in the state – was so frustrated by continued prolonged CenturyLink network outages that they finally built their own fiber-to-the-home network and found alternate routing into and out of the County. I talked to one service provider in Montana who said they’ve been inundated after this recent outage by businesses looking for an alternate to CenturyLink.
We have become so reliant on the Internet that major outages are unacceptable. Much of what we do everyday relies on the cloud. The fact that this outage extended to cellular outages, a crash of 911 systems and the failure of credit card processing demonstrates how pervasive the network is in the background of our daily lives. It’s frightening to think that there are legacy telco networks that have been poorly maintained that can still cause these kinds of widespread problems.
I’m not sure what the fix is for this problem. The FCC supposedly washed their hands of the responsibility for broadband networks – so they might not be willing to tackle any meaningful solutions to prevent future network crashes. Ultimately the fix might the one found by Cook County, Minnesota – communities finding their own network solutions that bypass the legacy networks.
This goes to show two things that are blatantly obvious to anyone with common sense and co incides with the troubles of PG&E in California:
We CANNOT rely on private corporations that are beholden to shareholders who put profits ahead of the safety of the public.
These companies are way to large and need to be broken up into much smaller entities to better serve their customers.
Doug: This outage impacted a great many customers outside of the CenturyLink Legacy footprint (former U S WEST) who didn’t know they had any reliance on CenturyLink services. This was due to services they did subscribe to, i.e. hosted Voip providers, POS terminal vendors with cloud connections being reliant on CenturyLink. There were significant impacts to enterprise customers as well as small business and consumer in rural areas. It will be interesting to see what the investigation finds.
Hearing the phrase “propagating invalid frames….” makes me wonder about two possibilities. Possibility 1 is that a router was sending out invalid (but correctly formatted) route information and the rest of the network was sending traffic into oblivion. Possibility 2 is that a router was telling the rest of the network it was fine, but in fact was not and traffic was not being forwarded by the bad router.
Both of these cases are failures that would very likely not cause the problem device to set an alarm.
This is where a “super extreme amazing wizard” level of knowledge of the network, it’s protocols, and network test equipment, is needed by the troubleshooters to correct the problem.
I think it took two days to fix the problem because the network was unfamiliar to the folks troubleshooting it. This could be a new network that is unfamiliar to the tech staff, or a legacy network where all the “oldsters” have left…
Pingback: Monthly Fiber Update: January 2019 | Rock Island Communications