Big Regional Network Outages

T-Mobile had a major network outage last week that cut off some voice calls and most texting for nearly a whole day. The company’s explanation of the outage was provided by Neville Ray, the president of technology.

The trigger event is known to be a leased fiber circuit failure from a third party provider in the Southeast. This is something that happens on every mobile network, so we’ve worked with our vendors to build redundancy and resiliency to make sure that these types of circuit failures don’t affect customers. This redundancy failed us and resulted in an overload situation that was then compounded by other factors. This overload resulted in an IP traffic storm that spread from the Southeast to create significant capacity issues across the IMS (IP multimedia Subsystem) core network that supports VoLTE calls.

In plain English, the electronics failed on a leased circuit, and then the back-up circuit also failed. This then caused a cascade that brought down a large part of the T-Mobile network.

You may recall that something similar happened to CenturyLink about two years ago. At the time the company blamed the outage on a bad circuit card in Denver that somehow cascaded to bring down a large swatch of fiber networks in the West, including numerous 911 centers. Since that outage, there have been numerous regional outages, which is one of the reasons that Project THOR recently launched in Colorado – the cities in that region could no longer tolerate the recurring multi-hour or even day-long regional network outages,

Having electronics fail is a somewhat common event. This is particularly true on circuits provided by the big carriers which tend to push the electronics to the max and keep equipment running to the last possible moment of its useful life. Anybody visiting a major telecom hub would likely be aghast at the age of some of the electronics still being used to transmit voice and data traffic.

I can recall two of my clients that have had similar experiences in the last few years. They had a leased circuit fail and then also saw the redundant path fail as well. In both cases, it turns out that the culprit was the provider of the leased circuits, which did not provide true redundancy. Although my clients had paid for redundancy, the carrier had sold them primary and backup circuits that shared some of the same electronics at ley points in the network – and when those key points failed their whole network went down.

However, what is unusual about the two big carrier outages is that the outages somehow cascaded into big regional outages. That was largely unheard of a decade ago. This reminds more of what we saw in the past in the power grid, when power outages in one town could cascade over large areas. The power companies have been trying to remedy this situation by breaking the power grid into smaller regional networks and putting in protection so that failures can’t overwhelm the interfaces between regional networks. In essence, the power companies have been trying to introduce some of the good lessons learned over time by the big telecom companies.

But it seems that the big telecom carriers are going in the opposite direction. I talked to several retired telecom network engineers and they all made the same guess about why we are seeing big regional outages. The telecom network used to be comprised of hundreds of regional hubs. Each hub had its own staff and operations and it was physically impossible for a problem from one hub to somehow take down a neighboring hub. The worst that would happen is that routes between hubs could go dark, but the problem never moved past the original hub.

The big telcos have all had huge numbers of layoffs over the last decade, and those purges have emptied out the big companies of the technicians that built and understood the networks. Meanwhile, the companies are trying to find efficiencies to get by with smaller staffing. It appears that the efficiencies that have been found are to introduce network solutions that cover large areas or even the whole nation. This means that the identical software and technicians are now being used to control giant swaths of the network. This homogenization and central control of a network means that failure in any one place in the network might cascade into a larger problem if the centralized software and/or technicians react improperly to a local outage. It’s likely that the big outages we’re starting to routinely see are caused by a combination of the  failure of people and software systems.

A few decades ago we somewhat regular power outages that affected multiple states. At the prodding of the government, the power companies undertook a nationwide effort to stop cascading outages, and in doing so they effectively emulated the old telecom network world. They ended the ability for an electric grid to automatically interface with neighboring grids and the last major power outage that wasn’t due to weather happened in the west in 2011.

I’ve seen absolutely no regulatory recognition of the major telecom outages we’ve been seeing. Without the FCC pushing the big telcos, it’s highly likely nothing will change. It’s frustrating to watch the telecom networks deteriorate at the same time that electric companies got together and fixed their issues.