How Safe is Your Network?

Last week Comcast suffered a major broadband outage. The worst imaginable set of events occurred when there two simultaneous fiber cuts on major legs of their backbone – one between Chicago and New York and one between Ashburn, Virginia and South Carolina. In case you don’t know, Ashburn is the home of the major Internet POP serving Washington DC and surrounding cities.

This is a network planner’s worth nightmare. Planners always try to build redundancy into fiber routes so that the network won’t crash from a single fiber cut. Modern backbone electronics can be set to automatically forward traffic in both directions around a ring so that service isn’t interrupted in the case of a fiber cut or failure of ring electronics somewhere along the ring. But rings using this technology can’t withstand two simultaneous cuts.

What was a bit surprising to me was the failure of a large part of the Comcast network with fiber cuts that were so far apart. It seems unlikely that the company has a fiber ring that sends all Internet traffic in such a large circle. It’s more likely that the company has centralized one or more of their routing functions, such as DNS routing in one place on the network and the fiber cuts might have isolated that key function, which would shut down their Internet product.

Redundancy is a big concern for most smaller network owners. Lack of redundancy was one of the major issues that drove Cook County, Minnesota to build their own fiber network. There is no cable provider in the county and their entire telecom network was provided by CenturyLink. Tourism is the major driver of the economy and a decade ago there was a cut in the CenturyLink fiber from Duluth that isolated the county during peak tourist season. That meant that the Internet, telephones, and cell phones didn’t work. Businesses couldn’t take credit cards, restaurants and hotels couldn’t take reservations, and family members on vacation couldn’t communicate with each other. This prompted the County to pursue a fiber network that included creating redundancy traffic in and out of the county. The network was ultimately built and operated by the local power cooperative, and today there is greatly reduced chance of a major telecom outage.

Even where there is redundancy there can be outages. One of my clients operates a large statewide fiber network that stretches for hundreds of miles. They followed good engineering practice and scheduled an upgrade of the ring electronics after midnight. While one of the nodes was being upgraded the fiber was cut on a different part of the network when a truck knocked down a telephone pole, and the whole network went dark. Fiber cuts in the middle of the night are somewhat rare, but they happen.

Whenever possible fiber engineers also build redundancy into a local fiber network. They might build a ring connecting the fiber huts serving neighborhoods so that the network keeps functioning with a cut along the ring. It’s nearly impossible to design such redundancy in the last-mile loop, but fiber cuts in the last mile only isolate homes associated with the specific fiber.

But just like with Comcast and Cook County, many local networks have a hard time creating redundancy outside of their immediate network. In geographically remote areas it’s often impossible to find a second secure route to the Internet, leaving a network, or whole communities vulnerable to a fiber cut somewhere outside their area.

Unfortunately, it’s getting easier for fiber providers to run into the same kind of issue that hit Comcast. We are migrating numerous functions to the cloud and having redundant fiber routing does not always mean that there is an automatic redundant connection made to a key cloud server. I have clients that are now relying on the cloud for all sorts of services such as VoIP, cable TV programming, DNS routing for the Internet, the use of cloud-based operational software, etc. These ISPs may have a redundant path to the Internet, but still have only one path to get to the company providing their cable TV signal or DNS routing.

The Comcast outage should prompt companies to look again at redundancy. Don’t assume that every function in the cloud is redundant even if you have a redundant connection to the Internet.

The Need for Fiber Redundancy

I just read a short article that mentioned that 30,000 customers in Corvallis, Oregon lost broadband and cable service when a car struck a utility pole and cut a fiber. It took Comcast 23 hours to restore service. There is nothing unusual about this outage and such outages happen every day across the country. I’m not even sure why this incident made the news other than that the number of customers that lost service from a single incident was larger than normal.

But this incident points to the issue of network redundancy – the ability of a network to keep working after a fiber gets cut. Since broadband is now becoming a necessity and not just a nice-to-have thing we are going to be hearing a lot more about redundancy in the future.

Lack of redundancy can strike anywhere, in big cities or small – but the effects in rural areas can be incredibly devastating. A decade ago I worked with Cook County, Minnesota, which is a county in the far north of the state. The economy of the county is driven by recreation and they were interested in getting better broadband. But what drove them to get serious about finding a solution was an incident that knocked out broadband and telephone to the entire county for several days. They County has now built their own fiber network that now includes redundant route diversity to the rest of the world.

We used to have this same concern about the telephone networks and smaller towns often got isolated from making or receiving calls when there was a cable cut. But as cellphones have become prevalent the cries about losing landline telephone have diminished. But the cries about lack of redundancy are back after communities suffer the kinds of outages just experienced by Corvallis. Local officials and the public want to know why our networks can’t be protected against these kinds of outages.

The simple answer is money. It often means building more fiber, and at a minimum it takes a lot more expensive electronics to create network redundancy. The way that redundancy works is simple – there must be separate fiber or electronic paths to provide service to an area in order to provide two broadband feeds. This can be created in two ways. On larger networks it’s created with fiber rings. In a ring configuration two sets of electronics are used to send every fiber signal in both directions around a fiber. In that configuration, when a fiber is cut the signal is still being received from the opposite direction. The other (and even more expensive) way to create diversity is to lay two separate fiber networks to reach a given location.

Route redundancy tends to diminish as a network gets closer to customers. In the US we have many different types of fiber networks. The long-haul fiber networks that connect the NFL cities are largely on rings. From the major cities there are then regional fiber networks that are built to reach surrounding communities. Some of these networks are also on fiber rings, but a surprising number are not and face the same kind of outages that Cook County had. Finally, there are local networks built of fiber, telephone copper, or coaxial cable that are built to get to customers. It’s rare to see route diversity at the local level.

But redundancy can be added anywhere in the network, at a cost. For example, it is not unusual for large businesses to seek local route diversity. They most often achieve this by buying broadband from more than one provider. But sometimes this doesn’t work if those providers are sharing the same poles to reach the business. I’ve also seen fiber providers create a local ring for large businesses willing to pay the high price for redundancy. But most of the last mile that we all live and work on has no protection. We are always one local disaster away from losing service like happened in Corvallis.

But the Corvallis outage was not an outage where a cut wire knocked out a dozen homes on a street. The fiber that got cut was obviously one that was being used to provide coverage to a wide area. A lot of my clients would not design a network where an outage could affect so many customers. If they served a town the size of Corvallis they would build some local rings to significantly reduce the number of customers that could be knocked out by an outage.

But the big ISPs like Comcast have taken shortcuts over the years and they have not spent the money to build local rings. But I am not singling out Comcast here because I think this is largely true of all of the big ISPs.

The consequences of a fiber cut like the one in Corvallis are huge. That outage had to include numerous businesses that lost their broadband connection for a day – and many businesses today cannot function without broadband. Businesses that are run out of homes lost service. And the cut disrupted homework, training, shopping, medical monitoring, security alarms, banking – you name it – for 30,000 homes and businesses.

There is no easy fix for this, but as broadband continues to become essential in our lives these kinds of outages are going to become less acceptable. We are going to start to hear people, businesses, and local governments shouting for better network redundancy, just as Cook County did a decade ago. And that clamor is going to drive some of these communities to seek their own fiber solution to protect from the economic devastation that can come with even moderate network outages. And to some degree, if this happens the carriers will have brought this upon themselves due to pinching pennies and not making redundancy a higher priority in network design.