When Broadband Doesn’t Work

I recently lost my company email service. We have been using Rackspace to host our email for a decade, and we loved the customer service. The company was immediately responsive to our questions, and the company was one of our most satisfactory tech vendors.

But then our email went dead, and we got the worst imaginable response. Rackspace posted about every twelve hours for days letting its millions of users know that it was investigating the issue, but there were no explanations or communications beyond the periodic uninformative web postings.

My blog today is not to complain about Rackspace, although I would rate their responsiveness to the outage as a one out of ten. Losing email is no fun, but we muddled by. I can imagine how devastating this incident was for retail businesses that lost email during the heart of the Christmas shopping season.

Losing email reminded me of how reliant businesses are on technology and on the web platforms that underlie our businesses. One part of my consulting practice is conducting surveys and interviews of businesses across the country for communities that are trying to understand the broadband environment. The number one issue I hear from businesses is how devastating it is to lose a broadband connection. A lot of businesses basically go dead when losing broadband.

I don’t think the average person realizes how reliant businesses are on broadband. People are not surprised when broadband shuts down consultants, engineers, or architects who rely on broadband to exchange data and files on projects. But the loss of broadband today can shut down businesses that the public doesn’t think of as broadband intensive.

An example I ran into recently was a sports bar. The business had a multiple-day broadband outage that the ISP blamed on a cable cut. The business relied on broadband for a huge number of functions. The bar lost its automated reservation system. The bar made a lot of money from arcade games that shut down because they were controlled in the cloud. The bar could no longer take credit card payments. It lost its automated accounting system that logged every transaction into the books. The online payroll system was gone, and there was no way to easily track hours, or tips, or pay employees. It became difficult to order supplies since the ordering systems for food and drink had largely been automated. The business lost access to the online banking it used every day. And customers lost the free WiFi.

One of the goals that most communities have is to make sure that the business community has good broadband. That used to mean making sure that every business could buy broadband from at least one fast ISP. But I’ve talked to businesses of all sizes that would gladly buy broadband from two ISPs to protect against losing connectivity. A decade ago, only the largest businesses were concerned about redundancy – today, a large percentage of businesses want a backup broadband connection using a different physical path. Businesses have heartily adopted technology but, by doing so, are vulnerable in ways they never were before. A decade ago, this bar would not have had all of these functions online.

My email outage highlights the other kind of outage that worries businesses. Every one of the functions utilized by the bar is provided by a different online vendor. Every online system a business uses is only as good as the ability of the underlying tech vendor to keep the systems running.

There are too many ways for a business to lose functionality. We’ve seen widespread broadband outages that cascade across the country and temporarily knock the biggest tech companies offline. Every system that this bar uses is susceptible to the underlying vendor having a system crash, being hacked, or getting hit with malware. Technology is great, and it has made businesses far more efficient – until it stops working.

I’ll probably never know why I lost my email. But this reminded me that I can’t fully take any online technology for granted – everything is going to fail periodically, whether due to losing broadband or due to the underlying tech company having an issue. It’s not a comfortable feeling knowing that things can go bad instantly when you least expect it.

Embracing Resiliency

For years the industry used the word redundancy when talking about how we protected our networks. The primary aspects of redundancy are having multiple fiber routes in place so that areas don’t become isolated if a fiber is cut or having enough spare electronics to quickly recover from problems.

But in recent years, we’ve started to talk about resiliency, which encompasses redundancy but means a whole lot more. Resiliency means taking proactive steps to prepare against reasonably expected problems of all sorts. There are many examples of how network owners are thinking in terms of resiliency.

For example, we’ve recently started seeing prolonged power outages or brownouts in Texas, parts of California, and elsewhere. This is due to a number of reasons like aging electrical grids, hotter temperatures putting stress on local electric networks or worsening winter ice storms. We’ve seen fiber network owners deal with this problem in several ways. One is to design networks with fewer powered locations. This is one of the biggest benefits of fiber PON networks compared to active electronics – but this can be applied to any network design with good planning. Fewer powered nodes mean fewer sites that need backup power and generators.

Another way to increase resiliency is the increased use of solar power. For small devices needing power, solar is a good alternative to wiring to the grid. But even for larger devices and locations, a good solar array can provide enough power to keep batteries charged.

One of the newest problems hitting networks in many parts of the country is the increase in average temperature and an increase in hot days each year. There are some clever solutions to the heat problem. One is to use reflective paint on huts and other devices to keep heat out – this can be very effective for larger air-conditioned huts. Another strategy for smaller network elements like cabinets is to install shade over the unit by deploying what looks like a sail. In some extreme cases, we’re seeing new kinds of cabinets on the market that come with air-conditioned doors to hold down the heat inside of a unit.

Much of the west is seeing a lot more fires than in recent years. One of the most commonsense strategies being used is to severely cut back on vegetation near huts and cabinets to decrease the vulnerability to fire damage. I also have clients who are more aggressive in areas with aerial wires to keep up with tree trimming programs.

We’ve also seen larger and more frequent floods in recent decades, including in areas that never had bad floods before. The most immediate step to protect against flooding is to make sure to have no electronics in basements or even on first floors if avoidable. I’ve not seen it yet, but I expect more network owners will consider a step taken for many years by telcos located in hurricane areas, which is putting huts and cabinets on stilts to keep them out of range of floodwaters.

I’ve also been having a lot more discussions with clients in recent years about burying networks. Most network builders have elected the lowest-cost option when building a network, and this has often meant putting fiber on poles. But when considering the total life cycle cost of operating the network, it’s becoming clear that in many cases fiber is a lower-cost option. I have one client that lost a new fiber network to fires last year and is replacing all routes with buried fiber even though the cost is significantly higher.

Another aspect of resiliency that is becoming more important is to have a mutual aid plan – to be part of a group that will respond when there is a network disaster. This means providing aid to others when there are problems, but having a swarm of technicians to help fix problems in your own network can be a lifesaver.

Relying on Cellular Broadband (Part II)

One of my recent blogs talked about the reliability of cellular data as a substitute for wireline broadband. Almost immediately I had an example of a wireless outage shoved in my face. I was in Phoenix at an all-day meeting. When I left at about 4:00 I tried my Uber app and it wasn’t working. The app cycled through but would not find a driver. This was inconvenient because I was standing in the 100-degree sun, so I immediately looked for shade. I tried a few more times. Giving up on Uber I tried Lyft and got the same results. Now I’m figuring a data outage, but since Android phones are sometimes squirrelly, to be safe I rebooted my phone.

That didn’t work and I was standing waiting in hot weather to get a ride to my hotel which was 20-miles away. Uber, Lyft and taxis were out of the question. Luckily my voice was still working, so I called my wife who ordered an Uber for me. But had she not been available I’m not sure how I would have gotten to my hotel. I’m picturing the huge number of other people this also inconvenienced. How many people landed at an airport and couldn’t get a ride? How many people were driving and suddenly lost access to their mapping software? How many businessmen were traveling and couldn’t read or respond to email?

When I got back to a landline connection I looked at the AT&T outage website and it was lit up like a Christmas tree. It looked like the east coast was totally out, but almost every other NFL city also showed an outage. Phoenix, which I knew to be out, didn’t even show on the map as having a problem, and it’s possible that the whole nationwide AT&T network had a data outage. A few days later I checked and AT&T had said nothing about the cause of the outage. Their outage website shows a 17-hour outage that day, without specifying the extent or the reason for the outage.

There is obviously something shoddy in the AT&T national network if an event of any kind can knock out the whole nationwide data network for that long. It’s hard to believe that the company would not have redundant backup for every critical system that is needed to keep the network functioning. There are only a few possible explanations. Possibly some critical component of data routing failed, such as their DNS system that routes Internet traffic for cellphones. The company might also have gone too far with software defined networking and created some new points of failure that could affect the whole network. Or the company had a major fiber cut that feeds the site of one of those key network systems. There is no excuse for any of these possibilities, and a company with nearly 160 million customers ought to have redundancy for every critical component of their wireless network.

I contrast this to the hundreds of companies I know with landline broadband networks. All of my clients worry about total network failure and they work hard to avoid it. Unless they are geographically isolated, most of my clients have redundant routes between their network and the Internet. They generally have redundancy of key routers and switches to keep critical functions operational. Most of my clients have almost no outages that are not caused in the last mile. Local broadband networks are always susceptible to cable cuts in the last mile. But those cuts, by design, only knock out customers who are ‘downstream’ from the cut. It’s becoming extremely rare for my clients to have a total network outage, and if they do they usually take steps to stop it from happening a second time.

The press is in love with wireless right now and there are dozens of articles every month declaring how wireless is our future. Cellphones are going to become blazingly fast and 5G will fill in the gaps where cellular isn’t good enough. I’ve written enough blogs about this that you probably know that I think we are still a number of years away from seeing such wireless technologies.

But this outage makes me wonder about whether people will ever fully trust wireless technologies if they are operated by the big ISPs. The big ISPs are cavalier about network outages and they seem to suppose that their customers will just accept them. If my ISP clients had a 17-hour outage they would have taken steps after the outage to made amends with customers. They would have explained the cause of the outage and talked about their plans to make sure that it didn’t happen again. They likely would have given every customer a day’s credit on their bill for the downtime.

It astounds me that something like this outage could happen. If I was the head of AT&T, heads would have rolled after this was fixed. There is no excuse for a company with a $23 billion annual capital budget to have a network that is vulnerable to a widespread outage. The only reason the company could have such outages is that they don’t place value on redundancy. Until the big ISPs can make their wireless networks as reliable as landline networks I will never consider using them for broadband. I can’t see customers sticking with a 5G network that has a 17-hour outage. Broadband is now critical to many of us and I expect outages to be measured in minutes, not in hours or days.

Why No Redundancy?

Copper wireI usually load a blog every morning between 7:00 and 8:00 eastern. But today my Internet was down. I first noticed then when I woke up around 2:30. Don’t even ask why I was up then, but that is not unusual for me. My Internet outage was also not that unusual. I have Comcast as my ISP and they seem to go out a few times per month. I’ve always given them the benefit of the doubt and assumed that a few of the late night outages are due to routine network maintenance.

So I grab my cell phone to turn on my mobile hot spot. Most of the outages here last an hour or two and that is the easy way to get through outages. But bam! – AT&T is out too. I have no bars on my LTE network. So my first thought is cable cut. The only realistic way that both carriers go out in this area is if the whole area is isolated by a downed fiber.

I check back and hit a few web sites and I find at about 3:00 that I have a very slow Facebook connection, but that it’s working. I can get Facebook updates and I can post to Facebook, but none of the links outside of Facebook work. And nothing else seems to be working. This tells me that Facebook has a peering arrangement of some kind with Comcast and must come into the area by a different fiber than the one that was cut.

So I start looking around. The first thing I find is that Netflix is working normally, just as fast as ever. So now I have a slow Facebook feed and fast Netflix and still nothing else. After a while Google starts working. It wasn’t working earlier, but it seems that I can search Google, although none of the links work. This tells me that Comcast peers with Google but that the Google links use the open Internet. I force a few links back through the Google URL just to see if that will work and I find that I can read links through Google. No other search engines seem to be working.

The only other think I found that worked with the NFL highlight films and I was able to see the walk-off blocked punt in last night’s Ravens – Browns game. It’s highly unlikely that the NFL has a peering relationship with anybody and they must have a deal with Google.

So now I know a bit about the Comcast Network. They peer with Netflix, Google and Facebook – and since these are three of the largest traffic producers on the web that is not unusual. And at least in my area the peering comes into the area on a different fiber path than the normal Internet backbone that has knocked out both Comcast and AT&T.

But I also now know that in my area that Comcast has no redundancy in the network. I find this interesting because most of my small clients insist on having redundancy in their networks. Of course, most of them operate in rural areas that are used to getting isolated when cables get cuts – it happened for many years with telephone lines and now with the Internet.

But I can see that Comcast hasn’t bothered creating a redundant network. This particular outage went for 7 or 8 hours which is a bit long, so this must be from a major fiber cut. But I look at a map of Florida and it is a natural candidate to have rings. Everybody lives on one of the two coasts and there are several major east-west connector roads. This makes for natural rings. And if our backbone was on a ring we wouldn’t even know there was an outage. But with all of their billions of dollars of profits, neither Comcast nor AT&T wireless cares enough about redundancy to have put our area backbone on a ring.

And I also don’t understand why they don’t have automatic alternate routing to bypass a fiber cut. If Netflix, Facebook and Google were connected everything else could have been routed along those same other fibers. That is something else my clients would have done to minimize outages for customers.

This is honestly unconscionable and perhaps it’s time we start clamoring to the FCC to require the big companies to plow some of their profits back into a better network. These same sort of outages happened a few times to the power grid a decade ago and the federal response was that the electric companies had to come up with a better network that could stop rolling outages. I know some of my clients that are electric companies spent some significant dollars towards that effort, and it seems to have worked. Considering how important the Internet has become for our daily lives and for commerce perhaps it’s time for the FCC to do the same thing.

Our Internet Infrastructure

Paul Barford, a professor at the University of Wisconsin, led an effort to map the major routes used by the Internet in the U.S. He believes that making knowledge of the map can help us plan better to make the Internet less susceptible to natural disasters, accidents, or intentional sabotage.

I can remember two times when the Internet backbone took a serious hit in this country and they were both in 2001. First, a 60-car CSX train derailed in the Howard Street tunnel in Baltimore, the resulting fire melted a lot of fiber cables that were on the east coast north-south route. Then later that year on 9/11, the twin World Trade Towers collapsed taking out the main carriers’ hotel and data center in Manhattan.

And there is no reason to think that we won’t have more disasters. When you look at the map, my first reaction is how few routes there are in the main backbone.

map_of_internets_backbonex519

 

Professor Barford hopes the map will spur conversation about the need for more route diversity. The Department of Homeland Security agrees and is publishing the map and making the details of the routes available to government, public, and private researchers.

Some might say that publishing such a map makes us more vulnerable. I don’t think it does. Everybody in the industry knows the addresses of the main Internet POPs since those are the end points of the data connections that ISPs buy to connect to the Internet. And I didn’t really need this map to know that the major routes of fiber mostly follow the Interstate highways. In Florida, where I live, there is a route on I-95 on one side of the state and I-75 on the other with a spur to Orlando. I doubt that anybody here in the industry didn’t already know that.

The one thing that strikes me about the map is that once you get off the major big city routes that many of the smaller US markets only have one route into and out of their hub; it doesn’t look that hard to isolate some markets with a couple of fiber cuts. I know that some of the carriers involved in the backbone have contingency plans that don’t show up on this map, and there are other fiber routes that can pick up the slack fairly soon after a major Internet outage in most places.

The other thing you realize about this network is that it wasn’t really designed—it grew organically. The network takes the shortest path between major markets using major roads and thus follows the routes built by the first fiber pioneers in the 80s and early 90s.

Hopefully this map spurs the carriers to get together and plan a more robust backbone going into the future. It’s very easy to get complacent about a network that is functioning, but this map highlights a number of vulnerable points in the network that could be improved. This kind of planning was undertaken by the large electric grids after a number of power outages a decade ago. Let’s not wait for major Internet outages to get us to pay attention to making the network safer and more redundant.