Network Outages Go Global

On August 30, CenturyLink experienced a major network outage that lasted for over five hours and which disrupted CenturyLink customers nationwide as well as many other networks. What was unique about the outage was the scope of the disruptions as the outage affected video streaming services, game platforms, and even webcasts of European soccer.

This is an example of how telecom network outages have expanded in size and scope and can now be global in scale. This is a development that I find disturbing because it means that our telecom networks are growing more vulnerable over time.

The story of what happened that day is fascinating and I’m including two links for those who want to peek into how the outages were viewed by outsiders who are engaged in monitoring Internet traffic flow. First is this report from a Cloudflare blog that was written on the day of the outage. Cloudflare is a company that specializes in protecting large businesses and networks from attacks and outages. The blog describes how Cloudflare dealt with the outage by rerouting traffic away from the CenturyLink network. This story alone is a great example of modern network protections that have been put into place to deal with major Internet traffic disruptions.

The second report comes from ThousandEyes, which is now owned by Cisco. The company is similar to Cloudflare and helps clients deal with security issues and network disruptions. The ThousandEye report comes from the day after the outage and discusses the likely reasons for the outage. Again, this is an interesting story for those who don’t know much about the operations of the large fiber networks that constitute the Internet. ThousandEyes confirms the suspicions that were expressed the day before by Cloudflare that the issue was caused by a powerful network command issued by CenturyLink using Flowspec that resulted in a logic loop that turned off and restarted BGP (Border Gateway Protocol) over and over again.

It’s reassuring to know that there are companies like Cloudflare and ThousandEye that can stop network outages from permeating into other networks. But what is also clear from the reporting of the event is that a single incident or bad command can take out huge portions of the Internet.

That is something worth examining from a policy perspective. It’s easy to understand how this happens at companies like CenturyLink. The company has acquired numerous networks over the years from the old Qwest network up to the Level 3 networks and has integrated them all into a giant platform. The idea that the company owns a large global network is touted to business customers as a huge positive – but is it?

Network owners like CenturyLink have consolidated and concentrated the control of the network to a few key network hubs controlled by a relatively small staff of network engineers. ThousandEyes says that the CenturyLink Network Operation Center in Denver is one of the best in existence, and I’m sure they are right. But that network center controls a huge piece of the country’s Internet backbone.

I can’t find where CenturyLink ever gave the exact reason why the company issued a faulty Flowspec command. It may have been used to try to tamp down a problem at one customer or have been part of more routine network upgrades implemented early on a Sunday morning when the Internet is at its quietest. From a policy perspective, it doesn’t matter – what matters is that a single faulty command could take down such a large part of the Internet.

This should cause concerns for several reasons. First, if one unintentional faulty command can cause this much damage, then the network is susceptible to this being done deliberately. I’m sure that the network engineers running the Internet will say that’s not likely to happen, but they also would have expected this particular outage to have been stopped much sooner and easier.

I think the biggest concern is that the big network owners have adopted the idea of centralization to such an extent that outages like this one are more and more likely. Centralization of big networks means that outages can now reach globally and not just locally like happened just a decade ago. Our desire to be as efficient as possible through centralization has increased the risk to the Internet, not decreased it.

A good analogy for understanding the risk in our Internet networks comes by looking at the nationwide electric grid. It used to be routine to purposefully allow neighboring grids to automatically interact until it because obvious after some giant rolling blackouts that we needed firewalls between grids. The electric industry reworked the way that grids interact, and the big rolling regional outages disappeared. It’s time to have that same discussion about the Internet infrastructure. Right now, the security of the Internet is in the hands of few corporations that stress the bottom line first, and which have willingly accepted increased risk to our Internet backbones as a price to pay for cost efficiency.

The Battle of the Routers

Cisco routerThere are several simultaneous forces tugging at companies like Cisco which make network routers. Cloud providers like Amazon and CloudFlare are successfully luring large businesses to move their IT functions from local routers to large data centers. Meanwhile, other companies like Facebook are pushing small cheap routers using open source software. But Cisco is fighting back with their push for fog computing which will place smaller function-specific routers near to the source of data at the edge.

Cloud Computing.

Companies like Amazon and CloudFlare have been very successful at luring companies to move their IT functions into the cloud. It’s incredibly expensive for small and medium companies to afford an IT staff or outsourced IT consultants, and the cloud is reducing both hardware and people costs for companies. CloudFlare alone last year announced that it was adding 5,000 new business customers per day to its cloud services.

There are several trends that are driving this shift to data centers. First, the cloud companies have been able to emulate with software what formerly took expensive routers at a customer’s location. This means that companies can get the same functions done for a fraction of the cost of doing IT functions in-house. The cloud companies are using simpler, cheaper routers that offer brute computing power which also are becoming more energy efficiency. For example, Amazon has designed all of the routers used in its data centers and doesn’t buy boxes from the traditional router manufacturers.

Businesses are also using this shift as an opportunity to unbundle from the traditional large software packages. Businesses historically have signed up for a suite of software from somebody like Microsoft or Oracle and would live with whatever those companies offered. But today there is a mountain of specialty software that outperforms the big software packages for specific functions like sales or accounting. Both the hardware and the new software are easier to use at the big data centers and companies no longer need to have staff or consultants who are Cisco certified to sit between users and the network.

Cheap Servers with Open Source Software.

Not every company wants to use the cloud and Cisco has new competition for businesses that want to keep local servers. Just during this last week both Facebook and HP announced that they are going to start marketing their cheaper routers to enterprise customers. Like most of the companies today with huge data centers, Facebook has developed its own hardware that is far cheaper than traditional routers. These cheaper routers are brute-force computers stripped of everything extraneous and that have all of their functionality defined by free open source software; customers are able to run any software they want. HP’s new router is an open source Linux-based router from their long-time partner Accton.

Cisco and the other router manufacturers today sell a bundled package of hardware and software and Facebook’s goal is to break the bundle. Traditional routers are not only more expensive than the new generation of equipment, but because of the bundle there is an ongoing ‘maintenance fee’ for keeping the router software current. This fee runs as much as 20% of the cost of the original hardware annually. Companies feel like they are paying for traditional routers over and over again, and to some extent they are.

These are the same kinds of fees that were common in the telecom industry historically with companies like Nortel and AT&T / Lucent. Those companies made far more money off of maintenance after the sale than they did from the original sales. But when hungry new competitors came along with a cheaper pricing model, the profits of those two companies collapsed over a few years and brought down the two largest companies in the telecom space.

Fog Computing.

Cisco is fighting back by pushing an idea called fog computing. This means having limited-function routers on the edge of the network to avoid having to ship all data to some remote cloud. The fog computing concept is that most of the data that will be collected by the Internet of Things will not necessarily need to be sent to a central depository for processing.

As an example, a factory might have dozens of industrial robots, and there will be monitors that constantly monitor them to spot troubles before they happen. The local fog computing routers would process a mountain of data over time, but would only communicate with a central hub when they sense some change in operations. With fog computing the local routers would process data for the one very specific purpose of spotting problems, which would save the factory-owner from paying for terabits of data transmission, while still getting the advantage of being connected to a cloud.

Fog computing also makes sense for applications that need instantaneous feedback, such as with an electric smart grid. When something starts going wrong in an electric grid, taking action immediately can save cascading failures, and microseconds can make a difference. Fog computing also makes sense for applications where the local device isn’t connected to the cloud 100% of the time, such as with a smart car or a monitor on a locomotive.

Leave it Cisco to find a whole new application for boxes in a market that is otherwise attacking the boxes they have historically built. Fog computing routers are mostly going to be smaller and cheaper than the historical Cisco products, but there is going to be a need for a whole lot of them when the IoT becomes pervasive.

Beyond a Tipping Point

Cloud_computing_icon_svgA few weeks ago I wrote a blog called A Tipping Point for the Telecom Industry that looked at the consequences of the revolution in technology that is sweeping our industry. In that blog I made a number of predictions about the natural consequences for drastically cheaper cloud services such as the mass migration of IT services to the cloud, massive consolidation of switch and router makers, a shift to software defined networks and the consequent expansion explosion in specialized Cloud software.

I recently read an interview in Business Insider with Matthew Price, the founder of CloudFlare. It’s a company that many of you will never have heard of, but which today is carrying 5% of the traffic on the web and growing rapidly. CloudFlare started as a cyber-security service for businesses and its primary product helped companies fend off hacker attacks. But the company has also developed a suite of other cloud services. The combination of services has been so effective that the company says it has recently been adding 5,000 new customers per day and is growing at an annual rate of 450%.

In that interview Price pointed out two trends that define how quickly the traditional market is changing. The first trend is that the functions served traditionally by hardware from companies like Cisco and HP are moving to the cloud to companies like Amazon and CloudFlare. The second is that companies are quickly unbundling from traditional software packages.

CloudFlare is directly taking on the router and switching functions that have been served most successfully by Cisco. CloudFlare offers services such as routing and switching, load balancing, security, DDoS mitigation and performance acceleration. But by being cloud-based, the CloudFlare services are less expensive, nimbler and don’t require detailed knowledge of Cisco’s proprietary software. Cisco has had an amazing run in the industry and has had huge earnings for decades. Its model has been based upon performing network functions very well, but at a cost. Cisco sells fairly expensive boxes that then come with even more expensive annual maintenance agreements. Companies also need to hire technicians and engineers with Cisco certifications in order to operate a Cisco network.

But the same trends that are dropping the cost of cloud services exponentially are going to kill Cisco’s business model. It’s now possible for a company like CloudFlare to use brute computing power in data centers to perform the same functions as Cisco. Companies no longer need to buy boxes and only need to pay for the specific network functions they need. And companies no longer need to rely on expensive technicians with a Cisco bias. Companies can also be nimble and can change the network on the fly as needed without having to wait for boxes and having to plan for expensive network cutovers.

This change is a direct result of cheaper computing resources. The relentless exponential improvements in most of the major components of the computer world have resulted in a new world order where centralized computing in the cloud is now significantly cheaper than local computing. I summed it up in my last blog saying that 2014 will be remembered as the year the cloud won. It will take a few years, but a cloud that is cheaper today and that is going to continue to get exponentially cheaper will break the business models for companies like Cisco, HP, Dell and IBM. Where there were hundreds of companies making routers and other network components there will soon be only a few companies – those that are the preferred vendors of the companies that control the cloud.

The reverse is happening with software. Large corporations for the last few decades have largely used giant software packages from SAP, Oracle and Microsoft. These huge packages integrated all of the software functions of a business from database, CRM, accounting, sales and operations. But these software packages were incredibly expensive. They were proprietary and cumbersome to learn. And they never exactly fit what a company wanted and it was typical for the company to bend to meet the limitations of the software instead of changing the software to fit the company.

But this is rapidly changing because the world is being flooded by a generation of new software that handle the individual functions better than was done by the big packages. There are now dozens of different collaborations platforms available. There are numerous packages for the sales and CRM function. There are specialized packages for accounting, human resources and operations.

All of these new software packages are made for the cloud. This makes them cheaper to use and for the most part easier to learn and more intuitive to use. They are readily customizable by each company to fit their culture and needs. For the most part the new world of software is built from the user interface backwards, meaning that the user interface is made as easy and intuitive as possible. The older platforms were built with centralized functions in mind first and ended up with a lot of training required for users.

All of this means that over the next decade we are going to see a huge shift in the corporate landscape. We are going to see a handful of cloud providers performing all of the network functions instead of hundreds of box makers. And in place of a few huge software companies we are going to see thousands of specialized software companies selling into niche markets and giving companies cheaper and better software solutions.