A Three Nines World?

FierceNetwork recently published a thought-provoking article by Steve Saunders that asks, “is four nines the new five nines?”  That’s a question that only network engineers will understand, but it is a shorthand way to talk about the reliability of our networks.

The phrase five nines refers to having a goal for a network to be in service 99.999% of the time. That’s an incredible level of uptime, and a five nines network is expected to not be out of service more than five minutes in a year. A four nines network would have the goal of not being out of service more than 53 minutes per year, and three nines would lower the goal to 526 minutes, or just under nine hours per year.

I have a lot of clients who have signed contracts with large data customers to meet four or five nines of reliability. The only way to make that guarantee is to have a lot of redundancy. That would mean physically redundant fiber routes to protect against fiber cuts. It would mean self-healing electronics that quickly adapt to fiber outages or the loss of a key set of electronics. It means having software that can quickly be reset as needed.

In the last few years, we’ve seen network outages of major proportions. The latest outage by Verizon knocked a lot of customers out for half a day. There have been multiple regional and national outages due to problems in the Amazon AWS data centers. The breadth and magnitude of these regional outages is making it hard for any ISP to guarantee that networks will be reliable due to problems cause upstream by larger industry players.

As Saunders points out, the culprit of most of the big outages is software. The software that controls the Internet has grown increasingly complex. Sanders says the communications networks have grown as complex as the systems that operate a nuclear submarine.

The article points to the complexity associated with the recent big Verizon outage. The problem was something that affected the standalone 5G core network. Verizon’s core network includes electronics and software from five vendors  – Case Systems, Ericsson, Nokia, Oracle, and Red Hat / OpenShift. – along with Verizon’s own software.

Saunders says the issue is structural. While Verizon network engineers are elite, they are expected to operate networks that have grown to a level of complexity that is beyond the ability of technicians to fully understand everything. I’m sure Verizon still has an internal goal of five nines, but the company can no longer realistically understand the complexity of its network and the interplay of the many diverse components.

The problems and the outages are likely to grow worse as we continue to convert to software-defined networks, and as big companies consolidate network operations and eliminate technicians as a cost savings. We are also increasingly using AI to write complex software, which is reducing our ability to fully understand and debug problems during a crisis.

Saunders points to another issue, which is the erosion of the separation between LAN and WAN. For decades, businesses have been secure behind firewalls since they ran different software inside the company than what was used to communicate outside the company. But that distinction has become blurred as a lot of software now reaches across that barrier.

The article’s conclusion is that we are probably going to have to learn to live with big outages. The day of expecting to be connected to super-safe networks is gone. The Verizon outage shows that we might already be living in a three nines world, something that makes every network engineer cringe.

3 thoughts on “A Three Nines World?

  1. At my last position, mgmt pushed every day for systems that would provide 5-nine level uptime, even if it sacrificed hardware organization, temp/space issues, or insane cabling mgmt.

  2. “..hard for any ISP to guarantee that networks will be reliable due to problems cause upstream by larger industry players.”
    From my perspective, any ISP can & should only guarantee reliability of their “on-net”, not in what lies beyond that is out of their control.
    100% agree, with the larger scope, considering the complexity of the larger carriers & the continuously expanding complexities of not only backbone transport, SDN, etc.
    In the future, AI engineered specifically to support ISP networks at large, to be able to mitigate the continuously expanding requirements, foresee potential bottlenecks, necessary software/firmware updates, model all transitional or generational cutovers & recognize incompatibilities. AI will be a huge step forward, as specific LLM’s are able to learn & provide insight beyond what manpower can achieve at this stage.

    • As am ISP owner for the last 13 years I can tell you our SLA does indeed state that our uptime is only guaranteed from the clients location to our nearest edge. The problem is making the client aware of that.

      The chances of an outage being something on ether side of the part we guarantee have gotten much much larger. WiFi or device on the local client network, or failing services on the other side, outside our ISP core.

      Our technicians are increasingly tasked with “defending” our service by digging into another companies failure on behalf of our client to prove it’s, yet again, not our ISP network.

      Meanwhile the government subsidized ISPs continue to have price wars dropping prices lower and lower and we all know how their tech support works.

      Sometime this year I realized we likely could take a successful position as the “elite isp” that charges a very nice premium for our services. No we won’t get everyone, but we’ll get those clients that understand the situation and we’ll make the same income with fewer connections.

Leave a Reply to Brian SCancel reply