Limiting Large Network Outages

Ookla recently published an interesting article that emphasizes what I have been telling folks for a long time. Not that many years ago, telephone and broadband networks were structured in such a way that most outages were local events. A fiber cut might kill service to a neighborhood; an electronics failure might kill service to a larger area, but for the most part, outages were contained within a discrete and local area.

There were exceptions. Rural areas have been susceptible to fiber cuts in the fiber that provides Internet backbone. Years ago, I worked with Cook County, Minnesota, which would lose voice and broadband every time there was a cut in the single fiber between Minneapolis and northern Minnesota that supported the area. A public-private partnership was created to build the THOR network to solve backhaul failures in a large chunk of southeastern Colorado.

https://www.ookla.com/articles/building-digital-resilience-strategies-2025

As the article points out, this has all changed because network operators have consolidated and interconnected networks across large geographic areas. Ookla says that the new phenomenon of large scale outages is a direct result of digital transformation. As carriers, companies, and governments have grown increasingly reliant on cloud services, managed providers, and interconnected networks, they now have to live with outages where what used to be a local problem can cascade across a region, or even across the country.

The article looks at the recent power outage in Spain and Portugal that quickly grew from a local outage to a power outage across much of the Iberian Peninsula. Ookla points out that in today’s world, there is not that much difference between outages of a power grid, a cellular network, or a fiber network.

The article points out that outages can cascade much faster than anybody expects. The difference between a temporary disruption and a system-wide crisis depends on how quickly the network operators can recognize and analyze the causes of a problem. Ookla says there are five key steps needed to keep disruptions from escalating. Every major network outage is likely due to network operators failing at one of the early steps of this process.

  • Detection: Spot the first signs of trouble across multiple data sources, from outage reports to operator dashboards.
  • Attribution: Identify the root cause of the problem, whether it’s an internal software bug, a fiber cut, or a regional power failure.
  • Communication: Share timely, accurate information with stakeholders and the public to reduce confusion.
  • Remediation: Act quickly to contain damage, restore critical services, and prevent cascading failures.
  • Learning: Capture lessons from each event and feed them back into playbooks, exercises, and long-term resilience planning.

Ookla believes that the local reaction within the first hour can make a huge difference in the extent and length of an outage. There was one power company in Iberia that was able to isolate itself from the cascading shutdown because it was prepared to react quickly. I wonder how many local ISPs are ready to quickly react to problems caused outside their local network. The Ookla article suggests that local operators can do a lot more to protect themselves and their customers against major outages.

One thought on “Limiting Large Network Outages

  1. In my view, if there were one valuable AI application that ISPs could use, it would be the monitoring and (faster) detection of outages. Year ago, as a user, I made this suggestion to the CEO of a MVPD/BSP after a long outage. Today, I wonder to what extent ISPs use this type of AI application in their networks.

Leave a Reply