The Causes of Network Outages

The Uptime Institute (UI) is an IT industry research firm that is best known for certifying that data centers meet industry standards. UI issues an annual report that analyzes the cause of data center outages. The causes for data center outages is relevant to the broadband industry, because the same kinds of issues shut down switching hubs and Network Operations Centers.

The following table shows the underlying cause of the network outages in 2022 that were severe enough to be publicly reported.

Cabling 9%
Capacity Issues 6%
Cooling 6%
Fiber Cut 17%
Fire 7%
IT Software Issues 18%
Network Connectivity 12%
Power 9%
Cyberattack 11%
Third Party 7%

UI cautions that it is somewhat skeptical of the stated reasons for outages since data center owners are notorious for lack of transparency. The report says that a large portion of outages are caused by human error and management failures. An increasing number of major outages are caused by cloud, collocation, ISP, or hosting companies.

Of growing concern is cyberattacks and ransomware, which accounted for 11% of major outages. These outages are often lengthy and can lead to contamination and loss of integrity of stored data. UI says that one of the reasons for increased security breaches is an increasing reliance on industry-standard operating systems and remote monitoring, which both create homogeneous systems that are easier for hackers to understand and breach.

6% of known outages in 2022 were ranked by UI as severe, meaning there was a major and damaging disruption of services, including large financial losses, compliance breaches, customer losses, and safety issues. Another 8% were considered to be serious, meaning the outage was still damaging.

The number of overall major outages is about the same as 2019 when the annual reports were first generated – in spite of the huge amounts of resources being spent to improve technology, software, and physical redundancy.

One interesting warning from the report is that UI cautions businesses not to rely on Service Level Agreements (SLAs) that promise 99.9%+ reliability. Many data center providers fall short of that level of performance, which is prossibly the reason for the lack of transparency.

Following are the major outages in 2022 by sector. Of particular concern to readers of this blog is telecommunications, which has seen a huge increase in the percentage of total outages since 2019.

Cloud / Internet Giant 19%
Digital Services 30%
Financial Services 7%
Government 7%
Telecommunications 32%
Transportation 5%

7 thoughts on “The Causes of Network Outages

  1. Telecommunications as the big data center outage factor should not be a surprise. It is, after all, the infrastructure that ties data centers together. Telecom infra should be more reliable but it keeps getting beat up by Mother Nature and by humans, both intentionally and unintentionally. That number should not be that high, though.

  2. no surprise ‘IT Issues’ are at the top of the list.

    This does re-inforce my assertion that the internet is to concentrated, with too much data flowing through too few primary exchanges. Low tollerance of a single fiber cut because the next hop down the line is an ISP that get’s all it’s connectivity through that cable. The servers are separated from a lot of customers by a single fiber.

    That also figures into the capacity issue, a lot of data not destined for a given datacenter/IX runs through it, so instead of needing 10/40/100G ports they need to keep ramping higher and higher. Smaller regional IX would allow a lot of traffic to bypass the large IX and give an alternate path to redundant resources in a different DC.

    I’m mixing DC and IX here because the large sites are essentially attached at the hip. There’s little difference between the DCs in Seattle and SIX because they all connect through SIX and that’s mostly true at all major DCs and IX.

  3. Most of the outages that I was called in to “figure out” were the result of misconfiguration regarding the selection of backup systems/routes.

    An example.

    Newark sends traffic directly to Memphis. If that circuit fails, Newark is configured to send traffic to Atlanta and Atlanta is expected to forward the traffic to Memphis.

    If Atlanta isn’t configured to deal with Newark to Memphis traffic…..oops….

    More insidious are the capacity issues. If someone forgets to provision the Atlanta to Memphis route to handle normal traffic, plus traffic imposed in a failure situation…..oops…

      • That is true but, a lot of things have changed since then. BBN (1822, if I recall correctly), IMPs, NCP, et al. are now long gone. šŸ™‚

        Robust has, of necessity, given way to efficiency.

      • First, a design goal doesn’t always become perfectly implemented.

        Second, I guess it depends on how progress is defined.

        In this case, progress is scaling the network from 100s of devices, each with a connection speed of 110bps (or maybe even as fast as 300bps), to billions of devices, each with a connection speed in the Mbps range…and not spending a nearly infinite amount of money on the network.

        BTW, even in the 1970s and early 1980s, there were lots of outages. Some (many? most?) were the result of several computer science/engineering grad students getting together to try something new.

        There would be a short outage while the “something new” was put into place (rebooting the computer-router). The outage was a bit longer if, “#*#^, it’s not working. Reboot it with the old floppy. The outage was extended if, “$$*##*@)^@)@@@@!!!!, the old floppy won’t boot. Go get the backup.” The outage was even longer if the nearest backup was on one of the CS faculties’ desks and that desk was piled halfway to the ceiling with papers.

        I will neither confirm, nor deny, being one of those grad students. šŸ™‚

Leave a Reply