The 2024 AT&T Outage

On February 22 of this year AT&T had a massive cellular outage. Customers started noticing the inability to make calls or texts or reach the Internet at 3:30 AM Eastern time. Ookla’s service Downtector said it recorded over 1.8 million reports of customer problems during the outage event, making this the most widespread outage since a big T-Mobile outage in June 2020.

The outage had a big impact on customers who found themselves unable to communicate. I was one of the many people affected that morning, but since I work from home I was able to pivot to make most of my planned calls work for the day to use video conference. That’s a much smaller impact than being away from home with your smartphone as your only means of communication.

The biggest immediate concern is always 911 and emergency services. Many local 911 centers issued an alert about the problem and warned people to use alternate ways to reach 911. But these alerts are largely worthless to somebody connected to AT&T on a smartphone since it’s unlikely they’d get the alert.

These kinds of outages trigger some folks to be prepared for outages. I recently talked to a real estate agent who carries three phones subscribed separately to AT&T, Verizon, and T-Mobile. The main reason for doing this is that she works in a rural area where each carrier has dead zones, but this also means always being able to reach 911.

AT&T was slow to acknowledge the problem and didn’t make a public announcement until noon Eastern. Service didn’t get fully fixed until more than twelve hours after it started. As usual, the company announcements both during and after the outage were nondescript. The first thing AT&T said was that the outage was not due to a cyberattack. That seems like an odd thing to say, but social media had settled on the cause as either terrorism or solar flares. AT&T said that the outage was due to “the application and execution of an incorrect process” during a network expansion.

I’m sure most folks had no idea what that meant, but industry folks understood that to mean there was human error in implementing an upgrade – more likely software than hardware.

A decade ago, this kind of outage was unthinkable. The cellular network used to be arranged in small regions that coordinated roaming, but each cell site was a standalone entity, largely operating using it’s own software and hardware. I have a friend who made a great living for many years doing cell site upgrades. He and his crew would travel for months on end going from cell site to cell site to repeatedly make the same upgrade. But the upgrades weren’t identical, because each cell site differed in the exact configuration of software and hardware installed. It took up to a year to make a nationwide upgrade to the cellular network.

The cellular network still has some of that regional flavor, but software is now upgraded nationally or by large region, as was being done in the “network expansion” at the time of the AT&T outage. A mistake in coding or a mistake in the sequence of the update steps can now crash a huge number of cell sites simultaneously. In the old configuration, any update bugs would get worked out in the first few cell sites that got updated.

There is no question that consolidation is better for the public in the long run because it means that people get to take advantage of upgrades sooner, instead of waiting for up to a year to see improvements. We would not have seen the big upgrades in cellular speeds over the last few years if cell carriers were still traveling to cell sites one-by-one to implement the many upgrades on the path to implement 5G.

But one of the big drawbacks of a modernized and centralized network are these big outages. There is always going to be human error when updating networks. It seems like there must be a way for big carriers to test upgrades on a few cell sites before updating many at the same time. In the old days, we called this having a test lab to test updates in a way that didn’t impact customers if something went wrong. Unfortunately, while we have gained huge efficiency, our networks are susceptible to widespread damage from cyberattacks, solar flares, and technician errors.

One thought on “The 2024 AT&T Outage

  1. This outage reminds me of the ATT long-distance landline nationwide outage in the 1990’s when an upgrade was made to the signaling system 7 (SS7) software. I was working at ATT at the time. The upgrade unexpectedly created conditions so that the telephony switches thought they were full (even though they weren’t) and therefore wouldn’t process the calls and kept passing them on to other switches experiencing the same problem. It lasted for hours.

Leave a Reply to Barbara CherryCancel reply