2024 was like most recent years where there were a few major broadband outages and a lot of smaller regional ones. Most carriers claim to be investing more money in increased redundancy to avoid major outages and one hopes that is cutting down on outages.
AT&T suffered a big outage in February when it lost cellular coverage in markets like Dallas Houston, Los Angeles, and Atlanta. The outage particularly affected first responders served by AT&T’s FirstNet network. The company said the outage was “caused by the application and execution of an incorrect process used as we were expanding our network, not a cyberattack” Basically, the company messed something up during a network update.
The biggest telco outages for the year came from hurricanes. In western North Carolina alone, 80% of cell sites went out of service by the day after the storm hit. Fiber networks were severed as entire roads washed away, and something like a million trees were damaged. I live in Asheville, and we experienced a total communications blackout with no cellular or landline broadband. It took about a week to get a partial cell signal back and over three weeks to get broadband. Some rural areas were out much longer.
Hurricane Milton caused broadband outages as well, more related to power outages than destroyed telecom network. A lot of places didn’t lose cell coverage, and most people were back in service within a few days.
The other big outages in 2024 were not network outages but service provider outages.
- Microsoft Teams had a seven hour outage on January 26. The cause of the outage was never disclosed but seems to have been internal to Microsoft.
- On March 5, Meta had an outage that blocked users from accessing Facebook, Instagram, Messenger, and Threads. The reason for the outage was a glitch in the login process.
- Google lost service for an hour on May 1. The problem was a failure in the verification process that couldn’t identify users.
- The biggest outage of the year happened on July 19 and affected 8.5 million Microsoft Windows devices. The outage was worldwide. Flights were canceled, customers couldn’t access banks, surgeries were canceled, and there were widespread 911 outages. The cause of the problem was a section of code at CrowdStrike, the cybersecurity firm that many large Windows customers were using to protect their devices. In retrospect, the outage was blamed on the lack of testing from CrowdStrike before implementing a software update.
- Microsoft had an outage on November 25 that caused intermittent inability for users to use Outlook or reach the web. Microsoft admitted the source of the problem was a configuration change – another software update problem.
- On December 11, OpenAI had an outage of it’s video service Sora. This was caused by a cascading error when a telemetry service overwhelmed the platform.
Interestingly, most of the service outages were the result of configuration changes, meaning software upgrades.
These big companies should learn a lesson from smaller telcos. I’ve had many clients who learned the hard way to never introduce new software onto customers without first testing the update. That means NEVER EVER, NEVER EVER, NEVER EVER (did I say that enough?). Many telcos have software test labs where they have a lab setup that mimics the network. They try updates in the test lab before ever subjecting their customers to an untested update. This is software update 101 stuff, but apparently, the smart guys at some of the biggest companies don’t think they need to take this extra precaution.
Of course, testing in the lab is important, but I also think that, if possible, updates should be rolled out slowly and halted on the first failure.