We’ve had some spectacular recent failures of software upgrades gone wrong. The one that got the most press was a software problem at the FAA that knocked out nationwide flights by corrupting the NOTAM system that transmits real-time information to pilots about flight hazards and airspace restrictions. The FCC said the outage was created when personnel unintentionally deleted files while working to correct synchronization between the live primary database and a backup database.
There seem to regularly be outages of Internet platforms caused by software issues. In the last year, there have been outages at Google, Facebook, Twitter, and dozens of other software platforms. The telecom industry has had plenty of outages caused by similar issues that have knocked out large chunks of the Internet backbone or various data centers. Any of these outages that were software-related have one thing in common – with good software upgrade procedures, the outages likely could have been prevented.
Telephone companies have the longest history working with software upgrades that are capable of knocking out networks. The possibility of big voice outages crept into the industry when we replaced electromechanical switches with electronic switches. The big telephone companies developed software upgrade protocols that were designed to minimize outages due to software upgrades. Even when upgrades went poorly, smart telcos adopted processes for quickly flipping back to the original configuration. The frequency and the size of the software outages we keep seeing today are good indicators that a lot of companies are not following the safe practice that have been around for decades.
One of the first things that anybody that touches a core of a network should understand is that there is no such thing as a casual upgrade or casual maintenance of a mission critical system. It’s obvious there are bad practices in place when one technician can delete or modify a file and cause a major outage – it should be impossible for somebody to have access to casually do that.
The processes for safe software upgrades are well known. They require a lot more discipline than many network engineers want to use – but they are safe. The tried-and-true way to make a software upgrade is as follows:
Have a Project Manager for the Upgrade. It is vital to have one person in charge of the upgrade. They can get assistance in planning and doing the upgrade, but they need to be ready and authorized to react if things don’t go as planned.
Develop a Checklist. There should be a step-by-step checklist of all aspects of the upgrade. Make sure to understand every piece of equipment and software that will be affected by the upgrade. Then, most importantly, develop a step-by-step list of the steps required to perform the upgrade.
Break the Upgrade into Manageable Steps. If possible, the upgrade should be done in stages where progress can be measured and tested after each step.
Establish a Baseline / Establish a Go-Back Process. Establishing a baseline means understanding the current network configuration in detail. It means understanding the exact settings of every piece of software and equipment. Once the baseline is in place there should be a go-back process. This is the process of returning software and hardware to the original configuration if something goes wrong during the upgrade. Ideally, the go-back would be something that can be implemented quickly, and if designed well, can be done in minutes.
Make Sure to have Vendor Support. It’s worth considering having a vendor representative on site for major upgrades, or on alert for minor ones. I have seen clients schedule an upgrade over a holiday, not thinking that the needed expertise at the vendor is probably not going to be available.
Pre-test Every Component before the Cut. Safe practices establish a test lab for a complicated upgrade where the new software and/or hardware is tested first in a lab setting instead of live.
Take Every Upgrade Seriously. I often see companies follow most of the above steps for major upgrades only to see them knock out their network for what they think of as simple upgrades or routine maintenance.
It’s easy to define a bad upgrade process as one where a single technician can unilaterally change files and setting in a mission critical system without going through any of the above processes. Every time there is a bad outage we hear reasons for the outage like a corrupted file or bad hardware – nobody ever admits they were too casual with an upgrade, although that’s probably the real reason for the outage.