Last year I wrote about big disruptive outages on the T-Mobile and the CenturyLink networks. Those outages demonstrate how a single circuit failure on a transport route or a single software error in a data center can spread quickly and cause big outages. I join a lot of the industry in blaming the spread of these outages on the concentration and centralization of networks where the nationwide routing of big networks is now controlled by only a handful of technicians in a few locations.
In early October, we saw the granddaddy of all network outages when Facebook, WhatsApp, and Instagram all crashed for much of a day. This was a colossal crash because the Facebook apps have billions of users worldwide. It’s easy to think of Facebook as just a social media company, but the app of suites is far more than that. Much of the third world uses WhatsApp instead of text messaging to communicate. Small businesses all over the world communicate with customers through Facebook and WhatsApp. A Facebook crash also affected many other apps. Anybody who automatically logs into other apps using the Facebook login credentials was also locked out since Facebook couldn’t verify their credentials.
Facebook blamed the outage on what it called routine software maintenance. I had to laugh the second I saw that announcement and the word ‘routine’. Facebook would have been well advised to have hired a few grizzled telecom technicians when it set up its data centers. We learned in the telecom industry many decades ago that there is no such thing as a routine software upgrade.
The telecom industry has long been at the mercy of telecom vendors that rush hardware and software into the real world without fully testing it. An ISP comes to expect to have issues in glitches when it is taking part in a technology beta test. But during the heyday of the telecom industry throughout the 80s, and 90s, practically every system small telcos operated was in beta test mode. Technology was changing quickly, and vendors rushed new and approved features onto the market without first testing them in real-life networks. The telcos and their end-user customers were the guinea pigs for vendor testing.
I feel bad for the Facebook technician who introduced the software problem that crashed the network. But I can’t blame him for making a mistake – I blame Facebook for not having basic protocols in place that would have made it impossible for the technician to crash the network.
I bet that Facebook has world-class physical security in its data centers. I’m sure the company has redundant fiber transport, layers of physical security to keep out intruders, and fire suppression systems to limit the damage if something goes wrong. But Facebook didn’t learn the basic Telecom 101 lesson that any general manager of a small telco or cable company could have told them. The biggest danger to your network is not from physical damage – that happens only rarely. The biggest danger is from software upgrades.
We learned in the telecom industry to never trust vendor software upgrades. Instead, we implemented protocols where we created a test lab to test each software upgrade on a tiny piece of the network before inflicting a faulty upgrade on the whole customer base. (The even better lesson most of us learned was to let the telcos with the smartest technicians in the state tackle the upgrade first before the rest of us considered it).
Shame on Facebook for having a network where a technician can implement a software change directly without first testing it and verifying it a dozen times. It was inevitable that a process without a prudent upgrade and testing process would eventually result in the big crash we saw. It’s not too late for Facebook – there are still a few telco old-timers around who could teach them to do this right.