It's funny - I spent a half hour the other day on the phone sharing some insights with a team from the New York City Economic Development Corporation, the city agency that is developing a telecommunications plan for the Big Apple. They are focusing on ways to assess and ultimately improve network reliability so that even while terrorists fly jetliners into skyscrapers, blackouts occur, or steampipes explode, the investment banks and global media companies can stay connected.
My advice was - telecom networks are among the most resilient infrastructures we have. Let the carriers worry about that, as their customers will demand what they think they need. But they depend on two critical underlying infrastructures that are old, weak, decaying, and highly vulnerable: physical conduits (tunnels, roads, even sewers and water mains). It's also an areas where local governments have some regulatory pressure to apply (versus telecom where they have almost none).
Two days later, Pacific Gas & Electric in San Francisco proved my point for me by letting a transformer explosion knock some important web services offline. GigaOM has great coverage:
This resulted in a transformer blowing up, and causing even more disruptions, especially at 365 Main, one of the large co-lo/data center facilities situated in the SOMA area of San Francisco.
This resulted in massive outages at some of Web 2.0’s brand name companies - Six Apart, Facebook, Technorati and Yelp - knocking out their systems and web services out flat. Whatever the reasons behind the failure might be, yesterday was a rude reminder of how fragile our digital lives are.
The seemingly invincible web services (not to mention the notional wealth they signify) vanish within a blink of the eye. It was also a reminder, that all the hoopla around web services is just noise - for in the end the hardware, the plumbing, the pipes and more importantly, the power grid is the real show.
Full article. Also see O'Reilly Radar blog's coverage.
Update: Sean Ness of IFTF forwarded me a blog comment by somafm that has a fairly detailed technical explanation of the chain of events that brought down 365 Main, the AboveNet data center that was at the heart of web outages caused by the San Francisco blackout:
365 Main, like all facilities built by AboveNet back in the day, doesn't have a battery backup UPS. Instead, they have these things called "CPS," or continuous power systems. What they are is very very large flywheels that sit between electric motors and generators. So the power from PG&E never directly touches 365 Main. PG&E power drives the motors which turn the flywheels which then turn the generators (or alternators, I don't remember the exact details) which in turn power the facility. There are 10 of these on their roof.
The flywheels (the CPS system) can run the generator at full load for up to 60 seconds according to the specs.
There are also 10 large diesel engines up on the roof as well, connected to these flywheels. If the power is out for more than 15 seconds, the generators start up, and clutch in and drive the flywheels. There are no generators in the basement. (There is a large fuel storage in the basement, and the fuel is pumped up to the roof. There are smaller fuel tanks on the roof as well. )
Here's what I think happened. Since there were several brief outages in a row before the power went out for good, it seems that the CPS (flywheel) systems weren't fully back up to speed when the next outage occurred. Since several of these grid power interruption happened in a row, and were shorter than the time required to trigger generator startup, the generators were not automatically started, BUT the CPS didn't have time to get back up to full capacity. By the 6th power glitch, there wasn't enough energy stored in the flywheels to keep the system going long enough for the diesel generators to start up and come to speed before switching over.
Why they just didn't manually switch on the generators at that point is beyond me.
So they had a brief power outage. By our logs, it looks like it was at the most 2 minutes, but probably closer to 20 seconds or so.
Update #2: From Good Morning Silicon Valley:
OK, it's all happenstance and coincidence, but those inclined to see omens and portents might be excused for thinking that the gods are displeased with the Net. Almost a month ago, a huge fire in downtown Palo Alto came within an alley of PAIX, the Palo Alto Internet Exchange, one of the major crossroads in the country for data traffic, threatening vast disruption. Then yesterday, a series of electrical outages and fluctuations left a good-size chunk of San Francisco powerless for several hours during the middle of the business day, including hosting service 365 Main, which powers many of the Web's most popular sites and boasts of doubly redundant backup in case of blackouts.
The effect rippled through the wired world. LiveJournal and Second Life went dead, AdBrite dimmed, Craigslist became unlisted, the 1Up gaming network went down, Facebook turned blank, Six Apart couldn't get it together, and Yelp was rendered silent. Unable to work, Web 2.0 programmers slathered themselves with sunscreen and stumbled into the unfamiliar daylight. Families were reunited as thousands of idled bloggers pushed away from the keyboard and were greeted by loved ones. Global temperature dropped as servers and PCs rested silently.
Soon enough, though, normality was restored, and the words "wake-up call" were zipping across the Net. "It exposed a larger vulnerability," said Technorati exec Derek Gordon . "If this could happen to such a collection of major websites, what would happen if this was part of a major catastrophe? This was sort of a wake-up call." And Don Dodge notes that this is exactly why companies that can afford it, like Microsoft and Google, are building their own multiple data centers.
Still, alarms about something as daunting and expensive as replacing aging infrastructure tend to get the snooze treatment until something truly calamitous happens. For a preview, see the Onion's report on the Great Web Crash of '07.
Is it feasible that someone could disable computers globally, through a virus or other means? What would the consequences be? Could it be fatal, or would they get things up and runnng again quickly?
Posted by: Carly Tinkler | February 10, 2008 at 12:43 PM