Like He-Who-Must-Not-Be-Named in the Harry Potter books, I find that there is something that must-not-be-named in the world of computing. It is That-Which-Must-Not-Be-Discussed. It’s not so much that people fear it (although many of them do), it is more that it is (for reasons that mystify me) incredibly low prestige. It’s kind of like maintenance or janitorial services in an office building. The higher prestige you are, the less likely you are to mention the subject – it is simply beneath notice. Why, people who wear uniforms, for goodness sake, do the work.
Nonetheless, this subject is incredibly high value. In fact, it’s hard to argue there is anything more valuable in the computing world. What is the subject? Well, think Haiti. Think Chile. Think Mexico. Think earthquake. Yes, earth-shaking, building-crashing, crevice-opening earthquake. What is the equivalent of an earthquake in the world of computing? You know, yes you do. If it’s a series of scary tremors, then it’s a site slow-down. If it brings down buildings, then it’s a site crash. If it brings down buildings and cars and people disappear into newly appeared holes in the ground, it’s a major outage with data loss.
Where does making buildings earthquake-proof stand in the overall priority of things?
It couldn’t have been too high on the priority list at RIM in the time leading up to the earthquake that struck them last December. Here is a representative story:
What a horrible thing to happen to their business! Lots of free publicity of exactly the kind they don’t want.
The story reminded me of similar events, less public, that have taken place at a couple of companies I know very well. The story also led me to reflect on how data centers (and related development) issues are typically left to fester until they blow up. Then the alarms ring, everyone runs around, the immediate problem is fixed. What is unusual is for management to take the systemic action that is required to greatly reduce the chance of the failure recurring. Some data center operations resemble a coal-fired heating furnace – they require constant care and feeding, are cranky and don’t like change of any kind, but people just don’t want to think about it. “Upgrade to gas? I don’t have time to think about it. Maybe in next year’s budget.”
Here’s what I find: the more august the group of people, the higher their status, the less willing they often seem to be to devote real time, effort and brain cycles to That-Which-Must-Not-Be-Discussed. This is wrong. It is so wrong, it is perverse. Change your priorities! Take a look at it now (or at least soon), when (I hope) the alarm bells are not ringing.
Even though the articles about the RIM debacle don’t go into detail, it is reasonable to guess a couple things about the RIM operation from facts that have been revealed. Here are some of the warning signs, most of which applied to the RIM case, and some of which may or may not have. I list them here as a quick check-list to see how vulnerable your operation may be.
- Highly complex system. RIM’s data center was said by several people to be highly complex. This is almost always a bad sign. Systems naturally become complex over time (kind of like entropy), and sometimes smart people insist for plausible-at-the-time reasons on adding complexity. The trouble is that, for a variety of understandable reasons, the more complex a system is, the more likely it is to fail when changed. It is worth working hard to reduce the number of elements in your data center and generally make it simpler.
- Change management risk. The failure at RIM is said to have been a result of a software “upgrade.” This is one of the most common opportunities for embarrassment. All too often, people respond by reducing the frequency of change, which actually increases the chance that any one change will cause a disaster (because it is likely to be a larger, more complex change). There are methods of reducing this risk to near zero.
- “us vs. them.” Most data center disasters I have seen happen when there is a (typically well-intentioned) strict separation between data center operations and the rest of the world.
At minimum, it is worth a quick, objective look at the “machine room” of your operation to see if it looks and feels like the kind of place to which disasters are naturally attracted.
Above all, get over That-Which-Must-Not-Be-Discussed – yes (dare I say), be like Harry Potter – call Voldemort by his proper name! Talk about earthquake vulnerability! And above all, do what you have to do so that, when things go wrong, your service keeps working.