Like He-Who-Must-Not-Be-Named
in the Harry Potter books, I find that there is something that
must-not-be-named in the world of computing. It is That-Which-Must-Not-Be-Discussed.
It’s not so much that people fear it (although many of them do), it is more
that it is (for reasons that mystify me) incredibly low prestige. It’s kind of
like maintenance or janitorial services in an office building. The higher
prestige you are, the less likely you are to mention the subject – it is simply
beneath notice. Why, people who wear
uniforms, for goodness sake, do the work.
Nonetheless, this subject is incredibly high value. In fact, it’s
hard to argue there is anything more
valuable in the computing world. What is the subject? Well, think Haiti. Think
Chile. Think Mexico. Think earthquake. Yes, earth-shaking,
building-crashing, crevice-opening earthquake.
What is the equivalent of an earthquake in the world of computing? You know,
yes you do. If it’s a series of scary tremors, then it’s a site slow-down. If it brings down buildings, then it’s a site crash. If it brings down buildings
and cars and people disappear into newly appeared holes in the ground, it’s a major outage with data loss.
Where does making buildings earthquake-proof
stand in the overall priority of things?
It couldn’t have been too high on
the priority list at RIM in the time leading up to the earthquake that struck
them last December. Here is a representative story:
What a horrible thing to
happen to their business! Lots of free publicity of exactly the kind they don’t
want.
The story reminded me of
similar events, less public, that have taken place at a couple of companies I
know very well. The story also led me to reflect on how data centers (and
related development) issues are typically left to fester until they blow up.
Then the alarms ring, everyone runs around, the immediate problem is fixed.
What is unusual is for management to take the systemic action that is
required to greatly reduce the chance of the failure recurring. Some data
center operations resemble a coal-fired heating furnace – they require constant
care and feeding, are cranky and don’t like change of any kind, but people just
don’t want to think about it. “Upgrade to gas? I don’t have time to think about
it. Maybe in next year’s budget.”
Here’s what I find: the more
august the group of people, the higher their status, the less willing they often seem to be to devote real time, effort and
brain cycles to That-Which-Must-Not-Be-Discussed.
This is wrong. It is so wrong, it is perverse. Change your priorities! Take a look at it now (or
at least soon), when (I hope) the alarm bells are not ringing.
Even though the articles
about the RIM debacle don’t go into detail, it is reasonable to guess a couple
things about the RIM operation from facts that have been revealed. Here are
some of the warning signs, most of which applied to the RIM case, and some of
which may or may not have. I list them here as a quick check-list to see how
vulnerable your operation may be.
- Highly
complex system. RIM’s data center was said by several people to be highly
complex. This is almost always a bad sign. Systems naturally become complex
over time (kind of like entropy), and sometimes smart people insist for
plausible-at-the-time reasons on adding complexity. The trouble is that, for a
variety of understandable reasons, the more complex a system is, the more
likely it is to fail when changed. It is worth working hard to reduce the
number of elements in your data center and generally make it simpler.
- Change
management risk. The failure at RIM is said to have been a result of a software
“upgrade.” This is one of the most common opportunities for embarrassment. All
too often, people respond by reducing the frequency of change, which actually
increases the chance that any one change will cause a disaster (because it is
likely to be a larger, more complex change). There are methods of reducing this
risk to near zero.
- “us
vs. them.” Most data center disasters I have seen happen when there is a
(typically well-intentioned) strict separation between data center operations
and the rest of the world.
At minimum, it is worth a
quick, objective look at the “machine room” of your operation to see if it
looks and feels like the kind of place to which disasters are naturally
attracted.
Above all, get over That-Which-Must-Not-Be-Discussed
– yes (dare I say), be like Harry Potter – call Voldemort by his proper
name! Talk about earthquake vulnerability! And above all, do what you have to
do so that, when things go wrong, your service keeps working.
Comments