People who want to build a highly available computer system tend to focus on eliminating single points of failure. This is a good thing. But we tend to focus only on the physical layer. We don't even notice the single points of failure at the logical layer. Logical single points of failure are just as likely to result in catastrophe as physical ones, and it's high time we started paying attention to them!
Why? Keeping your system up and running is the most important job of an organization's techies.
Physical redundancy
We eliminate single points of failure by having more than one of every component, and a structure that enables the system to keep running with a failed component, while allowing it to be repaired or replaced. For example, here is a typical redundant system diagram (credit Wikipedia).
And here are instructions for replacing a hot-swap drive in a redundant design (credit IBM):
Logical vs. Physical
We are familiar with the concept of logical and physical in computing. All of computing is built on layers and layers of logical structures. Frequently, we call something a "physical" layer which is actually the next layer down the stack of logical layers; we call it "physical" only because it is "closer" to the actual physical layer than the layer we call "logical." A good example of this is in databases, in which it is common to have a "physical" database design and a logical (higher level) one; of course, calling it a "physical" design is a joke, there's nothing physical about it.
Keep eliminating physical single points of failure
I am not arguing that physical redundancy isn't important or that we should stop eliminating physical single points of failure. When I see people running important computer systems that have single points of failure, I tend to wonder how often the people in charge were dropped on their heads onto concrete sidewalks as children, and how they manage to feed themselves.
The principle is simple: if you have just one of a thing, and that thing breaks, you're screwed. For example, if you have your database on one physical machine and that machine breaks, no more database.
What is a logical single point of failure?
I think people don't pay attention to logical single points of failure because it just isn't something anyone talks about. It's not part of the discourse. Let's change that!
A prime example of a logical single point of failure is a program. You create physical redundancy by running the program on several machines. Great. That covers machine (physical) failures. What about program (logical) failures? After all, program failures (i.e., bugs) hurt us far more often than machine failures. But somehow, we don't think of a program as a logical single point of failure. We think that the program has bugs, that our QA and testing weren't good enough, and we should re-double our QA efforts. And somehow, miraculously, for the first time in history, create a 100% bug free program. Ha, ha, ha, ha, ha, ha, ha.
Suppose you have version 3.2 of your program running in your data center. If that program is running on more than one machine, you have eliminated the physical single point of failure. If version 3.2 is the only version of the program that's running in the data center, then version 3.2 is a logical single point of failure. The only way to eliminate it is to have another version of the program also running in the data center!
Eliminate logical single points of failure, too!
Smart people already have a data center discipline that eliminates logical single points of failure.
Suppose there is a new version of linux you think should be deployed. Do you stop the data center, upgrade all the machines, and start things up again? While that may be the most "efficient" thing to do, from a redundancy point of view, it's completely insane. Smart people just don't do it. They put the new version of linux on just one of their machines, and see how it goes. If it runs well, they will deploy it to another machine, and eventually all the machines will have it. It's called a "rolling upgrade."
Same thing with the application. Great web sites change their applications frequently, but using the rolling upgrade discipline, and if they're really smart, with live parallel testing as well.
Getting into the details of application design, the best people go beyond this method to create another logical layer, so that the things that change most often are stored as "data," in a way that changes are highly unlikely to bring down the site. A simplistic example is content management systems, which are nothing more than a way of segregating the parts of a site that change often from those that don't, and keeping the frequently changed parts (the content) in a non-dangerous format.
Conclusion
There is little that is more important than keeping your system available to your customers. Eliminating single points of failure is a cornerstone activity in this effort. Many of us are well aware of physical single points of failure, and eliminate them nicely. It's time for more of us to include logical single points of failure in our purview, and to eliminate them with the same vigor and thoroughness that we do the phsical kind.
The redundant CPU's in the picture vote and if they disagree then the system knows something is broke.
Taken to extreme you could do this with software
by having separate teams follow the same requirements to develop two programs
and then at runtime vote that the programs get the
same results. Sounds very reliable but also twice as expensive.
The most reliable software (5ESS telephony switch) I ever worked on had a way to apply software updates while the system was running. 5ESS actually kept the old and new versions of the code in memory at the same time. Old phone calls would use the old code and new phone calls use the new code. If there was a software fault, 5ESS would automatically stop using the new code for new calls.
They is a Linux feature for applying kernel patches without rebooting the kernel.
Posted by: Brian Beuning | 05/04/2011 at 06:38 PM