Everyone knows losing your data is a bummer. If you're in charge of your organization's data, you know that losing data is the shortest path to "don't let the door hit you on the way out."
All the ways to assure your data is still available when you wake up tomorrow share a common theme: "make a copy." This is such a popular theme that it has turned into a theme-and-variations: "make a copy; make another copy; copy the copy; etc."
This sounds simple, but we all know that in computing, stuff is supposed to be complicated. Sure enough, this simple "just copy it" theme has gotten mired in hotly competing ways to get it done. And of course, there are politics -- whose responsibility is it to assure against loss?
So let me boil it down: there are two basic ways to do the copy:
- The guys in charge of the data, the storage guys, should copy the data from the original bunch of storage to a second bunch of storage.
- The guys who write the data, the applications or systems guys, should get their applications or systems to talk to each other and write the data twice.
The only reason this is hard is that politics and history are involved. If you had fresh, educated people starting from scratch, it would be no contest: way number 2 wins, almost every time. It's faster, cheaper and easier than way number 1. But since when can we wave a magic wand and eliminate politics and history? The reality is, storage guys own the data, they want to protect it, and so they (usually) really, really, REALLY want to be in charge.
Here's why they shouldn't be.
You've got two sites, number 1 and 2. Each one of them has a database and a bunch of storage. Transactions come into site 1 and get written to storage.
Here's a simple transaction that might be written to the database.
It's a SQL statement that says the DBMS should write the transaction into the transaction table. The transaction contains the usual fields, things like the unique ID for the tranaction, the account number it's applied to, the amount of the transaction, etc. This is usually a simple string, a line or two long.
When the database processes the transaction, it gets complicated, of course.
When the Insert statement goes to the DBMS, the DBMS has to write the transaction itself, but it also has to write at least a couple of the fields to index tables, kind of like card catalogs in old-style libraries that let you find where things are. Indices typically use well-know things called b-trees, which may require a couple of writes to create a multi-level index, for the same reason you put related files into sub-folders so you have some chance of finding them later. There will certainly be an index for the account ID and one for the account number. Finally, there's a log to enable the DBMS to figure out what it did in case bad things happen.
All this happens when the Insert transaction comes in. One simple request to the DBMS, many writes and updates to the storage, usually involving reading in big blocks of data, modifying a smal part of the block, and writing the whole thing out again.
Now we come to the crux of the matter: how do we get the data over to site 2? Does the DBMS at site 1 talk with his buddy at site 2 to get it done, or are the relevant storage blocks in site 1 copied over to site 2?
In the diagram, I show the DBMS doing the job in green and the storage doing the job in red.
You'll notice that the DBMS only has to send a tiny amount of data over to site 2, essentially the insert statement. Once it's there, DBMS #2 updates all the storage, something it's really good at doing.
To replicate the data once it's been stored (in red), HUGE amounts of data need to be sent over the network to site #2. It's not unusual for the ratio to be hundreds or thousands to one.
Sending data between sites is a relatively slow and expensive operation. That's why, if you want replication that's fast, reliable and inexpensive, you want the application to do the job, not the storage.
The storage replication people don't like to talk about the things that go wrong, but of course they do. What happens if some of the blocks get over but others don't. Or they're out of order. Or syncing with the database doesn't happen. Or any number of other bad outcomes.
Other applications
I'm using a database application to illustrate the principle, but similar dynamics work out with other applications. All major databases can replicate (Oracle, MySQL, SQLServer, MongoDB, etc.), the major file systems can replicate (for example Microsoft has VSS), and all the hypervisors can replicate.
The hypervisors are amazing. The first thing the storage guys will come back with is how many different applications you have to fiddle with to protect their data. The answer of substance is that the incremental effort for each application is truly trivial, well under 1%. The quick answer is that hypervisors (VMware, Hyper-V, etc.) are universal, and their replication is superior to storage replication. This is exactly why, as organizations move their data centers to the cloud, they are abandoning expensive, inefficient storage vendor-lock-in features like replication in favor of doing it in the hypervisor.
Conclusion
You have to protect and preserve your data. Non-negotiable. The storage guys used to have a monopoly on it. But their high-priced, inefficient copy methods are rapidly giving way to more effective, modern ways that save money and are nearly standard in the SLA-centric world of cloud computing.