Want to build a scalable application? Use a scalable architecture. What's a scalable architecture? Simple. A scalable architecture is "shared nothing," an architecture in which nothing is centralized. This seems to be harder to achieve the "deeper" you go into the stack; many software architects still seem to like centralized databases and storage. It's sad: centralized database and/or storage are the most frequent cause of problems, both technical and financial, in the systems I see.
Scalability
Scaling is simple concept. As your business grows, you should be able to grow your systems to match, with no trouble. Linear scalability is the goal: 11 servers should be able to do 10% more work than 10 servers. Adding a server gives you a whole server's worth of additional capacity. With anything less, you don't have linear scalability.
This is what we normally enjoy with web servers, due to the joys of web architecture and load balancers.
Sadly, this is often not what we normally enjoy with databases, because of mindless clinging to obsolete practices and concepts.
Databases
Databases are a wonderful example of a tool that was invented to solve a hard problem and has created a lot of value -- but has turned into a self-contained island of specialization that tends to cause more problems than it solves.
Databases are a Classic example of a Software Layer
Most people in software seem to think that having layers is a good thing. Software layers are, with few exceptions, a thing that is very, very bad! The existence and necessity of the layer tends to be accepted by everyone. It's so complicated that it requires specialists. The specialists are special because they know all about the layer and what it can do. They compete with other specialists to make it do more and more. Their judgments are rarely questioned. Sadly, they are wrong all too often both on matters of strategy and detailed tactics. All these characteristics of software layers apply to the database.
Database pathology is a classic result of the speed of computer evolution
Databases were invented by smart people who had a hard problem to solve. But the fact that they have persisted as a standard part of the programmer's toolkit, essentially unchanged, is a classic side-effect of the fact that computer speed evolves much more quickly than the minds and practices of the programmers who use them. This concept is explained and illustrated here.
How to fix the problem
There are a couple of approaches, depending on how radical you are.
- Fix the scalability problem by moving beyond databases
If you have the chance, you should do yourself and everyone else a favor and move to the modern age. As I show in detail here, the fierce speed of computer evolution has solved most of the problems that databases were designed to solve. The problem no longer exists! Get over it and move on!
- Fix the scalability problem by moving to shared nothing
If you're not willing to risk being burned at a stake for the heresy of claiming that a problem involving a bunch of data can be solved nicely without a database, there are almost always things you can do to fix the typical centralized database pathologies.
The desire to have all the data in a single central DBMS is strong among database specialists. This desire is what fuels the incredible amount of money that goes to high-end solutions like Oracle RAC. The desire is completely understandable. It's not unlike when a bunch of guys get together, bragging rights go to the one with the coolest car or truck.
However understandable, this desire is misguided, counter-productive and remarkably ignorant of fundamental DBMS concepts, like the difference between logical and physical embodiments of a schema. There is no question that there needs to be a single, central logical DBMS. But physical? Go back to database school, man! All you need to do is apply a simple concept like sharding, which in some variation is applicable to every commercial schema I've ever seen, and you've gone most of the way to the goal of a shared-nothing architecture, which gives you limitless linear scaling. Game over!
Analysis
Computers evolve far more quickly than software, which itself evolves far more quickly than the vast majority of programmers. There is nothing in human experience that evolves so quickly. This fact explains a great deal of what goes on in computing.
I've found that the more layers a given computer technology is "away from" the user, the more slowly it tends to change, i.e., the farther in the past its "best practices" tend to be rooted. In these terms, databases are pretty deeply buried from normal users, metaphorically many archaological layers below the surface. They are "older" in evolutionary terms than more modern things like browsers. Similarly, storage is buried pretty deep. That's why most of the people who devote their professional careers to them are mired in old concepts. If you think about it, you realize that DBMS and storage thinking strongly resembles thinking about those ancient beasts that used to rule the earth, mainframes!
Conclusion
Most software needs to be scalable. "Shared nothing" is the key architectural feature you need to achieve the gold standard of scalability, linear scalability. Shared nothing is common practice among layers of systems that are "close to" users, but relatively rare among the deeper layers, like database and storage. But by dragging the database function to within a decade or so of the present, and by applying concepts that are undisputed in the field, you can achieve linear scalability even for the database function, and usually save a pile of money and trouble to boot!
Infinitely easier said than done. The toolsets here either require a team of java engineers to babysit or huge design compromises. That and the business fundamentals like backups and recovery and DR are all still in the build-it-yourself-from-scratch realm. Its not that it can't be done, its just a massive engineering time-suck to solve a problem you probably don't have. You'd have to be drowning in VC money you don't know what to do with to blow it on building your own datastore.
Hopefully this all changes quickly, riak's addition of secondary indexes was a big step forward and the various branches of galera in the mysql ecosystem look super promising, but right now, in 2014, you can solve these problems much cheaper/faster with a (handful of) single physical instances using SSD.
Posted by: Jim B. | 02/07/2014 at 02:29 PM
What I meant wasn't building your own data store from scratch, but using one of the many post-DBMS tools that are available, for example redis. I also meant taking a post-rows-and-columns approach to data structure, for example documents in some format. I guess I wasn't clear enough.
Thanks for your comment, I couldn't agree more that evolving an existing place away from DBMS is a nightmare. I've done it in a couple places by applying the concepts to new projects.
Posted by: David B. Black | 02/08/2014 at 10:18 AM