I've already complained about the so-called Cloud, and I thought that would get it out of my system, but I've had it -- I'm up to here -- with blankety-blank "big data."
"Big Data" at the Harvard Library
The Harvard Library system is REALLY BIG. Here's Widener Library:
Widener has 57 miles of bookshelves with over 3 million volumes. But that's just the start. The Harvard University Library is the largest university library system in the world. There are more than 70 libraries in the system beyond Widener, holding a total of over 12 million books, maps, manuscripts, etc.
Here's the news: Harvard is putting it on-line! Well, not the actual books; there are little details like copyrights. But the metadata, about 100 attributes per object. According to David Weinberger, co-director of Harvard's Library Lab: "This is Big Data for books." A blog post described a day-long test run during which 15 hackers worked on a subset consisting of 600,000 items and produced various results.
Pretty amazing, huh? That's at least a couple miles of books!
"Big Data?" Give me a Break!
Apparently, no one does simple arithmetic anymore. Maybe it's the combined impact of reality TV shows, smartphones, global warming and Twitter. Who knows?
How much data does Harvard have? 12 million objects each with 100 attributes is 1.2 billion attributes. When I first started thinking about this, I gave generous estimates of the size of each metadata attribute. Then I got skeptical, dug a little, and found the actual data set. As of this month, there are 12,316,822 rows in the data set, with a compressed data size of 3,399,017,905. In case that appears to be a big number to you, it's less than 4GB.
4GB. Your smart phone probably has more than that, and your iPad certainly does. The laptop computer I'm using right now has more than that. Yes, yes, the data is compressed. How much will it be uncompressed? I could find out, but I'm lazy, and the answer in the very worst case is going to be about the same: a very small amount of data.
"Big Data" usually isn't
The reason I'm really tired of hearing about Big Data is that, in the vast majority of cases, it isn't big. Not only isn't it big, it's usually kinda small. So the people who talk about it as "Big Data" are either stupid or they're liars. Either of which make me irritated.
My acid test for whether a data set can't possibly be described as "Big Data" without embarassment or shame is whether it fits into the memory (as in the RAM-type memory, forget about disk) of a machine you can buy off the web for less than $15,000. These days, that's a machine from (for example) Dell that has about 256GB.
This machine from Dell will fit more than 50 copies of the Harvard Library data set. Who cares how big it is when uncompressed? Loads of copies of it will still fit entirely in memory.
Do not say "Hadoop" or "clusters." If you do, I swear I'll slap you.
Conclusion
The conclusion is obvious. Don't even start thinking the phrase "Big Data" unless you've first applied some common sense and performed some simple arithmetic. Definitely don't utter those words around me without first having applied the sobering tonic of arithmetic. And wake up and smell reality: "Big Data" and "Cloud" are little more than words that vendors use to trick unsuspecting victims into spending money on cool new things.
Then, of course, maybe you really do have big data -- like objectively, unusually BIG data. That's cool. It's easier to work with and get value out large data sets than ever before, and I hope you can use the latest, most productive methods for doing so. Some of the Oak companies are doing just that, and it's exciting.