"Big Data" is everywhere. If only because of this, it is important, like the way Paris Hilton
What's included in "Big Data?"
If your concern is storing, serving or transmitting it, you don't care what kind of data it is -- data is data, a pile of bits.
But not all data is created equal. The easiest way to understand this is to break all the bits into relevant buckets. By far, the largest bucket is for image data (including both still and moving pictures, videos). While the ratios vary, it's not unusual for there to be 100 bits of image data for each bit of other data.
While there's not a commonly accepted terminology, all the rest of the data can be understood as "coded" data. This again falls into two categories. The larger portion is "unstructured" data, things like documents, blogs, e-mails and most web pages (except for the images and videos on them). The smaller portion is "structured" data, which includes all databases, forms and anything else that can show up in a report.
When people talk about "big data," they could be talking about any of the above, but mostly people talk about it because they want to extract actionable information from it, and the source of most actionable information is structured data. So in the vast majority of cases, when people talk about "big data," they're talking about structured data.
Did Data used to be Small and Now it's Big?
Think about a bank statement. There's a little information about you at the top, but most of the statement is probably taken up by the transactions -- money moving into and out of the account. In general terms, this is the action log, the transaction history. This pattern of having an account master and detail records is a common one.
Now think about a web site. The site itself is like the bank statement, and the record of people visiting and intereacting with it is like the transaction history, generally known as a web log.
People generate far more transaction records when interacting with the web than other human activities; for example, you probably click on hundreds of pages for each bank transaction you make. So the amount of data can be pretty big.
The simple answer is: before the web, transaction data wasn't very big, and with the web, there's a lot more of it than there was before. Of course big data isn't just about the web; but the web has certainly gotten people to pay attention.
So where did "Big Data" come from?
It would be interesting to do a cultural history, but I suspect that the current interest in "big data" stems from the following factors:
- Companies that pay attention to web logs get information about visitor behavior that can be used to make more money.
- Internet advertising companies have done exactly this for years, and are getting really good at it.
- Shockingly, most people don't analyze their data to improve their behaviors.
- A closed loop system in which the results of your actions are used to enhance future actions is the clear winning strategy.
- This requires (gulp) collecting and analyzing the relevant data, which is far larger than most people are used to dealing with.
Thus the term "big data," which currently applies to just about any body of transaction data.
What's "Big" about "Big Data?"
Let's start by applying one of the fundamental concepts of computing to the question: counting. One of the first disk drives I got to use was a twelve inch removable pack developed by IBM:
Its capacity was about 1MB. While that may sound small by today's standards, let's put it in perspective. Each byte is the equivalent of a character that you can type. Using a generous measure of 30 wpm and 5 cpw, that's 9,000 characters in an hour of continuous typing with no breaks, so the disk above has a capacity of more than 100 hours of continous typing. That's one reason I thought the disk's capacity was huge -- it easily held the source code for the FORTRAN compiler I wrote at the time, which was about a year's worth of work!
Now let's get modern. Drives have gotten smaller while holding more and more. Here's a good visualization of the progression:
We're now at the point where truly small drives (1 to 2.5 inches) hold massive amounts of data; 1TB or more is common.
How much is that? Remember, it would take 100 hours of continuous typing to fill up the large disk pictured earlier. How much space would those drives fill if you had 1TB to store? That's about 1 million of the older disks; if you packed them tightly, they would fill a room that was about 100 feet long, 100 feet wide and 10 feet high. And I would have to type for 100 million continuous hours to fill them up. Now, that's big data.
Now that we've got a sense of how big a TB is, let's get real.
On a good day, this blog might have 100 page views, each generating a server log record. Such records vary in length, but let's say they average 100 bytes in length each, or 10K bytes a day. Not much.
Let's say I caught up to the Washington Post, a site which is in the top 100 in the US. It gets about 1 million page views a day. That would be a mighty 100 million bytes a day of raw server log data. 10 days would add up to a GB of data, which means that ten thousand days, about 30 year's worth of data would fit on one of those physically little drives pictured above that holds just 1TB of data.
The Washington Post is a major site; top 100. Their web transaction logs are the biggest data for analysis they've got. And here's what 30 year's worth of their data will fit on:
That's what they call "big data." This is why I instinctively drop into cynical mode when the subject of "big data" comes up. It just isn't usually very big!
How much data do you need?
It depends on context. If you're a website like Facebook offering a free service holding user's data, the answer is simple: you keep as much of the user's data as you feel like. You can (and if you're Facebook, regularly do) throw out data any time you feel like it, or just drop it on the floor and lose it because your programmers weren't up to dealing with it.
If you're a money-making business that depends on data, you could probably run your business better if you
- Kept all the data
- Analyzed it
- Came up with useful observations, and
- Changed your behaviors accordingly.
But most businesses don't do this very well, if at all. And they are feeling increasingly guilty about it. Thus the marketing drum-beat for selling everything that can possibly be labelled "big data."
Sarcasm aside, the fact is that most businesses don't need much data in order to perform wonderfully useful analyses. The reasons are simple:
- The things that matter the most are things you're not doing yet. The data you've got is historic. It's like if you're a comedian and the audience doesn't laugh much; no amount of big data analysis of audience reaction will help you come up with better laughs.
- The impact of big potential changes will be seen in lots of your data. Go back to statistics 1.01. How much data do you need to see that the coin you're flipping isn't a fair one? Only enough to prove that the 2 out of 3 times it comes up heads isn't a fluke.
- In the end, how many changes can you realistically make? Hundreds? How about rank ordering them, finding the most important ones first, then moving on from there? You'll quickly get to diminishing returns.
Finally, more important than anything else, is getting into an experimental, data-driven, closed-loop system. This is always the key to success. It how organizations become successful, get more successful, recover from trouble, and stay on a winning path.
Conclusion
For better or worse, "big data" is likely to be with us for awhile, at least as a technology fashion trend. Like all such fashion trends, it's a useful occasion for getting us all to check if we're putting our transaction data to its most optimal use in keeping us on the track we're on and getting us onto improved ones.
Comments