Using advanced software techniques can make a dramatic positive impact on business. It’s important for everyone to assure that your software efforts aren’t stuck in out-moded, last-generation tools and techniques.
Nearly everyone, including me, agrees with this simple statement. Nearly everyone also agrees on at least the top members of a list of reasonable candidates of “advanced software techniques” that are not “out-moded” or “last-generation.” That last statement is where the best technical people part ways with the crowd.
No, I’m not talking about hard-to-understand weird-o’s babbling about esoterica in some corner. The best technical people understand and support using the best and most appropriate techniques for solving a given problem, regardless of the recency of the technique or its prevalence. Sadly, it is often the case that the most-talked-about hot trends in software should not be used in any business that actually wants to spend money widely and get stuff done, quickly and well.
This is a BIG subject. It’s important. It’s deep. And it’s extensive. So let’s start with a big, fat, juicy example, one that was hot, hot, HOT but is now fading away, so it’s possible to talk about it somewhat more rationally. Maybe. I hope.
Big Data and Hadoop
There is little doubt that Big Data is a huge trend in software, though at least talking about it with that name appears to be undergoing a typical slow fade. Here is a review I wrote more than 5 years ago of the Big Data fashion trend as it existed at that time. It was everywhere you looked! Magazine covers! Ads! Conferences! Books! If you weren't somehow doing Big Data you were nothing and nobody.
I've been working with data my entire professional life. The data has been small, medium, large, big, huge, totally awesomely huge and even gi-normous. Since I've been faced with space and time constraints, I have long since settled on a fundamental concept of computing, simple but rarely done: Count the data! It sounds ridiculous, but it's almost a secret weapon, and appears to be rarely done. Here is some analysis of data sizes in the context of the big data trend. Here is a more detailed example of a big data set that Harvard bragged about. Hint: the data isn't very big.
I shouldn't have to say this, but here it is: For anything that people say is "big data," the very first step should be to ... count the data. Sounds simple, but apparently it's not. It's also not common to dig into the data a bit. I guess it's uncommon because in most cases of data that starts out looking big after you count -- this by itself is rare -- is that you find that most of the data is just not needed or not relevant. Which makes it not big anymore. Which means you don't need Hadoop!
Hadoop
I know I'm being silly here. Hey, we're talking Big Data! Surely we've got some somewhere. We've got to get in an expert and crunch away so we can get those virtuous, business-enhancing juices flowing through our company.
In this situation, at least until recently, what that meant was that you dove into the next level of detail and found out that the go-to tool was Hadoop. Dig in some more, and it sounds great. It's scalable without limit. You build your Hadoop cluster, script up some calculations, tap into the ocean of data you've got somewhere, and hear about how the Hadoop spins those computers up and down, and crunches all the data, using the computers that are available, and even working around ones that fail without anyone having to respond to some old-style beeper or something. No wonder Hadoop is the go-to tool for Big Data!
In the vast majority of situations, it's "decision made" time at this point. You get in your experts, they build their Hadoop cluster and away you go, climbing the Hadoop stairway to Big Data Heaven, with a glow of virtue surrounding everyone involved.
Very few people seem to dive in and understand what Hadoop and its main programming paradigm MapReduce are all about. The Hadoop "experts" don't seem to know what the reasonable alternatives are, and when they might be applicable.
Here's an example. In 2011, one of our large web companies had a huge problem caused by Google's move to a new search algorithm. The CTO grabbed a massive web log file, wrote some code to boil the terrabytes (Big Data for sure!) of data down to the key data elements, and then loaded them into the 512GB of DRAM of his powerful laptop computer and ran some advanced machine learning against it. You can see the CTO doing the work here. A few days later he had figured out Google's algorithm, reflected it in the company's website family, and traffic increased back to nearly the pre-change norm. If he had taken the Hadoop path, he would have worked for months, spent huge amounts of money, and found that the cluster and Hadoop thing would have basically been irrelevant to the problem.
Here are a couple things to consider:
- Hadoop, by definition, spreads its computing over the many machines available to it in the cluster, using HDFS (the Hadoop file system) for reading and writing data.
- It is literally thousands of times faster to get data from local memory than it is to get it from a disk-based file system. The fewer file reads and writes needed to perform a computation, the faster it will be. Hadoop doesn't care.
- Using more computers in a cluster means that there will be more I/O than using fewer. Many important calculations can be performed on a single properly-configured machine!
- It is literally thousands of times faster to get data from local memory than it is to get it from a disk-based file system. The fewer file reads and writes needed to perform a computation, the faster it will be. Hadoop doesn't care.
- MapReduce, the key processing engine of Hadoop, is one of those cool-sounding ideas whose job can be done perfectly well with normal code, which can do way more, vastly more efficiently.
- Why would anyone consider such an insanely wasteful approach? Once you know the origins, it makes sense.
- If you're a big search engine company, you have to have loads of servers, enough to hold all the data and handle all the search queries at peak traffic times.
- As is typical in situations like this, loads of servers will be under-used a large fraction of the day. Why not write some code that sucks up these "free" cycles and put them to work? Why not build a framework so you can just specify what you want done, without worrying about what resources from what machine where is used? Who cares if it's inefficient? It gets stuff done with the computers I already have. Brilliant!
- Now it makes sense that Hadoop started and grew at Yahoo, copying some ideas about a narrowly-applicable (MapReduce) system and framework built at Google.
- Except that at Yahoo, they somehow decided to make the Hadoop machines dedicated! Last I heard, they were up to, get ready ... 40,000 servers. Wow.
- With such an investment in getting value out of Big Data, Yahoo must be booming, just sky-rocketing with all the juice that has come out of the investment. Not. Why would anyone want to use an expensive, strange tool that generated no value for its originator? One word: fashion.
- Yes, there are some narrow situations in which Hadoop might be applicable. But in the vast majority of cases, you'll spend too much time getting way too many computers to do too little processing on not all that much data, and taking way too much time to get it done.
Conclusion
There is no doubt -- none! -- that you should use advanced software techniques in your business, because it will give you a competitive edge over everyone else.
The trouble is telling the difference between (1) value-adding advanced software techniques and (2) hype and software fashion. Even most software people have trouble telling the difference! In fact, software people who insist that there is a difference between value-adding software techniques and the latest thing that everyone is talking about run the serious risk of being marginalized, and categorized as being old farts who are unable or unwilling to do the work to learn the new methods.
In sharp contrast to the general thinking, software is a pre-scientific, fashion-driven field that resists holding new ideas to reasonable standards of proof and evidence. This makes it tough for business executives to know what to do. There is only one approach that works: roll up your sleeves, put ego and pride to the side, and figure it out using evidence and common sense.