There’s a rapidly growing movement to take all the data that’s scattered throughout an organization, rationalize it, bring it together, and make it available for analytics that will help management to understand and ultimately to transform the business.
This movement is taking place today. It’s explosive. It’s called Big Data and Analytics.
This movement also started in the 1970’s, took root in the 1980’s, exploded in the 1990’s and is with us today. It’s called Data Warehousing.
The fact that attention is slipping away from the thing called “data warehouse” and moving towards “Big Data” is a typical IT industry phenomenon. The problems are the same, the obstacles are the same and the solutions are the nearly the same – but the rhetoric and software are entirely different. Only a few savvy industry insiders are aware of the game that’s being played.
The Enterprise Data Warehouse
The Enterprise Data Warehouse, EDW, is the industry’s holy grail. It’s the place where all an organization’s data is stored for reading and analysis. All the data from the various operational and transaction databases is extracted, transformed as required and loaded into this database. Once there, it provides a “single source of truth” for the enterprise. Since the EDW is not running transactions, reports and analytics can be run against it at will, without harming ongoing operations. Single-purpose extracts can easily be made from it to support various projects.
The EDW makes common sense. It became a major goal for many organizations, and many are still marching towards that goal. There’s just this one little problem: getting there. Then there’s a second problem: realizing the potential value.
There are lots and lots or organizations that don’t have to worry about having an EDW that fails to fulfill its promise, because they just get bogged down along the way and never really get there.
Why don’t you read much about this? Simple: who wants to admit it? And if the road to the EDW ends up trapping those marching down it in impassable mud, who outside the organization is ever going to know it?
There’s a simple little acronym in EDW that is the tip of the mud-trap in which EDW gets bogged: ETL, which means Extract, Transform and Load. That’s what you do to get the data from where it starts to where it needs to be, in the EDW. Simple, right? Oh, if only it were…
Extract, Transform and Load
Before you even get to the E of ETL, you have to find the data. Then you have to get access to the data, with a properly jealous operational management group anxious that you avoid screwing them up. You have to get the whole thing to start with, and then a stream of updates.
The “database” could be nearly anything. It could be a set of ISAM files running under CICS on an IBM mainframe. In which case, you need to get your hands on the source copy books that contain the data definitions to have any hope of making sense out of them.
It could be something nice and modern, like Oracle. But you’d better start by getting a full dump of the schemas to have any hope of navigating among what could be many hundreds of tables. Then, without an E-R diagram that’s up to date, you’ll have little chance of making sense out of the tables. Then when you get down to it, you may discover a world of stored procedures initiated by access triggers, so that your innocent “just let me read the tables” turns out to have side effects. And then, getting the updates? You’ll soon find yourself either crawling to the DBA and begging for a change log to be shipped to you, or pleading to be allowed to program in some trigger-initiated stored procedures yourself, so you can get the updates with killing the performance of the DBMS, and avoid getting set upon by a mob of angry users.
Phew! Now you’ve gotten through the E part of one source. What if there are dozens, or hundreds?
And then the real fun begins. The also-innocent-sounding T phase of ETL. Because T doesn’t just mean simple, no-big-deal transforms, it also means take the customer names that are represented in different ways in different places, some of which have been updated or changed independently of the others, and make it so that when you end up with a customer in the EDW, that customer represents all of your relations with exactly one customer. Having three customer rows in the EDW for one customer kind of defeats the purpose of the EDW, after all.
I’m just scratching the surface here, but perhaps you can get a feel for why the glowing promise of the Enterprise Data Warehouse so often ends up with the participants hungry and wounded in various ditches along the path to the promised land.
Big Data
Forget I ever mentioned “data warehouse”, ETL or any of that other stuff. Bzzzzzzttt! New subject!! Brand new!!! NOT related to anything else in computing, completely WITHOUT history, stemming from this brand-new EXPLOSION of data that’s just EVERYwhere. It’s the Big Data movement! Where we take all these mountains of data that are just piling up useless and turn them into business GOLD. You’re already late – everyone else is already with it. There are books, conferences, experts, the whole nine yards!
You’ve got to get a Data Lake, and fill it up with data. Then you’ve got to rev up your Hadoop cluster and start cranking out those nuggets of business gold from all that data.
Except, hmmm. I’ve got to find the data. Get access to it. Get it once and then get a feed of the updates. All this data from different places, it doesn’t match up well, I’ve got to clean it up. Well, maybe I’ll just dump it into the Data Lake and let the Hadoop nerds worry about it. They’ve got all these servers at their disposal, maybe the servers can work at night cleaning everything up.
Gulp. I just looked at my nice, fresh, clean Data Lake. It’s a Data Swamp! There are snapping turtles and water moccasins swimming in there. Don’t. Like. This. Maybe I can get a transfer.
Conclusion
If “data warehousing” were a big success, it would have kept its name and would now handle what we now call “big data.” But no. “Data warehousing” projects are often classic IT projects that drift on forever, confronting obstacles and rarely producing results. Big Data is the new kid on the block. There aren’t (yet) decades of frustration and broken promises associated with it. Give it time. Every obstacle that DW projects encounter also rear up to challenge Big Data projects, and until solutions are found, returns will be equally elusive. And even then, there are conceptual flaws in most Big Data efforts.
Comments