Nearly everyone professes to LOVE data. Just think about all the talk about Big Data, Data Lakes and the rest. Lies. Liars lying big LIES. Everyone says they like data ... until they get near it. Suddenly they develop fevers and rashes. They're allergic! Someone else will have to actually handle the data!
Data, the foundation of AI, ML, Analytics
All you have to do is get a job in one of these fancy subjects, and you quickly get hit with reality. When you were in school, you had wonderful exercises where you could develop your skills in deep learning, random forest, or whatever. Now in the real world, some older person assigns you to some juicy-sounding task where you'll get to use your skills. Where's the data? you ask. A wrist-wave of the hand tells you it's over there. You go over there and can't believe what you see. Why, it's nothing like it was in school! You try for a couple hours to clean things up. Then a couple days. It's still bad, but maybe good enough. So you run it through some models. Disaster. The system crashes and/or generates garbage. You complain. "Grow up," you're told. "This is what we've got to work with. Deal with it."
At the end of a year, you realize you've spent half your time in meetings of one kind or another, and 90% of the "working" time has been spent trying to get the data in order. With unsatisfying results. You've got some choices to make. You can lie. You can get into management, marketing or sales. You can roll up your sleeves, forget the fancy stuff you learned in school, and become a data clean-up specialist, which is actually more like a create-decent-data-from-scratch specialist. Which is NOT what you signed up for. Waaaaahhhhh.
What's maybe worse of all is the status. AI and machine learning are clearly the prestigious upper floors of a grand apartment building. Deep learning thinks it's the penthouse, but whatever. The lower floors are occupied by simple analytics. The ground floors are occupied by people managing the databases and Hadoop clusters, and maybe even some ETL tools.
And then there are the basements. The sub-basement where the garbage chutes end. Where the janitors live. Where the crap from the elegant apartments is taken to be discarded. Where the water and oil and natural gas enter the building -- the things the fancy people on the upper floors need to wash up, keep warm and prepare to dress elegantly. That's the floor ... and the status ... of the data specialists.
You can tell yourself until you're blue in the face that without good data, none of the fancy stuff would work. It's the foundation, dammit! The janitors probably tell themselves the same thing about the heat, cooling, hot and cold water, cleaning and garbage removal. True -- but they're still janitors, wearing a uniform and passed in the halls by the upper-floor people as though they don't exist.
Bad data equals bad results
There's a simple reason why the incredible potential of the Big Data movement has now morphed into AI/ML and is even incorporating Blockchain. The time passes, tick. Tick. Tick. Tick. No results! Uniform use of the future tense! Claimed successes aren't really, when you dig into them.
Some of the reason is typical organizational incompetence. But much is also due to the fact that we are swimming in a sea of big data and no one wants to clean it up! It's so bad, we mostly don't acknowledge it; much easier just to ignore it.
This problem isn't new. See this:
I've talked about the importance of data as the foundation of AI/ML here. I've illustrated the horrendous problem of bad medical data here. Even basic data, like what providers are where, is wrong too often. By the way, these illustrations should be considered informal tips of massive icebergs. When I talk with true experts who are themselves knee-deep in this stuff, I find the situation is ... even worse.
As today's illustration of the problem, let me show you a piece of mail I got. It's from a major corporation, one of the big regional cable companies and internet service providers. They've got decades of experience working with customers in their geography. They've got to know every address, every household, with complete histories of using their service, dropping it, signing up again. How could they not know the basic demographics and the kind of approaches that work and the ones that don't?
Here's the mail I got (I blocked out the street number):
Looks OK, right? Nice and clear. Specific to the town, so it feels personal. Lots of good things about it. They even designed the envelope so you could see the plastic card on the right, with an eye-catching banner over it.
There's just one little problem. JO Black died in May 2001. More than seventeen years ago.
I don't think there's anything else I need to say except, good job, Optimum! You're doing a great job illustrating the near-universal toxic, rotten ocean of data in which we swim, and doing your part in keeping it that way.
Wait, you might say, this is a trivial little problem. In a way, it is: one piece of mail that shouldn't have been sent. But it's an illustration of a problem that's broad and deep. The notion that a wrongly sent piece of mail "means nothing, is trivial" is an attitude that is EXACTLY why people who care about data metaphorically wear uniforms and work out of the basement. Maybe Optimum is worse than all the rest. Sorry, they're not. JO Black gets a VERY slowly diminishing stream of mail at this address from a wide variety of vendors, large and small. So does Mrs. Grace Black, who died 4 years ago. So does Ms. Jessica Black, who lived here for awhile before moving 20 years ago. So do Mr. Samuel Black and Ms. Elspeth Black, who never lived at this address.
The Problem is Everywhere
To be totally clear: the problem isn't just wasteful mail solicitations. It's everywhere, and every stage of data collection and utilization. The problem with healthcare data is immense, for example, as I've illustrated often. Bad healthcare data, which is ubiquitous, has the direct result that normal, innocent people needlessly suffer and die. It doesn't get better, because all the smart people and the important decision-makers are busy attending conferences about how AI is transforming medicine and how blockchain will solve all the medical data problems -- leaving the ragged crew of people who are supposed to fix the problem ignored in the dank basement, spending their time scheming how they can at least get to the first floor, since it's perfectly obvious that no one is actually interested in ... fixing the data problem!!!
Conclusion
Everybody says they want data. BIG data. But what they really want is a springboard to do something prestigious, which turning a toxic stream of severely polluted data into something textbook-clean is not. While hardly the only factor, this is a major factor in the widespread untalked-about failures of fancy modern techniques to deliver practical results. The plain fact is, nobody cares about data.