Machine Learning is all the rage today. It's getting so I'm hearing about it as often as about Big Data! And that's a lot!
A Machine Learning Expert is supposed to be able to wield his magic weapon and solve problems that have eluded apparently smart, motivated and educated humans for decades. It's said to be that good! It's knocking off previously unsolvable problems in fintech and healthcare left and right, with no end in sight.
I'm a big fan of machine learning and related analytical techniques. I'm delighted that it is finally being applied to some problems for which it is well-suited; this happy event is long overdue. But as with most magic wands, some words of qualification and caution are in order.
Do you have a ML problem?
While the definition of machine learning has been stretched and pulled in recent years (maybe forever!), there are important numerical methods that most people don't consider to be ML algorithms. One important category of these is optimization techniques of the kind that are studied in Operations Research. For example, if you want to optimize running an oil refinery, you probably want goal-based optimization rather than machine learning.
What kind of machine learning?
Given that you have a problem that may lend itself to a machine learning approach, it's important that you pick the right one to use -- that's right, I said "right one." Machine learning is a body of algorithms, in fact a large and growing body. Here's a snapshot of part of a list of them -- this is less than a quarter of the list:
Do you have all the data?
Supposing you do in fact have a problem that lends itself to machine learning and have a sense of the appropriate technique to use, even a machine can only learn if it's got the right stuff to learn from.
This is one of the many reasons why machine learning efforts that end up yielding practical, real-world results nearly always start out with humans examining the data in great detail. Often they find that data you know is going to be important just isn't present in the data set.
And then when all the relevant data is there, what to do with it is often blindingly obvious -- human learning works just great, thank you very much!
Is the data any good?
There's this phrase you may have heard; it goes something like "garbage in, garbage something-or-other." You can probably figure it out. Sounds simple. It isn't. Even finding out whether your data is any good can be a major challenge.
And then, once you're pretty sure it's good, does it stay good? This isn't a pie-in-the-sky problem. One company I know that processes massive amounts of credit card data from credit card companies has a few people assigned full time to detect when the card company has changed something important about the data. When told about it, the card company would typically respond with skepticism. Then they would check. Then they'd say "oops." The "sorry" was usually implied, not stated.
Are the data types identified and in good order?
Take something simple, like date. If a date is coded, it may be a julian day number; many DBMS's do it this way. A date may be just a string, like 20161112. You have to know whether that's year-month-day (Nov 12 2016) or year-day-month (Dec 11 2016) or day-year-month (Dec 20 1611). The whole Y2K problem came from this, in which Sep 9 1999 would be represented in a string like this: 090999. Life is complicated. Data is even more complicated, in new and different ways. You have to embrace the complication.
And BTW, dates are just the tip of the messy-data iceberg.
Is the data normalized?
This may sound esoteric, but getting good results can hang on it. It basically means, is all the data that refers to the same thing coded the same way? For example, here are a couple street address variations: "Cedar Lake East," "Cedar Lake E," "Cedar Lk E." The guy driving the post office truck knows they're the same. Does the computer?
I ran into a more complex problem of this kind trying to get data about doctors, doctors who happened to work at varying office locations. How can you be sure it's the same doctor, particularly when there are doctors with similar names and different spellings of the same name.
Is the data coded where appropriate?
That's why you'd really like to have all your data be coded. You really don't want doctor names -- you want unique doctor ID's, like social security numbers. But all too often, you just don't have it.
How much natural language is used?
Any of the problems above can sink a machine learning project into the morass of struggling with tools and bad results. But when natural language is involved, you reach a whole new level of horror.
The world has a lot of experience with this, smart people working with the best tools on massive data sets over many years. Even things that sound simple are far from solved, and it's doubtful they ever will be solved with high precision.
One simple example: spam filters. Do you get any spam in your email? Are any valid emails marked as spam? If your answer is "no," then it's obvious that you basically don't use email.
Another example: comment filtering. As commenting on the web has exploded, so has the number and vigor of people who write nasty, obscene comments. Lots of people, yes including all the tech giants, have put loads of resources into automatically identifying comments that are inappropriate. You wouldn't think it would be that hard. It's not hard -- for humans. For machines, well, yeah it's hard.
The problem of natural language processing (NLP) remains unsolved. Even something seemingly simple like "feature extraction" from natural language can be a nightmare. Suppose you want to automatically extract from clinical notes whether a patient is homeless. Consider these sentences:
The possibility that the patient is homeless was raised several times. I examined it carefully. Not true.
This is easily understood by a human reader, but is surprisingly difficult for NLP because of making the link between the sentences. The problem is linking “it” with the attribute “homeless” and further interpreting “not true” to mean that the attribute “homeless” is false. And that again is a relatively simple case.
Conclusion
I'm all for machine learning. But fashionable trends like this all too often result in spending lots of money with promising results perpetually just over the horizon, and then ... the subject gets changed.
There's good news. Machine learning used by the right people in the right way against properly constructed and understood data sets with the right amounts of human learning added in can achieve astounding results. And have! A good example is The Oak HC/FT portfolio company Feedzai, which is blocking and tackling credit card fraudsters as we speak.