"No man ever steps in the same river twice, for it's not the same river and he's not the same man." - Heraclitus
Heraclitus was an ancient Greek philosopher.
The quote is somewhat intriguing when looking at it through the lens of machine learning. Let us explain…
In the machine learning, at least for some classes of problems, a predictive model is built based on training the dataset where the outcome is known, and the same model is then used to predict the outcome from actual input data. The model may be further refined when larger numbers of data with known outcomes are available, or in a continual fashion, when a 'training' occurs on an incoming data for which the current model has failed to predict the correct outcome. I.e. the data point that has been 'trained' enlarges the training dataset thus occasioning the model to be rebuilt considering one more datum points with known outcomes, than was previously available.
In a sense machine learning provides a way to build a predictive model statistically based on the known data (data with the known outcome) which may then be applied to predict the outcome of input data with an unknown outcome.
Now consider this in relation to the man stepping into the river in Heraclitus’ quote. Suppose ‘John’ is an angler. He has a river running nearby his country house. He steps into the river. He can see the fish swimming around him (we can assume he has sufficient equipment and the water is sufficiently clear that he has a good view of the fish swimming by him as he stands in the river). Now let's say John is interested in sampling what kind of fish is available in the river, the species and their sizes, as he is preparing to invite his friend for fishing a few months later.
The first day he spots 2 types of fish each of smaller variety, say fish A (short for fish of type A), and fish B and notes that down. The next day he steps into the river, he already has a summary of the previous days finding in his diary. Now he spots a bigger variety of fish A, and a new fish - type C, rather large sized (along with some small B fish). So, after a second day, his summary of the fish list looks like: fish A small, fish A big, fish B small, and fish C big.
Thus, little by little he builds up a rather big list of more than a hundred species of fish, some big, some small, some available in both big and small variety over a matter of few months. When he first stepped into the river, he had no idea of the kind of fish available in the river. But now he is not the same man, he is much more informed. This process of information enrichment is somewhat akin to maturing a machine learning model as it were, where each data point in the training data set (in a sense like a new day's experience) can refine the model slightly more. The model, for better or worse, has 'seen' (in a manner of speaking), a new data point, and became more informed (and possibly more refined) for it.
John decides to invite his friend Jake, for a day of fishing. "Listen mate, this is a great river, with plenty of fish to catch. Are you game?" John was exuberant over the phone. Why shouldn't he be? He has built up a great model for the river (the utility aspect of it in terms of fishing) in his diary (and in his mind). Jake was excited too. Although he is a man who had never stepped into the river thus far.
They collect appropriate equipment and go by the riverside. A lot of water has passed after John last stepped in.
Jake was too thrilled, so he steps in first. And the first thing he notices, is a full-grown crocodile.
The crocodile gave him a good chase, and both of them ran for their lives, leaving their fishing gear behind.
The nice model of the river, that John had built upon stepping in so many times in the past, has been completely skewed by one single experience. He decided not to step back into the river ever again. In machine learning context I would like to call such experiences (i.e. data points) the 'Heavy Outlier'.
The 'Heavy Outlier' is different from normal outliers, which lies beyond the usual broad area of data. The keyword here is 'Heavy', it not only lies out of the ordinary zone, but it has such a tremendous weightage compared to other data points, that it alone can completely change the model, carefully built over multiple normal training data instances (a shattering experience as it were, although a heavy outlier can also happen in a positive sense. Like someone winning a huge sum in a Lotto, which may drastically change the model of his household budgeting).
Consider this in another context, (and this too is a fictional account). Suppose a retailer is trying to make a model of his sell of mineral water beverages across shops in different suburbs, across different brands and flavor of beverages, different times of the year (close to Christmas for instance, people may have a tendency to buy certain items more in anticipation of partying and festivities) and so on. This may be for the purpose of both restocking frequency as well as pricing models for those items.
They also tried to correlate the pattern with discounts offered, i.e. what percentage of discounts makes some drink sell for how much more quantities by percentage.
But one day in one city, across multiple suburbs the sales skyrocketed, even though there was no discount offered on any of those items that week. Almost all shops went out of stock on those items, in a matter of hours. What could be the reasoning behind the increase in sales? That day the temperature broke the record for the last 43 years. One suburb had its hottest day on record. Although this is perhaps not too heavy an outlier, they should have considered the temperature forecast of the day as a parameter to be fed through (especially for such items) for the purpose of continual updating of the model.
So, what do we do with the 'Heavy Outliers’? In normal modelling, if the outliers are too few and far between, and they do not have significantly more impact than the normal data points, we may consider deleting them to make our model 'better' at predicting cases that are not outliers. However, if the outlier is hugely 'heavier' i.e. it has tremendous impact compared to the normal data points (think of seeing a fish versus being chased by a crocodile), then in our view such outliers must be considered. Otherwise, if things go wrong it will go wrong in a big way.
One of the propositions involve a mix of heuristics and statistical measures in the process of prediction, the other is based on parameterization. So, to end on a cautionary note - next time you have occasion to step into a river, do keep the possibility of crocodiles in mind.
This article was authored by Malay Mandal, Senior Developer at Sandstone Technology.