In this blog post we are talking about something that in machine learning is called data leakage. Please, do not misunderstand it as the leakage of data to the public. Data leakage in machine learning is when using a feature for predicting the output, that at the time of prediction cannot be available. In many cases, the feature holds information about the value we are trying to predict.
“Feature: a feature is an individual characteristic that is used for predicting the output value. Vibration, temperature, and sound are examples of three features.” neurospace
- When working with time series
- Choosing the right features for prediction
- Labeling our output values
Consider the following case: We wish to make a machine learning model for predicting the given order demand. We have data available from product category, manufacturing plant, and product code. Our output value is the given order demand.
We have a set of data gathered over a given time. The data is gathered with the same frequency, e.g. once a day or once a week. When the date and hour becomes important for the prediction, it is called a time series. Time series are ordered as historical data, the oldest date is the first observation. Just like when you list your company's revenue over the past 10 years. When training a machine learning model, we need a larger sample of this historical data.
Time series problems are prone to introduce data leakage, if we are not careful when designing the machine learning model. In the case of introducing data leakage in time series, we allow our model to use both past, but also future order demands, for predicting today’s order demand. Can you see the problem? If our model learns to use tomorrows prediction for predicting today, the model cannot be generalized and used to predict future order demands.
“Generalization: Generalizability describes whether the predictions from a machine learning mdoel can be applied and used in production.” neurospace
The table below is an example of a company's order demand. If you only have the following data available, what is your best guess the order demand the 29th of August 2019 is? As the pattern shows an increase of 5 products daily, a good estimate is 50 products.
Choosing the Right Features
We are going to make a machine learning model for predicting whether people have mycoplasma pneumoniae. The general symptoms for this is sore throat, dry cough, headache, runny nose, and a slight fever. As mentioned in previous case, we train our machine learning models on historical data. Our output value is going to say whether the given symptoms are caused by mycoplasma pneumoniae (yes/no). As an input feature, we add whether people got issued prescriptions for antibiotics (yes/no).
The problem is that people who are not sick with a specific set of viruses do not get antibiotics. Getting a prescription for antibiotics, is something that naturally happens after the predictions of our machine learning model, and thereby introduces data leakage.
Extracting the right features for a machine learning model is important. You need to be aware of whether the given features are correlated with the given output value, as well as making sure that they are not hold information about the output, that we naturally cannot have available at the time of prediction.
Labeling Output Value
Finally, we can introduce data leakage when labeling a dataset that is formed as time series.
An example of data leakage occurring from bad labeling, is when we wish to detect abnormal values. When determining whether the given observation should be labeled as either a normal value or an outlier, we look at future values in the time series as well.
An example is visualized in the table below. We see an increase up until the 40,000, with a decrease in values afterwards. Now comes the question for you to decide: “Is the 40,000 an outlier? Or would 27,000 be evaluated as an outlier? " These are tough questions, and if we label this wrong, we will additionally introduce this mistake to our model when it is learning to detect the patterns.
The Pitfalls of Data Leakage
When introducing data leakage to a machine learning model, you will get a high train and test accuracy, implying that the model is good enough for production. It will neither underfit or overfit. However, when implementing the machine learning model in production, it will no longer be introduced to one feature, because it is not available when you need the model's predictions. The feature missing might even be the most important feature for determining the right class: the leaked data.
When implementing the machine learning model in production, you will see that the predictions are not reliable, forcing you to stop using the machine learning model for predictions. This is why it is important to make sure that you do not introduce data leakage in your dataset. It is important that the model only learns patterns for predicting a given outcome, based on what is available at the time of prediction.
Data Leakage is the introduction of a feature, that at the time of prediction cannot be available. It often contains the information you are trying to predict. Therefore, data leakage can imply that a machine learning model is good enough for production, when in reality, it has learned patterns on some information that is not available at the time we need it to predict an output. Data leakage can occur when working with time series, labeling a dataset for generating the machine learning model, and when selecting the right features. It is important that the machine learning model is only introduced to information that can be available at the time of prediction.
// Maria Jensen, Machine Learning Engineer @ neurospace