In this blog post, we will discuss why you should stop focusing on Big Data and especially the big part, which is often interpreted as the number of observations we have in a dataset.
This blog post will prove that the size of your data has no value and representativeness is more important.
“Representativeness is when all possible outcomes that may occur are represented in your dataset”
When you have a representative dataset is determined by the challenge you are trying to solve. If you want to identify something that almost never happens, it will take you a long time to obtain a representative dataset compared to something you can experience every day. Let’s look at some examples to try and explain this.
Estimate whether a tumor is benign or malignant
In one example, the goal is to predict whether a tumor is benign or malignant. For this, we had a dataset with only 569 observations. The dataset had 357 observations showing benign tumors and 212 observations of malignant tumors. Based on a correlation analysis and descriptive statistics, it was identified that the two groups had significantly different patterns, which is why only 569 observations are sufficient to obtain a machine learning model with a test accuracy of 99.56% and an Area Under a Curve (AUC) score of 96.43%.
What about the number of features we use?
Within the Big Data and Machine Learning community, it is often said that models get better with data from multiple sources, but this is a misinterpretation. This may be true, but is not a universal truth. So if you hear this, ask How do we know?. In the previous case, we reduced the number of sources from 30 to 20 and improved the accuracy of our model. This proves the importance of choosing the right features for the model as opposed to just giving the model more data. It is better to go with 10 highly correlated features than 50 low correlated features.
Estimate the Remaining Useful Life
When a company wants to estimate the remaining useful life of an equipment, the dataset is representative when sufficient data is provided to show the pattern of breakdowns. In many cases, if data is retrieved from theright data, 7-10 breakdowns are sufficient to start gaining value while collecting more data. This was proved by estimating the remaining useful life of a water pump and on a turbofan engine. As more breakdowns occur, the accuracy of the model improves, but the improvements will stagnate at some point.
Predict credit card fraud
In another challenge we faced, we had to predict credit card fraud by estimate the probability that a transaction is normal or abnormal. The data set consisted of 284,807 observations, of which only 0.172% were credit card fraud. Although the dataset only had 492 fraud observations, the patterns were significantly different from normal transactions, making it possible to generate a model with a test accuracy of 99% and a ROC AUC score of 98.5%.
Be aware of the following
When you work with data analysis in any form, both in statistics and machine learning, you need data that represents what you wishes to predict. If your challenge is to estimate the remaining useful life of a machine, you must have experienced breakdowns while collecting data. If your data is too noisy for real patterns to be identified, you need more data and probably from other sources and at a different frequency. The moment your data is representative, you have a sufficient data set to gain value.
Finally, be sure to identify the three to ten main data sources for your problem. It is better to have 4 highly correlated sources for our problem than to have 50 with low correlation.
We have given you three examples of challenges where we have not had much data to work with, but still have a result that can provide value. Representativeness is defined by your challenge, which in turn affects how much data is needed for patterns to be separated from each other. There are many who aim for the perfect when doing machine learning, and a model that is not 100% correct is not good enough. But why let the perfect stand in the way of you getting better tomorrow? A machine learning model that is 10% better than what you have today is still 10% better. Furthermore, more data will only improve the predictions of the machine learning model.
// Maria Jensen, Machine Learning Engineer @ neurospace