We have talked about it for years now. Consultancies have been saying for a long time that we need it. Maybe you are even aiming towards it in your strategy? What exactly is Big Data? and how did we get to talk so much about it? Before reading further, try answering the following question - no cheating: How many V's defines Big Data?
Big data originally started with three V's, as described in big data right data, then there was five, and then ten. A newly published research paper from May 2019, suggest that Big Data contains 51 V's  We don't know about you but who can really remember 10 or even 51 V's? Maybe this is why that most focus on one specific V: Volume. It seems to have become the very definition of Big Data. Some companies are retrieving large amounts of data, without knowing the purpose of retrieving that data. The statement: “We might need it someday” is very common. When on the road we have met companies, saying that they cannot start getting value from their data, because they do not have large enough datasets. Our respond to that statement is always the same: when do you then have large enough datasets? Because if we only look at the size of our dataset when deciding whether our data is sufficient, then who determines when the dataset is big enough or rather what number of bytes is big in big data?
Big Data is not Right Data
This is why we are sincerely asking you to stop thinking about big data and definitely not terabytes or petabytes. Big data should not be a strategic objective, and do not use volume as the main driver when determining the quality of your data. Instead, we will ask you to think about Right Data. Right Data is about being strategic with what data you collect and having a purpose for collecting it. You start by looking at your strategy, and see what strategic goals you have. Based on your strategic objectives, you determine what data to collect by following these simple steps.
Start by looking at what value you are searching for, do not fall into the trap of “we might need the data someday” without having a purpose for doing it. There has to be a business case, money saved, money earned, better performance or better safety. If you have an objective about reducing unplanned downtime, you could solve this by different approaches: preventive maintenance, condition-based maintenance, and predictive maintenance. In this stage, you must determine which of the following maintenance approaches fits your need, because they demand different types of data and in different frequencies. It is not a good solution just taking the cheapest and fastest method, but try to analyze the costs of maintenance, the costs of breakdowns, and the costs of spare-parts. In that analysis, you will find your business case, and know which of the maintenance approaches that suits your company.
When we have determined the value, it is time to determine what data to collect. This is where correlation becomes important. We need some data, that can say something about the outcome we wish to predict. It is not necessarily better to collect data from 100 different sources than just a handful. 10 features that are highly correlated with the output, are better, than 40 with a very low correlation. This is not always stright forward for our human minds so you will likely need to do some experiments to figure out what is important and what is not. This experiment could be: Collect a small amount of everything you think is important and then check the correlations between the features this will help you figure out what is important and what is not based on data. We cannot address the importance of being selective about what data to collect from the beginning. It is a shame to have four years of data that are worth nothing, because one or two important features have not yet been collected or the frequency of the data is wrong.
Speed and Frequency
The speed of which data should be retrieved, it is determined by the value you wish to create, and the business case that you discovered in the beginning. The reason why you should spend more time on determining which value you wish to obtain is that retrieving the right data, for the right problem - but if the frequency is too low to use the data it can be a huge waste of time, costs, and energy. Make sure you invest in the right sensors for your challenge, and be ambitious. It is easy to down sample data from once per minute to once an hour but near impossible to do the other way around.
In Right Data, we focus on whether your dataset is representative. A representative dataset is a dataset of which all possible outcomes are presented. If your company is interested in predictive maintenance, a representative dataset means that you have 5-10 breakdowns observed in your data. You should additionally make sure that your data is unbiased, and well-documented. If you wish to predict the remaining useful life of an item, make sure you have a well-documented maintenance journal. What date and time did you perform maintenance, and what was the reason for maintaining the item? This is necessary in almost all cases for validating the new approach.
Size is the least important variable, as it is determined by if the data is representative. When you have a representative dataset you have the sufficient data size to start gaining value from it. You do not need big data or large volumes, we have proved this several times in our previous blog posts and at our customers. In the dataset of which we aim to detect whether a tumor is benign and malignant, we had 569 observations, and got a train and test accuracy of 98.69% and 96.23%. In another example, we retrieved data from a water pump, we experienced 7 breakdowns within the first five-six month. Creating a representative dataset.
neurospace's AI Camp
If you are interested in learning more about the concepts of Big Data vs Right Data, and how this could work in your company, the AI Camp will be a good place to start your data journey.
// Maria Jensen, Machine Learning Engineer @ neurospace
 Khan et al. (2019) The 51 V's of Big Data: Survey, Technologies, Characteristics, Opportunities, Issues and Challenges. COINS'19: Proceedings of the International Conference on Onmi-Layer Intelligent Systems pp. 19-24.