“Reduce energy consumption in your household, and you will live longer”.
This could be a headline in your favorite tabloid. People have a tendency to see a correlation between two values, and determines immediately that there is causality. But would you really live longer without energy in your household?
Correlation is key for any good statistical analysis and to solve problems with machine learning.
A correlation between two values like household energy consumption and average life span, do not necessarily imply there is a causation.
But what does it mean that two values are correlated, and why is it important in machine learning? This blog post aims to give you an understanding of what correlation and causation is.
First, we need to understand what correlation and causation means. One way of doing this is through definitions:
a correlation describes a mutual relationship between two or more values.
a causation describes when values have a relationship between cause and effect. Or in other words, you can explain one outcome based on another.
Roughly put, a correlation answers the question: “How much of the change in X, can be seen in Y as well? ”
If one value increases with a factor 2, a correlation implies that we additionally can see a similar change in the other value. Correlation is expressed as a value between -1 and +1, where;
|Correlation( r )||Interpretation|
|-1||A perfect downhill (negative) linear relationship|
|-0.7||A strong negative linear relationship|
|-0.5||A moderate negative linear relationship|
|-0.3||A weak negative linear relationship|
|0.0||No linear relationship|
|+0.3||A weak positive linear relationship|
|+0.5||A moderate positive linear relationship|
|+0.7||A strong positive linear relationship|
|+1.0||A perfect uphill (positive) linear relationship|
A correlation can be positive and negative and defines how the values are related. If the correlation is positive, it implies a mutual increase or decrease in both x and y. If the correlation is negative, it implies an opposed increase or decrease in x and y values.
The connection between two variables can be used to predict each other. If you have one value, you can say something about if the other value increases or declines. Correlations are the key for solving problems with statistics, as well as machine learning as they are used to express and view what values are important for the problem we are trying to solve (e.g. predicting future machinery failures).
A correlation have a significance level as well. The significance level is one pitfall which people forget to check. You can have a strong correlation between two values, but no significance. In this case, there is no evidence that the correlation we have identified is true.
Back to where we started ”Reduce energy consumption in your household, and you will live longer ” according to numbers from Statistics Denmark. But is this really true?
If we look there is a strong negative correlation (r= -0.83) between average life-span and the household energy consumption based on data available from Statistics Denmark from 2008 - 2017. The correlation is additionally significant 0.00317, where lower is better.
Average life span (in years)
Energy consumption in households, total (in GJ)
According to our initial analysis, the two variables, average life-span and energy consumption in households, are strongly negatively correlated. However, there are some common pitfalls we need to consider before we conclude that there is causality. Moreover, we need to validate if we can trust this result.
1) The size of the dataset
Based on what is available from Statistics Denmark, we have a small dataset consisting of only 9 observations. The correlation seen here, though significant, might be a coincidence, because we do not have a large enough dataset.
2) Are they connected in the real world?
How can it be true that the energy consumption in households are correlated with our average life-span? It most probably is not. Correlations says how values are mutual connected based on the tendency - if one value increases, does the other value increase or decrease?
So what we need, are correlations that are connected to the problem we wish to solve. In this case what impacts our average life-span? As it can be seen in the graph there is an anomaly in the third value. If there is a strong correlation this anomaly in energy consumption should be seen as well in the average life-span. However, it is not which indicates that energy consumption in the household might not be the best way to predict the average life-span.
Many believe that correlation equals causation, hence the title of this blog post. However, it is not necessarily the truth. Causation answers if we can say something about cause and effect: “How much of the change in X can be explained by Y?”. In our case: How much can the change in household energy consumption (cause) explain the average life-span (effect)?
Causation is difficult to measure, and can only be based on large datasets and thorough analysis. One way is to use an additive noise model.
Causation aims to identify the cause and effect - relationship. One approach of doing this is by looking at whether abnormal observations in one variable, can be seen in the other variable as well. We observe a high increase in household energy consumption in 2010, of which the value of average life-span additionally increases slightly. However, according to our correlation analysis, the average life-span should have decreased significantly this year as a cause of an increased household energy consumption.
So do you really live longer by reducing your energy consumption in your household? No. There is a correlation between the two variables but there is not causation. You cannot read the energy consumption in each household in your neighborhood and calculate the average life-span in your area. Additionally, the dataset gathered is too small. If we had a larger dataset, we might see that this correlation becomes insignificant.
How we use correlation and causation in Machine Learning
To be able to predict a given outcome, we need input data that in some ways can explain this outcome.
If we are making a forecasting model for predicting demand on ice cream, weather is considered an important parameter: when it is warm, we crave more ice cream and cold things.
When we predict the sales of christmas trees, seasonality might be an important factor: because we use more christmas trees during December than the rest of the year.
Predictive maintenance on rotating equipment can be used to predict break in the near future. To be able to do this relevant data need to be retrieved which can say something about how the equipment is operating. We know that vibrations, temperature, and sound can indicate this.
When the machine starts to vibrate more than usual, there could be something wrong. Why? Because it is not normal operating behavior, which makes vibrations an important measurement for predicting our target outcome.
The same goes with temperature. If the machine is overheating, we know something is wrong and so on.
Do not always trust a correlation. A correlation simply states if there is a mutual upward or downward relationship between two values. A correlation must be significant before it is possible to say something about the relationship and causation. Correlation does not imply that one value causes the other - this is why you need to check for causation! Do not automatically assume that correlation is causation.
// Maria Jensen, ML Engineer @ neurospace