Correlation measures the numerical relationship between two variables.
However, a high correlation coefficient (close to 1) does not necessarily indicate a direct relationship between the variables.
For example:
During the summer, ice cream sales at the beach rise, and drowning accidents also increase.
Does this mean that higher ice cream sales directly cause more drowning accidents?
We have created a fictional dataset for you to experiment with:
import pandas as pd import matplotlib.pyplot as plt Drowning_Accident = [20,40,60,80,100,120,140,160,180,200] Ice_Cream_Sale = [20,40,60,80,100,120,140,160,180,200] Drowning = {“Drowning_Accident”: [20,40,60,80,100,120,140,160,180,200], “Ice_Cream_Sale”: [20,40,60,80,100,120,140,160,180,200]} Drowning = pd.DataFrame(data=Drowning) Drowning.plot(x=“Ice_Cream_Sale”, y=“Drowning_Accident”, kind=“scatter”) plt.show() correlation_beach = Drowning.corr() print(correlation_beach) |
Output:
In other words, can ice cream sales be used to predict drowning accidents?
The answer is: probably not.
These two variables are likely correlating by chance.
So, what actually causes drowning?
Now, let’s reverse the argument:
Does a low correlation coefficient (close to zero) mean that changes in xx don’t affect yy?
Back to the question:
Can we conclude that Average Pulse does not affect Calorie Burnage because of a low correlation coefficient?
The answer is no.
This highlights an important distinction between correlation and causality:
Tip: Always carefully consider the concept of causality when making predictions! |