Stat Correlation vs. Causality

Correlation Does Not Imply Causality

Correlation measures the numerical relationship between two variables.

However, a high correlation coefficient (close to 1) does not necessarily indicate a direct relationship between the variables.

For example:
During the summer, ice cream sales at the beach rise, and drowning accidents also increase.
Does this mean that higher ice cream sales directly cause more drowning accidents?

The Beach Example in Python

We have created a fictional dataset for you to experiment with:

Example

import pandas as pd
import matplotlib.pyplot as plt

Drowning_Accident = [20,40,60,80,100,120,140,160,180,200]
Ice_Cream_Sale = [20,40,60,80,100,120,140,160,180,200]
Drowning = {“Drowning_Accident”: [20,40,60,80,100,120,140,160,180,200],
“Ice_Cream_Sale”: [20,40,60,80,100,120,140,160,180,200]}
Drowning = pd.DataFrame(data=Drowning)

Drowning.plot(x=“Ice_Cream_Sale”, y=“Drowning_Accident”, kind=“scatter”)
plt.show()

correlation_beach = Drowning.corr()
print(correlation_beach)

Output:

Correlation vs Causality – The Beach Example

In other words, can ice cream sales be used to predict drowning accidents?

The answer is: probably not.

These two variables are likely correlating by chance.

So, what actually causes drowning?

Unskilled swimmers
Strong waves
Cramps
Seizure disorders
Lack of supervision
Alcohol misuse
And more.

Now, let’s reverse the argument:

Does a low correlation coefficient (close to zero) mean that changes in $x$ don’t affect $y$ ?

Back to the question:

Can we conclude that Average Pulse does not affect Calorie Burnage because of a low correlation coefficient?
The answer is no.

This highlights an important distinction between correlation and causality:

Correlation is a number that measures how closely two variables are related.
Causality is the conclusion that $x$ directly causes $y$ .

Tip: Always carefully consider the concept of causality when making predictions!