Curriculum
Course: Pandas
Login
Text lesson

Pandas Correlations

Finding Relationships

One of the powerful features of the Pandas module is the corr() method, which calculates the relationship between columns in a dataset.

The examples on this page use a CSV file named “data.csv”.

Example

Display the correlation between the columns.

df.corr()

Result

Duration Pulse Maxpulse Calories

Duration 1.000000 -0.155408 0.009403 0.922721

Pulse -0.155408 1.000000 0.786535 0.025120

Maxpulse 0.009403 0.786535 1.000000 0.203814

Calories 0.922721 0.025120 0.203814 1.000000

Note: The corr() method excludes non-numeric columns.

Result Explained

The result of the corr() method is a table of numbers representing the strength of the relationship between two columns.

The values range from -1 to 1.

  • A value of 1 indicates a perfect correlation, meaning that when one value increases, the other does as well.
  • A value of 0.9 suggests a strong positive relationship, where an increase in one value likely leads to an increase in the other.
  • A value of -0.9 indicates a strong negative relationship, where an increase in one value likely causes a decrease in the other.
  • A value of 0.2 represents a weak relationship, meaning changes in one value don’t reliably correlate with changes in the other.
What constitutes a good correlation depends on the context, but generally, a correlation of at least 0.6 (or -0.6) can be considered a strong correlation.

Perfect Correlation:

We can observe that “Duration” and “Duration” have a correlation of 1.000000, which is expected since any column always has a perfect correlation with itself.

Good Correlation:

The correlation between “Duration” and “Calories” is 0.922721, indicating a strong relationship. This suggests that the longer the workout, the more calories are burned, and conversely, if many calories were burned, it’s likely the workout was lengthy.

Bad Correlation:

The correlation between “Duration” and “Maxpulse” is 0.009403, indicating a very weak relationship. This means we cannot predict the maximum pulse based solely on the workout duration, nor can we predict the duration from the max pulse.