One of the powerful features of the Pandas module is the corr() method, which calculates the relationship between columns in a dataset.
The examples on this page use a CSV file named “data.csv”.
Display the correlation between the columns.
df.corr() |
Duration Pulse Maxpulse Calories Duration 1.000000 -0.155408 0.009403 0.922721 Pulse -0.155408 1.000000 0.786535 0.025120 Maxpulse 0.009403 0.786535 1.000000 0.203814 Calories 0.922721 0.025120 0.203814 1.000000 |
Note: The corr() method excludes non-numeric columns. |
The result of the corr()
method is a table of numbers representing the strength of the relationship between two columns.
The values range from -1 to 1.
What constitutes a good correlation depends on the context, but generally, a correlation of at least 0.6 (or -0.6) can be considered a strong correlation. |
We can observe that “Duration” and “Duration” have a correlation of 1.000000, which is expected since any column always has a perfect correlation with itself.
The correlation between “Duration” and “Calories” is 0.922721, indicating a strong relationship. This suggests that the longer the workout, the more calories are burned, and conversely, if many calories were burned, it’s likely the workout was lengthy.
The correlation between “Duration” and “Maxpulse” is 0.009403, indicating a very weak relationship. This means we cannot predict the maximum pulse based solely on the workout duration, nor can we predict the duration from the max pulse.