Curriculum
Course: Pandas
Login
Text lesson

Cleaning Wrong Data

“Wrong data” doesn’t necessarily refer to empty cells or incorrect formats—it can simply be inaccurate, like someone entering “199” instead of “1.99”.

You can sometimes identify wrong data by reviewing the dataset and comparing it with your expectations.

For example, in our dataset, row 7 shows a duration of 450 minutes, while all other rows have durations between 30 and 60 minutes. While it might not be inherently wrong, considering this dataset tracks workout sessions, we can reasonably conclude that the person did not work out for 450 minutes.

Duration Date Pulse Maxpulse Calories

0 60 ‘2020/12/01’ 110 130 409.1

1 60 ‘2020/12/02’ 117 145 479.0

2 60 ‘2020/12/03’ 103 135 340.0

3 45 ‘2020/12/04’ 109 175 282.4

4 45 ‘2020/12/05’ 117 148 406.0

5 60 ‘2020/12/06’ 102 127 300.0

6 60 ‘2020/12/07’ 110 136 374.0

7 450 ‘2020/12/08’ 104 134 253.3

8 30 ‘2020/12/09’ 109 133 195.1

9 60 ‘2020/12/10’ 98 124 269.0

10 60 ‘2020/12/11’ 103 147 329.3

11 60 ‘2020/12/12’ 100 120 250.7

12 60 ‘2020/12/12’ 100 120 250.7

13 60 ‘2020/12/13’ 106 128 345.3

14 60 ‘2020/12/14’ 104 132 379.3

15 60 ‘2020/12/15’ 98 123 275.0

16 60 ‘2020/12/16’ 98 120 215.2

17 60 ‘2020/12/17’ 100 120 300.0

18 45 ‘2020/12/18’ 90 112 NaN

19 60 ‘2020/12/19’ 103 123 323.0

20 45 ‘2020/12/20’ 97 125 243.0

21 60 ‘2020/12/21’ 108 131 364.2

22 45 NaN 100 119 282.0

23 60 ‘2020/12/23’ 130 101 300.0

24 45 ‘2020/12/24’ 105 132 246.0

25 60 ‘2020/12/25’ 102 126 334.5

26 60 20201226 100 120 250.0

27 60 ‘2020/12/27’ 92 118 241.0

28 60 ‘2020/12/28’ 103 132 NaN

29 60 ‘2020/12/29’ 100 132 280.0

30 60 ‘2020/12/30’ 102 129 380.3

31 60 ‘2020/12/31’ 92 115 243.0

How can we correct incorrect values, such as the “Duration” in row 7?

Replacing Values

One way to fix incorrect values is by replacing them with the correct ones.

In our example, the value in row 7 is likely a typo, and it should be “45” instead of “450”. We can simply replace it with “45”.

Example

Update the “Duration” to 45 in row 7.

df.loc[7‘Duration’] = 45

For small datasets, you can manually replace incorrect data one by one, but this approach is impractical for larger datasets.

For bigger datasets, you can establish rules, such as defining boundaries for valid values, and replace any values that fall outside these boundaries.

Example

Iterate through all values in the “Duration” column.

If a value exceeds 120, set it to 120.

for x in df.index:
  if df.loc[x, “Duration”] > 120:
    df.loc[x, “Duration”] = 120

Removing Rows

Another way to handle incorrect data is by removing the rows that contain it.

This approach eliminates the need to determine what to replace the wrong data with, and there’s a good chance these rows are not essential for your analysis.

Example

Remove rows where the “Duration” value exceeds 120.

for x in df.index:
  if df.loc[x, “Duration”] > 120:
    df.drop(x, inplace = True)