1 Goal

In this tutorial you will learn to quickly identify & remove outliers from your data set.

2 Preparing the data

# First of all, we insert a couple of outliers to the $disp column of the mtcars dataset
# (mtcars comes with the base package, so no need to import anything)
# In order to have a couple of outliers in this dataset, we simply multiply the values in mtcars$disp that are higher than 420 by *2

mtcars$disp[which(mtcars$disp >420)] <- c(mtcars$disp[which(mtcars$disp >420)]*2)

# (This is just a random way of inserting a couple of outlier values, you could also assign a couple of high values in a milion different ways)

# Now we have a look at $disp column of the mtcars dataset with boxplot

boxplot(mtcars$disp)

# As you can see, there are three outliers of over 800

3 Storing outliers into a vector

# You can get the actual values of the outliers with this

boxplot(mtcars$disp)$out

## [1] 944 920 880

# (Optional) If you don't need to see the plot again, you can hide it using plot=FALSE

boxplot(mtcars$disp, plot=FALSE)$out

## [1] 944 920 880

# Now you can assign the outlier values into a vector

outliers <- boxplot(mtcars$disp, plot=FALSE)$out

# Check the results

print(outliers)

## [1] 944 920 880

4 Removing the outliers

# First you need find in which rows the outliers are

mtcars[which(mtcars$disp %in% outliers),]

##                      mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Cadillac Fleetwood  10.4   8  944 205 2.93 5.250 17.98  0  0    3    4
## Lincoln Continental 10.4   8  920 215 3.00 5.424 17.82  0  0    3    4
## Chrysler Imperial   14.7   8  880 230 3.23 5.345 17.42  0  0    3    4

# Now you can remove the rows containing the outliers, one possible option is:

mtcars <- mtcars[-which(mtcars$disp %in% outliers),]

# If you check now with boxplot, you will notice that those pesky outliers are gone

boxplot(mtcars$disp)

5 Conclusion

In this tutorial we learnt how to remove outliers from a data set in our favourite manner, which would be quick & dirty!

Removing outliers - quick & dirty

Ubiqum Code Academy

1 Goal

2 Preparing the data

3 Storing outliers into a vector

4 Removing the outliers

5 Conclusion