1 Goal


In this tutorial you will learn to quickly identify & remove outliers from your data set.


2 Preparing the data


# First of all, we insert a couple of outliers to the $disp column of the mtcars dataset
# (mtcars comes with the base package, so no need to import anything)
# In order to have a couple of outliers in this dataset, we simply multiply the values in mtcars$disp that are higher than 420 by *2

mtcars$disp[which(mtcars$disp >420)] <- c(mtcars$disp[which(mtcars$disp >420)]*2)

# (This is just a random way of inserting a couple of outlier values, you could also assign a couple of high values in a milion different ways)

# Now we have a look at $disp column of the mtcars dataset with boxplot

boxplot(mtcars$disp)

# As you can see, there are three outliers of over 800

3 Storing outliers into a vector


# You can get the actual values of the outliers with this

boxplot(mtcars$disp)$out

## [1] 944 920 880
# (Optional) If you don't need to see the plot again, you can hide it using plot=FALSE

boxplot(mtcars$disp, plot=FALSE)$out
## [1] 944 920 880
# Now you can assign the outlier values into a vector

outliers <- boxplot(mtcars$disp, plot=FALSE)$out

# Check the results

print(outliers)
## [1] 944 920 880

4 Removing the outliers


# First you need find in which rows the outliers are

mtcars[which(mtcars$disp %in% outliers),]
##                      mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Cadillac Fleetwood  10.4   8  944 205 2.93 5.250 17.98  0  0    3    4
## Lincoln Continental 10.4   8  920 215 3.00 5.424 17.82  0  0    3    4
## Chrysler Imperial   14.7   8  880 230 3.23 5.345 17.42  0  0    3    4
# Now you can remove the rows containing the outliers, one possible option is:

mtcars <- mtcars[-which(mtcars$disp %in% outliers),]

# If you check now with boxplot, you will notice that those pesky outliers are gone

boxplot(mtcars$disp)


5 Conclusion


In this tutorial we learnt how to remove outliers from a data set in our favourite manner, which would be quick & dirty!