In this tutorial you will learn to quickly identify & remove outliers from your data set.
# First of all, we insert a couple of outliers to the $disp column of the mtcars dataset
# (mtcars comes with the base package, so no need to import anything)
# In order to have a couple of outliers in this dataset, we simply multiply the values in mtcars$disp that are higher than 420 by *2
mtcars$disp[which(mtcars$disp >420)] <- c(mtcars$disp[which(mtcars$disp >420)]*2)
# (This is just a random way of inserting a couple of outlier values, you could also assign a couple of high values in a milion different ways)
# Now we have a look at $disp column of the mtcars dataset with boxplot
boxplot(mtcars$disp)
# As you can see, there are three outliers of over 800
# You can get the actual values of the outliers with this
boxplot(mtcars$disp)$out
## [1] 944 920 880
# (Optional) If you don't need to see the plot again, you can hide it using plot=FALSE
boxplot(mtcars$disp, plot=FALSE)$out
## [1] 944 920 880
# Now you can assign the outlier values into a vector
outliers <- boxplot(mtcars$disp, plot=FALSE)$out
# Check the results
print(outliers)
## [1] 944 920 880
# First you need find in which rows the outliers are
mtcars[which(mtcars$disp %in% outliers),]
## mpg cyl disp hp drat wt qsec vs am gear carb
## Cadillac Fleetwood 10.4 8 944 205 2.93 5.250 17.98 0 0 3 4
## Lincoln Continental 10.4 8 920 215 3.00 5.424 17.82 0 0 3 4
## Chrysler Imperial 14.7 8 880 230 3.23 5.345 17.42 0 0 3 4
# Now you can remove the rows containing the outliers, one possible option is:
mtcars <- mtcars[-which(mtcars$disp %in% outliers),]
# If you check now with boxplot, you will notice that those pesky outliers are gone
boxplot(mtcars$disp)
In this tutorial we learnt how to remove outliers from a data set in our favourite manner, which would be quick & dirty!