Univariate outliers can be found when looking at a distribution of values in a single variable.
On the other hand, multivariate outliers can be found in a n-dimensional space (of n-variables). In order to find them, we need to look at distributions in multi-dimensions.
Outliers can drastically change the results of the data analysis and statistical modelling. Some of the unfavourable impacts of outliers are;
One of the simplest methods for detecting univariate outliers is the use of box plots.
In the box plot, the “Tukey’s method of outlier detection” is used to detect outliers.
According to this method, outliers are defined as the values in the data set that fall beyond the range of −1.5×IQR to 1.5×IQR.
These −1.5×IQR and 1.5×IQR limits are called “outlier fences” and any values lying outside the outlier fences are depicted using an “o” or a similar symbol on the box plot.
# Diamonds$carat %>% boxplot(main="Box Plot of Diamond Carat", ylab="Carat", col = "grey")
A standardised score (z-score) of all observations are calculated.
An observation is regarded as an outlier based on its z-score, if the absolute value of its z-score is greater than 3.
First, it is important to check if the variable is normally distributed. Use hist() function to check the distribution.
From the max and min values in the summary we can see if the range of values goes beyond -3 and 3 in the z-scores.
# library(outliers)
# z.scores <- Diamonds$depth %>% scores(type = "z")
# z.scores %>% summary()
Using which(), we can also find the locations of z-scores whose absolute value is greater than 3.
# which( abs(z.scores) >3 )
Using length(), we can find how many outliers there are.
# length (which( abs(z.scores) >3 ))
This works for a numeric variable split up across a categorical variable.
# boxplot(Diamonds$carat ~ Diamonds$cut, main="Diamond carat by cut", ylab = "Carat", xlab = "Cut")
This works for two numeric variables.
According to this scatter plot, there are some possible outliers on the lower left and lower right hand side.
# Diamonds %>% plot(carat ~ depth, data = ., ylab="Carat", xlab="Depth", main="Carat by Depth")
This is the most commonly used distance metric to detect outliers for the multivariate setting.
It is an extension of the univariate z-score, which also accounts for the correlation structure between all the variables.
It follows a chi-squared distribution with n (number of variables) degrees of freedom, therefore any Mahalanobis distance greater than the critical chi-square value is treated as outliers.
# results <- mvn(data = versicolor, multivariateOutlierMethod = "quan", showOutliers = TRUE)
This plot suggests the existence of 2 outliers. If we would like to see the list of possible multivariable outliers:
# results$multivariateOutliers
This output provides the locations of the outliers. In this example, the 1st and 2nd observations are the suggested outliers for the data.
When the outlier is due to data entry error, data processing error or outlier observations are very small in numbers, then leaving out or deleting the outliers would be used as a strategy.
We filtered out the observations which had absolute z scores of more than 3 below.
# library(outliers)
# z.scores <- Diamonds$depth %>% scores(type = "z")
# Carat_clean<- Diamonds$carat[ - which( abs(z.scores) >3 )]
The results showed that 1 and 2 were the observations that were outliers, so we filtered out rows 1 and 2 below.
# results <- mvn(data = versicolor, multivariateOutlierMethod = "quan", showOutliers = TRUE)
# results$multivariateOutliers
# versicolor_clean <- versicolor[ -c(1,2), ]
showNewData=TRUE returns the new data with no outliers.
# versicolor_clean2 <- mvn(data = versicolor, multivariateOutlierMethod = "quan", showOutliers = TRUE, showNewData = TRUE)
Like imputation of missing values, we can also impute outliers. We can use mean or median imputation methods to replace outliers.
Use this method for mean or other values.
# Diamonds$carat[ which( abs(z.scores) >3 )] <- mean(Diamonds$carat, na.rm = TRUE)
This involves replacing the outliers with the nearest neighbours that are not outliers. For example, for outliers that lie outside the outlier fences on a box-plot, we can cap it by replacing those observations outside the lower limit with the value of the 5th percentile and those that lie above the upper limit, with the value of the 95th percentile.
This defines a function to cap the values outside the limits.
# cap <- function(x){
# quantiles <- quantile( x, c(.05, 0.25, 0.75, .95 ) )
# x[ x < quantiles[2] - 1.5*IQR(x) ] <- quantiles[1]
# x[ x > quantiles[3] + 1.5*IQR(x) ] <- quantiles[4]
# x}
# carat_capped <- Diamonds$carat %>% cap()
Take a subset of Diamonds data using quantitative variables
# Diamonds_sub <- Diamonds %>% select(carat, depth, price)
Apply a user defined function “cap” to a data frame.
# Diamonds_capped <- sapply(Diamonds_sub, FUN = cap)