Outliers

Univariate outliers can be found when looking at a distribution of values in a single variable.

On the other hand, multivariate outliers can be found in a n-dimensional space (of n-variables). In order to find them, we need to look at distributions in multi-dimensions.

Why are they bad?

Outliers can drastically change the results of the data analysis and statistical modelling. Some of the unfavourable impacts of outliers are;

they increase the error variance,
they reduce the power of statistical tests,
they can bias or influence the estimates of model parameters that may be of substantive interest.

Univariate Detection

Box Plots

One of the simplest methods for detecting univariate outliers is the use of box plots.

In the box plot, the “Tukey’s method of outlier detection” is used to detect outliers.

According to this method, outliers are defined as the values in the data set that fall beyond the range of −1.5×IQR to 1.5×IQR.

These −1.5×IQR and 1.5×IQR limits are called “outlier fences” and any values lying outside the outlier fences are depicted using an “o” or a similar symbol on the box plot.

# Diamonds$carat %>%  boxplot(main="Box Plot of Diamond Carat", ylab="Carat", col = "grey")

Distance Based Methods

A standardised score (z-score) of all observations are calculated.

An observation is regarded as an outlier based on its z-score, if the absolute value of its z-score is greater than 3.

Normal Distribution

First, it is important to check if the variable is normally distributed. Use hist() function to check the distribution.

Summarise Outliers in R

From the max and min values in the summary we can see if the range of values goes beyond -3 and 3 in the z-scores.

# library(outliers)

# z.scores <- Diamonds$depth %>% scores(type = "z")

# z.scores %>% summary()

Find Locations of Outliers in R

Using which(), we can also find the locations of z-scores whose absolute value is greater than 3.

# which( abs(z.scores) >3 )

Find Amount of Outliers in R

Using length(), we can find how many outliers there are.

# length (which( abs(z.scores) >3 ))

Multivariate Methods

Box Plots

This works for a numeric variable split up across a categorical variable.

# boxplot(Diamonds$carat ~ Diamonds$cut, main="Diamond carat by cut", ylab = "Carat", xlab = "Cut")

Scatter Plots

This works for two numeric variables.

According to this scatter plot, there are some possible outliers on the lower left and lower right hand side.

# Diamonds %>% plot(carat ~ depth, data = ., ylab="Carat", xlab="Depth", main="Carat by Depth")

Mahalanobis

This is the most commonly used distance metric to detect outliers for the multivariate setting.

It is an extension of the univariate z-score, which also accounts for the correlation structure between all the variables.

It follows a chi-squared distribution with n (number of variables) degrees of freedom, therefore any Mahalanobis distance greater than the critical chi-square value is treated as outliers.

MVN Function in R

# results <- mvn(data = versicolor, multivariateOutlierMethod = "quan", showOutliers = TRUE)

This plot suggests the existence of 2 outliers. If we would like to see the list of possible multivariable outliers:

# results$multivariateOutliers

This output provides the locations of the outliers. In this example, the 1st and 2nd observations are the suggested outliers for the data.

Handling Outliers

Excluding/Deleting

When to exclude?

When the outlier is due to data entry error, data processing error or outlier observations are very small in numbers, then leaving out or deleting the outliers would be used as a strategy.

Excluding using the which() function

We filtered out the observations which had absolute z scores of more than 3 below.

# library(outliers)

# z.scores <- Diamonds$depth %>%  scores(type = "z")

# Carat_clean<- Diamonds$carat[ - which( abs(z.scores) >3 )]

Excluding using MVN

The results showed that 1 and 2 were the observations that were outliers, so we filtered out rows 1 and 2 below.

# results <- mvn(data = versicolor, multivariateOutlierMethod = "quan", showOutliers = TRUE)

# results$multivariateOutliers

# versicolor_clean <- versicolor[ -c(1,2), ]

Excluding using MVN 2

showNewData=TRUE returns the new data with no outliers.

# versicolor_clean2 <- mvn(data = versicolor, multivariateOutlierMethod = "quan", showOutliers = TRUE, showNewData = TRUE)

Imputing

Like imputation of missing values, we can also impute outliers. We can use mean or median imputation methods to replace outliers.

Replacing with the mean

Use this method for mean or other values.

# Diamonds$carat[ which( abs(z.scores) >3 )] <- mean(Diamonds$carat, na.rm = TRUE)

Capping (aka Winsorising)

This involves replacing the outliers with the nearest neighbours that are not outliers. For example, for outliers that lie outside the outlier fences on a box-plot, we can cap it by replacing those observations outside the lower limit with the value of the 5th percentile and those that lie above the upper limit, with the value of the 95th percentile.

Write a function

This defines a function to cap the values outside the limits.

# cap <- function(x){
#     quantiles <- quantile( x, c(.05, 0.25, 0.75, .95 ) )
#     x[ x < quantiles[2] - 1.5*IQR(x) ] <- quantiles[1]
#     x[ x > quantiles[3] + 1.5*IQR(x) ] <- quantiles[4]
#     x}

Use the function

# carat_capped <- Diamonds$carat %>% cap()

Apply to dataframe using sapply

Take a subset of Diamonds data using quantitative variables

# Diamonds_sub <- Diamonds %>%  select(carat, depth, price)

Apply a user defined function “cap” to a data frame.

# Diamonds_capped <- sapply(Diamonds_sub, FUN = cap)