Outliers detection in R

Introduction

An outlier is a value or an observation that is distant from other observations, that is to say, a data point that differs significantly from other data points.
An observation must always be compared to other observations made on the same phenomenon before actually calling it an outlier.
Outliers can also arise due to an experimental, measurement or encoding error.

The dataset mpg from the {ggplot2} package will be used to illustrate the different approaches of outliers detection in R, and in particular we will focus on the variable hwy (highway miles per gallon).

1. Descriptive statistics

A. Minimum and maximum

The first step to detect outliers in R is to start with some descriptive statistics, and in particular with the minimum and maximum.

In R, this can easily be done with the summary() function:

library(ggplot2)
dat <- ggplot2::mpg
summary(dat$hwy)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   12.00   18.00   24.00   23.44   27.00   44.00

where the minimum and maximum are respectively the first and last values in the output above.

Alternatively, they can also be computed with the min() and max(), or range() functions:

min(dat$hwy)

## [1] 12

max(dat$hwy)

## [1] 44

range(dat$hwy)

## [1] 12 44

Some clear encoding mistake like a weight of 786 kg (1733 pounds) for a human will already be easily detected by this very simple technique.

B. Histogram

Another basic way to detect outliers is to draw a histogram of the data.

Using R base (with the number of bins corresponding to the square root of the number of observations in order to have more bins than the default option):

hist(dat$hwy,
  xlab = "hwy",
  main = "Histogram of hwy",
  breaks = sqrt(nrow(dat))
) # set number of bins

# histogram - 
ggplot(data = dat, mapping = aes(x = hwy))+       # set data and axes
  geom_histogram(              # display histogram
    color = "red",               # bin line color
    fill = "blue")          # bin interior color (fill)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

or using ggplot2 (learn how to create plots with this package via the esquisse addin or via this tutorial):

library(ggplot2)

ggplot(dat) +
  aes(x = hwy) +
  geom_histogram(bins = 30L, color = "red", fill = "#0c4c8a") +
  
  theme_minimal()

From the histogram, there seems to be a couple of observations higher than all other observations (see the bar on the right side of the plot).

C. Boxplot

In addition to histograms, boxplots are also useful to detect potential outliers.

Using R base:

boxplot(dat$hwy,
  ylab = "hwy"
)

or using ggplot2:

ggplot(dat) +
  aes(x = "", y = hwy) +
  geom_boxplot(fill = "blue") +
  theme_minimal()

A boxplot helps to visualize a quantitative variable by displaying five common location summary (minimum, median, first and third quartiles and maximum) and any observation that was classified as a suspected outlier using the interquartile range (IQR) criterion.

Observations considered as potential outliers by the IQR criterion are displayed as points in the boxplot. Based on this criterion, there are 2 potential outliers (see the 2 points above the vertical line, at the top of the boxplot).

Remember that it is not because an observation is considered as a potential outlier by the IQR criterion that you should remove it.

*Removing or keeping an outlier* depends on

the context of your analysis,
whether the tests you are going to perform on the dataset are robust to outliers or not, and
how far is the outlier from other observations.

It is also possible to extract the values of the potential outliers based on the IQR criterion thanks to the boxplot.stats()$out function:

boxplot.stats(dat$hwy)$out

## [1] 44 44 41

As you can see, there are actually 3 points considered as potential outliers:

2 observations with a value of 44 and 1 observation with a value of 41.

Thanks to the which() function it is possible to extract the row number corresponding to these outliers:

out <- boxplot.stats(dat$hwy)$out
out_ind <- which(dat$hwy %in% c(out))
out_ind

## [1] 213 222 223

With this information you can now easily go back to the specific rows in the dataset to verify them, or print all variables for these outliers:

dat[out_ind, ]

## # A tibble: 3 × 11
##   manufacturer model      displ  year   cyl trans  drv     cty   hwy fl    class
##   <chr>        <chr>      <dbl> <int> <int> <chr>  <chr> <int> <int> <chr> <chr>
## 1 volkswagen   jetta        1.9  1999     4 manua… f        33    44 d     comp…
## 2 volkswagen   new beetle   1.9  1999     4 manua… f        35    44 d     subc…
## 3 volkswagen   new beetle   1.9  1999     4 auto(… f        29    41 d     subc…

It is also possible to print the values of the outliers directly on boxplot with the mtext() function:

boxplot(dat$hwy, col = "blue",
  ylab = "hwy",
  main = "Boxplot of highway miles per gallon")

mtext(paste("Outliers: ", paste(out, collapse = ", ")))

D. Percentiles

This method of outliers detection is based on the percentiles. With the percentiles method, all observations that lie outside the interval formed by the 2.5 and 97.5 percentiles will be considered as potential outliers.

Other percentiles such as the 1 and 99, or the 5 and 95 percentiles can also be considered to construct the interval.

The values of the lower and upper percentiles (and thus the lower and upper limits of the interval) can be computed with the quantile() function:

lower_bound <- quantile(dat$hwy, 0.025)
lower_bound

## 2.5% 
##   14

upper_bound <- quantile(dat$hwy, 0.975)
upper_bound

##  97.5% 
## 35.175

According to this method, all observations below 14 and above 35.175 will be considered as potential outliers.

The row numbers of the observations outside of the interval can then be extracted with the which() function:

outlier_ind <- which(dat$hwy < lower_bound | dat$hwy > upper_bound)
outlier_ind

##  [1]  55  60  66  70 106 107 127 197 213 222 223

Then their values of highway miles per gallon can be printed:

dat[outlier_ind, "hwy"]

## # A tibble: 11 × 1
##      hwy
##    <int>
##  1    12
##  2    12
##  3    12
##  4    12
##  5    36
##  6    36
##  7    12
##  8    37
##  9    44
## 10    44
## 11    41

Alternatively, all variables for these outliers can be printed:

dat[outlier_ind, ]

## # A tibble: 11 × 11
##    manufacturer model      displ  year   cyl trans drv     cty   hwy fl    class
##    <chr>        <chr>      <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
##  1 dodge        dakota pi…   4.7  2008     8 auto… 4         9    12 e     pick…
##  2 dodge        durango 4…   4.7  2008     8 auto… 4         9    12 e     suv  
##  3 dodge        ram 1500 …   4.7  2008     8 auto… 4         9    12 e     pick…
##  4 dodge        ram 1500 …   4.7  2008     8 manu… 4         9    12 e     pick…
##  5 honda        civic        1.8  2008     4 auto… f        25    36 r     subc…
##  6 honda        civic        1.8  2008     4 auto… f        24    36 c     subc…
##  7 jeep         grand che…   4.7  2008     8 auto… 4         9    12 e     suv  
##  8 toyota       corolla      1.8  2008     4 manu… f        28    37 r     comp…
##  9 volkswagen   jetta        1.9  1999     4 manu… f        33    44 d     comp…
## 10 volkswagen   new beetle   1.9  1999     4 manu… f        35    44 d     subc…
## 11 volkswagen   new beetle   1.9  1999     4 auto… f        29    41 d     subc…

There are 11 potential outliers according to the percentiles method.

To reduce this number, you can set the percentiles to 1 and 99:

lower_bound <- quantile(dat$hwy, 0.01)
upper_bound <- quantile(dat$hwy, 0.99)

outlier_ind <- which(dat$hwy < lower_bound | dat$hwy > upper_bound)

dat[outlier_ind, ]

## # A tibble: 3 × 11
##   manufacturer model      displ  year   cyl trans  drv     cty   hwy fl    class
##   <chr>        <chr>      <dbl> <int> <int> <chr>  <chr> <int> <int> <chr> <chr>
## 1 volkswagen   jetta        1.9  1999     4 manua… f        33    44 d     comp…
## 2 volkswagen   new beetle   1.9  1999     4 manua… f        35    44 d     subc…
## 3 volkswagen   new beetle   1.9  1999     4 auto(… f        29    41 d     subc…

Setting the percentiles to 1 and 99 gives the same potential outliers as with the IQR criterion.

E. Z-scores

If your data come from a normal distribution, you can use the z-scores, which can be done with the scale() function in R.

According to this method, any z-score:

below -2 or above 2 is considered as rare below -3 or above 3 is considered as extremely rare

Others also use a z-score below -3.29 or above 3.29 to detect outliers. This value of 3.29 comes from the fact that 1 observation out of 1000 is out of this interval if the data follow a normal distribution.

In our case:

dat$z_hwy <- scale(dat$hwy)

hist(dat$z_hwy, col = "blue")

library(ggplot2)

ggplot(dat) +
  aes(x = z_hwy) +
  geom_histogram(bins = 30L, color = "red", fill = "#0c4c8a") +
  
  theme_minimal()

summary(dat$z_hwy)

##        V1          
##  Min.   :-1.92122  
##  1st Qu.:-0.91360  
##  Median : 0.09402  
##  Mean   : 0.00000  
##  3rd Qu.: 0.59782  
##  Max.   : 3.45274

We see that there are some observations above 3.29, but none below -3.29.

To identify the line in the dataset of these observations:

which(dat$z_hwy > 3.29)

## [1] 213 222

We see that observations 213 and 222 can be considered as outliers according to this method.

F. Hampel filter

Another method, known as Hampel filter, consists of considering as outliers the values outside the interval (I) formed by the median, plus or minus 3 median absolute deviations.

For this method we first set the interval limits thanks to the median() and mad()functions:

lower_bound <- median(dat$hwy) - 3 * mad(dat$hwy, constant = 1)
lower_bound

## [1] 9

upper_bound <- median(dat$hwy) + 3 * mad(dat$hwy, constant = 1)
upper_bound

## [1] 39

According to this method, all observations below 9 and above 39 will be considered as potential outliers.

The row numbers of the observations outside of the interval can then be extracted with the which() function:

outlier_ind <- which(dat$hwy < lower_bound | dat$hwy > upper_bound)
outlier_ind

## [1] 213 222 223

According to the Hampel filter, there are 3 outliers for the hwy variable.

2. Statistical tests

In this section, I will present 3 more formal techniques to detect outliers:

Grubbs’s test Dixon’s test Rosner’s test

These 3 statistical tests are part of more formal techniques of outliers detection as they all involve the computation of a test statistic that is compared to tabulated critical values (that are based on the sample size and the desired confidence level).

Note that the 3 tests are appropriate only when the data (without any outliers) are approximately normally distributed. The normality assumption must thus be verified before applying these tests for outliers (see how to test the normality assumption in R).

Grubbs’s test

The Grubbs test allows to detect whether the highest or lowest value in a dataset is an outlier.

The Grubbs test detects one outlier at a time (highest or lowest value), so the null and alternative hypotheses are as follows:

Null Hypothesis: The highest value is not an outlier

Alternative hypothesis: The highest value is an outlier

if we want to test the highest value, or:

Null Hypothesis : The lowest value is not an outlier Alternative hypothesis : The lowest value is an outlier if we want to test the lowest value.

As for any statistical test, if the p-value is less than the chosen significance threshold (generally α = 0.05) then the null hypothesis is rejected and we will conclude that the lowest/highest value is an outlier.

On the contrary,if the p-value is greater or equal than the significance level, the null hypothesis is not rejected, and we will conclude that, based on the data, we do not reject the hypothesis that the lowest/highest value is not an outlier.

Note that the Grubbs test is not appropriate for sample size of 6 or less (n≤6).

To perform the Grubbs test in R, we use the grubbs.test() function from the {outliers} package:

#install.packages("outliers")

library(outliers)
test <- grubbs.test(dat$hwy)
test

## 
##  Grubbs test for one outlier
## 
## data:  dat$hwy
## G = 3.45274, U = 0.94862, p-value = 0.05555
## alternative hypothesis: highest value 44 is an outlier

The p-value is 0.056. At the 5% significance level, we do not reject the hypothesis that the highest value 44 is not an outlier.

By default,the test is performed on the highest value (as shown in the R output: alternative hypothesis: highest value 44 is an outlier).

If you want to do the test for the lowest value, simply add the argument opposite = TRUE in the grubbs.test() function:

test <- grubbs.test(dat$hwy, opposite = TRUE)
test

## 
##  Grubbs test for one outlier
## 
## data:  dat$hwy
## G = 1.92122, U = 0.98409, p-value = 1
## alternative hypothesis: lowest value 12 is an outlier

The R output indicates that the test is now performed on the lowest value (see alternative hypothesis: lowest value 12 is an outlier).

The p-value is 1. At the 5% significance level, we do not reject the hypothesis that the lowest value 12 is not an outlier.

For the sake of illustration, we will now replace an observation with a more extreme value and perform the Grubbs test on this new dataset. Let’s replace the 34th row with a value of 212:

dat[34, "hwy"] <- 212

And we now apply the Grubbs test to test whether the highest value is an outlier:

test <- grubbs.test(dat$hwy)
test

## 
##  Grubbs test for one outlier
## 
## data:  dat$hwy
## G = 13.72240, U = 0.18836, p-value < 2.2e-16
## alternative hypothesis: highest value 212 is an outlier

The p-value is < 0.001. At the 5% significance level, we conclude that the highest value 212 is an outlier.

Dixon’s test

Similar to the Grubbs test, Dixon test is used to test whether a single low or high value is an outlier. So if more than one outliers is suspected, the test has to be performed on these suspected outliers individually.

Note that Dixon test is most useful for small sample size (usually n≤25).

To perform the Dixon’s test in R, we use the dixon.test() function from the {outliers} package. However, we restrict our dataset to the 20 first observations as the Dixon test can only be done on small sample size (R will throw an error and accepts only dataset of 3 to 30 observations):

subdat <- dat[1:20, ]
test <- dixon.test(subdat$hwy)
test

## 
##  Dixon test for outliers
## 
## data:  subdat$hwy
## Q = 0.57143, p-value = 0.006508
## alternative hypothesis: lowest value 15 is an outlier

The results show that the lowest value 15 is an outlier (p-value = 0.007).

To test for the highest value, simply add the opposite = TRUE argument to the dixon.test() function:

test <- dixon.test(subdat$hwy,
  opposite = TRUE
)
test

## 
##  Dixon test for outliers
## 
## data:  subdat$hwy
## Q = 0.25, p-value = 0.8582
## alternative hypothesis: highest value 31 is an outlier

The results show that the highest value 31 is not an outlier (p-value = 0.858).

It is a good practice to always check the results of the statistical test for outliers against the boxplot to make sure we tested all potential outliers:

out <- boxplot.stats(subdat$hwy)$out
boxplot(subdat$hwy,col = "blue",
  ylab = "hwy"
)
mtext(paste("Outliers: ", paste(out, collapse = ", ")))

From the boxplot, we see that we could also apply the Dixon test on the value 20 in addition to the value 15 done previously.

This can be done by finding the row number of the minimum value, excluding this row number from the dataset and then finally apply the Dixon test on this new data set:

# find and exclude lowest value
remove_ind <- which.min(subdat$hwy)
subsubdat <- subdat[-remove_ind, ]

# Dixon test on dataset without the minimum
test <- dixon.test(subsubdat$hwy)
test

## 
##  Dixon test for outliers
## 
## data:  subsubdat$hwy
## Q = 0.44444, p-value = 0.1297
## alternative hypothesis: lowest value 20 is an outlier

The results show that the second lowest value 20 is not an outlier (p-value = 0.13).

Rosner’s test

Rosner’s test for outliers has the advantages that:

it is used to detect several outliers at once (unlike Grubbs and Dixon test which must be performed iteratively to screen for multiple outliers), and it is designed to avoid the problem of masking, where an outlier that is close in value to another outlier can go undetected. Unlike Dixon test, note that Rosner test is most appropriate when the sample size is large (n≥20). We therefore use again the initial dataset dat, which includes 234 observations.

To perform the Rosner test we use the rosnerTest() function from the {EnvStats} package. This function requires at least 2 arguments: the data and the number of suspected outliers k (with k = 3 as the default number of suspected outliers).

For this example, we set the number of suspected outliers to be equal to 3, as suggested by the number of potential outliers outlined in the boxplot at the beginning of the article.

#install.packages("EnvStats")
library(EnvStats)

## 
## Attaching package: 'EnvStats'

## The following objects are masked from 'package:stats':
## 
##     predict, predict.lm

## The following object is masked from 'package:base':
## 
##     print.default

test <- rosnerTest(dat$hwy,
  k = 3
)
test

## $distribution
## [1] "Normal"
## 
## $statistic
##       R.1       R.2       R.3 
## 13.722399  3.459098  3.559936 
## 
## $sample.size
## [1] 234
## 
## $parameters
## k 
## 3 
## 
## $alpha
## [1] 0.05
## 
## $crit.value
## lambda.1 lambda.2 lambda.3 
## 3.652091 3.650836 3.649575 
## 
## $n.outliers
## [1] 1
## 
## $alternative
## [1] "Up to 3 observations are not\n                                 from the same Distribution."
## 
## $method
## [1] "Rosner's Test for Outliers"
## 
## $data
##   [1]  29  29  31  30  26  26  27  26  25  28  27  25  25  25  25  24  25  23
##  [19]  20  15  20  17  17  26  23  26  25  24  19  14  15  17  27 212  26  29
##  [37]  26  24  24  22  22  24  24  17  22  21  23  23  19  18  17  17  19  19
##  [55]  12  17  15  17  17  12  17  16  18  15  16  12  17  17  16  12  15  16
##  [73]  17  15  17  17  18  17  19  17  19  19  17  17  17  16  16  17  15  17
##  [91]  26  25  26  24  21  22  23  22  20  33  32  32  29  32  34  36  36  29
## [109]  26  27  30  31  26  26  28  26  29  28  27  24  24  24  22  19  20  17
## [127]  12  19  18  14  15  18  18  15  17  16  18  17  19  19  17  29  27  31
## [145]  32  27  26  26  25  25  17  17  20  18  26  26  27  28  25  25  24  27
## [163]  25  26  23  26  26  26  26  25  27  25  27  20  20  19  17  20  17  29
## [181]  27  31  31  26  26  28  27  29  31  31  26  26  27  30  33  35  37  35
## [199]  15  18  20  20  22  17  19  18  20  29  26  29  29  24  44  29  26  29
## [217]  29  29  29  23  24  44  41  29  26  28  29  29  29  28  29  26  26  26
## 
## $data.name
## [1] "dat$hwy"
## 
## $bad.obs
## [1] 0
## 
## $all.stats
##   i   Mean.i      SD.i Value Obs.Num     R.i+1 lambda.i+1 Outlier
## 1 0 24.21795 13.684345   212      34 13.722399   3.652091    TRUE
## 2 1 23.41202  5.951835    44     213  3.459098   3.650836   FALSE
## 3 2 23.32328  5.808172    44     222  3.559936   3.649575   FALSE
## 
## attr(,"class")
## [1] "gofOutlier"

The interesting results are provided in the $all.stats table:

test$all.stats

##   i   Mean.i      SD.i Value Obs.Num     R.i+1 lambda.i+1 Outlier
## 1 0 24.21795 13.684345   212      34 13.722399   3.652091    TRUE
## 2 1 23.41202  5.951835    44     213  3.459098   3.650836   FALSE
## 3 2 23.32328  5.808172    44     222  3.559936   3.649575   FALSE

Based on the Rosner test, we see that there is only one outlier (see the Outlier column), and that it is the observation 34 (see Obs.Num) with a value of 212 (see Value).

LOF (Local Outlier Factor): Proximity (density) Based Outlier Detection Technique

The proximity based outlier detection techniques falls under two major categories namely the density-based and the distance-based outlier detection.

Generally, in the proximity-based outlier detection technique an object is considered to be an outlier if it is distant from most other points.

This approach is more general and simpler than the statistical approaches, since it is easier to determine a meaningful proximity measure for a data set than to determine its statistical distribution.

In this lesson we will only focus on discussing about the density based outlier detection technique. A simplest way to measure whether an object is distant from most other points in the dataset is to use the distance to the k-nearest neighbor or the LOF (Local Outlier Factor).

The LOF method is based on scoring outliers on the basis of the density in the neighborhood. This technique is based on a parameter known asoutlier score. The outlier score of an object is thereciprocalof thedensity in the object’s neighborhoodwheredensityis theaverage distanceto thek-nearest neighbors`.

In the LOF technique the local density of a point is compared with that of its neighbors. If the former is significantly lower than the latter i.e. if LOF is greater than one, then the point is in a sparser region than its neighbors, which suggests it to be an outlier.

The only limitation of LOF is that it works on numeric data only.

The complexity of this algorithm is 0 (N2) and another challenge of this algorithm is to select a right value of k which is not very obvious.

In the R, in package DMwR there is a function lofactor() which computes the local outlier factors using the LOF algorithm.

Let us consider an example below. The data set used in this example is the dat data set. Click Here or more information about the dat data set.

Here using the statistical package R we will try to identify outliers in the dat data set using the LOF algorithm.

#remotes::install_github("cran/DMwR")
library(DMwR)

## Loading required package: lattice

## Loading required package: grid

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

Now we need to obtain some numeric data from the Iris data set which can be obtained using the following commands

Now we will obtain the outlier score with k= 5.

library(ggplot2)
dat <- ggplot2::mpg
head(dat)

## # A tibble: 6 × 11
##   manufacturer model displ  year   cyl trans      drv     cty   hwy fl    class 
##   <chr>        <chr> <dbl> <int> <int> <chr>      <chr> <int> <int> <chr> <chr> 
## 1 audi         a4      1.8  1999     4 auto(l5)   f        18    29 p     compa…
## 2 audi         a4      1.8  1999     4 manual(m5) f        21    29 p     compa…
## 3 audi         a4      2    2008     4 manual(m6) f        20    31 p     compa…
## 4 audi         a4      2    2008     4 auto(av)   f        21    30 p     compa…
## 5 audi         a4      2.8  1999     6 auto(l5)   f        16    26 p     compa…
## 6 audi         a4      2.8  1999     6 manual(m5) f        18    26 p     compa…

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

# get numeric columns using dplyr() function            
dat2 <- print(select_if(dat, is.numeric))

##     displ year cyl cty hwy
## 1     1.8 1999   4  18  29
## 2     1.8 1999   4  21  29
## 3     2.0 2008   4  20  31
## 4     2.0 2008   4  21  30
## 5     2.8 1999   6  16  26
## 6     2.8 1999   6  18  26
## 7     3.1 2008   6  18  27
## 8     1.8 1999   4  18  26
## 9     1.8 1999   4  16  25
## 10    2.0 2008   4  20  28
## 11    2.0 2008   4  19  27
## 12    2.8 1999   6  15  25
## 13    2.8 1999   6  17  25
## 14    3.1 2008   6  17  25
## 15    3.1 2008   6  15  25
## 16    2.8 1999   6  15  24
## 17    3.1 2008   6  17  25
## 18    4.2 2008   8  16  23
## 19    5.3 2008   8  14  20
## 20    5.3 2008   8  11  15
## 21    5.3 2008   8  14  20
## 22    5.7 1999   8  13  17
## 23    6.0 2008   8  12  17
## 24    5.7 1999   8  16  26
## 25    5.7 1999   8  15  23
## 26    6.2 2008   8  16  26
## 27    6.2 2008   8  15  25
## 28    7.0 2008   8  15  24
## 29    5.3 2008   8  14  19
## 30    5.3 2008   8  11  14
## 31    5.7 1999   8  11  15
## 32    6.5 1999   8  14  17
## 33    2.4 1999   4  19  27
## 34    2.4 2008   4  22  30
## 35    3.1 1999   6  18  26
## 36    3.5 2008   6  18  29
## 37    3.6 2008   6  17  26
## 38    2.4 1999   4  18  24
## 39    3.0 1999   6  17  24
## 40    3.3 1999   6  16  22
## 41    3.3 1999   6  16  22
## 42    3.3 2008   6  17  24
## 43    3.3 2008   6  17  24
## 44    3.3 2008   6  11  17
## 45    3.8 1999   6  15  22
## 46    3.8 1999   6  15  21
## 47    3.8 2008   6  16  23
## 48    4.0 2008   6  16  23
## 49    3.7 2008   6  15  19
## 50    3.7 2008   6  14  18
## 51    3.9 1999   6  13  17
## 52    3.9 1999   6  14  17
## 53    4.7 2008   8  14  19
## 54    4.7 2008   8  14  19
## 55    4.7 2008   8   9  12
## 56    5.2 1999   8  11  17
## 57    5.2 1999   8  11  15
## 58    3.9 1999   6  13  17
## 59    4.7 2008   8  13  17
## 60    4.7 2008   8   9  12
## 61    4.7 2008   8  13  17
## 62    5.2 1999   8  11  16
## 63    5.7 2008   8  13  18
## 64    5.9 1999   8  11  15
## 65    4.7 2008   8  12  16
## 66    4.7 2008   8   9  12
## 67    4.7 2008   8  13  17
## 68    4.7 2008   8  13  17
## 69    4.7 2008   8  12  16
## 70    4.7 2008   8   9  12
## 71    5.2 1999   8  11  15
## 72    5.2 1999   8  11  16
## 73    5.7 2008   8  13  17
## 74    5.9 1999   8  11  15
## 75    4.6 1999   8  11  17
## 76    5.4 1999   8  11  17
## 77    5.4 2008   8  12  18
## 78    4.0 1999   6  14  17
## 79    4.0 1999   6  15  19
## 80    4.0 1999   6  14  17
## 81    4.0 2008   6  13  19
## 82    4.6 2008   8  13  19
## 83    5.0 1999   8  13  17
## 84    4.2 1999   6  14  17
## 85    4.2 1999   6  14  17
## 86    4.6 1999   8  13  16
## 87    4.6 1999   8  13  16
## 88    4.6 2008   8  13  17
## 89    5.4 1999   8  11  15
## 90    5.4 2008   8  13  17
## 91    3.8 1999   6  18  26
## 92    3.8 1999   6  18  25
## 93    4.0 2008   6  17  26
## 94    4.0 2008   6  16  24
## 95    4.6 1999   8  15  21
## 96    4.6 1999   8  15  22
## 97    4.6 2008   8  15  23
## 98    4.6 2008   8  15  22
## 99    5.4 2008   8  14  20
## 100   1.6 1999   4  28  33
## 101   1.6 1999   4  24  32
## 102   1.6 1999   4  25  32
## 103   1.6 1999   4  23  29
## 104   1.6 1999   4  24  32
## 105   1.8 2008   4  26  34
## 106   1.8 2008   4  25  36
## 107   1.8 2008   4  24  36
## 108   2.0 2008   4  21  29
## 109   2.4 1999   4  18  26
## 110   2.4 1999   4  18  27
## 111   2.4 2008   4  21  30
## 112   2.4 2008   4  21  31
## 113   2.5 1999   6  18  26
## 114   2.5 1999   6  18  26
## 115   3.3 2008   6  19  28
## 116   2.0 1999   4  19  26
## 117   2.0 1999   4  19  29
## 118   2.0 2008   4  20  28
## 119   2.0 2008   4  20  27
## 120   2.7 2008   6  17  24
## 121   2.7 2008   6  16  24
## 122   2.7 2008   6  17  24
## 123   3.0 2008   6  17  22
## 124   3.7 2008   6  15  19
## 125   4.0 1999   6  15  20
## 126   4.7 1999   8  14  17
## 127   4.7 2008   8   9  12
## 128   4.7 2008   8  14  19
## 129   5.7 2008   8  13  18
## 130   6.1 2008   8  11  14
## 131   4.0 1999   8  11  15
## 132   4.2 2008   8  12  18
## 133   4.4 2008   8  12  18
## 134   4.6 1999   8  11  15
## 135   5.4 1999   8  11  17
## 136   5.4 1999   8  11  16
## 137   5.4 2008   8  12  18
## 138   4.0 1999   6  14  17
## 139   4.0 2008   6  13  19
## 140   4.6 2008   8  13  19
## 141   5.0 1999   8  13  17
## 142   2.4 1999   4  21  29
## 143   2.4 1999   4  19  27
## 144   2.5 2008   4  23  31
## 145   2.5 2008   4  23  32
## 146   3.5 2008   6  19  27
## 147   3.5 2008   6  19  26
## 148   3.0 1999   6  18  26
## 149   3.0 1999   6  19  25
## 150   3.5 2008   6  19  25
## 151   3.3 1999   6  14  17
## 152   3.3 1999   6  15  17
## 153   4.0 2008   6  14  20
## 154   5.6 2008   8  12  18
## 155   3.1 1999   6  18  26
## 156   3.8 1999   6  16  26
## 157   3.8 1999   6  17  27
## 158   3.8 2008   6  18  28
## 159   5.3 2008   8  16  25
## 160   2.5 1999   4  18  25
## 161   2.5 1999   4  18  24
## 162   2.5 2008   4  20  27
## 163   2.5 2008   4  19  25
## 164   2.5 2008   4  20  26
## 165   2.5 2008   4  18  23
## 166   2.2 1999   4  21  26
## 167   2.2 1999   4  19  26
## 168   2.5 1999   4  19  26
## 169   2.5 1999   4  19  26
## 170   2.5 2008   4  20  25
## 171   2.5 2008   4  20  27
## 172   2.5 2008   4  19  25
## 173   2.5 2008   4  20  27
## 174   2.7 1999   4  15  20
## 175   2.7 1999   4  16  20
## 176   3.4 1999   6  15  19
## 177   3.4 1999   6  15  17
## 178   4.0 2008   6  16  20
## 179   4.7 2008   8  14  17
## 180   2.2 1999   4  21  29
## 181   2.2 1999   4  21  27
## 182   2.4 2008   4  21  31
## 183   2.4 2008   4  21  31
## 184   3.0 1999   6  18  26
## 185   3.0 1999   6  18  26
## 186   3.5 2008   6  19  28
## 187   2.2 1999   4  21  27
## 188   2.2 1999   4  21  29
## 189   2.4 2008   4  21  31
## 190   2.4 2008   4  22  31
## 191   3.0 1999   6  18  26
## 192   3.0 1999   6  18  26
## 193   3.3 2008   6  18  27
## 194   1.8 1999   4  24  30
## 195   1.8 1999   4  24  33
## 196   1.8 1999   4  26  35
## 197   1.8 2008   4  28  37
## 198   1.8 2008   4  26  35
## 199   4.7 1999   8  11  15
## 200   5.7 2008   8  13  18
## 201   2.7 1999   4  15  20
## 202   2.7 1999   4  16  20
## 203   2.7 2008   4  17  22
## 204   3.4 1999   6  15  17
## 205   3.4 1999   6  15  19
## 206   4.0 2008   6  15  18
## 207   4.0 2008   6  16  20
## 208   2.0 1999   4  21  29
## 209   2.0 1999   4  19  26
## 210   2.0 2008   4  21  29
## 211   2.0 2008   4  22  29
## 212   2.8 1999   6  17  24
## 213   1.9 1999   4  33  44
## 214   2.0 1999   4  21  29
## 215   2.0 1999   4  19  26
## 216   2.0 2008   4  22  29
## 217   2.0 2008   4  21  29
## 218   2.5 2008   5  21  29
## 219   2.5 2008   5  21  29
## 220   2.8 1999   6  16  23
## 221   2.8 1999   6  17  24
## 222   1.9 1999   4  35  44
## 223   1.9 1999   4  29  41
## 224   2.0 1999   4  21  29
## 225   2.0 1999   4  19  26
## 226   2.5 2008   5  20  28
## 227   2.5 2008   5  20  29
## 228   1.8 1999   4  21  29
## 229   1.8 1999   4  18  29
## 230   2.0 2008   4  19  28
## 231   2.0 2008   4  21  29
## 232   2.8 1999   6  16  26
## 233   2.8 1999   6  18  26
## 234   3.6 2008   6  17  26

dat2

## # A tibble: 234 × 5
##    displ  year   cyl   cty   hwy
##    <dbl> <int> <int> <int> <int>
##  1   1.8  1999     4    18    29
##  2   1.8  1999     4    21    29
##  3   2    2008     4    20    31
##  4   2    2008     4    21    30
##  5   2.8  1999     6    16    26
##  6   2.8  1999     6    18    26
##  7   3.1  2008     6    18    27
##  8   1.8  1999     4    18    26
##  9   1.8  1999     4    16    25
## 10   2    2008     4    20    28
## # … with 224 more rows

outlier.scores <- lofactor(dat2, k=5)

Now that you have computed the outlier scores you need to plot the density graph by typing the command. The density graph is shown below

plot(density(outlier.scores))

outliers <- order(outlier.scores, decreasing=T)[1:5]

print(outliers)

## [1] 157 149  92  91 103

You should now be able to identify the outliers.

The five outliers obtained in the output are the row numbers in the dat2 data derived from the dat2 data set.

To visualize the outliers with a biplot of the first two principal components use the following commands provided below:

n <- nrow(dat2)

labels <- 1:n

labels[-outliers] <- "."

biplot(prcomp(dat2), cex=.8, xlabs=labels)

Now we can use the pairs plot to visualize the outliers which are marked with a “+” sign in red by typing in the following commands:

pch <- rep(".", n)

pch[outliers] <- "+"

col <- rep("black", n)


col[outliers] <- "red"

pairs(dat2, pch=pch, col=col)

#install.packages("rgl")
library(rgl)
plot3d(dat2$displ, dat2$cty, dat2$hwy, type="s", col=col)

On the other hand, in the package Rlof, the function lof() which is a parallel implementation of the algorithm lofactor() provides two additional features namely supporting multiple values of k and several choices of distance metrics.

Since this package is not available for the windows system we will not discuss about the lof() function here. Mac and Linux users, if interested, should refer to the tutorial for the usage of the lof() function in the Rlof package.

Outliers detection in R

Alema Fissuh

2023-02-08

Outliers detection in R

Introduction

1. Descriptive statistics

A. Minimum and maximum

B. Histogram

C. Boxplot

D. Percentiles

E. Z-scores

F. Hampel filter

2. Statistical tests

Grubbs’s test

Dixon’s test

Rosner’s test

LOF (Local Outlier Factor): Proximity (density) Based Outlier Detection Technique