An outlier
is a value or an observation
that is distant from other observations, that is to say, a data point
that differs significantly from other data points.
An observation must always be compared to other observations made on the same phenomenon before actually calling it an outlier.
Outliers can also arise due to an experimental, measurement or encoding error.
The dataset mpg
from the {ggplot2}
package will be used to illustrate the different approaches of outliers
detection in R, and in particular we will focus on the variable hwy
(highway miles per gallon).
The first step to detect outliers in R is to start with some
descriptive statistics
, and in particular with the
minimum
and maximum
.
In R
, this can easily be done with the
summary()
function:
library(ggplot2)
dat <- ggplot2::mpg
summary(dat$hwy)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 12.00 18.00 24.00 23.44 27.00 44.00
where the minimum
and maximum
are
respectively the first
and last values
in the
output above.
Alternatively, they can also be computed with the min()
and max()
, or range()
functions:
min(dat$hwy)
## [1] 12
max(dat$hwy)
## [1] 44
range(dat$hwy)
## [1] 12 44
Some clear encoding mistake
like a weight of 786 kg
(1733 pounds) for a human
will already
be easily detected
by this
very simple technique
.
Another basic way to detect outliers is
to draw a histogram
of the data.
Using R base (with the number of bins corresponding to the square root of the number of observations in order to have more bins than the default option):
hist(dat$hwy,
xlab = "hwy",
main = "Histogram of hwy",
breaks = sqrt(nrow(dat))
) # set number of bins
# histogram -
ggplot(data = dat, mapping = aes(x = hwy))+ # set data and axes
geom_histogram( # display histogram
color = "red", # bin line color
fill = "blue") # bin interior color (fill)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
or using ggplot2 (learn how to create plots with this package via the esquisse addin or via this tutorial):
library(ggplot2)
ggplot(dat) +
aes(x = hwy) +
geom_histogram(bins = 30L, color = "red", fill = "#0c4c8a") +
theme_minimal()
From the
histogram
, there seems to be a couple of
observations higher than
all other observations
(see the bar on the right side of
the plot).
In addition to histograms, boxplots
are also
useful to detect potential outliers
.
Using R base:
boxplot(dat$hwy,
ylab = "hwy"
)
or
using ggplot2
:
ggplot(dat) +
aes(x = "", y = hwy) +
geom_boxplot(fill = "blue") +
theme_minimal()
A
boxplot
helps to visualize a quantitative variable by
displaying five common location summary
(minimum, median,
first and third quartiles and maximum) and
any observation that was classified as a suspected outlier
using the interquartile range (IQR) criterion
.
Observations considered as potential outliers by the IQR criterion are displayed as points in the boxplot. Based on this criterion, there are 2 potential outliers (see the 2 points above the vertical line, at the top of the boxplot).
Remember that it is not because an observation is considered as a potential outlier by the IQR criterion that you should remove it.
*Removing or keeping an outlier*
depends on
the context of your analysis,
whether the tests you are going to perform on the dataset are robust to outliers or not, and
how far is the outlier from other observations.
It is also possible to extract the values of the potential outliers
based on the IQR criterion thanks to the
boxplot.stats()$out
function:
boxplot.stats(dat$hwy)$out
## [1] 44 44 41
As you can see, there are actually 3 points considered as potential
outliers
:
2 observations with a value of 44
and
1 observation with a value of 41
.
Thanks to the which()
function it is possible to
extract the row number
corresponding to
these outliers
:
out <- boxplot.stats(dat$hwy)$out
out_ind <- which(dat$hwy %in% c(out))
out_ind
## [1] 213 222 223
With this information you can now easily go back to the
specific rows in the
dataset
to verify them,
or print all variables for these outliers
:
dat[out_ind, ]
## # A tibble: 3 × 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 volkswagen jetta 1.9 1999 4 manua… f 33 44 d comp…
## 2 volkswagen new beetle 1.9 1999 4 manua… f 35 44 d subc…
## 3 volkswagen new beetle 1.9 1999 4 auto(… f 29 41 d subc…
It is also possible to print the values of the outliers
directly on boxplot
with the mtext()
function:
boxplot(dat$hwy, col = "blue",
ylab = "hwy",
main = "Boxplot of highway miles per gallon")
mtext(paste("Outliers: ", paste(out, collapse = ", ")))
This method of outliers detection
is based on the
percentiles
. With the percentiles method,
all observations that lie outside the interval
formed by
the 2.5 and 97.5 percentiles
will be considered as
potential outliers
.
Other percentiles such as the 1 and 99, or the 5 and 95
percentiles can also be considered
to construct the interval
.
The values of the lower and upper percentiles (and thus the lower and
upper limits of the interval) can be computed with the
quantile()
function:
lower_bound <- quantile(dat$hwy, 0.025)
lower_bound
## 2.5%
## 14
upper_bound <- quantile(dat$hwy, 0.975)
upper_bound
## 97.5%
## 35.175
According to this method,
all observations below 14 and above 35.175
will be
considered as potential outliers
.
The row numbers of the observations outside of the interval
can then be extracted with the which()
function:
outlier_ind <- which(dat$hwy < lower_bound | dat$hwy > upper_bound)
outlier_ind
## [1] 55 60 66 70 106 107 127 197 213 222 223
Then their values of highway miles per gallon
can be
printed
:
dat[outlier_ind, "hwy"]
## # A tibble: 11 × 1
## hwy
## <int>
## 1 12
## 2 12
## 3 12
## 4 12
## 5 36
## 6 36
## 7 12
## 8 37
## 9 44
## 10 44
## 11 41
Alternatively, all variables for these outliers
can be
printed
:
dat[outlier_ind, ]
## # A tibble: 11 × 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 dodge dakota pi… 4.7 2008 8 auto… 4 9 12 e pick…
## 2 dodge durango 4… 4.7 2008 8 auto… 4 9 12 e suv
## 3 dodge ram 1500 … 4.7 2008 8 auto… 4 9 12 e pick…
## 4 dodge ram 1500 … 4.7 2008 8 manu… 4 9 12 e pick…
## 5 honda civic 1.8 2008 4 auto… f 25 36 r subc…
## 6 honda civic 1.8 2008 4 auto… f 24 36 c subc…
## 7 jeep grand che… 4.7 2008 8 auto… 4 9 12 e suv
## 8 toyota corolla 1.8 2008 4 manu… f 28 37 r comp…
## 9 volkswagen jetta 1.9 1999 4 manu… f 33 44 d comp…
## 10 volkswagen new beetle 1.9 1999 4 manu… f 35 44 d subc…
## 11 volkswagen new beetle 1.9 1999 4 auto… f 29 41 d subc…
There are 11 potential outliers
according to the
percentiles method
.
To reduce this number
, you can
set the percentiles to 1 and 99
:
lower_bound <- quantile(dat$hwy, 0.01)
upper_bound <- quantile(dat$hwy, 0.99)
outlier_ind <- which(dat$hwy < lower_bound | dat$hwy > upper_bound)
dat[outlier_ind, ]
## # A tibble: 3 × 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 volkswagen jetta 1.9 1999 4 manua… f 33 44 d comp…
## 2 volkswagen new beetle 1.9 1999 4 manua… f 35 44 d subc…
## 3 volkswagen new beetle 1.9 1999 4 auto(… f 29 41 d subc…
Setting the percentiles to 1 and 99
gives the
same potential outliers
as with the
IQR criterion
.
If your data come from a normal distribution
, you can
use the z-scores
, which can be done with the
scale()
function in R
.
According to this method, any z-score
:
below -2 or above 2
is considered as rare
below -3 or above 3
is considered as
extremely rare
Others also use a z-score below -3.29
or
above 3.29
to detect outliers
. This value of
3.29
comes from the fact that
1 observation out of 1000
is
out of this interval
if the
data follow a normal distribution
.
In our case:
dat$z_hwy <- scale(dat$hwy)
hist(dat$z_hwy, col = "blue")
library(ggplot2)
ggplot(dat) +
aes(x = z_hwy) +
geom_histogram(bins = 30L, color = "red", fill = "#0c4c8a") +
theme_minimal()
summary(dat$z_hwy)
## V1
## Min. :-1.92122
## 1st Qu.:-0.91360
## Median : 0.09402
## Mean : 0.00000
## 3rd Qu.: 0.59782
## Max. : 3.45274
We see that there are some observations
above 3.29
, but none below -3.29
.
To identify the
line in the dataset of these observations
:
which(dat$z_hwy > 3.29)
## [1] 213 222
We see that observations 213 and 222
can be considered
as outliers
according to this method.
Another method, known as Hampel filter, consists of considering as outliers the values outside the interval (I) formed by the median, plus or minus 3 median absolute deviations.
For this method we first set the interval limits thanks to the
median()
and mad()
functions:
lower_bound <- median(dat$hwy) - 3 * mad(dat$hwy, constant = 1)
lower_bound
## [1] 9
upper_bound <- median(dat$hwy) + 3 * mad(dat$hwy, constant = 1)
upper_bound
## [1] 39
According to this method,
all observations below 9 and above 39
will be considered as
potential outliers
.
The
row numbers of the observations outside of the interval
can
then be extracted with the which()
function:
outlier_ind <- which(dat$hwy < lower_bound | dat$hwy > upper_bound)
outlier_ind
## [1] 213 222 223
According to the Hampel filter
, there are
3 outliers
for the hwy variable.
In this section, I will present
3 more formal techniques to detect outliers
:
Grubbs’s test
Dixon’s test
Rosner’s test
These 3 statistical tests are part of
more formal techniques of outliers detection
as they all
involve the computation of a test statistic that is compared to
tabulated critical values (that are based on the sample size and the
desired confidence level).
Note that the 3 tests are appropriate only when the data (without any outliers) are approximately normally distributed. The normality assumption must thus be verified before applying these tests for outliers (see how to test the normality assumption in R).
The Grubbs test allows to detect whether the highest or lowest value in a dataset is an outlier.
The Grubbs test detects one outlier at a time (highest or lowest value), so the null and alternative hypotheses are as follows:
Null Hypothesis
:
The highest value is not an outlier
Alternative hypothesis
:
The highest value is an outlier
if we want to test the highest value, or:
Null Hypothesis
:
The lowest value is not an outlier
Alternative hypothesis
:
The lowest value is an outlier
if we want to test the
lowest value.
As for any statistical test, if the p-value is less than
the chosen significance threshold (generally α = 0.05)
then
the null hypothesis is rejected
and we will conclude that
the lowest/highest value is an outlier
.
On the
contrary,if the p-value is greater or equal than the significance level
,
the null hypothesis is not rejected
, and
we will conclude
that, based on the data,
we do not reject the hypothesis that the lowest/highest value
is not an outlier
.
Note that the
Grubbs test is not appropriate for sample size of 6 or less (n≤6)
.
To perform the Grubbs test in R
, we use the
grubbs.test()
function from the {outliers}
package:
#install.packages("outliers")
library(outliers)
test <- grubbs.test(dat$hwy)
test
##
## Grubbs test for one outlier
##
## data: dat$hwy
## G = 3.45274, U = 0.94862, p-value = 0.05555
## alternative hypothesis: highest value 44 is an outlier
The p-value is 0.056
. At the
5% significance level
, we do not reject
the
hypothesis that the highest value 44 is not an outlier
.
By default,the test is performed on the highest value (as shown in the R output: alternative hypothesis: highest value 44 is an outlier).
If you want to do the test for the lowest value, simply add the argument opposite = TRUE in the grubbs.test() function:
test <- grubbs.test(dat$hwy, opposite = TRUE)
test
##
## Grubbs test for one outlier
##
## data: dat$hwy
## G = 1.92122, U = 0.98409, p-value = 1
## alternative hypothesis: lowest value 12 is an outlier
The R output indicates that the test is now performed on the lowest value (see alternative hypothesis: lowest value 12 is an outlier).
The p-value is 1
.
At the 5% significance level, we do not reject the hypothesis
that the lowest value 12 is not an outlier
.
For the sake of illustration, we will now replace an observation with a more extreme value and perform the Grubbs test on this new dataset. Let’s replace the 34th row with a value of 212:
dat[34, "hwy"] <- 212
And we now apply the Grubbs test to test whether the highest value is an outlier:
test <- grubbs.test(dat$hwy)
test
##
## Grubbs test for one outlier
##
## data: dat$hwy
## G = 13.72240, U = 0.18836, p-value < 2.2e-16
## alternative hypothesis: highest value 212 is an outlier
The p-value is < 0.001
.
At the 5% significance level
, we conclude that the highest
value 212
is an outlier
.
Similar to the Grubbs test, Dixon test
is used to
test
whether a single low or high value
is an
outlier
. So
if more than one outliers is suspected
, the test
has to be performed on these suspected outliers individually
.
Note that Dixon test is most useful for small sample size (usually n≤25).
To perform the Dixon’s test in R
, we use the
dixon.test()
function from the
{outliers} package
. However, we restrict our
dataset to the 20 first observations as the Dixon test can only be done on small sample size
(R will throw an error and accepts only dataset of 3 to 30
observations):
subdat <- dat[1:20, ]
test <- dixon.test(subdat$hwy)
test
##
## Dixon test for outliers
##
## data: subdat$hwy
## Q = 0.57143, p-value = 0.006508
## alternative hypothesis: lowest value 15 is an outlier
The results show that
the lowest value 15 is an outlier (p-value = 0.007)
.
To test for the highest value
, simply
add the opposite = TRUE
argument to the
dixon.test()
function:
test <- dixon.test(subdat$hwy,
opposite = TRUE
)
test
##
## Dixon test for outliers
##
## data: subdat$hwy
## Q = 0.25, p-value = 0.8582
## alternative hypothesis: highest value 31 is an outlier
The results show that
the highest value 31 is not an outlier (p-value = 0.858)
.
It is a good practice to always check the results of the statistical test for outliers against the boxplot to make sure we tested all potential outliers:
out <- boxplot.stats(subdat$hwy)$out
boxplot(subdat$hwy,col = "blue",
ylab = "hwy"
)
mtext(paste("Outliers: ", paste(out, collapse = ", ")))
From the
boxplot
, we see that we could also apply the
Dixon test
on the value 20 in addition to the value 15 done
previously.
This can be done by
finding the row number of the minimum value
,
excluding this row number from the dataset
and then
finally apply the Dixon test
on this new data set:
# find and exclude lowest value
remove_ind <- which.min(subdat$hwy)
subsubdat <- subdat[-remove_ind, ]
# Dixon test on dataset without the minimum
test <- dixon.test(subsubdat$hwy)
test
##
## Dixon test for outliers
##
## data: subsubdat$hwy
## Q = 0.44444, p-value = 0.1297
## alternative hypothesis: lowest value 20 is an outlier
The results show that the
second lowest value 20 is not an outlier
(p-value =
0.13).
Rosner’s test for outliers has the advantages that:
it is used to detect several outliers at once (unlike Grubbs and Dixon test which must be performed iteratively to screen for multiple outliers), and it is designed to avoid the problem of masking, where an outlier that is close in value to another outlier can go undetected. Unlike Dixon test, note that Rosner test is most appropriate when the sample size is large (n≥20). We therefore use again the initial dataset dat, which includes 234 observations.
To perform the Rosner test we use the rosnerTest() function from the {EnvStats} package. This function requires at least 2 arguments: the data and the number of suspected outliers k (with k = 3 as the default number of suspected outliers).
For this example, we set the number of suspected outliers to be equal to 3, as suggested by the number of potential outliers outlined in the boxplot at the beginning of the article.
#install.packages("EnvStats")
library(EnvStats)
##
## Attaching package: 'EnvStats'
## The following objects are masked from 'package:stats':
##
## predict, predict.lm
## The following object is masked from 'package:base':
##
## print.default
test <- rosnerTest(dat$hwy,
k = 3
)
test
## $distribution
## [1] "Normal"
##
## $statistic
## R.1 R.2 R.3
## 13.722399 3.459098 3.559936
##
## $sample.size
## [1] 234
##
## $parameters
## k
## 3
##
## $alpha
## [1] 0.05
##
## $crit.value
## lambda.1 lambda.2 lambda.3
## 3.652091 3.650836 3.649575
##
## $n.outliers
## [1] 1
##
## $alternative
## [1] "Up to 3 observations are not\n from the same Distribution."
##
## $method
## [1] "Rosner's Test for Outliers"
##
## $data
## [1] 29 29 31 30 26 26 27 26 25 28 27 25 25 25 25 24 25 23
## [19] 20 15 20 17 17 26 23 26 25 24 19 14 15 17 27 212 26 29
## [37] 26 24 24 22 22 24 24 17 22 21 23 23 19 18 17 17 19 19
## [55] 12 17 15 17 17 12 17 16 18 15 16 12 17 17 16 12 15 16
## [73] 17 15 17 17 18 17 19 17 19 19 17 17 17 16 16 17 15 17
## [91] 26 25 26 24 21 22 23 22 20 33 32 32 29 32 34 36 36 29
## [109] 26 27 30 31 26 26 28 26 29 28 27 24 24 24 22 19 20 17
## [127] 12 19 18 14 15 18 18 15 17 16 18 17 19 19 17 29 27 31
## [145] 32 27 26 26 25 25 17 17 20 18 26 26 27 28 25 25 24 27
## [163] 25 26 23 26 26 26 26 25 27 25 27 20 20 19 17 20 17 29
## [181] 27 31 31 26 26 28 27 29 31 31 26 26 27 30 33 35 37 35
## [199] 15 18 20 20 22 17 19 18 20 29 26 29 29 24 44 29 26 29
## [217] 29 29 29 23 24 44 41 29 26 28 29 29 29 28 29 26 26 26
##
## $data.name
## [1] "dat$hwy"
##
## $bad.obs
## [1] 0
##
## $all.stats
## i Mean.i SD.i Value Obs.Num R.i+1 lambda.i+1 Outlier
## 1 0 24.21795 13.684345 212 34 13.722399 3.652091 TRUE
## 2 1 23.41202 5.951835 44 213 3.459098 3.650836 FALSE
## 3 2 23.32328 5.808172 44 222 3.559936 3.649575 FALSE
##
## attr(,"class")
## [1] "gofOutlier"
The interesting results are provided in the $all.stats table
:
test$all.stats
## i Mean.i SD.i Value Obs.Num R.i+1 lambda.i+1 Outlier
## 1 0 24.21795 13.684345 212 34 13.722399 3.652091 TRUE
## 2 1 23.41202 5.951835 44 213 3.459098 3.650836 FALSE
## 3 2 23.32328 5.808172 44 222 3.559936 3.649575 FALSE
Based on the Rosner test
, we see that
there is only one outlier
(see the Outlier column), and
that it is the observation 34
(see Obs.Num) with a value of
212 (see Value).
The proximity based outlier detection techniques
falls
under two major categories namely the density-based
and the
distance-based outlier detection
.
Generally, in the
proximity-based outlier detection technique
an object is
considered to be
an outlier if it is distant from most other points
.
This approach is more general and simpler than the statistical approaches, since it is easier to determine a meaningful proximity measure for a data set than to determine its statistical distribution.
In this lesson we will only focus on discussing about
the density based
outlier detection technique
.
A simplest way to measure whether an object is distant from most other
points in the dataset is to use the distance
to the
k-nearest neighbor
or the
LOF (Local Outlier Factor)
.
The LOF method is based on scoring outliers
on the basis
of the density in the
neighborhood. This technique is based on a parameter known as
outlier
score. The outlier score of an object is the
reciprocalof the
density
in the object’s
neighborhoodwhere
densityis the
average
distanceto the
k-nearest neighbors`.
In the LOF technique the local density of a point is compared with that of its neighbors. If the former is significantly lower than the latter i.e. if LOF is greater than one, then the point is in a sparser region than its neighbors, which suggests it to be an outlier.
The only limitation of LOF is that it works on numeric data only
.
The complexity of this algorithm is 0 (N2)
and another
challenge of this algorithm is
to select a right value of k which is not very obvious
.
In the R, in package DMwR
there is a function
lofactor()
which computes the
local outlier factors
using the
LOF algorithm
.
Let us consider an example below. The data set used in this example is the dat data set. Click Here or more information about the dat data set.
Here using the statistical package R we will try to identify outliers in the dat data set using the LOF algorithm.
#remotes::install_github("cran/DMwR")
library(DMwR)
## Loading required package: lattice
## Loading required package: grid
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
Now we need to obtain some numeric data from the Iris data set which can be obtained using the following commands
Now we will obtain the outlier score with k= 5.
library(ggplot2)
dat <- ggplot2::mpg
head(dat)
## # A tibble: 6 × 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compa…
## 2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compa…
## 3 audi a4 2 2008 4 manual(m6) f 20 31 p compa…
## 4 audi a4 2 2008 4 auto(av) f 21 30 p compa…
## 5 audi a4 2.8 1999 6 auto(l5) f 16 26 p compa…
## 6 audi a4 2.8 1999 6 manual(m5) f 18 26 p compa…
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# get numeric columns using dplyr() function
dat2 <- print(select_if(dat, is.numeric))
## displ year cyl cty hwy
## 1 1.8 1999 4 18 29
## 2 1.8 1999 4 21 29
## 3 2.0 2008 4 20 31
## 4 2.0 2008 4 21 30
## 5 2.8 1999 6 16 26
## 6 2.8 1999 6 18 26
## 7 3.1 2008 6 18 27
## 8 1.8 1999 4 18 26
## 9 1.8 1999 4 16 25
## 10 2.0 2008 4 20 28
## 11 2.0 2008 4 19 27
## 12 2.8 1999 6 15 25
## 13 2.8 1999 6 17 25
## 14 3.1 2008 6 17 25
## 15 3.1 2008 6 15 25
## 16 2.8 1999 6 15 24
## 17 3.1 2008 6 17 25
## 18 4.2 2008 8 16 23
## 19 5.3 2008 8 14 20
## 20 5.3 2008 8 11 15
## 21 5.3 2008 8 14 20
## 22 5.7 1999 8 13 17
## 23 6.0 2008 8 12 17
## 24 5.7 1999 8 16 26
## 25 5.7 1999 8 15 23
## 26 6.2 2008 8 16 26
## 27 6.2 2008 8 15 25
## 28 7.0 2008 8 15 24
## 29 5.3 2008 8 14 19
## 30 5.3 2008 8 11 14
## 31 5.7 1999 8 11 15
## 32 6.5 1999 8 14 17
## 33 2.4 1999 4 19 27
## 34 2.4 2008 4 22 30
## 35 3.1 1999 6 18 26
## 36 3.5 2008 6 18 29
## 37 3.6 2008 6 17 26
## 38 2.4 1999 4 18 24
## 39 3.0 1999 6 17 24
## 40 3.3 1999 6 16 22
## 41 3.3 1999 6 16 22
## 42 3.3 2008 6 17 24
## 43 3.3 2008 6 17 24
## 44 3.3 2008 6 11 17
## 45 3.8 1999 6 15 22
## 46 3.8 1999 6 15 21
## 47 3.8 2008 6 16 23
## 48 4.0 2008 6 16 23
## 49 3.7 2008 6 15 19
## 50 3.7 2008 6 14 18
## 51 3.9 1999 6 13 17
## 52 3.9 1999 6 14 17
## 53 4.7 2008 8 14 19
## 54 4.7 2008 8 14 19
## 55 4.7 2008 8 9 12
## 56 5.2 1999 8 11 17
## 57 5.2 1999 8 11 15
## 58 3.9 1999 6 13 17
## 59 4.7 2008 8 13 17
## 60 4.7 2008 8 9 12
## 61 4.7 2008 8 13 17
## 62 5.2 1999 8 11 16
## 63 5.7 2008 8 13 18
## 64 5.9 1999 8 11 15
## 65 4.7 2008 8 12 16
## 66 4.7 2008 8 9 12
## 67 4.7 2008 8 13 17
## 68 4.7 2008 8 13 17
## 69 4.7 2008 8 12 16
## 70 4.7 2008 8 9 12
## 71 5.2 1999 8 11 15
## 72 5.2 1999 8 11 16
## 73 5.7 2008 8 13 17
## 74 5.9 1999 8 11 15
## 75 4.6 1999 8 11 17
## 76 5.4 1999 8 11 17
## 77 5.4 2008 8 12 18
## 78 4.0 1999 6 14 17
## 79 4.0 1999 6 15 19
## 80 4.0 1999 6 14 17
## 81 4.0 2008 6 13 19
## 82 4.6 2008 8 13 19
## 83 5.0 1999 8 13 17
## 84 4.2 1999 6 14 17
## 85 4.2 1999 6 14 17
## 86 4.6 1999 8 13 16
## 87 4.6 1999 8 13 16
## 88 4.6 2008 8 13 17
## 89 5.4 1999 8 11 15
## 90 5.4 2008 8 13 17
## 91 3.8 1999 6 18 26
## 92 3.8 1999 6 18 25
## 93 4.0 2008 6 17 26
## 94 4.0 2008 6 16 24
## 95 4.6 1999 8 15 21
## 96 4.6 1999 8 15 22
## 97 4.6 2008 8 15 23
## 98 4.6 2008 8 15 22
## 99 5.4 2008 8 14 20
## 100 1.6 1999 4 28 33
## 101 1.6 1999 4 24 32
## 102 1.6 1999 4 25 32
## 103 1.6 1999 4 23 29
## 104 1.6 1999 4 24 32
## 105 1.8 2008 4 26 34
## 106 1.8 2008 4 25 36
## 107 1.8 2008 4 24 36
## 108 2.0 2008 4 21 29
## 109 2.4 1999 4 18 26
## 110 2.4 1999 4 18 27
## 111 2.4 2008 4 21 30
## 112 2.4 2008 4 21 31
## 113 2.5 1999 6 18 26
## 114 2.5 1999 6 18 26
## 115 3.3 2008 6 19 28
## 116 2.0 1999 4 19 26
## 117 2.0 1999 4 19 29
## 118 2.0 2008 4 20 28
## 119 2.0 2008 4 20 27
## 120 2.7 2008 6 17 24
## 121 2.7 2008 6 16 24
## 122 2.7 2008 6 17 24
## 123 3.0 2008 6 17 22
## 124 3.7 2008 6 15 19
## 125 4.0 1999 6 15 20
## 126 4.7 1999 8 14 17
## 127 4.7 2008 8 9 12
## 128 4.7 2008 8 14 19
## 129 5.7 2008 8 13 18
## 130 6.1 2008 8 11 14
## 131 4.0 1999 8 11 15
## 132 4.2 2008 8 12 18
## 133 4.4 2008 8 12 18
## 134 4.6 1999 8 11 15
## 135 5.4 1999 8 11 17
## 136 5.4 1999 8 11 16
## 137 5.4 2008 8 12 18
## 138 4.0 1999 6 14 17
## 139 4.0 2008 6 13 19
## 140 4.6 2008 8 13 19
## 141 5.0 1999 8 13 17
## 142 2.4 1999 4 21 29
## 143 2.4 1999 4 19 27
## 144 2.5 2008 4 23 31
## 145 2.5 2008 4 23 32
## 146 3.5 2008 6 19 27
## 147 3.5 2008 6 19 26
## 148 3.0 1999 6 18 26
## 149 3.0 1999 6 19 25
## 150 3.5 2008 6 19 25
## 151 3.3 1999 6 14 17
## 152 3.3 1999 6 15 17
## 153 4.0 2008 6 14 20
## 154 5.6 2008 8 12 18
## 155 3.1 1999 6 18 26
## 156 3.8 1999 6 16 26
## 157 3.8 1999 6 17 27
## 158 3.8 2008 6 18 28
## 159 5.3 2008 8 16 25
## 160 2.5 1999 4 18 25
## 161 2.5 1999 4 18 24
## 162 2.5 2008 4 20 27
## 163 2.5 2008 4 19 25
## 164 2.5 2008 4 20 26
## 165 2.5 2008 4 18 23
## 166 2.2 1999 4 21 26
## 167 2.2 1999 4 19 26
## 168 2.5 1999 4 19 26
## 169 2.5 1999 4 19 26
## 170 2.5 2008 4 20 25
## 171 2.5 2008 4 20 27
## 172 2.5 2008 4 19 25
## 173 2.5 2008 4 20 27
## 174 2.7 1999 4 15 20
## 175 2.7 1999 4 16 20
## 176 3.4 1999 6 15 19
## 177 3.4 1999 6 15 17
## 178 4.0 2008 6 16 20
## 179 4.7 2008 8 14 17
## 180 2.2 1999 4 21 29
## 181 2.2 1999 4 21 27
## 182 2.4 2008 4 21 31
## 183 2.4 2008 4 21 31
## 184 3.0 1999 6 18 26
## 185 3.0 1999 6 18 26
## 186 3.5 2008 6 19 28
## 187 2.2 1999 4 21 27
## 188 2.2 1999 4 21 29
## 189 2.4 2008 4 21 31
## 190 2.4 2008 4 22 31
## 191 3.0 1999 6 18 26
## 192 3.0 1999 6 18 26
## 193 3.3 2008 6 18 27
## 194 1.8 1999 4 24 30
## 195 1.8 1999 4 24 33
## 196 1.8 1999 4 26 35
## 197 1.8 2008 4 28 37
## 198 1.8 2008 4 26 35
## 199 4.7 1999 8 11 15
## 200 5.7 2008 8 13 18
## 201 2.7 1999 4 15 20
## 202 2.7 1999 4 16 20
## 203 2.7 2008 4 17 22
## 204 3.4 1999 6 15 17
## 205 3.4 1999 6 15 19
## 206 4.0 2008 6 15 18
## 207 4.0 2008 6 16 20
## 208 2.0 1999 4 21 29
## 209 2.0 1999 4 19 26
## 210 2.0 2008 4 21 29
## 211 2.0 2008 4 22 29
## 212 2.8 1999 6 17 24
## 213 1.9 1999 4 33 44
## 214 2.0 1999 4 21 29
## 215 2.0 1999 4 19 26
## 216 2.0 2008 4 22 29
## 217 2.0 2008 4 21 29
## 218 2.5 2008 5 21 29
## 219 2.5 2008 5 21 29
## 220 2.8 1999 6 16 23
## 221 2.8 1999 6 17 24
## 222 1.9 1999 4 35 44
## 223 1.9 1999 4 29 41
## 224 2.0 1999 4 21 29
## 225 2.0 1999 4 19 26
## 226 2.5 2008 5 20 28
## 227 2.5 2008 5 20 29
## 228 1.8 1999 4 21 29
## 229 1.8 1999 4 18 29
## 230 2.0 2008 4 19 28
## 231 2.0 2008 4 21 29
## 232 2.8 1999 6 16 26
## 233 2.8 1999 6 18 26
## 234 3.6 2008 6 17 26
dat2
## # A tibble: 234 × 5
## displ year cyl cty hwy
## <dbl> <int> <int> <int> <int>
## 1 1.8 1999 4 18 29
## 2 1.8 1999 4 21 29
## 3 2 2008 4 20 31
## 4 2 2008 4 21 30
## 5 2.8 1999 6 16 26
## 6 2.8 1999 6 18 26
## 7 3.1 2008 6 18 27
## 8 1.8 1999 4 18 26
## 9 1.8 1999 4 16 25
## 10 2 2008 4 20 28
## # … with 224 more rows
outlier.scores <- lofactor(dat2, k=5)
Now that you have computed the outlier scores you need to plot the density graph by typing the command. The density graph is shown below
plot(density(outlier.scores))
outliers <- order(outlier.scores, decreasing=T)[1:5]
print(outliers)
## [1] 157 149 92 91 103
You should now be able to identify the outliers.
The five outliers obtained in the output are the row numbers in the dat2 data derived from the dat2 data set.
To visualize the outliers with a biplot of the first two principal components use the following commands provided below:
n <- nrow(dat2)
labels <- 1:n
labels[-outliers] <- "."
biplot(prcomp(dat2), cex=.8, xlabs=labels)
Now we can use the pairs plot to visualize the outliers which are marked
with a “+” sign in red by typing in the following commands:
pch <- rep(".", n)
pch[outliers] <- "+"
col <- rep("black", n)
col[outliers] <- "red"
pairs(dat2, pch=pch, col=col)
#install.packages("rgl")
library(rgl)
plot3d(dat2$displ, dat2$cty, dat2$hwy, type="s", col=col)
On the other hand, in the package Rlof, the function lof() which is a parallel implementation of the algorithm lofactor() provides two additional features namely supporting multiple values of k and several choices of distance metrics.
Since this package is not available for the windows system we will not discuss about the lof() function here. Mac and Linux users, if interested, should refer to the tutorial for the usage of the lof() function in the Rlof package.