Outlier Detection

One of the first steps towards obtaining a coherent analysis is the detection of outlying observations. Although outliers are often considered as an error or noise, they may carry important information (see Mandelbrot/Taleb).

Detected outliers are candidates for aberrant data that may otherwise adversely lead to model misspecification, biased parameter estimation and incorrect results. It is therefore important to identify them prior to modelling and analysis.

Applications of Outlier Detection

Outlier detection methods have been suggested for numerous applications, such as credit card fraud detection, clinical trials, voting irregularity analysis, data cleansing, network intrusion, severe weather prediction, geographic information systems and athlete performance analysis.

Grubbs’ Test

Hypotheses

Hypotheses: Grubbs’ test is defined for the hypothesis:

[Ho] : There are no outliers in the data set
[Ha] : There is exactly one outlier in the data set


install.packages("outliers")
library(outliers)
#Package Author : Lukasz Komsta (UMLUB, Poland)

grubbs.test(myData)

library(outliers)
set.seed(1234)
X <- c(rnorm(99,15,1),20) 
grubbs.test(X)
## 
##  Grubbs test for one outlier
## 
## data:  X
## G = 4.63470, U = 0.78083, p-value = 4.517e-05
## alternative hypothesis: highest value 20 is an outlier

Outliers on Boxplots

Boxplots can used to indentify potential outliers. However there is a different mechanism for classifying outliers, and various analyses may not always agree on particular cases.

    boxplot(X, col="lightblue",pch=16,horizontal = TRUE)