Introduction

In this document, we use two popular algorithms, Local Outlier Factor and Random Forest, to detect outliers in the dataset, and plot the results.

Setting Up Environment

We use ‘DMwR’ package for LOF fucntion, ‘outlier’ package for grubbs test, and ‘CORElearn’ for random forest. If you have not installed these three packages, please intall them first.

After that, we have to load them.

library(DMwR)

## Loading required package: lattice

## Loading required package: grid

library(outliers)
library(CORElearn)

Use LOF Algorithm to Detect Outliers and Plot them

Generation of the DataSet

We randomly generat nine group of points, where each group contans 150 points with different mean and standard deviation. Those points are represented as a 450x3 numeric matirx.

set.seed(1234)
gen.xyz <- function(n, mean, sd) {
    cbind(rnorm(n, mean[1], sd[1]),
          rnorm(n, mean[2],sd[2]),
          rnorm(n, mean[3],sd[3])
    );
}

ArtifData <- rbind(gen.xyz(150, c(0,0,0), c(.2,.2,.2)),
             gen.xyz(150, c(2.5,0,1), c(.4,.2,.6)),
             gen.xyz(150, c(1.25,.5, .1), c(.3,.2, .5)));
str(ArtifData)

##  num [1:450, 1:3] -0.2414 0.0555 0.2169 -0.4691 0.0858 ...

head(ArtifData)

##             [,1]        [,2]         [,3]
## [1,] -0.24141315 -0.07544753 -0.115991398
## [2,]  0.05548585  0.01952389 -0.190655740
## [3,]  0.21688824  0.32774893 -0.035885717
## [4,] -0.46913954 -0.17511849  0.201961643
## [5,]  0.08582494  0.02435200  0.004725323
## [6,]  0.10121118  0.27242613 -0.129805644

Use LOF algorithm to calculate the Outlier score

With the dataset, we then use the LOF algorithm to produce a vector of local outlier factors for each case.

outlier.scores <- lofactor(ArtifData, k=5)
plot(density(outlier.scores));

According to above graph, we can observe the outlier is one tail.

We can further output the top 5 outliers.

outliers <- order(outlier.scores, decreasing=T)[1:5]
print(outliers);

## [1] 188 186  66  51 301

Use Grubbs tests for one outlier

Since the size of the generated dataset is 450x3, we can further explore the outlier in each column.

grubbs.test(ArtifData[,1], type = 10, opposite = FALSE, two.sided = FALSE)

## 
##  Grubbs test for one outlier
## 
## data:  ArtifData[, 1]
## G = 2.39270, U = 0.98722, p-value = 1
## alternative hypothesis: highest value 3.77836047914295 is an outlier

grubbs.test(ArtifData[,2], type = 10, opposite = FALSE, two.sided = FALSE)

## 
##  Grubbs test for one outlier
## 
## data:  ArtifData[, 2]
## G = 2.70820, U = 0.98363, p-value = 1
## alternative hypothesis: lowest value -0.646630426584628 is an outlier

grubbs.test(ArtifData[,3], type = 10, opposite = FALSE, two.sided = FALSE)

## 
##  Grubbs test for one outlier
## 
## data:  ArtifData[, 3]
## G = 3.61900, U = 0.97077, p-value = 0.06043
## alternative hypothesis: highest value 2.6234650283196 is an outlier

And we can get the conclusion that the results above basically match up the generation setting above with the given means and standard deviations.

Plot Outliers column-by-column

As above, we plot the outlier by column. The top 5 outliers in each column are colored as red ‘+’ and others are colored as black ‘.’.

pch <- rep(".", 150)
pch[outliers] <- "+"
col <- rep("black", 150)
col[outliers] <- "red"
pairs(ArtifData, pch=pch, col=col)

In each sub-graph, it mainly contains two group of points from different column and each outlier is clearly showed.

Use Random Forest to Detect Outliers and Plot them

Import Iris DataSet and Explore it

We first import Iris dataset and explore it.

dataset <- iris
head(dataset)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

str(dataset$Species)

##  Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

With the observation, we know Iris dataset containg three species’ information.

Build Random Forest Model and Plot the Result

We then call ‘CoreModel’ function to build random forest classification model of the dataset and plot the outliers.

md <- CoreModel(Species ~ ., dataset, model="rf", rfNoTrees=30)
outliers <- rfOutliers(md, dataset)
plot(abs(outliers))

With the plotted graph, we know that the first species is significantly different from the other two, and each outlier within each species is labeled as empty circle.

Conclusion

In ths document, Local Outlier Factor and Random Forest are implemented, and the results are plotted as above. We can know that these two powerful algorithms can be demostrated easily by simple R codes with excellent outputs.

Detecting Outlier by Local Outlier Factor and Random Forest

Cheng-Chung Li

2016/8/7