In this document, we use two popular algorithms, Local Outlier Factor and Random Forest, to detect outliers in the dataset, and plot the results.
We use ‘DMwR’ package for LOF fucntion, ‘outlier’ package for grubbs test, and ‘CORElearn’ for random forest. If you have not installed these three packages, please intall them first.
After that, we have to load them.
library(DMwR)
## Loading required package: lattice
## Loading required package: grid
library(outliers)
library(CORElearn)
We randomly generat nine group of points, where each group contans 150 points with different mean and standard deviation. Those points are represented as a 450x3 numeric matirx.
set.seed(1234)
gen.xyz <- function(n, mean, sd) {
cbind(rnorm(n, mean[1], sd[1]),
rnorm(n, mean[2],sd[2]),
rnorm(n, mean[3],sd[3])
);
}
ArtifData <- rbind(gen.xyz(150, c(0,0,0), c(.2,.2,.2)),
gen.xyz(150, c(2.5,0,1), c(.4,.2,.6)),
gen.xyz(150, c(1.25,.5, .1), c(.3,.2, .5)));
str(ArtifData)
## num [1:450, 1:3] -0.2414 0.0555 0.2169 -0.4691 0.0858 ...
head(ArtifData)
## [,1] [,2] [,3]
## [1,] -0.24141315 -0.07544753 -0.115991398
## [2,] 0.05548585 0.01952389 -0.190655740
## [3,] 0.21688824 0.32774893 -0.035885717
## [4,] -0.46913954 -0.17511849 0.201961643
## [5,] 0.08582494 0.02435200 0.004725323
## [6,] 0.10121118 0.27242613 -0.129805644
With the dataset, we then use the LOF algorithm to produce a vector of local outlier factors for each case.
outlier.scores <- lofactor(ArtifData, k=5)
plot(density(outlier.scores));
According to above graph, we can observe the outlier is one tail.
We can further output the top 5 outliers.
outliers <- order(outlier.scores, decreasing=T)[1:5]
print(outliers);
## [1] 188 186 66 51 301
Since the size of the generated dataset is 450x3, we can further explore the outlier in each column.
grubbs.test(ArtifData[,1], type = 10, opposite = FALSE, two.sided = FALSE)
##
## Grubbs test for one outlier
##
## data: ArtifData[, 1]
## G = 2.39270, U = 0.98722, p-value = 1
## alternative hypothesis: highest value 3.77836047914295 is an outlier
grubbs.test(ArtifData[,2], type = 10, opposite = FALSE, two.sided = FALSE)
##
## Grubbs test for one outlier
##
## data: ArtifData[, 2]
## G = 2.70820, U = 0.98363, p-value = 1
## alternative hypothesis: lowest value -0.646630426584628 is an outlier
grubbs.test(ArtifData[,3], type = 10, opposite = FALSE, two.sided = FALSE)
##
## Grubbs test for one outlier
##
## data: ArtifData[, 3]
## G = 3.61900, U = 0.97077, p-value = 0.06043
## alternative hypothesis: highest value 2.6234650283196 is an outlier
And we can get the conclusion that the results above basically match up the generation setting above with the given means and standard deviations.
As above, we plot the outlier by column. The top 5 outliers in each column are colored as red ‘+’ and others are colored as black ‘.’.
pch <- rep(".", 150)
pch[outliers] <- "+"
col <- rep("black", 150)
col[outliers] <- "red"
pairs(ArtifData, pch=pch, col=col)
In each sub-graph, it mainly contains two group of points from different column and each outlier is clearly showed.
We first import Iris dataset and explore it.
dataset <- iris
head(dataset)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
str(dataset$Species)
## Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
With the observation, we know Iris dataset containg three species’ information.
We then call ‘CoreModel’ function to build random forest classification model of the dataset and plot the outliers.
md <- CoreModel(Species ~ ., dataset, model="rf", rfNoTrees=30)
outliers <- rfOutliers(md, dataset)
plot(abs(outliers))
With the plotted graph, we know that the first species is significantly different from the other two, and each outlier within each species is labeled as empty circle.
In ths document, Local Outlier Factor and Random Forest are implemented, and the results are plotted as above. We can know that these two powerful algorithms can be demostrated easily by simple R codes with excellent outputs.