LOF (Local Outlier Factor): Proximity (density) Based Outlier Detection Technique

The proximity based outlier detection techniques falls under two major categories namely the density-based and the distance-based outlier detection. Generally, in the proximity-based outlier detection technique an object is considered to be an outlier if it is distant from most other points.

This approach is more general and simpler than the statistical approaches, since it is easier to determine a meaningful proximity measure for a data set than to determine its statistical distribution.

In this lesson we will only focus on discussing about the density based outlier detection technique.

A simplest way to measure whether an object is distant from most other points in the dataset is to use the distance to the k-nearest neighbor or the LOF (Local Outlier Factor).

The LOF method is based on scoring outliers on the basis of the density in the neighborhood. This technique is based on a parameter known as outlier score. The outlier score of an object is the reciprocal of the density in the object’s neighborhood where density is the average distance to the k-nearest neighbors.

In the LOF technique the local density of a point is compared with that of its neighbors. If the former is significantly lower than the latter i.e. if LOF is greater than one, then the point is in a sparser region than its neighbors, which suggests it to be an outlier.

The only limitation of LOF is that it works on numeric data only.

The complexity of this algorithm is 0 (N2) and another challenge of this algorithm is to select a right value of k which is not very obvious.

In the R, in package DMwR there is a function lofactor() which computes the local outlier factors using the LOF algorithm.

Let us consider an example below. The data set used in this example is the Iris data set. Click Here or more information about the Iris data set.

Here using the statistical package R we will try to identify outliers in the Iris data set using the LOF algorithm.

library(DMwR)
## Warning: package 'DMwR' was built under R version 3.3.2
## Loading required package: lattice
## Loading required package: grid

Now we need to obtain some numeric data from the Iris data set which can be obtained using the following commands

Now we need to obtain some numeric data from the Iris data set which can be obtained using the following commands

iris2 <- iris[,1:4]
head(iris2)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1          5.1         3.5          1.4         0.2
## 2          4.9         3.0          1.4         0.2
## 3          4.7         3.2          1.3         0.2
## 4          4.6         3.1          1.5         0.2
## 5          5.0         3.6          1.4         0.2
## 6          5.4         3.9          1.7         0.4

Now we will obtain the outlier score with k= 5.

outlier.scores <- lofactor(iris2, k=5)

Now that you have computed the outlier scores you need to plot the density graph by typing the command. The density graph is shown below

plot(density(outlier.scores))

Now we can obtain the top 5 outliers in the data set by typing the following commands

outliers <- order(outlier.scores, decreasing=T)[1:5]

print(outliers)
## [1]  42 107  23 110  63

You should now be able to identify the outliers.

The five outliers obtained in the output are the row numbers in the Iris2 data derived from the Iris data set.

To visualize the outliers with a biplot of the first two principal components use the following commands provided below:

n <- nrow(iris2)

labels <- 1:n

labels[-outliers] <- "."

biplot(prcomp(iris2), cex=.8, xlabs=labels)

Now we can use the pairs plot to visualize the outliers which are marked with a “+” sign in red by typing in the following commands:

pch <- rep(".", n)

pch[outliers] <- "+"

col <- rep("black", n)


col[outliers] <- "red"

pairs(iris2, pch=pch, col=col)

library(rgl)
## Warning: package 'rgl' was built under R version 3.3.2
plot3d(iris2$Petal.Width, iris2$Petal.Length, iris2$Sepal.Width, type="s", col=col) 

On the other hand, in the package Rlof, the function lof() which is a parallel implementation of the algorithm lofactor() provides two additional features namely supporting multiple values of k and several choices of distance metrics.

Since this package is not available for the windows system we will not discuss about the lof() function here. Mac and Linux users, if interested, should refer to the tutorial for the usage of the lof() function in the Rlof package.