In this section we will discuss about the k-means algorithm for detecting the outliers.
The k-means clustering technique (reference: lesson 6. section 6.3.1)
In the k-means based outlier detection technique the data are partitioned in to k groups by assigning them to the closest cluster centers.
Once assigned we can compute the distance or dissimilarity between each object and its cluster center, and pick those with largest distances as outliers.
Here we will look in to an example to illustrate the k-means technique to detect the outlier using the Iris data set as we used to illustrate the proximity based outlier detection technique.
From the Iris data set create a subset in R using the following command.
iris2 <- iris[,1:4]
On this subset of data perform a k-means cluster using the kmeans() function with k=3
kmeans.result <- kmeans(iris2, centers=3)
To view the results of the k-means cluster type the following command
kmeans.result$centers
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 5.006000 3.428000 1.462000 0.246000
## 2 6.850000 3.073684 5.742105 2.071053
## 3 5.901613 2.748387 4.393548 1.433871
To obtain the cluster ID’s type the following command
kmeans.result$cluster
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [36] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [71] 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 2 2 2
## [106] 2 3 2 2 2 2 2 2 3 3 2 2 2 2 3 2 3 2 3 2 2 3 3 2 2 2 2 2 3 2 2 2 2 3 2
## [141] 2 2 3 2 2 2 3 2 2 3
In the next step we will calculate the distance between the objects and cluster centers to determine the outliers and identify 5 largest distances which are outliers. Finally we will print the five outliers.
The R commands for the following steps are provided below:
centers <- kmeans.result$centers[kmeans.result$cluster, ] # "centers" is a data frame of 3 centers but the length of iris dataset so we can canlculate distance difference easily.
distances <- sqrt(rowSums((iris2 - centers)^2))
outliers <- order(distances, decreasing=T)[1:5]
print(outliers) # these rows are 5 top outliers
## [1] 99 58 94 61 119
To print the details about the outliers use the following command
print(iris2[outliers,])
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 99 5.1 2.5 3.0 1.1
## 58 4.9 2.4 3.3 1.0
## 94 5.0 2.3 3.3 1.0
## 61 5.0 2.0 3.5 1.0
## 119 7.7 2.6 6.9 2.3
Using the following commands provided below you should be able to plot the clusters with the “+” representing the outliers and the asterisks “*" representing the cluster center.
plot(iris2[,c("Sepal.Length", "Sepal.Width")], pch=19, col=kmeans.result$cluster, cex=1)
points(kmeans.result$centers[,c("Sepal.Length", "Sepal.Width")], col=1:3, pch=15, cex=2)
points(iris2[outliers, c("Sepal.Length", "Sepal.Width")], pch="+", col=4, cex=3)