A simple k-means clustering using the iris data.

data("iris")
head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
summary(iris)
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
## 

This document helps to understanding how k-means clustering is done in R and check the accuracy of the k-means algorithm.

The iris data, extracted from Ronald Fisher’s paper “The use of multiple measurements in taxonomic problems” contains three plant species. Four features; sepal length, sepal width, petal length and petal width is provided for 150 observation.

In this document, I try to classify the data into clusters from the given four features. I will then compare the classification with the original plant species and see how accurate the classification is.

K-means is an iterative algorithm which partitions the observations into clusters depending on the given features. User must provide how many clusters to divide the data into. The algorithm then randomly allocates ‘centroid’ for each cluster and computes the square distance between all the points and the centroid (euclidean distance). The points with the least distance to a centroid are then allocated to that cluster. The algorithm continues until there is no further reallocation between points allocated in a cluster.

The number of clusters to allocate for K-means is the function of cluster number and variation tradeoff. While K-means clustering is an unsupervised machine learning algorithm, in this example, we already now that there are three species and as a result, we will provide the algorithm to cluster it into three partitions.

I will make a new data set which will not include plant species since that is what we shall be trying to find out.

irisset <- iris[-5]
head(irisset)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1          5.1         3.5          1.4         0.2
## 2          4.9         3.0          1.4         0.2
## 3          4.7         3.2          1.3         0.2
## 4          4.6         3.1          1.5         0.2
## 5          5.0         3.6          1.4         0.2
## 6          5.4         3.9          1.7         0.4

Ploting the dataset with ggplot2 with length and width

library(ggplot2)
ggplot(iris,aes(x = Sepal.Length, y = Sepal.Width, shape= Species, col= Species)) + geom_point()

ggplot(iris,aes(x = Petal.Length, y = Petal.Width, shape= Species, col= Species)) + geom_point()

Now, I shall run the k-means to partition the irrsset data.

set.seed(100)
irisCluster <- kmeans(irisset,3)
irisCluster$cluster
##   [1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
##  [38] 3 3 3 3 3 3 3 3 3 3 3 3 3 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [75] 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2 2 2 2 1 2 2 2 2
## [112] 2 2 1 1 2 2 2 2 1 2 1 2 1 2 2 1 1 2 2 2 2 2 1 2 2 2 2 1 2 2 2 1 2 2 2 1 2
## [149] 2 1
table(irisCluster$cluster)
## 
##  1  2  3 
## 62 38 50

The next plot will show the observation which is divided by k-means cluster as compared to the original classification by species. The colors indicate the cluster that the observation is divided into whereas the shape of the plot indicates the real plat species.

ggplot(iris,aes(x = Sepal.Length, y = Sepal.Width, shape= Species, col= as.factor(irisCluster$cluster))) + geom_point()

ggplot(iris,aes(x = Petal.Length, y = Petal.Width, shape= Species, col= as.factor(irisCluster$cluster))) + geom_point()

In this graph, it can be seen that all the green colored points are circle. However, blue points are a combination of triangle and rectangle and similarly so are the orange points.

Judging by the graph, most of the points can be classified from our k-means algorithm. With the help of table function, the accuracy of the model can be calculated.

table(irisCluster$cluster,iris$Species)
##    
##     setosa versicolor virginica
##   1      0         48        14
##   2      0          2        36
##   3     50          0         0

The table numerically illustrates what the graph had shown. All Setosa are classified as a single group. There is some classification variation in Versicolor and even more in Virginia. From the graph, we know that 48 Versicolor and 14 Virginia were classified into cluster one whereas two Versicolor and 36 Virginia were classified into cluster two. As the graph shows that difference between Versicolor and Verginica is smaller than with Setosa, that may explain the variation in the clustering.

From the graph we can find that cluster one is Versicolor, cluster two is Virginicaand cluster three is Setosa. With the help of this information, we will try to calculate the accuracy of our algorithm.

#Accuracy is the sum of correct observation by the total observation
(50+48+36)/150
## [1] 0.8933333

The model has an accuracy of approximately 90%. With the four observation, it can be concluded that the k-means algorithm did great job in clustering the data.