A simple k-means clustering using the iris data.
data("iris")
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
This document helps to understanding how k-means clustering is done in R and check the accuracy of the k-means algorithm.
The iris data, extracted from Ronald Fisher’s paper “The use of multiple measurements in taxonomic problems” contains three plant species. Four features; sepal length, sepal width, petal length and petal width is provided for 150 observation.
In this document, I try to classify the data into clusters from the given four features. I will then compare the classification with the original plant species and see how accurate the classification is.
K-means is an iterative algorithm which partitions the observations into clusters depending on the given features. User must provide how many clusters to divide the data into. The algorithm then randomly allocates ‘centroid’ for each cluster and computes the square distance between all the points and the centroid (euclidean distance). The points with the least distance to a centroid are then allocated to that cluster. The algorithm continues until there is no further reallocation between points allocated in a cluster.
The number of clusters to allocate for K-means is the function of cluster number and variation tradeoff. While K-means clustering is an unsupervised machine learning algorithm, in this example, we already now that there are three species and as a result, we will provide the algorithm to cluster it into three partitions.
I will make a new data set which will not include plant species since that is what we shall be trying to find out.
irisset <- iris[-5]
head(irisset)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 5.1 3.5 1.4 0.2
## 2 4.9 3.0 1.4 0.2
## 3 4.7 3.2 1.3 0.2
## 4 4.6 3.1 1.5 0.2
## 5 5.0 3.6 1.4 0.2
## 6 5.4 3.9 1.7 0.4
Ploting the dataset with ggplot2 with length and width
library(ggplot2)
ggplot(iris,aes(x = Sepal.Length, y = Sepal.Width, shape= Species, col= Species)) + geom_point()
ggplot(iris,aes(x = Petal.Length, y = Petal.Width, shape= Species, col= Species)) + geom_point()
Now, I shall run the k-means to partition the irrsset data.
set.seed(100)
irisCluster <- kmeans(irisset,3)
irisCluster$cluster
## [1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [38] 3 3 3 3 3 3 3 3 3 3 3 3 3 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [75] 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2 2 2 2 1 2 2 2 2
## [112] 2 2 1 1 2 2 2 2 1 2 1 2 1 2 2 1 1 2 2 2 2 2 1 2 2 2 2 1 2 2 2 1 2 2 2 1 2
## [149] 2 1
table(irisCluster$cluster)
##
## 1 2 3
## 62 38 50
The next plot will show the observation which is divided by k-means cluster as compared to the original classification by species. The colors indicate the cluster that the observation is divided into whereas the shape of the plot indicates the real plat species.
ggplot(iris,aes(x = Sepal.Length, y = Sepal.Width, shape= Species, col= as.factor(irisCluster$cluster))) + geom_point()
ggplot(iris,aes(x = Petal.Length, y = Petal.Width, shape= Species, col= as.factor(irisCluster$cluster))) + geom_point()
In this graph, it can be seen that all the green colored points are circle. However, blue points are a combination of triangle and rectangle and similarly so are the orange points.
Judging by the graph, most of the points can be classified from our k-means algorithm. With the help of table function, the accuracy of the model can be calculated.
table(irisCluster$cluster,iris$Species)
##
## setosa versicolor virginica
## 1 0 48 14
## 2 0 2 36
## 3 50 0 0
The table numerically illustrates what the graph had shown. All Setosa are classified as a single group. There is some classification variation in Versicolor and even more in Virginia. From the graph, we know that 48 Versicolor and 14 Virginia were classified into cluster one whereas two Versicolor and 36 Virginia were classified into cluster two. As the graph shows that difference between Versicolor and Verginica is smaller than with Setosa, that may explain the variation in the clustering.
From the graph we can find that cluster one is Versicolor, cluster two is Virginicaand cluster three is Setosa. With the help of this information, we will try to calculate the accuracy of our algorithm.
#Accuracy is the sum of correct observation by the total observation
(50+48+36)/150
## [1] 0.8933333
The model has an accuracy of approximately 90%. With the four observation, it can be concluded that the k-means algorithm did great job in clustering the data.