K-means clustering is a method of clustering data into a predetermined number of clusters. It is an unsupervised learning algorithm that tries to group data points in a way that minimizes the sum of the distances between the data points within each cluster.
The algorithm works as follows:
Specify the number of clusters, k. Initialize k centroids randomly within the data. Assign each data point to the nearest centroid. Recompute the centroids as the mean of the data points assigned to each centroid. Repeat steps 3 and 4 until the centroids stop moving or a maximum number of iterations is reached. One limitation of K-means is that it assumes that the clusters are spherical and equally sized, which may not always be the case in real-world data. Additionally, the initial placement of the centroids can affect the final clustering, so it may be necessary to run the algorithm multiple times with different initialization to obtain the best results.
For this case study will be performing flower (iris) segmentation by apply Kmean clustering. The objective is to understand the Kmean clustering.
Data attributes include: Sepal width and Length, petal width and length plus species. Aim is to build a model that is able to predict flower species depending on attribute given.
Loading Important Libraries
library(data.table)
library(dplyr)
library(tidyverse)
library(ggplot2)Load and view the data set
require("datasets")data("iris")Viewing with the structure of the data set
str(iris)## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
Statistical summary
summary(iris)## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
Data set preview
head(iris)Since clustering is a type of Unsupervised Learning, we would not require Class Label(output) during execution of our algorithm. We will, therefore, remove Class Attribute “Species” and store it in another variable. We would then normalize the attributes between 0 and 1 using our own function.
iris.new<- iris[, c(1, 2, 3, 4)]
iris.class<- iris[, "Species"]
head(iris.new)Previewing the class column
head(iris.class)## [1] setosa setosa setosa setosa setosa setosa
## Levels: setosa versicolor virginica
Normalizing the data set so that no particular attribute has more impact on clustering algorithm than others.
normalize <- function(x){
return ((x-min(x)) / (max(x)-min(x)))
}
iris.new$Sepal.Length<- normalize(iris.new$Sepal.Length)
iris.new$Sepal.Width<- normalize(iris.new$Sepal.Width)
iris.new$Petal.Length<- normalize(iris.new$Petal.Length)
iris.new$Petal.Width<- normalize(iris.new$Petal.Width)
head(iris.new)Applying the K-means clustering algorithm with no. of centroids(k)=3
result<- kmeans(iris.new,3) records preview
result$size ## [1] 50 39 61
Getting the value of cluster center data point value(3 centers for k=3)
result$centers ## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 0.1961111 0.5950000 0.07830508 0.06083333
## 2 0.7072650 0.4508547 0.79704476 0.82478632
## 3 0.4412568 0.3073770 0.57571548 0.54918033
Getting the cluster vector that shows the cluster where each record falls
result$cluster## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 2 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [75] 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 2 2 2 2 3 2 2 2 2
## [112] 2 2 3 2 2 2 2 2 3 2 3 2 3 2 2 3 3 2 2 2 2 2 3 3 2 2 2 3 2 2 2 3 2 2 2 3 2
## [149] 2 3
Visualizing the clustering results
plot(iris[,1:2], col = result$cluster) par(mfrow = c(2,2), mar = c(5,4,2,2))Plotting to see how Sepal.Length and Sepal.Width data points have been distributed in clusters
plot(iris.new[c(1,2)], col = result$cluster)Plotting to see how Sepal.Length and Sepal.Width data points have been distributed originally as per “class” attribute in data set.
plot(iris.new[c(1,2)], col = iris.class)Plotting to see how Petal.Length and Petal.Width data points have been distributed in clusters
plot(iris.new[c(1,2)], col = result$cluster)plot(iris.new[c(3,4)], col = iris.class)Result of table shows that Cluster 1 corresponds to Virginica, Cluster 2 corresponds to Versicolor and Cluster 3 to Setosa.
table(result$cluster, iris.class)## iris.class
## setosa versicolor virginica
## 1 50 0 0
## 2 0 3 36
## 3 0 47 14
In order to improve this accuracy further, will try different values of “k”.
Applying the K-means clustering algorithm with no. of centroids(k)5
result<- kmeans(iris.new,5) records preview
result$size ## [1] 40 19 29 50 12
Getting the value of cluster center data point value(3 centers for k=5)
result$centers ## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 0.5430556 0.3750000 0.65423729 0.63020833
## 2 0.6242690 0.4649123 0.76271186 0.89692982
## 3 0.3563218 0.2370690 0.50905903 0.47126437
## 4 0.1961111 0.5950000 0.07830508 0.06083333
## 5 0.8819444 0.4687500 0.89830508 0.81250000
Getting the cluster vector that shows the cluster where each record falls
result$cluster## [1] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [38] 4 4 4 4 4 4 4 4 4 4 4 4 4 1 1 1 3 1 3 1 3 1 3 3 1 3 1 3 1 1 3 3 3 1 3 1 1
## [75] 1 1 1 1 1 3 3 3 3 1 3 1 1 3 3 3 3 1 3 3 3 3 3 1 3 3 2 1 5 1 2 5 3 5 1 5 2
## [112] 1 2 1 2 2 1 5 5 3 2 1 5 1 2 5 1 1 2 5 5 5 2 1 1 5 2 1 1 2 2 2 1 2 2 2 1 2
## [149] 2 1
Visualizing the clustering results
plot(iris[,1:2], col = result$cluster) par(mfrow = c(2,2), mar = c(5,4,2,2))Plotting to see how Sepal.Length and Sepal.Width data points have been distributed in clusters
plot(iris.new[c(1,2)], col = result$cluster)Plotting to see how Sepal.Length and Sepal.Width data points have been distributed originally as per “class” attribute in data set.
plot(iris.new[c(1,2)], col = iris.class)Plotting to see how Petal.Length and Petal.Width data points have been distributed in clusters
plot(iris.new[c(1,2)], col = result$cluster)plot(iris.new[c(3,4)], col = iris.class)Result of table shows that Cluster 1 corresponds to Virginica, Cluster 2 corresponds to Versicolor and Cluster 3 to Setosa.
table(result$cluster, iris.class)## iris.class
## setosa versicolor virginica
## 1 0 23 17
## 2 0 0 19
## 3 0 27 2
## 4 50 0 0
## 5 0 0 12
There are several metrics that can be used to evaluate the success of K-means clustering:
Within-cluster sum of squares (WCSS): This measures the sum of the squared distances between the data points in a cluster and the centroid of the cluster. A small WCSS indicates that the data points in the cluster are close to the centroid, and the cluster is therefore “tight”.
Silhouette score: This measures how well each data point is assigned to its own cluster. A high silhouette score indicates that the data points are well-separated from other clusters.
Calinski-Harabasz index: This measures the ratio of the sum of squared distances between the data points in a cluster and the centroid of the cluster, to the sum of squared distances between the data points and the centroid of the whole dataset. A high Calinski-Harabasz index indicates that the clusters are well-separated from each other.
Dunn index: This measures the ratio of the minimum distance between the centroids of two different clusters, to the maximum distance between the data points in the same cluster. A high Dunn index indicates that the clusters are well-separated from each other and the data points within each cluster are close to the centroid.
It is important to note that these metrics should not be used in isolation, as different metrics may be more or less relevant depending on the specific characteristics of the dataset. It is also important to consider the business context in which the clusters will be used, as the ultimate goal of clustering may be to solve a particular business problem.