GOAL: The purpose of clustering is to find natural grouping of data points in data set such that points in same cluster are more similar to each other than points in different cluster.
clustering can be an end goal of the project or results of clustering can be used for further research to explore the causes of similarities and dissimilarities. also it is a great data exploratoration technique in exploratory data analysis phase.
Process:
- decide no. of clusters K.
- randomly assign data points to these clusters.
- calculate the centroids of each cluster which is mean of observations in p-dimension.
- calculate distance of each data-point to each centroid based distance matrix selected (eucledian distance, manhattan distance, etc.)
- re-assign each data point to their nearest centroid.
- if changes in assigment, repeat from step-3, OR stop.
for categorical variables, we have to use K-medoids method. medoids are mode of a set. in k-medoids method, medoids are data-points itself.
IRIS dataset
data("iris")
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
iris2 = iris[,-5]
mean.iris = apply(iris2, 2, mean)
sd.iris = apply(iris2, 2, sd)
scaled.iris = scale(iris2, mean.iris, sd.iris)
3 methods of Clustering validation:
- Cohesion: measure of similarity between data-points in a cluster. measured by within cluster sum of square errors. minimum the better.
- Separation: measure of distinction between clusters. measured by between clusters sum of squares. Higher the better.
- Silhouette coefficient: a measure of combination of both cohesion and separation. a coefficient from 0 to 1. 1 being the best.
finding best K where wss is optimum
wss = (nrow(scaled.iris)-1) * sum(apply(scaled.iris,2,var))
wss# this is max wss possible for this dataset which we get when K=1.
## [1] 596
for (i in 1:15){
kmeans.model2 = kmeans(scaled.iris, i,nstart = 50)
wss[i] = kmeans.model2$tot.withinss
}
plot(1:15, wss, type="b",
xlab = "Number of Clusters",
ylab = "within cluster sum of squares")
at K=3 we get best clustering with minimum wss. after that the advantage is not that big.
bss = rep(0,15)
for (i in 1:15){
kmeans.model2 = kmeans(scaled.iris, i, nstart = 50)
bss[i] = kmeans.model2$betweenss
}
plot(1:15, bss, type="b",
xlab = "Number of Clusters",
ylab = "Between cluster sum of squares")
at K=3 we get best clustering with maximum bss. after that the advantage is not that big.
# calculating euclidean distance
distance = dist(scaled.iris)
library(cluster)
plot(silhouette(kmeans(scaled.iris, centers = 3, nstart = 50)$cluster, dist(scaled.iris)))
running k-means clustering with K=3 ………
Run multiple times and keep the one with minimum wss and aximum bss.
set.seed(1)
kmeans.model = kmeans(scaled.iris, centers = 3, nstart = 50)
wss = kmeans.model$tot.withinss
bss = kmeans.model$betweenss
wss
## [1] 138.8884
bss
## [1] 457.1116
plot(iris2$Petal.Length, iris2$Petal.Width,
col=kmeans.model$cluster,
cex=2,
pch=19,
xlab = "petal.length",
ylab = "petal.width")
points(kmeans.model$centers, pch=3, cex=5, col=1:3, lwd=4)
plot of Species type from original data.
plot(iris$Petal.Length, iris$Petal.Width,
col=iris$Species,
cex=2,
pch=19,
xlab = "petal.length",
ylab = "petal.width")
Limitations of k-means clustering:
- no. of iterations required to cnverge cannot be determined.
- poor choice of initial centroids may result in suboptimal cluster assigments.(solutions (1) performing multiple runs (2) choosing the centroids using heirarchical approach)
- cannot handle noisy data well.
- cannot handle missign data.
- not suitable to discover clusters with non-convex (irregular) or concentric (same center) shapes.
- does not work well with unequal sized clusters.
- not robust to changes in data set. even minor changes results in changes in cluster changes.
- every data point is forced to be a part of only one cluster, no overlap allowed.