clustering can be an end goal of the project or results of clustering can be used for further research to explore the causes of similarities and dissimilarities. also it is a great data exploratoration technique in exploratory data analysis phase.
Process:
- decide no. of clusters K.
- randomly assign data points to these clusters.
- calculate the centroids of each cluster which is mean of observations in p-dimension.
- calculate distance of each data-point to each centroid based distance matrix selected (eucledian distance, manhattan distance, etc.)
- re-assign each data point to their nearest centroid.
- if changes in assigment, repeat from step-3, OR stop.
for categorical variables, we have to use K-medoids method. medoids are mode of a set. in k-medoids method, medoids are data-points itself.
setwd("F:\\DATA SCIENCE\\PENN STATE\\Course work\\IE 575 - Predictive Analytics\\Lesson-11_K-means_clustering")
p2 = read.csv("IE575_HW11_p2.csv")
head(p2)
## Area Perimeter Compactness LengthOfKernel WidthOfKernel AsymmetryCoeff
## 1 15.26 14.84 0.8710 5.763 3.312 2.221
## 2 14.88 14.57 0.8811 5.554 3.333 1.018
## 3 14.29 14.09 0.9050 5.291 3.337 2.699
## 4 13.84 13.94 0.8955 5.324 3.379 2.259
## 5 16.14 14.99 0.9034 5.658 3.562 1.355
## 6 14.38 14.21 0.8951 5.386 3.312 2.462
## LengthOfKernelGroove
## 1 5.220
## 2 4.956
## 3 4.825
## 4 4.805
## 5 5.175
## 6 4.956
dim(p2)
## [1] 210 7
str(p2)
## 'data.frame': 210 obs. of 7 variables:
## $ Area : num 15.3 14.9 14.3 13.8 16.1 ...
## $ Perimeter : num 14.8 14.6 14.1 13.9 15 ...
## $ Compactness : num 0.871 0.881 0.905 0.895 0.903 ...
## $ LengthOfKernel : num 5.76 5.55 5.29 5.32 5.66 ...
## $ WidthOfKernel : num 3.31 3.33 3.34 3.38 3.56 ...
## $ AsymmetryCoeff : num 2.22 1.02 2.7 2.26 1.35 ...
## $ LengthOfKernelGroove: num 5.22 4.96 4.83 4.8 5.17 ...
mean.p2 = apply(p2, 2, mean)
sd.p2 = apply(p2, 2, sd)
scaled.p2 = scale(p2, mean.p2, sd.p2)
finding best K where wss is optimum
set.seed(1000)
wss = (nrow(scaled.p2)-1) * sum(apply(scaled.p2,2,var))
wss# this is max wss possible for this dataset which we get when K=1.
## [1] 1463
wss = rep(0,10)
for (i in 1:10){
kmeans.model = kmeans(scaled.p2, i,nstart = 50)
wss[i] = kmeans.model$tot.withinss
}
plot(1:10, wss, type="b",
xlab = "Number of Clusters",
ylab = "within cluster sum of squares")

This plot has no. of clusters (1 to 10) on x-axis and “within cluster sum of squares” on y-axis.
“within cluser sum of squares” represents the measurement of similarity of data points within the same cluster. How similar are the data points within the cluster.
it is obvious that as you increases no. of clusters for the data set, the data points are going to be more closer as variability in data is reduced. but disadvantage is too many clusters and reduced interpretability.
this is phenomenon is shows in this plot. as no. of clusters increases the within cluster variation is reducing. we want minimum variation meaning similar data points in a cluster.
as you can see, for first few no. of clusters the drop in variability is large and then it reduces significantly and the then adding clusters does not have that big of an advantage in reduction in variability.
the point where we can achieve optimal point with good balance of reduction in variability and no. of clusters is a good point for the selectino of K (NO. OF CLUSTERS).
in our plot, at K=3 we get best clustering with minimum wss. after that the advantage is not that big.
set.seed(1000)
bss = rep(0,10)
for (i in 1:10){
kmeans.model = kmeans(scaled.p2, i, nstart = 50)
bss[i] = kmeans.model$betweenss
}
plot(1:10, bss, type="b",
xlab = "Number of Clusters",
ylab = "Between cluster sum of squares")

at K=3 we get best clustering with maximum bss. after that the advantage is not that big.
# calculating euclidean distance
set.seed(1000)
distance = dist(scaled.p2)
library(cluster)
plot(silhouette(kmeans(scaled.p2, centers = 3, nstart = 50)$cluster, dist(scaled.p2)))

running k-means clustering with K=3 ………
Run multiple times and keep the one with minimum wss and aximum bss.
set.seed(1000)
kmeans.model = kmeans(p2, centers = 3, nstart = 50)
wss = kmeans.model$tot.withinss
bss = kmeans.model$betweenss
wss
## [1] 587.3186
bss
## [1] 2132.534
# centroids from the kmeans model:
centroids = kmeans.model$centers
centroids
## Area Perimeter Compactness LengthOfKernel WidthOfKernel
## 1 18.72180 16.29738 0.8850869 6.208934 3.722672
## 2 11.96442 13.27481 0.8522000 5.229286 2.872922
## 3 14.64847 14.46042 0.8791667 5.563778 3.277903
## AsymmetryCoeff LengthOfKernelGroove
## 1 3.603590 6.066098
## 2 4.759740 5.088519
## 3 2.648933 5.192319
p2$cluster = kmeans.model$cluster
library(reshape2)
# calculating centroids manually:
m = melt(p2, id.vars = c("cluster"))
p2.centroids = dcast(m, cluster ~ variable, mean)
p2.centroids
## cluster Area Perimeter Compactness LengthOfKernel WidthOfKernel
## 1 1 18.72180 16.29738 0.8850869 6.208934 3.722672
## 2 2 11.96442 13.27481 0.8522000 5.229286 2.872922
## 3 3 14.64847 14.46042 0.8791667 5.563778 3.277903
## AsymmetryCoeff LengthOfKernelGroove
## 1 3.603590 6.066098
## 2 4.759740 5.088519
## 3 2.648933 5.192319
above are the 3 centroids for 3 clusters for scaled dataset. as this is 7-dimensional data each centroid is also 7-dimensional.