GOAL: The purpose of clustering is to find natural grouping of data points in data set such that points in same cluster are more similar to each other than points in different cluster.

clustering can be an end goal of the project or results of clustering can be used for further research to explore the causes of similarities and dissimilarities. also it is a great data exploratoration technique in exploratory data analysis phase.

Process:

  1. decide no. of clusters K.
  2. randomly assign data points to these clusters.
  3. calculate the centroids of each cluster which is mean of observations in p-dimension.
  4. calculate distance of each data-point to each centroid based distance matrix selected (eucledian distance, manhattan distance, etc.)
  5. re-assign each data point to their nearest centroid.
  6. if changes in assigment, repeat from step-3, OR stop.

for categorical variables, we have to use K-medoids method. medoids are mode of a set. in k-medoids method, medoids are data-points itself.

setwd("F:\\DATA SCIENCE\\PENN STATE\\Course work\\IE 575 - Predictive Analytics\\Lesson-11_K-means_clustering")
p2 = read.csv("IE575_HW11_p2.csv")
head(p2)
##    Area Perimeter Compactness LengthOfKernel WidthOfKernel AsymmetryCoeff
## 1 15.26     14.84      0.8710          5.763         3.312          2.221
## 2 14.88     14.57      0.8811          5.554         3.333          1.018
## 3 14.29     14.09      0.9050          5.291         3.337          2.699
## 4 13.84     13.94      0.8955          5.324         3.379          2.259
## 5 16.14     14.99      0.9034          5.658         3.562          1.355
## 6 14.38     14.21      0.8951          5.386         3.312          2.462
##   LengthOfKernelGroove
## 1                5.220
## 2                4.956
## 3                4.825
## 4                4.805
## 5                5.175
## 6                4.956
dim(p2)
## [1] 210   7
str(p2)
## 'data.frame':    210 obs. of  7 variables:
##  $ Area                : num  15.3 14.9 14.3 13.8 16.1 ...
##  $ Perimeter           : num  14.8 14.6 14.1 13.9 15 ...
##  $ Compactness         : num  0.871 0.881 0.905 0.895 0.903 ...
##  $ LengthOfKernel      : num  5.76 5.55 5.29 5.32 5.66 ...
##  $ WidthOfKernel       : num  3.31 3.33 3.34 3.38 3.56 ...
##  $ AsymmetryCoeff      : num  2.22 1.02 2.7 2.26 1.35 ...
##  $ LengthOfKernelGroove: num  5.22 4.96 4.83 4.8 5.17 ...
mean.p2 = apply(p2, 2, mean)
sd.p2 = apply(p2, 2, sd)

scaled.p2 = scale(p2, mean.p2, sd.p2)

3 methods of Clustering validation:

  1. Cohesion: a measure of similarity between data-points in a cluster. measured by within cluster sum of square errors. minimum the better.
  2. Separation: a measure of distinction between clusters. measured by between clusters sum of squares. Higher the better.
  3. Silhouette coefficient: a measure of combination of both cohesion and separation. a coefficient from 0 to 1. 1 being the best.

finding best K where wss is optimum

set.seed(1000)

wss = (nrow(scaled.p2)-1) * sum(apply(scaled.p2,2,var))
wss# this is max wss possible for this dataset which we get when K=1.
## [1] 1463
wss = rep(0,10)
for (i in 1:10){
        kmeans.model = kmeans(scaled.p2, i,nstart = 50)
        wss[i] = kmeans.model$tot.withinss
}

plot(1:10, wss, type="b",
     xlab = "Number of Clusters",
     ylab = "within cluster sum of squares")

This plot has no. of clusters (1 to 10) on x-axis and “within cluster sum of squares” on y-axis.

“within cluser sum of squares” represents the measurement of similarity of data points within the same cluster. How similar are the data points within the cluster.

it is obvious that as you increases no. of clusters for the data set, the data points are going to be more closer as variability in data is reduced. but disadvantage is too many clusters and reduced interpretability.

this is phenomenon is shows in this plot. as no. of clusters increases the within cluster variation is reducing. we want minimum variation meaning similar data points in a cluster.

as you can see, for first few no. of clusters the drop in variability is large and then it reduces significantly and the then adding clusters does not have that big of an advantage in reduction in variability.

the point where we can achieve optimal point with good balance of reduction in variability and no. of clusters is a good point for the selectino of K (NO. OF CLUSTERS).

in our plot, at K=3 we get best clustering with minimum wss. after that the advantage is not that big.

set.seed(1000)

bss = rep(0,10)

for (i in 1:10){
        kmeans.model = kmeans(scaled.p2, i, nstart = 50)
        bss[i] = kmeans.model$betweenss
}

plot(1:10, bss, type="b",
     xlab = "Number of Clusters",
     ylab = "Between cluster sum of squares")

at K=3 we get best clustering with maximum bss. after that the advantage is not that big.

# calculating euclidean distance
set.seed(1000)

distance = dist(scaled.p2)

library(cluster)

plot(silhouette(kmeans(scaled.p2, centers = 3, nstart = 50)$cluster, dist(scaled.p2)))

running k-means clustering with K=3 ………

Run multiple times and keep the one with minimum wss and aximum bss.
set.seed(1000)

kmeans.model = kmeans(p2, centers = 3, nstart = 50)
wss = kmeans.model$tot.withinss
bss = kmeans.model$betweenss
wss
## [1] 587.3186
bss
## [1] 2132.534
# centroids from the kmeans model:
centroids = kmeans.model$centers
centroids
##       Area Perimeter Compactness LengthOfKernel WidthOfKernel
## 1 18.72180  16.29738   0.8850869       6.208934      3.722672
## 2 11.96442  13.27481   0.8522000       5.229286      2.872922
## 3 14.64847  14.46042   0.8791667       5.563778      3.277903
##   AsymmetryCoeff LengthOfKernelGroove
## 1       3.603590             6.066098
## 2       4.759740             5.088519
## 3       2.648933             5.192319
p2$cluster = kmeans.model$cluster

library(reshape2)

# calculating centroids manually:
m = melt(p2, id.vars = c("cluster"))
p2.centroids = dcast(m, cluster ~ variable, mean)
p2.centroids
##   cluster     Area Perimeter Compactness LengthOfKernel WidthOfKernel
## 1       1 18.72180  16.29738   0.8850869       6.208934      3.722672
## 2       2 11.96442  13.27481   0.8522000       5.229286      2.872922
## 3       3 14.64847  14.46042   0.8791667       5.563778      3.277903
##   AsymmetryCoeff LengthOfKernelGroove
## 1       3.603590             6.066098
## 2       4.759740             5.088519
## 3       2.648933             5.192319

above are the 3 centroids for 3 clusters for scaled dataset. as this is 7-dimensional data each centroid is also 7-dimensional.

Limitations of k-means clustering:

  1. no. of iterations required to cnverge cannot be determined.
  2. poor choice of initial centroids may result in suboptimal cluster assigments.(solutions (1) performing multiple runs (2) choosing the centroids using heirarchical approach)
  3. cannot handle noisy data well.
  4. cannot handle missign data.
  5. not suitable to discover clusters with non-convex (irregular) or concentric (same center) shapes.
  6. does not work well with unequal sized clusters.
  7. not robust to changes in data set. even minor changes results in changes in cluster changes.
  8. every data point is forced to be a part of only one cluster, no overlap allowed.