K Means Clustering     

Author

David McCabe

Published

September 27, 2024

K-means clustering is an unsupervised machine learning algorithm used to group similar data points into distinct clusters. It begins by randomly placing k cluster centers, where k is a predefined number. The algorithm iteratively refines the clusters by updating each center to the centroid of the points assigned to it—those for which it is the closest center. Each data point is then reassigned to the nearest centroid. This process continues until the cluster centers stabilize. The resulting decision boundaries align with the edges of a Voronoi diagram, with each cluster centroid at the center of its respective region.

Basic Workflow

0. Here we’ve prepared a trivial example

head(d)
   cluster.classification          x            y cluster.prediction
                   <char>      <num>        <num>             <fctr>
1:                   blue  5.1783354  0.008128035                  1
2:                  green -0.3572169  3.938524299                  2
3:                   blue  2.0661709  0.709043159                  1
4:                    red -0.1929758  1.477317927                  2
5:                  green  0.8282841  1.965905018                  2
6:                   blue  2.6999990 -0.096870995                  1

We’ve prepared three clusters of points each generated using a normal distribution…

d[,
  cluster.classification := as.numeric(as.factor(d$cluster.classification))
]
sscore <- silhouette(d$cluster.classification, dist(d[,.(x,y)]))

plot(sscore, col = viridis(3), border = NA)

1. We perform K-Means clustering analysis

We specify the number of clusters to compute centers. It is also possible to specify starting guesses for the centre start point, algorithm, no. iterations…

cluster.analysis<-kmeans(d[,.(x,y)],
                         centers = 3,
                         algorithm = "Hartigan-Wong")

predicted.centers<-as.data.table(cluster.analysis$centers)
print(predicted.centers)
            x          y
        <num>      <num>
1:  4.0555958 -0.1444632
2:  0.0890872  3.0258692
3: -0.3297954 -0.3242243

2. Add clustering prediction back into data.table

It’s handy to store predictions beside earlier input data…

# add cluster prediction to data.table
d<-d[,cluster.prediction:=factor(cluster.analysis$cluster)]
head(d)
   cluster.classification          x            y cluster.prediction
                    <num>      <num>        <num>             <fctr>
1:                      1  5.1783354  0.008128035                  1
2:                      2 -0.3572169  3.938524299                  2
3:                      1  2.0661709  0.709043159                  1
4:                      3 -0.1929758  1.477317927                  2
5:                      2  0.8282841  1.965905018                  2
6:                      1  2.6999990 -0.096870995                  1

3. You’re done…

4. Compute Silhouette Score

For each observation \(i\), the silhouette width \(s(i)\) is a measure of parent cluster affinity and is given by: \[\boxed{s(i)\equiv\frac{b(i)-a(i)}{\text{max}(a(i),b(i))}}\] where…

  • \(a(i)\) is the average dissimilarity between \(i\) and all other points in the parent cluster \(A\) to which \(i\) belongs — generally computed as the average Euclidean distance to all other points in the parent cluster.
    • small for inliers (high cohesion, low separation) - Think of a small circular boundary centered on and well within the cluster
    • Larger for outliers (low cohesion, high separation) - Think of a large circle centered on the very edge of a cluster
  • \(b(i)\) is the average dissimilarity between \(i\) and all points in the foreign cluster \(B\) to which \(i\) does not belong - generally computed as the average distance to all points in the foreign cluster.
    • smaller for a point stretching toward the foreign cluster (low separation) - Think of a point at the edge of a cluster in the direction of the foreign cluster, almost midway between the parent and foreign cluster—its classification is more uncertain
    • larger for a point far from the foreign cluster (high separation) - Think of a point at the extreme edge of a cluster, on the far side, as distant as possible from the foreign cluster
  • numerator \(b(i)-a(i)\) is a measure of parent cluster affinity the combination of foreign cluster dissimilarity and parent cluster similarity
  • denominator \(\text{max}(a(i),b(i))\) is a normalisation constant
d$cluster.prediction <- as.numeric(d$cluster.prediction)

sscore <- silhouette(d$cluster.prediction, dist(d[,.(x,y)]))

plot(sscore, col = viridis(3), border = NA)

the term Silhouette stems from the fact that we look at the silhouette/line formed by the edge of the ends of bars or some shit