4.2 -partitioning methods

Introduction

  • Partitioning methods use a different approach to finding clusters
  • First step: specify \(k\).
  • Randomly select \(k\) centroids.
  • Assign points to cluster whose centroid they are closest to.
  • Re-compute centroid.
  • Iterate until convergence
  • Two important optimization methods we will consider are k-means and partitioning around medoids (PAM).

The K-means clustering algorithm

  • First step: specify \(k\). Suppose we set \(k=4\).
  • Randomly select \(k\) points in \(p-\)dimensional space. These are the initial “centroids.”

Assign

  • Next step: Assign points to the centroid they are closest to

Recompute centroids

  • Next step: re-compute centroids as mean vectors of new clusters

Reassign points

  • Reassign points to nearest new centroid

Recompute centroids

  • Re-compute centroids

Reassign points

  • Reassign points to nearest new centroid

Recompute centroids

  • Re-compute centroids

Reassign points

  • Reassign points to nearest new centroid

Recompute centroids

  • Re-compute centroids

Convergence

  • Algorithm converges when:
    • Cluster assignments don’t change
    • Centroids stabilize

Notes on k-means

  • Initial centroids matter!! It’s a good idea to run the k-means algorithm multiple times to make sure the algorithm converges in the same place.
  • K-means is good at finding convex clusters; will struggle identifying non-convex clusters
    • Convex - all points in cluster can be connected by a line that does not leave the cluster
    • Aka “spherical”
  • Only works for numeric data where Euclidean distance makes sense
    • Make sure to scale!!

K-means in R

  • kmeans(x, centers, iter.max = 10, nstart = 1)
    • x: a scaled numeric data matrix (does not work with distance object)
    • centers: integer value for number of clusters
    • iter.max =10: how many iterations to perform, by default 10, recommend increasing
    • nstart = 1: number of random starting points, recommend increasing

Partitioning around medoids (PAM)

  • PAM works very similarly to k-means
  • Rather than inital arbitrary centroids, selects \(k\) medoids: actual data points
  • Iterates between new medoids and cluster reassignment

K-means vs PAM

Criterion k-means PAM
Cluster center Centroid (mean of points) Medoid (actual point)
Optimization goal Minimize sum of square Euclidean distances to cluster centroid Minimize sum of pairwise dissimilarities between points and medoid
Speed Fast Slow
Robust to outliers No Yes
Distance metric Euclidean only Any
Cluster shapes Spherical/convex Spherical/convex

PAM in R

  • Must load cluster packages
  • pam(x, k, nstart)
    • x: data matrix or dissimilarity matrix!
    • k: number of clusters to identify
    • nstart: number of random initial medoids to initialize algorithm

multishapes data

  • The multishapes data set is in the factoextra package:
library(factoextra)
head(multishapes)
           x          y shape
1 -0.8037393 -0.8530526     1
2  0.8528507  0.3676184     1
3  0.9271795 -0.2749024     1
4 -0.7526261 -0.5115652     1
5  0.7068462  0.8106792     1
6  1.0346985  0.3946550     1
ggplot(data = multishapes, aes(x = x, y = y)) +
  geom_point() + 
  theme_classic(base_size = 16)
  • Appear to be 5 clusters
    • 2 non-convex rings
    • 2 convex linear
    • 1 convex blob

multiple different shaped clusters

Applying kmeans to multishapes data

  • Using kmeans to identify 5 clusters:
numeric_only <- multishapes %>%
  select(x,y)
kmeans_clusters <- kmeans(numeric_only,
                         centers = 5,
                         iter.max = 20,
                         nstart = 10
                         )
multishapes$kmeans_clusters <- factor(kmeans_clusters$cluster)
ggplot(data = multishapes, aes(x = x, y = y, color=kmeans_clusters)) +
  geom_point() + 
  guides(color='none') + 
  theme_classic(base_size = 16)

multiple different shaped clusters with kmeans results

Applying pam to multishapes data

  • Using pam to identify 5 clusters:
library(cluster)
pam_clusters <- pam(numeric_only,
                        k = 5,
                        nstart = 10
                    )


#Add clusters to data frame
multishapes$pam_clusters <- factor(pam_clusters$cluster)
ggplot(data = multishapes, aes(x = x, y = y, color=pam_clusters)) +
  geom_point() + 
  guides(color='none') + 
  theme_classic(base_size = 16)

multiple different shaped clusters with kmeans results

Choosing \(k\)

  • The same approaches we reviewed for choosing \(k\) in hierarchical clustering apply to both k-means and PAM
    • WSS as a function of \(k\) available if clustering numeric data
    • Average silhouette widths as a function of \(k\) in general
  • The same caveats for these metrics still apply: they are good for evaluating cluster quality if the clusters are convex, not so much if they are non-convex.

Partitioning methods on the drug use data

  • Let’s revisit the U.S. city drug data:
library(tidyverse)
drugs <- read.csv('Data/IllicitDrug.csv') %>% 
  column_to_rownames('State')
head(drugs)
           DrugUse BingeDrink Poverty HSdrop Income
Alabama        3.3       15.6    14.5   12.6  22946
Alaska         8.5       19.8     9.4   10.9  28523
Arizona        5.4       17.4    16.6   14.4  25307
Arkansas       2.8       16.8    14.8   11.4  22114
California     6.2       16.7    15.4   14.2  29819
Colorado       6.5       19.8     9.2    9.8  31678
drugs_scaled <- scale(drugs)

Choosing \(k\)

library(factoextra)
library(patchwork)
library(cluster)
fviz_nbclust(drugs_scaled, 
             FUNcluster = kmeans,
             method='wss',
             ) + 
  labs(title = 'Plot of WSS vs k using kmeans') + 
fviz_nbclust(drugs_scaled, 
             FUNcluster = kmeans,
             method='silhouette',
             ) + 
  labs(title = 'Plot of avg silhouette vs k using kmeans') +
fviz_nbclust(drugs_scaled, 
             FUNcluster = pam,
             method='wss',
             ) + 
  labs(title = 'Plot of WSS vs k using PAM') + 
fviz_nbclust(drugs_scaled, 
             FUNcluster = pam,
             method='silhouette',
             ) + 
  labs(title = 'Plot of avg silhouette vs k using PAM')
  • PAM vs kmeans look virtually identical.
  • WSS plot has slight elbow at \(k=4\) using k-means
  • Average silhouette is largest at \(k=2\) for both methods, with \(k=4\) second-best
    • PAM actually has large mean silhouettes at \(k=7\) and \(k=9\), but this seems like overkill

Characterizing clusters

  • Proceeding with 4 clusters:
kmeans4 <- kmeans(drugs_scaled, centers = 4, nstart = 10)
pam4 <- pam(drugs_scaled, k = 4, nstart = 10)
  • PCA to visualize:
drug_pca <- prcomp(drugs, center=TRUE, scale. = TRUE)
kmeans_biplot <- fviz_pca(drug_pca, 
         habillage = factor(kmeans4$cluster),
         repel = TRUE) + 
      ggtitle('K-means 4-cluster solution') + 
      guides(color='none',shape='none')
pam_biplot <- fviz_pca(drug_pca, 
         habillage = factor(pam4$clustering),
         repel = TRUE) + 
      ggtitle('PAM 4-cluster solution') + 
      guides(color='none',shape='none')
kmeans_biplot + pam_biplot
  • Only notable differences:
    • Indiana, Virginia, Maryland, Nevada