4.2 -partitioning methods

Introduction

Partitioning methods use a different approach to finding clusters
First step: specify \(k\).
Randomly select \(k\) centroids.
Assign points to cluster whose centroid they are closest to.
Re-compute centroid.
Iterate until convergence
Two important optimization methods we will consider are k-means and partitioning around medoids (PAM).

The K-means clustering algorithm

First step: specify \(k\). Suppose we set \(k=4\).
Randomly select \(k\) points in \(p-\)dimensional space. These are the initial “centroids.”

Assign

Next step: Assign points to the centroid they are closest to

Recompute centroids

Next step: re-compute centroids as mean vectors of new clusters

Reassign points

Reassign points to nearest new centroid

Recompute centroids

Re-compute centroids

Reassign points

Reassign points to nearest new centroid

Recompute centroids

Re-compute centroids

Reassign points

Reassign points to nearest new centroid

Recompute centroids

Re-compute centroids

Convergence

Algorithm converges when:
- Cluster assignments don’t change
- Centroids stabilize

Notes on k-means

Initial centroids matter!! It’s a good idea to run the k-means algorithm multiple times to make sure the algorithm converges in the same place.
K-means is good at finding convex clusters; will struggle identifying non-convex clusters
- Convex - all points in cluster can be connected by a line that does not leave the cluster
- Aka “spherical”
Only works for numeric data where Euclidean distance makes sense
- Make sure to scale!!

K-means in `R`

kmeans(x, centers, iter.max = 10, nstart = 1)
- x: a scaled numeric data matrix (does not work with distance object)
- centers: integer value for number of clusters
- iter.max =10: how many iterations to perform, by default 10, recommend increasing
- nstart = 1: number of random starting points, recommend increasing

Partitioning around medoids (PAM)

PAM works very similarly to k-means
Rather than inital arbitrary centroids, selects \(k\) medoids: actual data points
Iterates between new medoids and cluster reassignment

K-means vs PAM

Criterion	k-means	PAM
Cluster center	Centroid (mean of points)	Medoid (actual point)
Optimization goal	Minimize sum of square Euclidean distances to cluster centroid	Minimize sum of pairwise dissimilarities between points and medoid
Speed	Fast	Slow
Robust to outliers	No	Yes
Distance metric	Euclidean only	Any
Cluster shapes	Spherical/convex	Spherical/convex

PAM in `R`

Must load cluster packages
pam(x, k, nstart)
- x: data matrix or dissimilarity matrix!
- k: number of clusters to identify
- nstart: number of random initial medoids to initialize algorithm

`multishapes` data

The multishapes data set is in the factoextra package:

library(factoextra)
head(multishapes)

           x          y shape
1 -0.8037393 -0.8530526     1
2  0.8528507  0.3676184     1
3  0.9271795 -0.2749024     1
4 -0.7526261 -0.5115652     1
5  0.7068462  0.8106792     1
6  1.0346985  0.3946550     1

ggplot(data = multishapes, aes(x = x, y = y)) +
  geom_point() + 
  theme_classic(base_size = 16)

Appear to be 5 clusters
- 2 non-convex rings
- 2 convex linear
- 1 convex blob

Applying `kmeans` to `multishapes` data

Using kmeans to identify 5 clusters:

numeric_only <- multishapes %>%
  select(x,y)
kmeans_clusters <- kmeans(numeric_only,
                         centers = 5,
                         iter.max = 20,
                         nstart = 10
                         )
multishapes$kmeans_clusters <- factor(kmeans_clusters$cluster)

ggplot(data = multishapes, aes(x = x, y = y, color=kmeans_clusters)) +
  geom_point() + 
  guides(color='none') + 
  theme_classic(base_size = 16)

multiple different shaped clusters with kmeans results

Applying `pam` to `multishapes` data

Using pam to identify 5 clusters:

library(cluster)
pam_clusters <- pam(numeric_only,
                        k = 5,
                        nstart = 10
                    )


#Add clusters to data frame
multishapes$pam_clusters <- factor(pam_clusters$cluster)

ggplot(data = multishapes, aes(x = x, y = y, color=pam_clusters)) +
  geom_point() + 
  guides(color='none') + 
  theme_classic(base_size = 16)

Choosing \(k\)

The same approaches we reviewed for choosing \(k\) in hierarchical clustering apply to both k-means and PAM
- WSS as a function of \(k\) available if clustering numeric data
- Average silhouette widths as a function of \(k\) in general
The same caveats for these metrics still apply: they are good for evaluating cluster quality if the clusters are convex, not so much if they are non-convex.

Partitioning methods on the drug use data

Let’s revisit the U.S. city drug data:

library(tidyverse)
drugs <- read.csv('Data/IllicitDrug.csv') %>% 
  column_to_rownames('State')
head(drugs)

           DrugUse BingeDrink Poverty HSdrop Income
Alabama        3.3       15.6    14.5   12.6  22946
Alaska         8.5       19.8     9.4   10.9  28523
Arizona        5.4       17.4    16.6   14.4  25307
Arkansas       2.8       16.8    14.8   11.4  22114
California     6.2       16.7    15.4   14.2  29819
Colorado       6.5       19.8     9.2    9.8  31678

drugs_scaled <- scale(drugs)

Choosing \(k\)

library(factoextra)
library(patchwork)
library(cluster)

fviz_nbclust(drugs_scaled, 
             FUNcluster = kmeans,
             method='wss',
             ) + 
  labs(title = 'Plot of WSS vs k using kmeans') + 
fviz_nbclust(drugs_scaled, 
             FUNcluster = kmeans,
             method='silhouette',
             ) + 
  labs(title = 'Plot of avg silhouette vs k using kmeans') +
fviz_nbclust(drugs_scaled, 
             FUNcluster = pam,
             method='wss',
             ) + 
  labs(title = 'Plot of WSS vs k using PAM') + 
fviz_nbclust(drugs_scaled, 
             FUNcluster = pam,
             method='silhouette',
             ) + 
  labs(title = 'Plot of avg silhouette vs k using PAM')

PAM vs kmeans look virtually identical.
WSS plot has slight elbow at \(k=4\) using k-means
Average silhouette is largest at \(k=2\) for both methods, with \(k=4\) second-best
- PAM actually has large mean silhouettes at \(k=7\) and \(k=9\), but this seems like overkill

Characterizing clusters

Proceeding with 4 clusters:

kmeans4 <- kmeans(drugs_scaled, centers = 4, nstart = 10)
pam4 <- pam(drugs_scaled, k = 4, nstart = 10)

PCA to visualize:

drug_pca <- prcomp(drugs, center=TRUE, scale. = TRUE)

kmeans_biplot <- fviz_pca(drug_pca, 
         habillage = factor(kmeans4$cluster),
         repel = TRUE) + 
      ggtitle('K-means 4-cluster solution') + 
      guides(color='none',shape='none')
pam_biplot <- fviz_pca(drug_pca, 
         habillage = factor(pam4$clustering),
         repel = TRUE) + 
      ggtitle('PAM 4-cluster solution') + 
      guides(color='none',shape='none')
kmeans_biplot + pam_biplot

Only notable differences:
- Indiana, Virginia, Maryland, Nevada

4.2 -partitioning methods

Introduction

The K-means clustering algorithm

Assign

Recompute centroids

Reassign points

Recompute centroids

Reassign points

Recompute centroids

Reassign points

Recompute centroids

Convergence

Notes on k-means

K-means in R

Partitioning around medoids (PAM)

K-means vs PAM

PAM in R

multishapes data

Applying kmeans to multishapes data

Applying pam to multishapes data

Choosing \(k\)

Partitioning methods on the drug use data

Choosing \(k\)

Characterizing clusters

K-means in `R`

PAM in `R`

`multishapes` data

Applying `kmeans` to `multishapes` data

Applying `pam` to `multishapes` data