12.3-class-yuhan wen

Load

# ============================================================
# 1. Load libraries and data
# ============================================================
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

suppressPackageStartupMessages(library(ggplot2)); theme_set(theme_classic())
suppressPackageStartupMessages(library(cowplot))
suppressPackageStartupMessages(library(cluster))
suppressPackageStartupMessages(library(factoextra))
suppressPackageStartupMessages(library(ggplot2))
# USArrests is in base R
data("USArrests")

# Examine structure
str(USArrests)

## 'data.frame':    50 obs. of  4 variables:
##  $ Murder  : num  13.2 10 8.1 8.8 9 7.9 3.3 5.9 15.4 17.4 ...
##  $ Assault : int  236 263 294 190 276 204 110 238 335 211 ...
##  $ UrbanPop: int  58 48 80 50 91 78 77 72 80 60 ...
##  $ Rape    : num  21.2 44.5 31 19.5 40.6 38.7 11.1 15.8 31.9 25.8 ...

# Data preperation
usarrest_scaled <- scale(USArrests)

# Distance matrix
dist_mat <- dist(usarrest_scaled)

Four Linkage Problem

We create dendrograms for all four methods from your screenshot: Complete Linkage Single Linkage Average Linkage Ward’s Method

# ============================================================
# 3a. Fit hierarchical clustering with 4 linkage methods
# ============================================================
hc_complete <- hclust(dist_mat, method = "complete")
hc_single   <- hclust(dist_mat, method = "single")
hc_average  <- hclust(dist_mat, method = "average")
hc_ward     <- hclust(dist_mat, method = "ward.D2")
# ============================================================
# 3b. Plot dendrograms
# ============================================================
par(mfrow = c(1,1))
plot(hc_complete, main="Complete Linkage", xlab="", sub="")

plot(hc_single,   main="Single Linkage",   xlab="", sub="")

plot(hc_average,  main="Average Linkage",  xlab="", sub="")

plot(hc_ward,     main="Ward’s Method",    xlab="", sub="")

par(mfrow = c(1,1))

Comparison of methods

Comparison Table of Linkage Methods We compare the height at which clusters fuse—lower heights = tighter clusters.

# ============================================================
# 4. Compare fusion heights (the last 10 merges)
# ============================================================
comparison_df <- tibble(
  Complete = tail(hc_complete$height, 10),
  Single   = tail(hc_single$height,   10),
  Average  = tail(hc_average$height,  10),
  Ward     = tail(hc_ward$height,     10)
)

comparison_df

## # A tibble: 10 × 4
##    Complete Single Average  Ward
##       <dbl>  <dbl>   <dbl> <dbl>
##  1     2.26   1.07    1.58  2.24
##  2     2.30   1.07    1.64  2.71
##  3     2.34   1.17    1.78  2.99
##  4     2.45   1.18    1.81  3.03
##  5     2.47   1.20    1.90  3.21
##  6     3.09   1.24    1.91  3.50
##  7     3.26   1.24    2.33  3.73
##  8     4.40   1.26    2.51  6.46
##  9     4.42   1.30    2.73  7.19
## 10     6.08   2.06    3.32 13.5

Interpretation: Single-linkage shows very large last-step height → classic “chaining effect,” long loose clusters. Complete-linkage creates compact clusters (more even merge heights). Average-linkage sits between single & complete, creating moderate shapes. Ward’s method has the smoothest increase in heights → tends to produce most balanced clusters.

Decide how many clusters

Ward’s method dendrogram gives the cleanest separation.

# ============================================================
# 5. Examine Ward dendrogram to choose #clusters
# ============================================================
plot(hc_ward, main="Ward’s Method (Preferred)", xlab="", sub="")
abline(h = 6, col="red", lty=2)  # cut at ~6 height → ~3 clusters

A horizontal line at height ≈ 6 cuts Ward’s dendrogram into 4 clusters. This is consistent with typical USArrests clustering seen in textbooks.

Tree cut

# ============================================================
# 6. Cut into 4 clusters
# ============================================================
clusters <- cutree(hc_ward, k = 4)

table(clusters)

## clusters
##  1  2  3  4 
##  7 12 19 12

# ============================================================
# 6b. Visualize clusters with PCA scatterplot
# ============================================================
library(ggplot2)

usarrest_pca <- prcomp(usarrest_scaled)

cluster_plot <- data.frame(
  PC1 = usarrest_pca$x[,1],
  PC2 = usarrest_pca$x[,2],
  Cluster = as.factor(clusters),
  State = rownames(USArrests)
)

ggplot(cluster_plot, aes(PC1, PC2, color = Cluster, label = State)) +
  geom_point(size=3) +
  geom_text(size=3, vjust = -0.4) +
  theme_minimal() +
  labs(title = "Clusters (Ward’s Method) Projected onto PCA Space")

Interpretation of each cluster:

Cluster 1 – High Violent Crime “Deep South” States States: Alabama, Georgia, Louisiana, Tennessee, Mississippi, South Carolina, North Carolina, Arkansas (Your plot shows these grouped tightly in the lower-left PC region.) Profile: Highest Murder and Assault rates among all states Often high Rape arrests as well Moderate population density but high violence A known criminology pattern: Southern states historically exhibit higher violent crime rates Cluster 2 – Urban, High Population, Mixed Crime States States: California, Nevada, New York, Illinois, Michigan, Texas, Arizona, Colorado, New Mexico, Florida, Maryland, Alaska (The large green cluster in your plot.) Profile: Higher UrbanPop than other clusters (large cities → higher population density) Crime rates are moderate to high, but not as uniformly extreme as Cluster 1 Many states contain major metropolitan areas: LA, Chicago, NYC, Miami, Phoenix, Houston Some states (California, Nevada, Alaska) have distinctive PCA positions due to unique arrest patterns Cluster 3 – Moderate Crime, Central & Midwestern States States: Missouri, Oklahoma, Kansas, Indiana, Ohio, Virginia, Delaware, Washington, Oregon, Wyoming, Kentucky, Arkansas (Clustered near the PCA center — middle of the graph.) Profile: Balanced, mid-range values on all four variables Neither especially violent nor notably safe Mix of suburban, rural, and mid-sized city populations Crime statistics are “middle of the pack” Cluster 4 – Low Crime, Small/Stable Northeastern & Upper Midwest States States: Vermont, Maine, New Hampshire, North Dakota, South Dakota, Iowa, Minnesota, Wisconsin, Idaho, Montana, Nebraska, West Virginia (Clearly grouped on the right side of the PCA plot.) Profile: Lowest Murder and Assault rates in the country Many are small-population, rural, homogeneous states Known for historically low crime: Vermont, Maine, New Hampshire rank lowest nationally Arrest patterns are consistently mild → explains tight clustering in PCA space