Introduction

This document provides an overview and practical examples of Unsupervised Learning methods using R, based on Chapter 10 of An Introduction to Statistical Learning with Applications in R. We explore the following techniques:

  • Principal Components Analysis (PCA)
  • K-Means Clustering
  • Hierarchical Clustering

Each section includes R code, explanations, and interpretation tips.

First things first… Let’s install and Load Packages.

We use base R functions for PCA and clustering, so no additional packages are needed at this point.

# No packages needed for base R functions used in PCA and clustering

1. The Dataset

We will use the USArrests dataset, which contains statistics in arrests per 100,000 residents for assault, murder, and rape in each of the 50 US states in 1973.

head(USArrests)
Murder Assault UrbanPop Rape
Alabama 13.2 236 58 21.2
Alaska 10.0 263 48 44.5
Arizona 8.1 294 80 31.0
Arkansas 8.8 190 50 19.5
California 9.0 276 91 40.6
Colorado 7.9 204 78 38.7
summary(USArrests)
##      Murder          Assault         UrbanPop          Rape      
##  Min.   : 0.800   Min.   : 45.0   Min.   :32.00   Min.   : 7.30  
##  1st Qu.: 4.075   1st Qu.:109.0   1st Qu.:54.50   1st Qu.:15.07  
##  Median : 7.250   Median :159.0   Median :66.00   Median :20.10  
##  Mean   : 7.788   Mean   :170.8   Mean   :65.54   Mean   :21.23  
##  3rd Qu.:11.250   3rd Qu.:249.0   3rd Qu.:77.75   3rd Qu.:26.18  
##  Max.   :17.400   Max.   :337.0   Max.   :91.00   Max.   :46.00

Explanation:

This step helps us get familiar with the dataset. We check the first few rows and generate summary statistics to understand variable ranges and distributions.

2. Principal Components Analysis (PCA)

# Standardize the variables
USArrests_scaled <- scale(USArrests)

# Run PCA
pca_result <- prcomp(USArrests_scaled)

# PCA summary
summary(pca_result)
## Importance of components:
##                           PC1    PC2     PC3     PC4
## Standard deviation     1.5749 0.9949 0.59713 0.41645
## Proportion of Variance 0.6201 0.2474 0.08914 0.04336
## Cumulative Proportion  0.6201 0.8675 0.95664 1.00000
# Plot PCA
biplot(pca_result, scale = 0)

Explanation We first scale the data using scale() because PCA is sensitive to the scale of the variables. If the variables are measured on different scales (e.g. “Assault” may have a much higher range than “Murder”), PCA could be dominated by the variable with the largest variance. Scaling ensures that each variable has a mean of 0 and standard deviation of 1, allowing all variables to contribute equally to the analysis.

The prcomp() function performs Principal Components Analysis by computing the covariance matrix of the scaled data and extracting its eigenvalues and eigenvectors:

  • The eigenvectors define the principal components (i.e., new axes in the transformed space).
  • The eigenvalues indicate how much of the data’s total variance is explained by each component.

The summary() function shows:

  • The standard deviation of each principal component.
  • The proportion of variance explained by each.
  • The cumulative proportion, helping us decide how many components are meaningful.

Finally, the biplot() visualises the data in the space of the first two principal components. It shows:

  • Observations (e.g., states) as points.
  • Variables as arrows, showing their direction and contribution to the components.

This allows us to see patterns or clusters in the data, and understand which variables are most influential.

Summary

Key Components of a Biplot

  1. Axes (PC1, PC2, etc.):
  • The axes represent the first two principal components (e.g., PC1 and PC2), which capture the maximum variance in the dataset.
  • Each point in the biplot is a projection of the original data onto these new axes.
  1. Points (Observations):
  • Each point corresponds to an observation (row) in the dataset.
  • Points that are close to each other are similar in terms of the variables captured by PC1 and PC2.
  1. Arrows (Features/Variables):
  • The arrows indicate the direction and strength (loadings) of each original feature (column) in the dataset.
  • Longer arrows mean the feature contributes more to the variation captured by the components.
  • The angle between arrows tells you about the correlation between variables:
    • Small angle: strong positive correlation.
    • Angle near 90 degree: weak or no correlation.
    • Angle near 180 degree: strong negative correlation.

Interpretation

  • Use the summary() output to assess how many principal components capture most of the variance (e.g., 2 components might explain 85% of the variance).

  • In the biplot:

    • Longer arrows indicate variables that contribute more to the components.
    • Angles between arrows reflect correlations: small angles mean positive correlation, angles near 180 degree mean negative correlation.
    • Clustering of points may reveal natural groupings of observations.

Interpretation of This Biplot (biplot(pca_result, scale = 0))

This biplot visualises the principal component analysis (PCA) of the USArrests dataset after scaling (standardisation). Here’s what you’re looking at:

  1. Points = US States
  • Each point represents one of the 50 US states.
  • The position is based on the first two principal components (PC1 and PC2), which summarise the most variance in the data.
  • States that are close together are similar in terms of the arrest rates and urban population.
  1. Arrows = Variables
  • The arrows represent the original variables:
    • Murder
    • Assault
    • UrbanPop
    • Rape
  • Arrow Length = Strength of contribution to PCs
    • Longer arrows indicate variables that contribute more to the variation.
  • Direction of Arrows:
    • Arrows pointing in the same direction suggest positive correlation.
    • Arrows pointing in opposite directions suggest negative correlation.
    • Perpendicular arrows (≈ 90 degree) suggest no correlation.
  1. Principal Components (PC1 & PC2):
  • PC1 (x-axis) likely captures overall crime rate: heavily influenced by Assault, Murder, and Rape.
  • PC2 (y-axis) may reflect differences between population size (UrbanPop) and types of crimes.
  1. Insights We Might Notice:
  • If states in the top-right have high values on UrbanPop and Rape, they might be more urban and have higher reported sexual assaults.
  • States in the bottom-left might have lower urban population and crime rates.
  • Assault often dominates PC1 due to its large variance in the dataset.

3. K-Means Clustering

K-means clustering is an unsupervised machine learning technique used to group data points into a pre-defined number (K) of clusters. It aims to minimise the within-cluster variation, grouping similar observations together.

K-means clustering is an unsupervised learning technique that groups observations into K distinct, non-overlapping clusters based on similarity. It is particularly useful when we suspect natural groupings exist in the data but do not have pre-labelled classes.

This method computes pairwise distances between states and builds a hierarchy of clusters. complete linkage means the distance between clusters is defined as the furthest distance between any two points in the clusters.

  • The dendrogram visualises the nested clustering structure.
  • This technique doesn’t require specifying the number of clusters in advance.

Let’s see how does it work.

# Set seed for reproducibility
set.seed(2)

# Apply k-means clustering with 2 clusters
km_out <- kmeans(USArrests_scaled, centers = 2, nstart = 20)

# View cluster assignments
km_out$cluster
##        Alabama         Alaska        Arizona       Arkansas     California 
##              1              1              1              2              1 
##       Colorado    Connecticut       Delaware        Florida        Georgia 
##              1              2              2              1              1 
##         Hawaii          Idaho       Illinois        Indiana           Iowa 
##              2              2              1              2              2 
##         Kansas       Kentucky      Louisiana          Maine       Maryland 
##              2              2              1              2              1 
##  Massachusetts       Michigan      Minnesota    Mississippi       Missouri 
##              2              1              2              1              1 
##        Montana       Nebraska         Nevada  New Hampshire     New Jersey 
##              2              2              1              2              2 
##     New Mexico       New York North Carolina   North Dakota           Ohio 
##              1              1              1              2              2 
##       Oklahoma         Oregon   Pennsylvania   Rhode Island South Carolina 
##              2              2              2              2              1 
##   South Dakota      Tennessee          Texas           Utah        Vermont 
##              2              1              1              2              2 
##       Virginia     Washington  West Virginia      Wisconsin        Wyoming 
##              2              2              2              2              2
# Visualise clusters using PCA
plot(pca_result$x[,1:2], col = km_out$cluster, pch = 20, cex = 2)

Interpretation of the Results

  1. Number of Clusters:
  • The analysis defines 2 clusters. This means the states are grouped into two distinct sets based on similarities in their standardised crime statistics.
  1. Cluster Assignments:
  • You can view which states belong to which cluster using km_out$cluster.
  • These are arbitrary labels (1 and 2), but they separate states into relatively homogeneous groups.
  1. Reproducibility:
  • The use of set.seed(2) ensures that the clustering is the same every time you run the code.

Visualisation with PCA Plot

The last line:

plot(pca_result$x[,1:2], col = km_out$cluster, pch = 20, cex = 2)

projects the K-means clusters onto the first two principal components (from your earlier PCA).

This visualisation shows:

  • Each point = a state.
  • Colour = cluster assignment (Cluster 1 vs Cluster 2).
  • Points close together and sharing colour = similar arrest statistics and urban population.
  • Clusters may reveal geographical or socio-economic groupings (e.g. high-crime urban states vs low-crime rural states).
# Load ggplot2
library(ggplot2)

# Create a data frame with PCA results and cluster assignment
pca_df <- as.data.frame(pca_result$x[, 1:2])  # First two PCs
pca_df$State <- rownames(USArrests)
pca_df$Cluster <- factor(km_out$cluster)

# Plot with ggplot2
ggplot(pca_df, aes(x = PC1, y = PC2, colour = Cluster, label = State)) +
  geom_point(size = 3, alpha = 0.7) +
  geom_text(vjust = -1.2, size = 3) +
  labs(title = "K-means Clustering of US States (k = 2)",
       subtitle = "Visualised using First Two Principal Components",
       x = "Principal Component 1",
       y = "Principal Component 2") +
  theme_minimal() +
  scale_colour_brewer(palette = "Set1")

This Plot Shows:

  • Each point is a US state.
  • Colours show which K-means cluster the state belongs to.
  • The axes are the first two principal components, representing the strongest patterns in the data.
  • Labels make it easier to identify which state is where.
  • Cluster separation is now clearer than in the base plot().

Let us examine what happens when we introduce an additional cluster, that is, when we set K = 3.

Different K

Trying different values of K is a key step in unsupervised learning as it helps us uncover potentially more nuanced groupings in the data and assess how cluster structure evolves.

In this example, we apply K-means clustering to the USArrests dataset (standardised), aiming to identify states with similar crime patterns. We choose K = 3 to form three broad groups of states.

# Apply K-means clustering with 3 clusters
set.seed(2)
km_out_3 <- kmeans(USArrests_scaled, centers = 3, nstart = 20)

# View cluster sizes
table(km_out_3$cluster)
## 
##  1  2  3 
## 17 13 20

To better understand the structure of the clusters, we visualise the K-means results using the first two principal components from PCA. This helps us see the separation in two dimensions while preserving most of the variation in the data.

library(ggplot2)

# Create data frame with PCA results and K=3 cluster assignments
pca_df_3 <- as.data.frame(pca_result$x[, 1:2])
pca_df_3$State <- rownames(USArrests_scaled)
pca_df_3$Cluster <- factor(km_out_3$cluster)

# Project K=3 centroids into PCA space
centroids_3 <- as.data.frame(km_out_3$centers %*% pca_result$rotation[, 1:2])
colnames(centroids_3) <- c("PC1", "PC2")  # ensure names match
centroids_3$Cluster <- factor(1:3)

# Plot
ggplot(pca_df_3, aes(x = PC1, y = PC2, colour = Cluster)) +
  geom_point(size = 3, alpha = 0.8) +
  geom_text(aes(label = State), vjust = -1, size = 3, show.legend = FALSE) +
  geom_point(data = centroids_3, aes(x = PC1, y = PC2), 
             colour = "black", shape = 4, size = 5, stroke = 1.5) +
  labs(title = "K-means Clustering of US States (K = 3)",
       subtitle = "Visualised using First Two Principal Components",
       x = "Principal Component 1",
       y = "Principal Component 2") +
  theme_minimal() +
  scale_colour_brewer(palette = "Dark2")

When we increase the number of clusters from 2 to 3, we allow the algorithm to form a more nuanced separation of the states based on their standardised crime statistics.

In the visualisation:

  • Each point is a US state, positioned by its scores on the first two principal components (PC1 and PC2).
  • Colours represent the three cluster assignments.
  • Black Xs mark the cluster centroids, projected into PCA space.
  • State names help us identify which state is located where.

What We Can Observe:

  • The first cluster may still capture states with generally higher levels of violent crime, as indicated by strong PC1 values.
  • A second cluster might group lower-crime, less urbanised states.
  • A third cluster appears to split off some states that have moderate crime profiles, or specific patterns (e.g. high Rape but low Murder).
  • The separation is not just by overall crime levels but may involve specific combinations (e.g. urban population vs. Assault rates).

Comparison with K = 2 Clustering

Feature K = 2 Clustering K = 3 Clustering
Number of groups Two broad clusters More refined, with three distinct clusters
Interpretability Easy to interpret: “high crime” vs “low crime” More granular: potential subgroups based on specific crime patterns
Cluster centroids Clearly separated One centroid appears close to others, suggesting overlap or ambiguity
State grouping Large states grouped coarsely Some states that were grouped together now form a third category
Usefulness Good for simplicity and clarity Better for capturing more subtle trends in the data

Final Thoughts

  • K = 2 gives us a clear, macro-level picture, useful for a broad overview.
  • K = 3 helps us explore whether any states stand apart in a unique way and may reveal states that don’t fit neatly into a binary split.

Conclusion

This document explored PCA, K-Means, and Hierarchical Clustering techniques using the USArrests dataset. These unsupervised learning methods help uncover structure in the data when we do not have a response variable. They are useful for dimensionality reduction and exploratory data analysis.


4. Hierarchical Clustering

Hierarchical clustering is an unsupervised learning method that builds a hierarchy of clusters without requiring the number of clusters to be specified beforehand. It visualises nested groupings using a dendrogram, a tree-like diagram showing how states are merged step by step.

Different linkage methods define how distances between clusters are computed:

  • Complete linkage: max distance between observations in each cluster
  • Average linkage: average distance
  • Single linkage: min (nearest neighbour) distance

Each method produces a different clustering structure.

Distance Matrix and Clustering

# Compute distance matrix
dist_matrix <- dist(USArrests_scaled)

# Perform hierarchical clustering using different linkage methods
hc_complete <- hclust(dist_matrix, method = "complete")
hc_average  <- hclust(dist_matrix, method = "average")
hc_single   <- hclust(dist_matrix, method = "single")

Dendrogram Comparison

We now plot the dendrograms side by side to compare how different linkage methods group the US states.

par(mfrow = c(1, 3))  # 1 row, 3 columns

plot(hc_complete, main = "Complete Linkage", xlab = "", sub = "", cex = 0.6)
rect.hclust(hc_complete, k = 3, border = "red")

plot(hc_average, main = "Average Linkage", xlab = "", sub = "", cex = 0.6)
rect.hclust(hc_average, k = 3, border = "blue")

plot(hc_single, main = "Single Linkage", xlab = "", sub = "", cex = 0.6)
rect.hclust(hc_single, k = 3, border = "forestgreen")

Interpretation and Comparison

  • In all three dendrograms, each leaf represents a US state.
  • The height at which two branches merge reflects the dissimilarity between the clusters.
Linkage Method Description Observations
Complete Merges clusters based on the maximum distance Tends to form compact, evenly sized clusters
Average Uses the average distance between clusters Produces moderate, balanced cluster shapes
Single Merges based on minimum distance (nearest neighbours) May result in chaining effect (long, thin clusters)

Insights

  • Complete linkage gave the clearest, most interpretable clusters on both the dendrogram and PCA plot.
  • Average linkage offers a smoother compromise and can be more robust.
  • Single linkage often merges many points early, leading to one dominant chain-like cluster.

Compared to K-means

  • Hierarchical clustering shows nested structure and dendrograms give insight into the merging sequence.
  • K-means forces flat partitions and optimises for compactness, making it more suitable for large datasets.

5. Comparing Clustering Methods

Comparing Cluster Assignments by State

We now create a table that shows the cluster assignments for each state using:

  • K-means clustering (k = 3)
  • Hierarchical clustering with complete linkage (k = 3)
# Assign cluster labels
kmeans_clusters <- km_out_3$cluster
hier_clusters <- cutree(hc_complete, k = 3)

# Create comparison table
cluster_comparison <- data.frame(
  State = rownames(USArrests),
  KMeans = factor(kmeans_clusters),
  Hierarchical = factor(hier_clusters)
)

# Preview
knitr::kable(head(cluster_comparison, 10), caption = "Cluster Assignments by Method (First 10 States)")
Cluster Assignments by Method (First 10 States)
State KMeans Hierarchical
Alabama Alabama 3 1
Alaska Alaska 3 1
Arizona Arizona 3 2
Arkansas Arkansas 1 3
California California 3 2
Colorado Colorado 3 2
Connecticut Connecticut 1 3
Delaware Delaware 1 3
Florida Florida 3 2
Georgia Georgia 3 1

Coloured Dendrogram (Complete Linkage)

Here we use the dendextend package to colour the dendrogram labels based on hierarchical clustering (k = 3).

# install.packages("dendextend")  # Uncomment if not installed
library(dendextend)

# Convert and colour the dendrogram
dend <- as.dendrogram(hc_complete)
dend_col <- color_branches(dend, k = 3)

# Plot coloured dendrogram
plot(dend_col, main = "Coloured Dendrogram - Complete Linkage", ylab = "Height")

Cluster Agreement (Adjusted Rand Index)

The Adjusted Rand Index (ARI) measures the agreement between two clusterings. It ranges from 0 (random agreement) to 1 (perfect match).

# install.packages("mclust")  # Uncomment if not installed
library(mclust)

# Compute ARI between K-means and hierarchical clusters
ari <- adjustedRandIndex(kmeans_clusters, hier_clusters)
ari
## [1] 0.393779

Interpretation

The table above shows how each method assigns clusters to individual states. Differences in assignments arise because:

  • K-means partitions data to minimise within-cluster variance using centroids.
  • Hierarchical clustering merges or splits groups based on inter-cluster distances (in this case, using complete linkage).

The coloured dendrogram provides a visual summary of the hierarchical structure and how the clusters form step by step.

The coloured dendrogram offers a cleaner view of the hierarchical structure and the cut-level clusters.

The Adjusted Rand Index (ARI) quantifies the similarity between methods. A high ARI (close to 1) suggests that both methods yield consistent clusters, while a lower ARI implies they capture different aspects of the data structure.

Visualising Agreement and Disagreement on PCA

To better understand where clustering methods agree or differ, we highlight states on the PCA plot based on whether their K-means and hierarchical cluster assignments match.

# Build PCA plot data with cluster comparison
pca_df <- as.data.frame(pca_result$x[, 1:2])
pca_df$State <- rownames(USArrests)
pca_df$KMeans <- factor(kmeans_clusters)
pca_df$Hierarchical <- factor(hier_clusters)

# Add agreement flag
pca_df$Agreement <- ifelse(pca_df$KMeans == pca_df$Hierarchical, "Agree", "Disagree")

# Load ggplot2 if needed
library(ggplot2)

# Plot
ggplot(pca_df, aes(x = PC1, y = PC2, label = State)) +
  geom_point(aes(colour = Agreement), size = 3, alpha = 0.8) +
  geom_text(vjust = -1, size = 3, show.legend = FALSE) +
  scale_colour_manual(values = c("Agree" = "forestgreen", "Disagree" = "firebrick")) +
  labs(title = "Cluster Assignment Agreement (K-means vs Hierarchical)",
       subtitle = "Highlighted on First Two Principal Components",
       x = "Principal Component 1",
       y = "Principal Component 2") +
  theme_minimal()

Interpretation

  • Green points show where both methods agree on the cluster assignment.
  • Red points highlight states where K-means and hierarchical clustering disagree.
  • This view helps you spot which states are ambiguous or lie near cluster boundaries.

6. Conclusion

In this notebook, we explored several core techniques in unsupervised learning using the USArrests dataset, including Principal Component Analysis (PCA), K-means clustering, and Hierarchical clustering with multiple linkage strategies.

PCA allowed us to reduce the dimensionality of the dataset and visualise structure in the data. K-means provided a straightforward partitioning of the states into clusters based on similarity, while hierarchical clustering offered a more nuanced, nested view of relationships between states.

By comparing cluster assignments and visualising areas of agreement and disagreement, we gained insights into how different algorithms interpret the underlying structure of the same data. No single method is universally superior, each reveals different aspects of the data, and their usefulness depends on the context and goals of the analysis.

These approaches are particularly valuable in exploratory settings where we seek to uncover latent patterns or groupings without predefined labels. Future directions could include evaluating other clustering methods (e.g. DBSCAN), using different distance metrics, or applying these techniques to other datasets.

Unsupervised learning remains a powerful and flexible tool in the data scientist’s toolkit, especially when used thoughtfully alongside visualisation and domain understanding.