This document provides an overview and practical examples of Unsupervised Learning methods using R, based on Chapter 10 of An Introduction to Statistical Learning with Applications in R. We explore the following techniques:
Each section includes R code, explanations, and interpretation tips.
First things first… Let’s install and Load Packages.
We use base R functions for PCA and clustering, so no additional packages are needed at this point.
# No packages needed for base R functions used in PCA and clustering
We will use the USArrests
dataset, which contains
statistics in arrests per 100,000 residents for assault, murder, and
rape in each of the 50 US states in 1973.
head(USArrests)
Murder | Assault | UrbanPop | Rape | |
---|---|---|---|---|
Alabama | 13.2 | 236 | 58 | 21.2 |
Alaska | 10.0 | 263 | 48 | 44.5 |
Arizona | 8.1 | 294 | 80 | 31.0 |
Arkansas | 8.8 | 190 | 50 | 19.5 |
California | 9.0 | 276 | 91 | 40.6 |
Colorado | 7.9 | 204 | 78 | 38.7 |
summary(USArrests)
## Murder Assault UrbanPop Rape
## Min. : 0.800 Min. : 45.0 Min. :32.00 Min. : 7.30
## 1st Qu.: 4.075 1st Qu.:109.0 1st Qu.:54.50 1st Qu.:15.07
## Median : 7.250 Median :159.0 Median :66.00 Median :20.10
## Mean : 7.788 Mean :170.8 Mean :65.54 Mean :21.23
## 3rd Qu.:11.250 3rd Qu.:249.0 3rd Qu.:77.75 3rd Qu.:26.18
## Max. :17.400 Max. :337.0 Max. :91.00 Max. :46.00
Explanation:
This step helps us get familiar with the dataset. We check the first few rows and generate summary statistics to understand variable ranges and distributions.
# Standardize the variables
USArrests_scaled <- scale(USArrests)
# Run PCA
pca_result <- prcomp(USArrests_scaled)
# PCA summary
summary(pca_result)
## Importance of components:
## PC1 PC2 PC3 PC4
## Standard deviation 1.5749 0.9949 0.59713 0.41645
## Proportion of Variance 0.6201 0.2474 0.08914 0.04336
## Cumulative Proportion 0.6201 0.8675 0.95664 1.00000
# Plot PCA
biplot(pca_result, scale = 0)
Explanation We first scale the data using
scale()
because PCA is sensitive to the scale of the
variables. If the variables are measured on different scales
(e.g. “Assault” may have a much higher range than “Murder”), PCA could
be dominated by the variable with the largest variance. Scaling ensures
that each variable has a mean of 0 and standard deviation of 1, allowing
all variables to contribute equally to the analysis.
The prcomp()
function performs Principal Components
Analysis by computing the covariance matrix of the scaled data and
extracting its eigenvalues and eigenvectors:
The summary()
function shows:
Finally, the biplot()
visualises the data in the space
of the first two principal components. It shows:
This allows us to see patterns or clusters in the data, and understand which variables are most influential.
Summary
Key Components of a Biplot
Interpretation
Use the summary()
output to assess how many
principal components capture most of the variance (e.g., 2 components
might explain 85% of the variance).
In the biplot:
Interpretation of This Biplot
(biplot(pca_result, scale = 0)
)
This biplot visualises the principal component analysis (PCA) of the
USArrests
dataset after scaling (standardisation). Here’s
what you’re looking at:
Murder
Assault
UrbanPop
Rape
Assault
, Murder
, and
Rape
.UrbanPop
) and types of crimes.UrbanPop
and Rape
, they might be more urban and have higher reported
sexual assaults.K-means clustering is an unsupervised machine learning technique used to group data points into a pre-defined number (K) of clusters. It aims to minimise the within-cluster variation, grouping similar observations together.
K-means clustering is an unsupervised learning technique that groups observations into K distinct, non-overlapping clusters based on similarity. It is particularly useful when we suspect natural groupings exist in the data but do not have pre-labelled classes.
This method computes pairwise distances between states and builds a
hierarchy of clusters. complete
linkage means the distance
between clusters is defined as the furthest distance between any two
points in the clusters.
Let’s see how does it work.
# Set seed for reproducibility
set.seed(2)
# Apply k-means clustering with 2 clusters
km_out <- kmeans(USArrests_scaled, centers = 2, nstart = 20)
# View cluster assignments
km_out$cluster
## Alabama Alaska Arizona Arkansas California
## 1 1 1 2 1
## Colorado Connecticut Delaware Florida Georgia
## 1 2 2 1 1
## Hawaii Idaho Illinois Indiana Iowa
## 2 2 1 2 2
## Kansas Kentucky Louisiana Maine Maryland
## 2 2 1 2 1
## Massachusetts Michigan Minnesota Mississippi Missouri
## 2 1 2 1 1
## Montana Nebraska Nevada New Hampshire New Jersey
## 2 2 1 2 2
## New Mexico New York North Carolina North Dakota Ohio
## 1 1 1 2 2
## Oklahoma Oregon Pennsylvania Rhode Island South Carolina
## 2 2 2 2 1
## South Dakota Tennessee Texas Utah Vermont
## 2 1 1 2 2
## Virginia Washington West Virginia Wisconsin Wyoming
## 2 2 2 2 2
# Visualise clusters using PCA
plot(pca_result$x[,1:2], col = km_out$cluster, pch = 20, cex = 2)
Interpretation of the Results
km_out$cluster
.Visualisation with PCA Plot
The last line:
plot(pca_result$x[,1:2], col = km_out$cluster, pch = 20, cex = 2)
projects the K-means clusters onto the first two principal components (from your earlier PCA).
This visualisation shows:
# Load ggplot2
library(ggplot2)
# Create a data frame with PCA results and cluster assignment
pca_df <- as.data.frame(pca_result$x[, 1:2]) # First two PCs
pca_df$State <- rownames(USArrests)
pca_df$Cluster <- factor(km_out$cluster)
# Plot with ggplot2
ggplot(pca_df, aes(x = PC1, y = PC2, colour = Cluster, label = State)) +
geom_point(size = 3, alpha = 0.7) +
geom_text(vjust = -1.2, size = 3) +
labs(title = "K-means Clustering of US States (k = 2)",
subtitle = "Visualised using First Two Principal Components",
x = "Principal Component 1",
y = "Principal Component 2") +
theme_minimal() +
scale_colour_brewer(palette = "Set1")
This Plot Shows:
Let us examine what happens when we introduce an additional cluster, that is, when we set K = 3.
Trying different values of K is a key step in unsupervised learning as it helps us uncover potentially more nuanced groupings in the data and assess how cluster structure evolves.
In this example, we apply K-means clustering to the USArrests dataset (standardised), aiming to identify states with similar crime patterns. We choose K = 3 to form three broad groups of states.
# Apply K-means clustering with 3 clusters
set.seed(2)
km_out_3 <- kmeans(USArrests_scaled, centers = 3, nstart = 20)
# View cluster sizes
table(km_out_3$cluster)
##
## 1 2 3
## 17 13 20
To better understand the structure of the clusters, we visualise the K-means results using the first two principal components from PCA. This helps us see the separation in two dimensions while preserving most of the variation in the data.
library(ggplot2)
# Create data frame with PCA results and K=3 cluster assignments
pca_df_3 <- as.data.frame(pca_result$x[, 1:2])
pca_df_3$State <- rownames(USArrests_scaled)
pca_df_3$Cluster <- factor(km_out_3$cluster)
# Project K=3 centroids into PCA space
centroids_3 <- as.data.frame(km_out_3$centers %*% pca_result$rotation[, 1:2])
colnames(centroids_3) <- c("PC1", "PC2") # ensure names match
centroids_3$Cluster <- factor(1:3)
# Plot
ggplot(pca_df_3, aes(x = PC1, y = PC2, colour = Cluster)) +
geom_point(size = 3, alpha = 0.8) +
geom_text(aes(label = State), vjust = -1, size = 3, show.legend = FALSE) +
geom_point(data = centroids_3, aes(x = PC1, y = PC2),
colour = "black", shape = 4, size = 5, stroke = 1.5) +
labs(title = "K-means Clustering of US States (K = 3)",
subtitle = "Visualised using First Two Principal Components",
x = "Principal Component 1",
y = "Principal Component 2") +
theme_minimal() +
scale_colour_brewer(palette = "Dark2")
When we increase the number of clusters from 2 to 3, we allow the
algorithm to form a more nuanced separation of the states based on their
standardised crime statistics.
In the visualisation:
What We Can Observe:
Comparison with K = 2 Clustering
Feature | K = 2 Clustering | K = 3 Clustering |
---|---|---|
Number of groups | Two broad clusters | More refined, with three distinct clusters |
Interpretability | Easy to interpret: “high crime” vs “low crime” | More granular: potential subgroups based on specific crime patterns |
Cluster centroids | Clearly separated | One centroid appears close to others, suggesting overlap or ambiguity |
State grouping | Large states grouped coarsely | Some states that were grouped together now form a third category |
Usefulness | Good for simplicity and clarity | Better for capturing more subtle trends in the data |
Final Thoughts
This document explored PCA, K-Means, and Hierarchical Clustering
techniques using the USArrests
dataset. These unsupervised
learning methods help uncover structure in the data when we do not have
a response variable. They are useful for dimensionality reduction and
exploratory data analysis.
Hierarchical clustering is an unsupervised learning method that builds a hierarchy of clusters without requiring the number of clusters to be specified beforehand. It visualises nested groupings using a dendrogram, a tree-like diagram showing how states are merged step by step.
Different linkage methods define how distances between clusters are computed:
Each method produces a different clustering structure.
# Compute distance matrix
dist_matrix <- dist(USArrests_scaled)
# Perform hierarchical clustering using different linkage methods
hc_complete <- hclust(dist_matrix, method = "complete")
hc_average <- hclust(dist_matrix, method = "average")
hc_single <- hclust(dist_matrix, method = "single")
We now plot the dendrograms side by side to compare how different linkage methods group the US states.
par(mfrow = c(1, 3)) # 1 row, 3 columns
plot(hc_complete, main = "Complete Linkage", xlab = "", sub = "", cex = 0.6)
rect.hclust(hc_complete, k = 3, border = "red")
plot(hc_average, main = "Average Linkage", xlab = "", sub = "", cex = 0.6)
rect.hclust(hc_average, k = 3, border = "blue")
plot(hc_single, main = "Single Linkage", xlab = "", sub = "", cex = 0.6)
rect.hclust(hc_single, k = 3, border = "forestgreen")
Interpretation and Comparison
Linkage Method | Description | Observations |
---|---|---|
Complete | Merges clusters based on the maximum distance | Tends to form compact, evenly sized clusters |
Average | Uses the average distance between clusters | Produces moderate, balanced cluster shapes |
Single | Merges based on minimum distance (nearest neighbours) | May result in chaining effect (long, thin clusters) |
Insights
Compared to K-means
We now create a table that shows the cluster assignments for each state using:
# Assign cluster labels
kmeans_clusters <- km_out_3$cluster
hier_clusters <- cutree(hc_complete, k = 3)
# Create comparison table
cluster_comparison <- data.frame(
State = rownames(USArrests),
KMeans = factor(kmeans_clusters),
Hierarchical = factor(hier_clusters)
)
# Preview
knitr::kable(head(cluster_comparison, 10), caption = "Cluster Assignments by Method (First 10 States)")
State | KMeans | Hierarchical | |
---|---|---|---|
Alabama | Alabama | 3 | 1 |
Alaska | Alaska | 3 | 1 |
Arizona | Arizona | 3 | 2 |
Arkansas | Arkansas | 1 | 3 |
California | California | 3 | 2 |
Colorado | Colorado | 3 | 2 |
Connecticut | Connecticut | 1 | 3 |
Delaware | Delaware | 1 | 3 |
Florida | Florida | 3 | 2 |
Georgia | Georgia | 3 | 1 |
Coloured Dendrogram (Complete Linkage)
Here we use the dendextend
package to colour the
dendrogram labels based on hierarchical clustering (k = 3).
# install.packages("dendextend") # Uncomment if not installed
library(dendextend)
# Convert and colour the dendrogram
dend <- as.dendrogram(hc_complete)
dend_col <- color_branches(dend, k = 3)
# Plot coloured dendrogram
plot(dend_col, main = "Coloured Dendrogram - Complete Linkage", ylab = "Height")
Cluster Agreement (Adjusted Rand Index)
The Adjusted Rand Index (ARI) measures the agreement between two clusterings. It ranges from 0 (random agreement) to 1 (perfect match).
# install.packages("mclust") # Uncomment if not installed
library(mclust)
# Compute ARI between K-means and hierarchical clusters
ari <- adjustedRandIndex(kmeans_clusters, hier_clusters)
ari
## [1] 0.393779
Interpretation
The table above shows how each method assigns clusters to individual states. Differences in assignments arise because:
The coloured dendrogram provides a visual summary of the hierarchical structure and how the clusters form step by step.
The coloured dendrogram offers a cleaner view of the hierarchical structure and the cut-level clusters.
The Adjusted Rand Index (ARI) quantifies the similarity between methods. A high ARI (close to 1) suggests that both methods yield consistent clusters, while a lower ARI implies they capture different aspects of the data structure.
To better understand where clustering methods agree or differ, we highlight states on the PCA plot based on whether their K-means and hierarchical cluster assignments match.
# Build PCA plot data with cluster comparison
pca_df <- as.data.frame(pca_result$x[, 1:2])
pca_df$State <- rownames(USArrests)
pca_df$KMeans <- factor(kmeans_clusters)
pca_df$Hierarchical <- factor(hier_clusters)
# Add agreement flag
pca_df$Agreement <- ifelse(pca_df$KMeans == pca_df$Hierarchical, "Agree", "Disagree")
# Load ggplot2 if needed
library(ggplot2)
# Plot
ggplot(pca_df, aes(x = PC1, y = PC2, label = State)) +
geom_point(aes(colour = Agreement), size = 3, alpha = 0.8) +
geom_text(vjust = -1, size = 3, show.legend = FALSE) +
scale_colour_manual(values = c("Agree" = "forestgreen", "Disagree" = "firebrick")) +
labs(title = "Cluster Assignment Agreement (K-means vs Hierarchical)",
subtitle = "Highlighted on First Two Principal Components",
x = "Principal Component 1",
y = "Principal Component 2") +
theme_minimal()
Interpretation
In this notebook, we explored several core techniques in unsupervised
learning using the USArrests
dataset, including
Principal Component Analysis (PCA), K-means
clustering, and Hierarchical clustering with
multiple linkage strategies.
PCA allowed us to reduce the dimensionality of the dataset and visualise structure in the data. K-means provided a straightforward partitioning of the states into clusters based on similarity, while hierarchical clustering offered a more nuanced, nested view of relationships between states.
By comparing cluster assignments and visualising areas of agreement and disagreement, we gained insights into how different algorithms interpret the underlying structure of the same data. No single method is universally superior, each reveals different aspects of the data, and their usefulness depends on the context and goals of the analysis.
These approaches are particularly valuable in exploratory settings where we seek to uncover latent patterns or groupings without predefined labels. Future directions could include evaluating other clustering methods (e.g. DBSCAN), using different distance metrics, or applying these techniques to other datasets.
Unsupervised learning remains a powerful and flexible tool in the data scientist’s toolkit, especially when used thoughtfully alongside visualisation and domain understanding.