Statistical learning lab 11
question 8
8a
# Load data
data("USArrests")
# Perform PCA with scaling and centering
pca_result <- prcomp(USArrests, scale. = TRUE)
# Compute PVE using the sdev output
sdev <- pca_result$sdev
pve_a <- sdev^2 / sum(sdev^2)
# Print PVE from method (a)
print("PVE using sdev:")
## [1] "PVE using sdev:"
print(pve_a)
## [1] 0.62006039 0.24744129 0.08914080 0.04335752
8b
# Standardize the data (same as scale = TRUE)
scaled_data <- scale(USArrests)
# Get the PCA loadings (phi)
loadings <- pca_result$rotation
# Compute numerator: squared projection of data onto each PC
scores <- scaled_data %*% loadings # PC scores: x_ij * phi_jm
numerator <- colSums(scores^2)
# Compute denominator: total variance in scaled data
denominator <- sum(scaled_data^2)
# Compute PVE using Equation 12.10
pve_b <- numerator / denominator
# Print PVE from method (b)
print("PVE using Equation 12.10:")
## [1] "PVE using Equation 12.10:"
print(pve_b)
## PC1 PC2 PC3 PC4
## 0.62006039 0.24744129 0.08914080 0.04335752
results
- Both methods gave identical results, confirming that Equation 12.10
is consistent with the PCA implementation in R when using centered and
scaled data.
- PC1 and PC2 together explain ~86.75% of the total variance.
question 9
9a
# Load data
data("USArrests")
# Compute Euclidean distance
dist_usarrests <- dist(USArrests)
# Hierarchical clustering using complete linkage
hc_complete <- hclust(dist_usarrests, method = "complete")
# Plot dendrogram
plot(hc_complete, main = "Hierarchical Clustering (Unscaled Data)",
xlab = "", sub = "", cex = 0.7)

9b
# Cut tree to form 3 clusters
clusters_unscaled <- cutree(hc_complete, k = 3)
# Show clusters
print(clusters_unscaled)
## Alabama Alaska Arizona Arkansas California
## 1 1 1 2 1
## Colorado Connecticut Delaware Florida Georgia
## 2 3 1 1 2
## Hawaii Idaho Illinois Indiana Iowa
## 3 3 1 3 3
## Kansas Kentucky Louisiana Maine Maryland
## 3 3 1 3 1
## Massachusetts Michigan Minnesota Mississippi Missouri
## 2 1 3 1 2
## Montana Nebraska Nevada New Hampshire New Jersey
## 3 3 1 3 2
## New Mexico New York North Carolina North Dakota Ohio
## 1 1 1 3 3
## Oklahoma Oregon Pennsylvania Rhode Island South Carolina
## 2 2 3 2 1
## South Dakota Tennessee Texas Utah Vermont
## 3 2 2 3 3
## Virginia Washington West Virginia Wisconsin Wyoming
## 2 2 3 3 2
# Create a list of states per cluster
split(names(clusters_unscaled), clusters_unscaled)
## $`1`
## [1] "Alabama" "Alaska" "Arizona" "California"
## [5] "Delaware" "Florida" "Illinois" "Louisiana"
## [9] "Maryland" "Michigan" "Mississippi" "Nevada"
## [13] "New Mexico" "New York" "North Carolina" "South Carolina"
##
## $`2`
## [1] "Arkansas" "Colorado" "Georgia" "Massachusetts"
## [5] "Missouri" "New Jersey" "Oklahoma" "Oregon"
## [9] "Rhode Island" "Tennessee" "Texas" "Virginia"
## [13] "Washington" "Wyoming"
##
## $`3`
## [1] "Connecticut" "Hawaii" "Idaho" "Indiana"
## [5] "Iowa" "Kansas" "Kentucky" "Maine"
## [9] "Minnesota" "Montana" "Nebraska" "New Hampshire"
## [13] "North Dakota" "Ohio" "Pennsylvania" "South Dakota"
## [17] "Utah" "Vermont" "West Virginia" "Wisconsin"
results
- Cluster 1 – Higher Crime States Likely driven by high violent crime
rates (especially Assault): States: Alabama, Alaska, Arizona,
California, Delaware, Florida, Illinois, Louisiana, Maryland, Michigan,
Mississippi, Nevada, New Mexico, New York, North Carolina, South
Carolina
- Cluster 2 – Moderate Crime States Mixed group, transitional in terms
of crime levels: States: Arkansas, Colorado, Georgia, Massachusetts,
Missouri, New Jersey, Oklahoma, Oregon, Rhode Island, Tennessee, Texas,
Virginia, Washington, Wyoming
- Cluster 3 – Lower Crime States Generally lower in violent crimes,
possibly more rural or less densely populated: States: Connecticut,
Hawaii, Idaho, Indiana, Iowa, Kansas, Kentucky, Maine, Minnesota,
Montana, Nebraska, New Hampshire, North Dakota, Ohio, Pennsylvania,
South Dakota, Utah, Vermont, West Virginia, Wisconsin.
9c
# Scale the variables (mean 0, sd 1)
scaled_data <- scale(USArrests)
# Compute distances on scaled data
dist_scaled <- dist(scaled_data)
# Hierarchical clustering on scaled data
hc_complete_scaled <- hclust(dist_scaled, method = "complete")
# Plot dendrogram
plot(hc_complete_scaled, main = "Hierarchical Clustering (Scaled Data)",
xlab = "", sub = "", cex = 0.7)

# Cut tree for 3 clusters
clusters_scaled <- cutree(hc_complete_scaled, k = 3)
# Show clusters
split(names(clusters_scaled), clusters_scaled)
## $`1`
## [1] "Alabama" "Alaska" "Georgia" "Louisiana"
## [5] "Mississippi" "North Carolina" "South Carolina" "Tennessee"
##
## $`2`
## [1] "Arizona" "California" "Colorado" "Florida" "Illinois"
## [6] "Maryland" "Michigan" "Nevada" "New Mexico" "New York"
## [11] "Texas"
##
## $`3`
## [1] "Arkansas" "Connecticut" "Delaware" "Hawaii"
## [5] "Idaho" "Indiana" "Iowa" "Kansas"
## [9] "Kentucky" "Maine" "Massachusetts" "Minnesota"
## [13] "Missouri" "Montana" "Nebraska" "New Hampshire"
## [17] "New Jersey" "North Dakota" "Ohio" "Oklahoma"
## [21] "Oregon" "Pennsylvania" "Rhode Island" "South Dakota"
## [25] "Utah" "Vermont" "Virginia" "Washington"
## [29] "West Virginia" "Wisconsin" "Wyoming"
9d
- Effect of Scaling the Variables Scaling the variables significantly
changes the results of hierarchical clustering.
- Before scaling: Variables with larger variances or measured on
larger scales (like Assault) dominate the calculation of Euclidean
distances. In the USArrests dataset, Assault has the highest variance
and strongly influences the clusters. As a result, the unscaled
clustering mostly reflects differences in Assault rates, potentially
ignoring meaningful patterns in Murder, Rape, or UrbanPop.
- After scaling: All variables contribute equally to the distance
calculations. This results in more balanced clusters that reflect a
combination of Murder, Assault, Rape, and UrbanPop, rather than just the
most dominant variable.
question 10
10a
set.seed(123)
# Number of observations per class
n <- 20
# Total observations = 60 (3 classes)
# 50 variables (features)
p <- 50
# Create 3 classes with different means
class1 <- matrix(rnorm(n * p, mean = 0), nrow = n)
class2 <- matrix(rnorm(n * p, mean = 3), nrow = n)
class3 <- matrix(rnorm(n * p, mean = -3), nrow = n)
# Combine into one dataset
X <- rbind(class1, class2, class3)
# Create class labels
true_labels <- rep(1:3, each = n)
10b
# Perform PCA
pca_result <- prcomp(X)
# Plot the first two principal components
pc_scores <- pca_result$x
# Color by true class
colors <- c("red", "blue", "darkgreen")
plot(pc_scores[,1:2], col = colors[true_labels], pch = 19,
xlab = "PC1", ylab = "PC2", main = "PCA: First Two Principal Components")
legend("topright", legend = paste("Class", 1:3), col = colors, pch = 19)

10c
set.seed(123)
kmeans_3 <- kmeans(X, centers = 3, nstart = 20)
# Compare clusters to true classes
table(true_labels, kmeans_3$cluster)
##
## true_labels 1 2 3
## 1 0 20 0
## 2 20 0 0
## 3 0 0 20
results
- K-means recovered the true class structure. This confirms that the
data has well-separated clusters, and the clustering algorithm is
working as intended.
10d
set.seed(123)
kmeans_2 <- kmeans(X, centers = 2, nstart = 20)
table(true_labels, kmeans_2$cluster)
##
## true_labels 1 2
## 1 0 20
## 2 0 20
## 3 20 0
results
- Cluster 1 contains only Class 3. Cluster 2 contains both Class 1 and
Class 2.
- With K=2, K-means fails to capture all three classes. However, it
still partially reflects the data’s structure by isolating the most
distinct class.
10e
set.seed(123)
kmeans_4 <- kmeans(X, centers = 4, nstart = 20)
table(true_labels, kmeans_4$cluster)
##
## true_labels 1 2 3 4
## 1 0 0 20 0
## 2 8 0 0 12
## 3 0 20 0 0
results
- Class 1 is perfectly captured by Cluster 3.Class 3 is perfectly
captured by Cluster 2.Class 2, however, is split between Cluster 1 and
Cluster 4:
- K=4 results in over-clustering. While two of the true classes are
still perfectly clustered, one (Class 2) is artificially split.
- This confirms that K=3 is the best match for the true class
structure.
10f
set.seed(123)
kmeans_pc <- kmeans(pc_scores[, 1:2], centers = 3, nstart = 20)
table(true_labels, kmeans_pc$cluster)
##
## true_labels 1 2 3
## 1 0 20 0
## 2 20 0 0
## 3 0 0 20
results
- K-means clustering on the reduced 2D PCA representation perfectly
recovers the original class labels.
- This result is identical to where clustering was applied to the full
50-dimensional data.
- K-means clustering on the first two principal components is just as
effective as clustering on the full dataset. This confirms that PCA is a
powerful preprocessing step that can improve clustering performance by
reducing noise and dimensionality.
10g
# Scale each variable (standard deviation = 1)
X_scaled <- scale(X)
set.seed(123)
kmeans_scaled <- kmeans(X_scaled, centers = 3, nstart = 20)
table(true_labels, kmeans_scaled$cluster)
##
## true_labels 1 2 3
## 1 0 20 0
## 2 20 0 0
## 3 0 0 20
results
- In part (b) we performed PCA and in (c) you applied K-means directly
to the raw data. There, we also got perfect clustering.
- In (g), after scaling each variable to have standard deviation 1,got
perfect clustering.
- Scaling the data before clustering gave the same perfect result as
in part (c), because simulated features already had similar variances.
However, in most practical scenarios, scaling is essential to ensure
fair contribution from all variables.