Statistical learning lab 11

question 8

8a

# Load data
data("USArrests")

# Perform PCA with scaling and centering
pca_result <- prcomp(USArrests, scale. = TRUE)

# Compute PVE using the sdev output
sdev <- pca_result$sdev
pve_a <- sdev^2 / sum(sdev^2)

# Print PVE from method (a)
print("PVE using sdev:")

## [1] "PVE using sdev:"

print(pve_a)

## [1] 0.62006039 0.24744129 0.08914080 0.04335752

8b

# Standardize the data (same as scale = TRUE)
scaled_data <- scale(USArrests)

# Get the PCA loadings (phi)
loadings <- pca_result$rotation

# Compute numerator: squared projection of data onto each PC
scores <- scaled_data %*% loadings  # PC scores: x_ij * phi_jm
numerator <- colSums(scores^2)

# Compute denominator: total variance in scaled data
denominator <- sum(scaled_data^2)

# Compute PVE using Equation 12.10
pve_b <- numerator / denominator

# Print PVE from method (b)
print("PVE using Equation 12.10:")

## [1] "PVE using Equation 12.10:"

print(pve_b)

##        PC1        PC2        PC3        PC4 
## 0.62006039 0.24744129 0.08914080 0.04335752

results

Both methods gave identical results, confirming that Equation 12.10 is consistent with the PCA implementation in R when using centered and scaled data.
PC1 and PC2 together explain ~86.75% of the total variance.

question 9

9a

# Load data
data("USArrests")

# Compute Euclidean distance
dist_usarrests <- dist(USArrests)

# Hierarchical clustering using complete linkage
hc_complete <- hclust(dist_usarrests, method = "complete")

# Plot dendrogram
plot(hc_complete, main = "Hierarchical Clustering (Unscaled Data)",
     xlab = "", sub = "", cex = 0.7)

9b

# Cut tree to form 3 clusters
clusters_unscaled <- cutree(hc_complete, k = 3)

# Show clusters
print(clusters_unscaled)

##        Alabama         Alaska        Arizona       Arkansas     California 
##              1              1              1              2              1 
##       Colorado    Connecticut       Delaware        Florida        Georgia 
##              2              3              1              1              2 
##         Hawaii          Idaho       Illinois        Indiana           Iowa 
##              3              3              1              3              3 
##         Kansas       Kentucky      Louisiana          Maine       Maryland 
##              3              3              1              3              1 
##  Massachusetts       Michigan      Minnesota    Mississippi       Missouri 
##              2              1              3              1              2 
##        Montana       Nebraska         Nevada  New Hampshire     New Jersey 
##              3              3              1              3              2 
##     New Mexico       New York North Carolina   North Dakota           Ohio 
##              1              1              1              3              3 
##       Oklahoma         Oregon   Pennsylvania   Rhode Island South Carolina 
##              2              2              3              2              1 
##   South Dakota      Tennessee          Texas           Utah        Vermont 
##              3              2              2              3              3 
##       Virginia     Washington  West Virginia      Wisconsin        Wyoming 
##              2              2              3              3              2

# Create a list of states per cluster
split(names(clusters_unscaled), clusters_unscaled)

## $`1`
##  [1] "Alabama"        "Alaska"         "Arizona"        "California"    
##  [5] "Delaware"       "Florida"        "Illinois"       "Louisiana"     
##  [9] "Maryland"       "Michigan"       "Mississippi"    "Nevada"        
## [13] "New Mexico"     "New York"       "North Carolina" "South Carolina"
## 
## $`2`
##  [1] "Arkansas"      "Colorado"      "Georgia"       "Massachusetts"
##  [5] "Missouri"      "New Jersey"    "Oklahoma"      "Oregon"       
##  [9] "Rhode Island"  "Tennessee"     "Texas"         "Virginia"     
## [13] "Washington"    "Wyoming"      
## 
## $`3`
##  [1] "Connecticut"   "Hawaii"        "Idaho"         "Indiana"      
##  [5] "Iowa"          "Kansas"        "Kentucky"      "Maine"        
##  [9] "Minnesota"     "Montana"       "Nebraska"      "New Hampshire"
## [13] "North Dakota"  "Ohio"          "Pennsylvania"  "South Dakota" 
## [17] "Utah"          "Vermont"       "West Virginia" "Wisconsin"

results

Cluster 1 – Higher Crime States Likely driven by high violent crime rates (especially Assault): States: Alabama, Alaska, Arizona, California, Delaware, Florida, Illinois, Louisiana, Maryland, Michigan, Mississippi, Nevada, New Mexico, New York, North Carolina, South Carolina
Cluster 2 – Moderate Crime States Mixed group, transitional in terms of crime levels: States: Arkansas, Colorado, Georgia, Massachusetts, Missouri, New Jersey, Oklahoma, Oregon, Rhode Island, Tennessee, Texas, Virginia, Washington, Wyoming
Cluster 3 – Lower Crime States Generally lower in violent crimes, possibly more rural or less densely populated: States: Connecticut, Hawaii, Idaho, Indiana, Iowa, Kansas, Kentucky, Maine, Minnesota, Montana, Nebraska, New Hampshire, North Dakota, Ohio, Pennsylvania, South Dakota, Utah, Vermont, West Virginia, Wisconsin.

9c

# Scale the variables (mean 0, sd 1)
scaled_data <- scale(USArrests)

# Compute distances on scaled data
dist_scaled <- dist(scaled_data)

# Hierarchical clustering on scaled data
hc_complete_scaled <- hclust(dist_scaled, method = "complete")

# Plot dendrogram
plot(hc_complete_scaled, main = "Hierarchical Clustering (Scaled Data)",
     xlab = "", sub = "", cex = 0.7)

# Cut tree for 3 clusters
clusters_scaled <- cutree(hc_complete_scaled, k = 3)

# Show clusters
split(names(clusters_scaled), clusters_scaled)

## $`1`
## [1] "Alabama"        "Alaska"         "Georgia"        "Louisiana"     
## [5] "Mississippi"    "North Carolina" "South Carolina" "Tennessee"     
## 
## $`2`
##  [1] "Arizona"    "California" "Colorado"   "Florida"    "Illinois"  
##  [6] "Maryland"   "Michigan"   "Nevada"     "New Mexico" "New York"  
## [11] "Texas"     
## 
## $`3`
##  [1] "Arkansas"      "Connecticut"   "Delaware"      "Hawaii"       
##  [5] "Idaho"         "Indiana"       "Iowa"          "Kansas"       
##  [9] "Kentucky"      "Maine"         "Massachusetts" "Minnesota"    
## [13] "Missouri"      "Montana"       "Nebraska"      "New Hampshire"
## [17] "New Jersey"    "North Dakota"  "Ohio"          "Oklahoma"     
## [21] "Oregon"        "Pennsylvania"  "Rhode Island"  "South Dakota" 
## [25] "Utah"          "Vermont"       "Virginia"      "Washington"   
## [29] "West Virginia" "Wisconsin"     "Wyoming"

9d

Effect of Scaling the Variables Scaling the variables significantly changes the results of hierarchical clustering.
Before scaling: Variables with larger variances or measured on larger scales (like Assault) dominate the calculation of Euclidean distances. In the USArrests dataset, Assault has the highest variance and strongly influences the clusters. As a result, the unscaled clustering mostly reflects differences in Assault rates, potentially ignoring meaningful patterns in Murder, Rape, or UrbanPop.
After scaling: All variables contribute equally to the distance calculations. This results in more balanced clusters that reflect a combination of Murder, Assault, Rape, and UrbanPop, rather than just the most dominant variable.

question 10

10a

set.seed(123)

# Number of observations per class
n <- 20

# Total observations = 60 (3 classes)
# 50 variables (features)
p <- 50

# Create 3 classes with different means
class1 <- matrix(rnorm(n * p, mean = 0), nrow = n)
class2 <- matrix(rnorm(n * p, mean = 3), nrow = n)
class3 <- matrix(rnorm(n * p, mean = -3), nrow = n)

# Combine into one dataset
X <- rbind(class1, class2, class3)

# Create class labels
true_labels <- rep(1:3, each = n)

10b

# Perform PCA
pca_result <- prcomp(X)

# Plot the first two principal components
pc_scores <- pca_result$x

# Color by true class
colors <- c("red", "blue", "darkgreen")
plot(pc_scores[,1:2], col = colors[true_labels], pch = 19,
     xlab = "PC1", ylab = "PC2", main = "PCA: First Two Principal Components")
legend("topright", legend = paste("Class", 1:3), col = colors, pch = 19)

10c

set.seed(123)
kmeans_3 <- kmeans(X, centers = 3, nstart = 20)

# Compare clusters to true classes
table(true_labels, kmeans_3$cluster)

##            
## true_labels  1  2  3
##           1  0 20  0
##           2 20  0  0
##           3  0  0 20

results

K-means recovered the true class structure. This confirms that the data has well-separated clusters, and the clustering algorithm is working as intended.

10d

set.seed(123)
kmeans_2 <- kmeans(X, centers = 2, nstart = 20)
table(true_labels, kmeans_2$cluster)

##            
## true_labels  1  2
##           1  0 20
##           2  0 20
##           3 20  0

results

Cluster 1 contains only Class 3. Cluster 2 contains both Class 1 and Class 2.
With K=2, K-means fails to capture all three classes. However, it still partially reflects the data’s structure by isolating the most distinct class.

10e

set.seed(123)
kmeans_4 <- kmeans(X, centers = 4, nstart = 20)
table(true_labels, kmeans_4$cluster)

##            
## true_labels  1  2  3  4
##           1  0  0 20  0
##           2  8  0  0 12
##           3  0 20  0  0

results

Class 1 is perfectly captured by Cluster 3.Class 3 is perfectly captured by Cluster 2.Class 2, however, is split between Cluster 1 and Cluster 4:
K=4 results in over-clustering. While two of the true classes are still perfectly clustered, one (Class 2) is artificially split.
This confirms that K=3 is the best match for the true class structure.

10f

set.seed(123)
kmeans_pc <- kmeans(pc_scores[, 1:2], centers = 3, nstart = 20)
table(true_labels, kmeans_pc$cluster)

##            
## true_labels  1  2  3
##           1  0 20  0
##           2 20  0  0
##           3  0  0 20

results

K-means clustering on the reduced 2D PCA representation perfectly recovers the original class labels.
This result is identical to where clustering was applied to the full 50-dimensional data.
K-means clustering on the first two principal components is just as effective as clustering on the full dataset. This confirms that PCA is a powerful preprocessing step that can improve clustering performance by reducing noise and dimensionality.

10g

# Scale each variable (standard deviation = 1)
X_scaled <- scale(X)

set.seed(123)
kmeans_scaled <- kmeans(X_scaled, centers = 3, nstart = 20)
table(true_labels, kmeans_scaled$cluster)

##            
## true_labels  1  2  3
##           1  0 20  0
##           2 20  0  0
##           3  0  0 20

results

In part (b) we performed PCA and in (c) you applied K-means directly to the raw data. There, we also got perfect clustering.
In (g), after scaling each variable to have standard deviation 1,got perfect clustering.
Scaling the data before clustering gave the same perfect result as in part (c), because simulated features already had similar variances. However, in most practical scenarios, scaling is essential to ensure fair contribution from all variables.