In Section 12.2.3, a formula for calculating PVE was given in Equation 12.10. We also saw that the PVE can be obtained using the sdev output of the prcomp() function.On the USArrests data, calculate PVE in two ways:
# Load the data
data("USArrests")
# Center and scale the data
scaled_data <- scale(USArrests)
# (a) Using prcomp()$sdev
prcomp_result <- prcomp(USArrests, scale. = TRUE)
# Standard deviations of PCs
pc_sdev <- prcomp_result$sdev
# PVE: variance of each PC / total variance
pve_a <- pc_sdev^2 / sum(pc_sdev^2)
print("PVE using prcomp()$sdev:")
## [1] "PVE using prcomp()$sdev:"
print(pve_a)
## [1] 0.62006039 0.24744129 0.08914080 0.04335752
loadings <- prcomp_result$rotation
# Get the PC scores (Z = X x Loadings)
scores <- as.matrix(scaled_data) %*% loadings
# Numerators: sum of squared scores for each PC
numerators <- colSums(scores^2)
# Denominator: total sum of squares of the data matrix
total_variance <- sum(scaled_data^2)
# Apply Equation 12.10
pve_b <- numerators / total_variance
print("PVE using Equation 12.10:")
## [1] "PVE using Equation 12.10:"
print(pve_b)
## PC1 PC2 PC3 PC4
## 0.62006039 0.24744129 0.08914080 0.04335752
Both methods yielded identical PVEs—[0.62, 0.25, 0.09, 0.04]—confirming that the variance explained by each principal component is consistent whether computed via prcomp() or manually using Equation 12.10.
Consider the USArrests data. We will now perform hierarchical clustering on the states. (a) Using hierarchical clustering with complete linkage and Euclidean distance, cluster the states.
# Load data
data("USArrests")
# Compute distance matrix
dist_matrix <- dist(USArrests)
# Perform hierarchical clustering with complete linkage
hc_complete <- hclust(dist_matrix, method = "complete")
# Plot dendrogram
plot(hc_complete, main = "Hierarchical Clustering (Complete Linkage)")
# Cut tree into 3 clusters
clusters <- cutree(hc_complete, k = 3)
# Create a table of clusters
cluster_groups <- split(rownames(USArrests), clusters)
print(cluster_groups)
## $`1`
## [1] "Alabama" "Alaska" "Arizona" "California"
## [5] "Delaware" "Florida" "Illinois" "Louisiana"
## [9] "Maryland" "Michigan" "Mississippi" "Nevada"
## [13] "New Mexico" "New York" "North Carolina" "South Carolina"
##
## $`2`
## [1] "Arkansas" "Colorado" "Georgia" "Massachusetts"
## [5] "Missouri" "New Jersey" "Oklahoma" "Oregon"
## [9] "Rhode Island" "Tennessee" "Texas" "Virginia"
## [13] "Washington" "Wyoming"
##
## $`3`
## [1] "Connecticut" "Hawaii" "Idaho" "Indiana"
## [5] "Iowa" "Kansas" "Kentucky" "Maine"
## [9] "Minnesota" "Montana" "Nebraska" "New Hampshire"
## [13] "North Dakota" "Ohio" "Pennsylvania" "South Dakota"
## [17] "Utah" "Vermont" "West Virginia" "Wisconsin"
# Scale data (mean 0, sd 1)
scaled_data <- scale(USArrests)
# Distance matrix for scaled data
dist_matrix_scaled <- dist(scaled_data)
# Hierarchical clustering
hc_scaled <- hclust(dist_matrix_scaled, method = "complete")
# Plot dendrogram
plot(hc_scaled, main = "Hierarchical Clustering (Scaled Data)")
# Cut into 3 clusters
clusters_scaled <- cutree(hc_scaled, k = 3)
# List of clusters
cluster_groups_scaled <- split(rownames(USArrests), clusters_scaled)
print(cluster_groups_scaled)
## $`1`
## [1] "Alabama" "Alaska" "Georgia" "Louisiana"
## [5] "Mississippi" "North Carolina" "South Carolina" "Tennessee"
##
## $`2`
## [1] "Arizona" "California" "Colorado" "Florida" "Illinois"
## [6] "Maryland" "Michigan" "Nevada" "New Mexico" "New York"
## [11] "Texas"
##
## $`3`
## [1] "Arkansas" "Connecticut" "Delaware" "Hawaii"
## [5] "Idaho" "Indiana" "Iowa" "Kansas"
## [9] "Kentucky" "Maine" "Massachusetts" "Minnesota"
## [13] "Missouri" "Montana" "Nebraska" "New Hampshire"
## [17] "New Jersey" "North Dakota" "Ohio" "Oklahoma"
## [21] "Oregon" "Pennsylvania" "Rhode Island" "South Dakota"
## [25] "Utah" "Vermont" "Virginia" "Washington"
## [29] "West Virginia" "Wisconsin" "Wyoming"
Scaling had a clear impact on the clustering—before scaling, variables with larger variances (like ‘Assault’) dominated the results, while after scaling, all variables contributed equally, changing the cluster structure. I believe scaling is essential when variables are on different scales, as it ensures no single variable unfairly influences the clustering, leading to more balanced and meaningful results.
In this problem, you will generate simulated data, and then perform PCA and K-means clustering on the data.
set.seed(123)
# 20 observations per class, 50 variables
n <- 20
p <- 50
# Class 1: centered at 0
class1 <- matrix(rnorm(n * p), nrow = n)
# Class 2: shifted by +2
class2 <- matrix(rnorm(n * p, mean = 2), nrow = n)
# Class 3: shifted by -2
class3 <- matrix(rnorm(n * p, mean = -2), nrow = n)
# Combine into one dataset
X <- rbind(class1, class2, class3)
# True labels
labels <- c(rep(1, n), rep(2, n), rep(3, n))
# PCA
pca_out <- prcomp(X)
# Plot PC1 vs PC2
plot(pca_out$x[,1:2], col = labels, pch = 19,
main = "PCA: First Two Principal Components")
legend("topright", legend = c("Class 1", "Class 2", "Class 3"), col = 1:3, pch = 19)
Observation: I confirmed that the three classes are clearly separated in
the PC1 vs PC2 plot, showing distinct groupings.
set.seed(123)
km_out <- kmeans(X, centers = 3, nstart = 20)
# Compare clusters to true labels
table(km_out$cluster, labels)
## labels
## 1 2 3
## 1 0 20 0
## 2 20 0 0
## 3 0 0 20
Observation: The K-means clustering aligns closely with the true labels, although cluster numbers are arbitrary. There’s a strong match between clusters and actual classes, confirming K-means effectively identified the underlying groups.
set.seed(123)
km_out2 <- kmeans(X, centers = 2, nstart = 20)
table(km_out2$cluster, labels)
## labels
## 1 2 3
## 1 0 0 20
## 2 20 20 0
Observation: With K = 2, the clustering tends to merge two classes together, losing the clear distinction between all three true groups. This highlights that K = 2 is too coarse for this dataset.
set.seed(123)
km_out4 <- kmeans(X, centers = 4, nstart = 20)
table(km_out4$cluster, labels)
## labels
## 1 2 3
## 1 0 8 0
## 2 0 0 20
## 3 20 0 0
## 4 0 12 0
Observation: For K = 4, the algorithm splits one or more true classes into multiple clusters, over-segmenting the data. This shows that K = 4 is too granular for the true structure of the data.
set.seed(123)
km_pc <- kmeans(pca_out$x[, 1:2], centers = 3, nstart = 20)
table(km_pc$cluster, labels)
## labels
## 1 2 3
## 1 0 20 0
## 2 20 0 0
## 3 0 0 20
Observation: Clustering on the first two principal components produced results very similar to clustering on the full data, showing that the main structure is well captured by the first two PCs.
# Scale the data
X_scaled <- scale(X)
set.seed(123)
km_scaled <- kmeans(X_scaled, centers = 3, nstart = 20)
table(km_scaled$cluster, labels)
## labels
## 1 2 3
## 1 0 20 0
## 2 20 0 0
## 3 0 0 20
Observation: After scaling, clustering results remained strong, similar to part (c). Scaling ensures that each variable contributes equally, which is especially important if some variables have different variances.
Here PCA effectively captured the class separation, and K-means worked well when the correct K was chosen. Scaling and dimensionality reduction (PCA) both helped maintain meaningful clustering by balancing the influence of each variable.