8

(a): Using sdev from prcomp()

data("USArrests")

pca_result <- prcomp(USArrests, center = TRUE, scale. = TRUE)


sdev <- pca_result$sdev


pve_a <- sdev^2 / sum(sdev^2)


print("PVE using sdev from prcomp():")
## [1] "PVE using sdev from prcomp():"
print(pve_a*100)
## [1] 62.006039 24.744129  8.914080  4.335752

PC1 captures 62.01% of the total variability in the dataset. It is the most important linear combination of variables.

PC2 adds another 24.74%, so together PC1 and PC2 explain 86.75% of the variation.

PC3 and PC4 only contribute minor additional variation (less than 10% combined).

This rapid drop-off in PVE means the data can be well-represented in 2 dimensions, reducing dimensionality from 4 to 2 with minimal information loss.

(b): Using Equation 10.8

Z <- scale(…) ensures zero mean and unit variance.

PC_scores gives projection of data onto principal component axes.

Squaring and summing each component’s scores gives total variance explained by that component.

The denominator represents total variance in the entire dataset.

Result: PVE of each PC as a percentage of total variance.

Z <- scale(as.matrix(USArrests))

# Perform PCA
pca1 <- prcomp(USArrests, scale. = TRUE)

loadings <- as.matrix(pca1$rotation)  


PC_scores <- Z %*% loadings 

numerators <- apply(PC_scores^2, 2, sum)


denominator <- sum(Z^2)


PVE_eq10_8 <- 100 * numerators / denominator


print("PVE using Equation 10.8:")
## [1] "PVE using Equation 10.8:"
print(PVE_eq10_8)
##       PC1       PC2       PC3       PC4 
## 62.006039 24.744129  8.914080  4.335752

Introduction

In this analysis, we use the USArrests dataset to compute the Proportion of Variance Explained (PVE) by each principal component using Equation 10.8 from An Introduction to Statistical Learning. We also validate that the result matches the sdev output from prcomp().

PCA reduces the original 4D space to 2D while preserving ~87% of the total variance (PC1 + PC2).

This shows the dataset is highly compressible, meaning much of the structure can be understood by plotting or analyzing just the first 2 components.

Because your PVE values from both the sdev-based method and Equation 10.8 are exactly the same, this verifies that your manual calculation is implemented correctly and PCA math is internally consistent.

9

(a) Hierarchical clustering (Unscaled)

data("USArrests")



dist_usarrests <- dist(USArrests)


hc_complete <- hclust(dist_usarrests, method = "complete")


plot(hc_complete, main = "Dendrogram - Unscaled Data", xlab = "States", sub = "", cex = 0.6)

(b) Cut tree into 3 clusters

cluster_unscaled <- cutree(hc_complete, k = 3)


print("Unscaled Clusters (k = 3):")
## [1] "Unscaled Clusters (k = 3):"
print(cluster_unscaled)
##        Alabama         Alaska        Arizona       Arkansas     California 
##              1              1              1              2              1 
##       Colorado    Connecticut       Delaware        Florida        Georgia 
##              2              3              1              1              2 
##         Hawaii          Idaho       Illinois        Indiana           Iowa 
##              3              3              1              3              3 
##         Kansas       Kentucky      Louisiana          Maine       Maryland 
##              3              3              1              3              1 
##  Massachusetts       Michigan      Minnesota    Mississippi       Missouri 
##              2              1              3              1              2 
##        Montana       Nebraska         Nevada  New Hampshire     New Jersey 
##              3              3              1              3              2 
##     New Mexico       New York North Carolina   North Dakota           Ohio 
##              1              1              1              3              3 
##       Oklahoma         Oregon   Pennsylvania   Rhode Island South Carolina 
##              2              2              3              2              1 
##   South Dakota      Tennessee          Texas           Utah        Vermont 
##              3              2              2              3              3 
##       Virginia     Washington  West Virginia      Wisconsin        Wyoming 
##              2              2              3              3              2
table(cluster_unscaled)
## cluster_unscaled
##  1  2  3 
## 16 14 20

Explanation:

(c) Hierarchical clustering (Scaled)

usarrests_scaled <- scale(USArrests)

dist_scaled <- dist(usarrests_scaled)
hc_complete_scaled <- hclust(dist_scaled, method = "complete")

plot(hc_complete_scaled, main = "Dendrogram - Scaled Data", xlab = "States", sub = "", cex = 0.6)

Explanation:

scale(USArrests) standardizes each variable (column) to have a mean of 0 and standard deviation of 1. This ensures that each feature contributes equally to the distance calculations, regardless of its original scale or unit. After scaling, we repeat the distance calculation and clustering as in part (a).

When i plot the new dendrogram and create clusters, will likely observe that many states change cluster assignments. This is because now Murder, Rape, and UrbanPop have equal weight with Assault.

(d)

cluster_scaled <- cutree(hc_complete_scaled, k = 3)

print("Scaled Clusters (k = 3):")
## [1] "Scaled Clusters (k = 3):"
print(cluster_scaled)
##        Alabama         Alaska        Arizona       Arkansas     California 
##              1              1              2              3              2 
##       Colorado    Connecticut       Delaware        Florida        Georgia 
##              2              3              3              2              1 
##         Hawaii          Idaho       Illinois        Indiana           Iowa 
##              3              3              2              3              3 
##         Kansas       Kentucky      Louisiana          Maine       Maryland 
##              3              3              1              3              2 
##  Massachusetts       Michigan      Minnesota    Mississippi       Missouri 
##              3              2              3              1              3 
##        Montana       Nebraska         Nevada  New Hampshire     New Jersey 
##              3              3              2              3              3 
##     New Mexico       New York North Carolina   North Dakota           Ohio 
##              2              2              1              3              3 
##       Oklahoma         Oregon   Pennsylvania   Rhode Island South Carolina 
##              3              3              3              3              1 
##   South Dakota      Tennessee          Texas           Utah        Vermont 
##              3              1              2              3              3 
##       Virginia     Washington  West Virginia      Wisconsin        Wyoming 
##              3              3              3              3              3
table(scaled = cluster_scaled, unscaled = cluster_unscaled)
##       unscaled
## scaled  1  2  3
##      1  6  2  0
##      2  9  2  0
##      3  1 10 20

Effect of Scaling: Scaling standardizes all variables to have equal variance. In USArrests, some features (like Assault) have much higher variance than others (like Murder). Without scaling, these high-variance features dominate the distance calculation, biasing the clustering. After scaling, each feature contributes equally, resulting in a very different cluster structure.

Assault has values ranging from ~45 to 337, while Murder ranges from ~0.8 to 17.4. Without scaling, Assault dominates the distance metric.

After scaling, each variable contributes equally, and clustering becomes more balanced and reflective of all features.

22 states changed cluster assignments between scaled and unscaled versions — a significant shift.

Conclusion: Yes, the variables should be scaled before computing distances in hierarchical clustering, especially when the features are on different scales.

10

(a) Generate a Simulated Dataset with 3 Classes

set.seed(1234)

class1 <- matrix(rnorm(1000, mean = 0, sd = 3), nrow = 20)
class2 <- matrix(rnorm(1000, mean = 10, sd = 3), nrow = 20)
class3 <- matrix(rnorm(1000, mean = 20, sd = 3), nrow = 20)


sim_data <- rbind(class1, class2, class3)
dfClust <- data.frame(sim_data)

true_labels <- rep(1:3, each = 20)

(b) Perform PCA and Plot First 2 PCs

library(ggplot2)


pcaGen1 <- prcomp(dfClust, scale. = TRUE)


pc1p2 <- pcaGen1$x[, 1:2]


dfClust2 <- cbind(dfClust, PC1 = pc1p2[, 1], PC2 = pc1p2[, 2])
dfClust2$grp <- as.factor(true_labels)


ggplot(dfClust2, aes(x = PC1, y = PC2, color = grp)) +
  geom_point(size = 2) +
  labs(title = "PCA: First 2 Principal Components", color = "Group") +
  theme_minimal()

(c) K-means Clustering with K = 3

set.seed(1234)


km1Gen <- kmeans(dfClust[, 1:50], centers = 3, nstart = 20)


table(observed = dfClust2$grp, predicted = km1Gen$cluster)
##         predicted
## observed  1  2  3
##        1  0 20  0
##        2 20  0  0
##        3  0  0 20

(d) K-means Clustering with K = 2

set.seed(1234)


km2Gen <- kmeans(dfClust[, 1:50], centers = 2, nstart = 20)


table(observed = dfClust2$grp, predicted = km2Gen$cluster)
##         predicted
## observed  1  2
##        1  0 20
##        2 20  0
##        3 20  0

(e) K-means Clustering with K = 4

set.seed(1234)

set.seed(1234)


km4Gen <- kmeans(dfClust[, 1:50], centers = 4, nstart = 20)


table(observed = dfClust2$grp, predicted = km4Gen$cluster)
##         predicted
## observed  1  2  3  4
##        1  0  0  0 20
##        2 20  0  0  0
##        3  0 10 10  0
library(ggplot2)
ggplot(data = dfClust, aes(x = X1, y = X2, color = as.factor(km4Gen$cluster))) +
  geom_point(size = 2) +
  labs(title = "K-means Clustering (K = 4) on Raw Data", color = "Cluster") +
  theme_minimal()

apply(dfClust, 2, var)
##       X1       X2       X3       X4       X5       X6       X7       X8 
## 75.56377 90.24011 89.88951 68.41239 81.37546 71.96201 80.45309 76.21942 
##       X9      X10      X11      X12      X13      X14      X15      X16 
## 70.80045 82.61500 75.43233 70.58546 77.28182 74.54818 78.60427 76.15304 
##      X17      X18      X19      X20      X21      X22      X23      X24 
## 77.69029 84.47877 75.17476 79.37404 89.22844 75.86312 72.09148 81.38697 
##      X25      X26      X27      X28      X29      X30      X31      X32 
## 70.76246 69.58550 67.08389 85.20998 80.86051 78.00331 72.57235 85.86888 
##      X33      X34      X35      X36      X37      X38      X39      X40 
## 79.01548 80.86763 78.75439 82.00076 75.56149 75.02611 71.13518 84.67417 
##      X41      X42      X43      X44      X45      X46      X47      X48 
## 81.06913 72.53952 85.59838 73.76235 72.76319 79.41029 75.26227 84.10153 
##      X49      X50 
## 86.84702 68.58006

(f) K-means on First Two Principal Components (K = 3)

set.seed(1234)


kmpc2 <- kmeans(pc1p2, centers = 3, nstart = 20)


table(predicted = kmpc2$cluster, observed = dfClust2$grp)
##          observed
## predicted  1  2  3
##         1  0 20  0
##         2 20  0  0
##         3  0  0 20

(g) K-means After Scaling the Variables (K = 3)

set.seed(1234)

scaled_df <- scale(dfClust[, 1:50])
km1GenSc <- kmeans(scaled_df, centers = 3, nstart = 20)


table(km1GenSc$cluster, km1Gen$cluster)
##    
##      1  2  3
##   1 20  0  0
##   2  0 20  0
##   3  0  0 20

Explanation: (a) Generate a simulated dataset

generated 60 observations (20 per class) and 50 variables using rnorm(), assigning each class a distinct mean (0, 10, 20). Purpose: Creating three clearly separated classes in a high-dimensional space allows for meaningful dimensionality reduction and cluster testing. Interpretation: The mean shift ensures the groups are distinct, which is critical for PCA and K-means to work well.

  1. Perform PCA and plot the first two principal components

PCA was applied to the scaled data, and the first two principal components were plotted using ggplot2.

Output Interpretation: The plot shows clear separation between the three classes in 2D space. This indicates that the first two PCs capture most of the variance responsible for class distinction. Conclusion: moving to clustering since PCA confirms visible class separation.

  1. Perform K-means clustering with K = 3

K-means was applied to the raw (unscaled) dataset with 3 clusters. Result: The clustering matched the true class labels perfectly, though cluster labels may differ (e.g., 1 ≠ “class 1”). Interpretation: Since the class structure was well-separated and balanced, K-means could recover the clusters effectively. Conclusion: K-means is successful when the groups are linearly separable and well-separated in the original space.

  1. Perform K-means with K = 2

K-means was run with 2 clusters. Output: Two true classes got merged, and the algorithm failed to distinguish all three. Interpretation: Choosing fewer clusters than actual groups causes under-segmentation. Conclusion: This misclassification highlights the importance of selecting the right number of clusters (K).

  1. Perform K-means with K = 4 + visualize and check variance

K-means with 4 clusters was applied. A plot of X1 vs X2 showed cluster distribution. Variable-wise variances were computed.

Output Interpretation: One true group was split into two smaller clusters (over-segmentation). The variances showed substantial spread across variables — reinforcing the importance of scaling. Conclusion: Using too many clusters leads to unnecessary splitting of well-defined classes.

  1. K-means clustering on first two PCs

You applied K-means with K = 3 on only the first two PC score vectors.

Output Interpretation: The clustering was still nearly perfect. This confirms that PC1 and PC2 captured most of the useful variation for distinguishing groups. Conclusion: Dimensionality reduction via PCA before clustering is effective and computationally efficient.

  1. K-means on scaled data (K = 3)

Variables were standardized using scale(), and K-means clustering was repeated with K = 3.

Output Interpretation: The clustering results were similar or slightly improved, depending on whether original variable variances differed. The apply(dfClust, 2, var) showed that variable scales were uneven — justifying the need for scaling. Conclusion: Scaling is important when variables have unequal variance, ensuring no single variable dominates the distance measure used in clustering.