Question 8:

data(USArrests)
X <- scale(USArrests)

(a) Using sdev:

# PVE via the prcomp() “sdev” output:

pc <- prcomp(X, center = FALSE, scale. = FALSE)  # data is already scaled
sdev  <- pc$sdev
pve_a <- sdev^2 / sum(sdev^2)

(b) Using prcomp():

# PVE via Equation (12.10):

V       <- pc$rotation                # columns are v_1, v_2, …
scores  <- X %*% V                    # matrix of principal‐component scores
ss_pc   <- colSums(scores^2)          # numerator: sum of squares of each PC
ss_X    <- sum(X^2)                   # denominator: sum of squares of all entries
pve_b   <- ss_pc / ss_X
# Comparing both:

pve <- cbind(
  PVE_via_sdev = pve_a,
  PVE_via_formula = pve_b
)
round(pve, 4)
##     PVE_via_sdev PVE_via_formula
## PC1       0.6201          0.6201
## PC2       0.2474          0.2474
## PC3       0.0891          0.0891
## PC4       0.0434          0.0434

After comparing part (a) (via sdev) and part (b) (via the formula), we obtained identical results as we centered and scaled the variables.

Question 9:

(a) Clustering:

data("USArrests")

# Euclidean distance between rows (states):
d_raw <- dist(USArrests, method = "euclidean")

# Hierarchical clustering with complete linkage:
hc_raw <- hclust(d_raw, method = "complete")

# Plotting the dendrogram:
plot(hc_raw, main = "Complete Linkage on Raw USArrests", xlab = "", sub = "")

(b) Cutting Dendrogram:

# Cutting into 3 clusters:
clusters_raw <- cutree(hc_raw, k = 3)
table(clusters_raw)
## clusters_raw
##  1  2  3 
## 16 14 20
split(rownames(USArrests), clusters_raw)
## $`1`
##  [1] "Alabama"        "Alaska"         "Arizona"        "California"    
##  [5] "Delaware"       "Florida"        "Illinois"       "Louisiana"     
##  [9] "Maryland"       "Michigan"       "Mississippi"    "Nevada"        
## [13] "New Mexico"     "New York"       "North Carolina" "South Carolina"
## 
## $`2`
##  [1] "Arkansas"      "Colorado"      "Georgia"       "Massachusetts"
##  [5] "Missouri"      "New Jersey"    "Oklahoma"      "Oregon"       
##  [9] "Rhode Island"  "Tennessee"     "Texas"         "Virginia"     
## [13] "Washington"    "Wyoming"      
## 
## $`3`
##  [1] "Connecticut"   "Hawaii"        "Idaho"         "Indiana"      
##  [5] "Iowa"          "Kansas"        "Kentucky"      "Maine"        
##  [9] "Minnesota"     "Montana"       "Nebraska"      "New Hampshire"
## [13] "North Dakota"  "Ohio"          "Pennsylvania"  "South Dakota" 
## [17] "Utah"          "Vermont"       "West Virginia" "Wisconsin"

(c) After scaling:

# Scaling each variable to have SD = 1, then re‐cluster:
US_scaled <- scale(USArrests)
d_scl   <- dist(US_scaled, method = "euclidean")
hc_scl  <- hclust(d_scl, method = "complete")

# Plot:
plot(hc_scl, main = "Complete Linkage on Scaled USArrests", xlab = "", sub = "")

# Cutting into 3 clusters:
clusters_scl <- cutree(hc_scl, k = 3)
table(clusters_scl)
## clusters_scl
##  1  2  3 
##  8 11 31
split(rownames(USArrests), clusters_scl)
## $`1`
## [1] "Alabama"        "Alaska"         "Georgia"        "Louisiana"     
## [5] "Mississippi"    "North Carolina" "South Carolina" "Tennessee"     
## 
## $`2`
##  [1] "Arizona"    "California" "Colorado"   "Florida"    "Illinois"  
##  [6] "Maryland"   "Michigan"   "Nevada"     "New Mexico" "New York"  
## [11] "Texas"     
## 
## $`3`
##  [1] "Arkansas"      "Connecticut"   "Delaware"      "Hawaii"       
##  [5] "Idaho"         "Indiana"       "Iowa"          "Kansas"       
##  [9] "Kentucky"      "Maine"         "Massachusetts" "Minnesota"    
## [13] "Missouri"      "Montana"       "Nebraska"      "New Hampshire"
## [17] "New Jersey"    "North Dakota"  "Ohio"          "Oklahoma"     
## [21] "Oregon"        "Pennsylvania"  "Rhode Island"  "South Dakota" 
## [25] "Utah"          "Vermont"       "Virginia"      "Washington"   
## [29] "West Virginia" "Wisconsin"     "Wyoming"

(d) Effect of Scaling:

On the raw data, the clustering is dominated by variables with the largest variances—here, Murder and UrbanPop tend to drive which states join together early.

After scaling, each variable contributes equally (unit variance). As a result, states that are similar only on, for say, Assault or Rape may now end up together, whereas before those variables were “drowned out.”

So the variables needs to be scaled if the original variables are on very different scales or have very different spreads (e.g. Murder ranges roughly 0–17, while UrbanPop is 32–91), then without scaling the distance metric will be dominated by the “largest” variables. Scaling ensures each feature contributes proportionately to the Euclidean distance and hence to the clustering.

However, if there is a substantive reason why some variables should carry more weight (for example, if Murder really is inherently more important for the application), we could choose not to scale or else apply a weighted distance. But in most exploratory settings with mixed‐scale variables, standardizing to unit variance is the recommended practice.

Question 10:

(a) Generate data:

set.seed(1349)

# Generate 60×50 data: 3 classes of size 20, with mean shifts:
n_per_class <- 20
p            <- 50

# Class 1: centered at 0
X1 <- matrix(rnorm(n_per_class * p), nrow = n_per_class)

# Class 2: shift all variables up by +3
X2 <- matrix(rnorm(n_per_class * p), nrow = n_per_class) + 3

# Class 3: shift all variables down by –3
X3 <- matrix(rnorm(n_per_class * p), nrow = n_per_class) - 3

# Combining:
X  <- rbind(X1, X2, X3)
y  <- factor(rep(1:3, each = n_per_class), labels = c("C1","C2","C3"))

(b) PCA:

# PCA and scatter of PC1 vs PC2:
pca <- prcomp(X, center = TRUE, scale. = FALSE)

scores <- pca$x[,1:2]
library(ggplot2)
ggplot(data.frame(scores, Class = y), aes(PC1, PC2, color = Class)) +
  geom_point(size=2, alpha=0.8) +
  theme_minimal() +
  ggtitle("PCA: First Two PCs, colored by true class")

(c) K-Means:

# k-means with K = 3 on raw data:
set.seed(2025)
km3_raw <- kmeans(X, centers = 3, nstart = 50)
table(True = y, KMeans = km3_raw$cluster)
##     KMeans
## True  1  2  3
##   C1  0 20  0
##   C2  0  0 20
##   C3 20  0  0

From the above results, we can see that each class (C1, C2, C3) have 20 observations classified in Cluster 2, 3, and 4 respectively. This shows one-to-one mapping with each clusters, yielding zero misclassification.

(d) K-Means (K=2):

set.seed(2025)
km2_raw <- kmeans(X, centers = 2, nstart = 50)
table(True = y, KMeans = km2_raw$cluster)
##     KMeans
## True  1  2
##   C1  0 20
##   C2  0 20
##   C3 20  0

Results shows that true class C1 and C2 have merged their observations into same cluster 2, while true class C3 have its observation in cluster 1.

(e) K-Means (K=4):

set.seed(2025)
km4_raw <- kmeans(X, centers = 4, nstart = 50)
table(True = y, KMeans = km4_raw$cluster)
##     KMeans
## True  1  2  3  4
##   C1  9 11  0  0
##   C2  0  0  0 20
##   C3  0  0 20  0

Results show that the true class C1 has been split into cluster 1 and 2. Whereas true class C2 has its observation in cluster 4 and C3 in cluster 3.

(f) K-Means on 2 PC Vectors:

set.seed(2025)
km3_pcs <- kmeans(scores, centers = 3, nstart = 50)
table(True = y, KMeans = km3_pcs$cluster)
##     KMeans
## True  1  2  3
##   C1  0 20  0
##   C2  0  0 20
##   C3 20  0  0

Results show that after performing K-Means on 2D PC vectors, it recovered the classification as perfectly as possible. Now, true C1 has its observations on cluster 2, C2 on cluster 3, and C3 on cluster 1. This is identical with K-Means with K=3.

(g) After scaling K-Means:

X_scaled <- scale(X)
set.seed(2025)
km3_scl <- kmeans(X_scaled, centers = 3, nstart = 50)
table(True = y, KMeans = km3_scl$cluster)
##     KMeans
## True  1  2  3
##   C1  0  0 20
##   C2  0 20  0
##   C3 20  0  0

After scaling, the one-to-one mapping between true class and clusters is still obtained. The only difference is that the labels are rearranged now. Before C1 was mapped to cluster 2, C2 to cluster 3, and C3 to cluster 1. Now, C1 is mapped to cluster 3, C2 to cluster 2, and C3 to cluster 1. Since K-Means returns arbitrary integer labels, it should not be an issue, as long as the perfect mapping with no misclassification is achieved.