if (!require(factoextra)) {
  install.packages("factoextra")
}
## Loading required package: factoextra
## Loading required package: ggplot2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
if (!require(ggfortify)) {
  install.packages("ggfortify")
}
## Loading required package: ggfortify
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(cluster)
library(factoextra)
library(tidyr)
library(ggfortify)

8. In Section 12.2.3, a formula for calculating PVE was given in Equation 12.10. We also saw that the PVE can be obtained using the sdev output of the prcomp() function. On the USArrests data, calculate PVE in two ways:

(a) Using the sdev output of the prcomp() function, as was done in Section 12.2.3.

data("USArrests")
USArrests_scaled <- scale(USArrests)
pca_result <- prcomp(USArrests_scaled)

pve <- (pca_result$sdev)^2 / sum(pca_result$sdev^2)
pve
## [1] 0.62006039 0.24744129 0.08914080 0.04335752

(b) By applying Equation 12.10 directly. That is, use the prcomp() function to compute the principal component loadings. Then, use those loadings in Equation 12.10 to obtain the PVE.

# Compute the variance explained by each principal component
loadings <- pca_result$rotation
scores <- as.matrix(USArrests_scaled) %*% loadings
pc_var <- apply(scores^2, 2, sum)
pve_manual <- pc_var / sum(pc_var)
pve_manual
##        PC1        PC2        PC3        PC4 
## 0.62006039 0.24744129 0.08914080 0.04335752

Interpretation: Both methods yield the same PVE values:
- PC1 explains ~62% of the variance
- PC2 ~25%
- PC3 ~8.9%
- PC4 ~4.3%

This confirms the manual application of Equation 12.10 matches the built-in computation.

9. Consider the USArrests data. We will now perform hierarchical clustering on the states.

(a) Using hierarchical clustering with complete linkage and Euclidean distance, cluster the states.

dist_matrix <- dist(USArrests, method = "euclidean")
hc_complete <- hclust(dist_matrix, method = "complete")
plot(hc_complete, main = "Hierarchical Clustering Dendrogram", xlab = "", sub = "", cex = 0.6)

(b) Cut the dendrogram at a height that results in three distinct clusters. Which states belong to which clusters?

clusters <- cutree(hc_complete, k = 3)
cluster_assignments <- data.frame(State = rownames(USArrests), Cluster = clusters)
head(cluster_assignments, 10)
##                   State Cluster
## Alabama         Alabama       1
## Alaska           Alaska       1
## Arizona         Arizona       1
## Arkansas       Arkansas       2
## California   California       1
## Colorado       Colorado       2
## Connecticut Connecticut       3
## Delaware       Delaware       1
## Florida         Florida       1
## Georgia         Georgia       2

(c) Hierarchically cluster the states using complete linkage and Euclidean distance, after scaling the variables to have standard deviation one.

USArrests_scaled <- scale(USArrests)
dist_matrix_scaled <- dist(USArrests_scaled, method = "euclidean")
hc_complete_scaled <- hclust(dist_matrix_scaled, method = "complete")
plot(hc_complete_scaled, main = "Hierarchical Clustering Dendrogram (Scaled Data)", xlab = "", sub = "", cex = 0.6)

(d) What effect does scaling the variables have on the hierarchical clustering obtained? In your opinion, should the variables be scaled before the inter-observation dissimilarities are computed? Provide a justification for your answer.

Scaling the data significantly changes the clustering outcome. Without scaling, variables with higher variance (like Assault) dominate the distance computation. Scaling ensures all features contribute equally, which is crucial in clustering applications. Hence, variables should be scaled before computing inter-observation distances.

10. In this problem, you will generate simulated data, and then perform PCA and K-means clustering on the data.

(a) Generate a simulated data set with 20 observations in each of three classes (i.e. 60 observations total), and 50 variables.

set.seed(123)
class1 <- matrix(rnorm(20 * 50, mean = 0), ncol = 50)
class2 <- matrix(rnorm(20 * 50, mean = 3), ncol = 50)
class3 <- matrix(rnorm(20 * 50, mean = -3), ncol = 50)
X <- rbind(class1, class2, class3)
true_labels <- factor(rep(1:3, each = 20))

(b) Perform PCA on the 60 observations and plot the first two principal component score vectors. Use a different color to indicate the observations in each of the three classes. If the three classes appear separated in this plot, then continue on to part (c) If not, then return to part (a) and modify the simulation so that there is greater separation between the three classes. Do not continue to part (c) until the three classes show at least some separation in the first two principal component score vectors.

pca_result <- prcomp(X, scale. = TRUE)
pca_df <- data.frame(
  PC1 = pca_result$x[, 1],
  PC2 = pca_result$x[, 2],
  Class = true_labels
)

ggplot(pca_df, aes(x = PC1, y = PC2, color = Class)) +
  geom_point(size = 2.5) +
  labs(title = "PCA: First Two Principal Components",
       x = "Principal Component 1",
       y = "Principal Component 2") +
  theme_minimal()

Interpretation: The three classes are clearly separated in PC1-PC2 space, satisfying the condition for moving to clustering.

(c) Perform K-means clustering of the observations with K = 3. How well do the clusters that you obtained in K-means clustering compare to the true class labels?

set.seed(123)
kmeans_result <- kmeans(X, centers = 3, nstart = 25)
table(True_Class = true_labels, KMeans_Cluster = kmeans_result$cluster)
##           KMeans_Cluster
## True_Class  1  2  3
##          1  0 20  0
##          2 20  0  0
##          3  0  0 20

Interpretation: Although K-means arbitrarily assigns cluster numbers, the confusion matrix shows perfect separation: - All 20 class 1 observations went to a single cluster, - All 20 class 2 to another, - And the same for class 3.

This confirms that PCA + K-means was highly effective for this data configuration.

(d) Perform K-means clustering with K = 2. Describe your results.

kmeans_k2 <- kmeans(X, centers = 2, nstart = 25)
table(True_Class = true_labels, KMeans_Cluster = kmeans_k2$cluster)
##           KMeans_Cluster
## True_Class  1  2
##          1 20  0
##          2 20  0
##          3  0 20

Interpretation: With only 2 clusters, the algorithm merges at least one class into another. This under-clustering results in mixed groupings and higher intra-cluster variance.

(e) Now perform K-means clustering with K = 4, and describe your results.

kmeans_k4 <- kmeans(X, centers = 4, nstart = 25)
table(True_Class = true_labels, KMeans_Cluster = kmeans_k4$cluster)
##           KMeans_Cluster
## True_Class  1  2  3  4
##          1  0  0 20  0
##          2  0 12  0  8
##          3 20  0  0  0

Interpretation: K = 4 over-clusters the data. One class is likely split into two clusters, while others remain intact. This adds unnecessary complexity without improving separation.

(f) Now perform K-means clustering with K = 3 on the first two principal component score vectors, rather than on the raw data. That is, perform K-means clustering on the 60 × 2 matrix of which the first column is the first principal component score vector, and the second column is the second principal component score vector. Comment on the results.

kmeans_pc <- kmeans(pca_df[, 1:2], centers = 3, nstart = 25)
table(True_Class = true_labels, KMeans_PC = kmeans_pc$cluster)
##           KMeans_PC
## True_Class  1  2  3
##          1 20  0  0
##          2  0  0 20
##          3  0 20  0

Interpretation: K-means on PC1 and PC2 still produces strong clustering, similar to clustering on raw data. PCA preserved the class separability in fewer dimensions, indicating that the first two PCs contain most of the clustering-relevant structure.

(g) Using the scale() function, perform K-means clustering with K = 3 on the data after scaling each variable to have standard deviation one. How do these results compare to those obtained in (b)? Explain.

X_scaled <- scale(X)
kmeans_scaled <- kmeans(X_scaled, centers = 3, nstart = 25)
table(True_Class = true_labels, KMeans_Scaled = kmeans_scaled$cluster)
##           KMeans_Scaled
## True_Class  1  2  3
##          1  0  0 20
##          2 20  0  0
##          3  0 20  0

Interpretation: Scaling does not drastically change the clustering results here, as the synthetic data already had well-separated means. However, in real-world data, scaling ensures variables contribute equally to distance calculations, preventing any single variable from dominating the clustering process.