if (!require(factoextra)) {
install.packages("factoextra")
}
## Loading required package: factoextra
## Loading required package: ggplot2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
if (!require(ggfortify)) {
install.packages("ggfortify")
}
## Loading required package: ggfortify
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(cluster)
library(factoextra)
library(tidyr)
library(ggfortify)
data("USArrests")
USArrests_scaled <- scale(USArrests)
pca_result <- prcomp(USArrests_scaled)
pve <- (pca_result$sdev)^2 / sum(pca_result$sdev^2)
pve
## [1] 0.62006039 0.24744129 0.08914080 0.04335752
# Compute the variance explained by each principal component
loadings <- pca_result$rotation
scores <- as.matrix(USArrests_scaled) %*% loadings
pc_var <- apply(scores^2, 2, sum)
pve_manual <- pc_var / sum(pc_var)
pve_manual
## PC1 PC2 PC3 PC4
## 0.62006039 0.24744129 0.08914080 0.04335752
Interpretation: Both methods yield the same PVE
values:
- PC1 explains ~62% of the variance
- PC2 ~25%
- PC3 ~8.9%
- PC4 ~4.3%
This confirms the manual application of Equation 12.10 matches the built-in computation.
dist_matrix <- dist(USArrests, method = "euclidean")
hc_complete <- hclust(dist_matrix, method = "complete")
plot(hc_complete, main = "Hierarchical Clustering Dendrogram", xlab = "", sub = "", cex = 0.6)
clusters <- cutree(hc_complete, k = 3)
cluster_assignments <- data.frame(State = rownames(USArrests), Cluster = clusters)
head(cluster_assignments, 10)
## State Cluster
## Alabama Alabama 1
## Alaska Alaska 1
## Arizona Arizona 1
## Arkansas Arkansas 2
## California California 1
## Colorado Colorado 2
## Connecticut Connecticut 3
## Delaware Delaware 1
## Florida Florida 1
## Georgia Georgia 2
USArrests_scaled <- scale(USArrests)
dist_matrix_scaled <- dist(USArrests_scaled, method = "euclidean")
hc_complete_scaled <- hclust(dist_matrix_scaled, method = "complete")
plot(hc_complete_scaled, main = "Hierarchical Clustering Dendrogram (Scaled Data)", xlab = "", sub = "", cex = 0.6)
Scaling the data significantly changes the clustering outcome. Without scaling, variables with higher variance (like Assault) dominate the distance computation. Scaling ensures all features contribute equally, which is crucial in clustering applications. Hence, variables should be scaled before computing inter-observation distances.
set.seed(123)
class1 <- matrix(rnorm(20 * 50, mean = 0), ncol = 50)
class2 <- matrix(rnorm(20 * 50, mean = 3), ncol = 50)
class3 <- matrix(rnorm(20 * 50, mean = -3), ncol = 50)
X <- rbind(class1, class2, class3)
true_labels <- factor(rep(1:3, each = 20))
pca_result <- prcomp(X, scale. = TRUE)
pca_df <- data.frame(
PC1 = pca_result$x[, 1],
PC2 = pca_result$x[, 2],
Class = true_labels
)
ggplot(pca_df, aes(x = PC1, y = PC2, color = Class)) +
geom_point(size = 2.5) +
labs(title = "PCA: First Two Principal Components",
x = "Principal Component 1",
y = "Principal Component 2") +
theme_minimal()
Interpretation: The three classes are clearly separated in PC1-PC2 space, satisfying the condition for moving to clustering.
set.seed(123)
kmeans_result <- kmeans(X, centers = 3, nstart = 25)
table(True_Class = true_labels, KMeans_Cluster = kmeans_result$cluster)
## KMeans_Cluster
## True_Class 1 2 3
## 1 0 20 0
## 2 20 0 0
## 3 0 0 20
Interpretation: Although K-means arbitrarily assigns cluster numbers, the confusion matrix shows perfect separation: - All 20 class 1 observations went to a single cluster, - All 20 class 2 to another, - And the same for class 3.
This confirms that PCA + K-means was highly effective for this data configuration.
kmeans_k2 <- kmeans(X, centers = 2, nstart = 25)
table(True_Class = true_labels, KMeans_Cluster = kmeans_k2$cluster)
## KMeans_Cluster
## True_Class 1 2
## 1 20 0
## 2 20 0
## 3 0 20
Interpretation: With only 2 clusters, the algorithm merges at least one class into another. This under-clustering results in mixed groupings and higher intra-cluster variance.
kmeans_k4 <- kmeans(X, centers = 4, nstart = 25)
table(True_Class = true_labels, KMeans_Cluster = kmeans_k4$cluster)
## KMeans_Cluster
## True_Class 1 2 3 4
## 1 0 0 20 0
## 2 0 12 0 8
## 3 20 0 0 0
Interpretation: K = 4 over-clusters the data. One class is likely split into two clusters, while others remain intact. This adds unnecessary complexity without improving separation.
kmeans_pc <- kmeans(pca_df[, 1:2], centers = 3, nstart = 25)
table(True_Class = true_labels, KMeans_PC = kmeans_pc$cluster)
## KMeans_PC
## True_Class 1 2 3
## 1 20 0 0
## 2 0 0 20
## 3 0 20 0
Interpretation: K-means on PC1 and PC2 still produces strong clustering, similar to clustering on raw data. PCA preserved the class separability in fewer dimensions, indicating that the first two PCs contain most of the clustering-relevant structure.
X_scaled <- scale(X)
kmeans_scaled <- kmeans(X_scaled, centers = 3, nstart = 25)
table(True_Class = true_labels, KMeans_Scaled = kmeans_scaled$cluster)
## KMeans_Scaled
## True_Class 1 2 3
## 1 0 0 20
## 2 20 0 0
## 3 0 20 0
Interpretation: Scaling does not drastically change the clustering results here, as the synthetic data already had well-separated means. However, in real-world data, scaling ensures variables contribute equally to distance calculations, preventing any single variable from dominating the clustering process.