data("USArrests")
pca_result <- prcomp(USArrests, center = TRUE, scale. = TRUE)
sdev <- pca_result$sdev
pve_a <- sdev^2 / sum(sdev^2)
print("PVE using sdev from prcomp():")
## [1] "PVE using sdev from prcomp():"
print(pve_a*100)
## [1] 62.006039 24.744129 8.914080 4.335752
PC1 captures 62.01% of the total variability in the dataset. It is the most important linear combination of variables.
PC2 adds another 24.74%, so together PC1 and PC2 explain 86.75% of the variation.
PC3 and PC4 only contribute minor additional variation (less than 10% combined).
This rapid drop-off in PVE means the data can be well-represented in 2 dimensions, reducing dimensionality from 4 to 2 with minimal information loss.
Z <- scale(…) ensures zero mean and unit variance.
PC_scores gives projection of data onto principal component axes.
Squaring and summing each component’s scores gives total variance explained by that component.
The denominator represents total variance in the entire dataset.
Result: PVE of each PC as a percentage of total variance.
Z <- scale(as.matrix(USArrests))
# Perform PCA
pca1 <- prcomp(USArrests, scale. = TRUE)
loadings <- as.matrix(pca1$rotation)
PC_scores <- Z %*% loadings
numerators <- apply(PC_scores^2, 2, sum)
denominator <- sum(Z^2)
PVE_eq10_8 <- 100 * numerators / denominator
print("PVE using Equation 10.8:")
## [1] "PVE using Equation 10.8:"
print(PVE_eq10_8)
## PC1 PC2 PC3 PC4
## 62.006039 24.744129 8.914080 4.335752
In this analysis, we use the USArrests
dataset to
compute the Proportion of Variance Explained (PVE) by
each principal component using Equation 10.8 from
An Introduction to Statistical Learning. We also validate that
the result matches the sdev
output from
prcomp()
.
PCA reduces the original 4D space to 2D while preserving ~87% of the total variance (PC1 + PC2).
This shows the dataset is highly compressible, meaning much of the structure can be understood by plotting or analyzing just the first 2 components.
Because your PVE values from both the sdev-based method and Equation 10.8 are exactly the same, this verifies that your manual calculation is implemented correctly and PCA math is internally consistent.
data("USArrests")
dist_usarrests <- dist(USArrests)
hc_complete <- hclust(dist_usarrests, method = "complete")
plot(hc_complete, main = "Dendrogram - Unscaled Data", xlab = "States", sub = "", cex = 0.6)
cluster_unscaled <- cutree(hc_complete, k = 3)
print("Unscaled Clusters (k = 3):")
## [1] "Unscaled Clusters (k = 3):"
print(cluster_unscaled)
## Alabama Alaska Arizona Arkansas California
## 1 1 1 2 1
## Colorado Connecticut Delaware Florida Georgia
## 2 3 1 1 2
## Hawaii Idaho Illinois Indiana Iowa
## 3 3 1 3 3
## Kansas Kentucky Louisiana Maine Maryland
## 3 3 1 3 1
## Massachusetts Michigan Minnesota Mississippi Missouri
## 2 1 3 1 2
## Montana Nebraska Nevada New Hampshire New Jersey
## 3 3 1 3 2
## New Mexico New York North Carolina North Dakota Ohio
## 1 1 1 3 3
## Oklahoma Oregon Pennsylvania Rhode Island South Carolina
## 2 2 3 2 1
## South Dakota Tennessee Texas Utah Vermont
## 3 2 2 3 3
## Virginia Washington West Virginia Wisconsin Wyoming
## 2 2 3 3 2
table(cluster_unscaled)
## cluster_unscaled
## 1 2 3
## 16 14 20
Explanation:
usarrests_scaled <- scale(USArrests)
dist_scaled <- dist(usarrests_scaled)
hc_complete_scaled <- hclust(dist_scaled, method = "complete")
plot(hc_complete_scaled, main = "Dendrogram - Scaled Data", xlab = "States", sub = "", cex = 0.6)
Explanation:
scale(USArrests) standardizes each variable (column) to have a mean of 0 and standard deviation of 1. This ensures that each feature contributes equally to the distance calculations, regardless of its original scale or unit. After scaling, we repeat the distance calculation and clustering as in part (a).
When i plot the new dendrogram and create clusters, will likely observe that many states change cluster assignments. This is because now Murder, Rape, and UrbanPop have equal weight with Assault.
cluster_scaled <- cutree(hc_complete_scaled, k = 3)
print("Scaled Clusters (k = 3):")
## [1] "Scaled Clusters (k = 3):"
print(cluster_scaled)
## Alabama Alaska Arizona Arkansas California
## 1 1 2 3 2
## Colorado Connecticut Delaware Florida Georgia
## 2 3 3 2 1
## Hawaii Idaho Illinois Indiana Iowa
## 3 3 2 3 3
## Kansas Kentucky Louisiana Maine Maryland
## 3 3 1 3 2
## Massachusetts Michigan Minnesota Mississippi Missouri
## 3 2 3 1 3
## Montana Nebraska Nevada New Hampshire New Jersey
## 3 3 2 3 3
## New Mexico New York North Carolina North Dakota Ohio
## 2 2 1 3 3
## Oklahoma Oregon Pennsylvania Rhode Island South Carolina
## 3 3 3 3 1
## South Dakota Tennessee Texas Utah Vermont
## 3 1 2 3 3
## Virginia Washington West Virginia Wisconsin Wyoming
## 3 3 3 3 3
table(scaled = cluster_scaled, unscaled = cluster_unscaled)
## unscaled
## scaled 1 2 3
## 1 6 2 0
## 2 9 2 0
## 3 1 10 20
Effect of Scaling: Scaling standardizes all variables to have equal variance. In USArrests, some features (like Assault) have much higher variance than others (like Murder). Without scaling, these high-variance features dominate the distance calculation, biasing the clustering. After scaling, each feature contributes equally, resulting in a very different cluster structure.
Assault has values ranging from ~45 to 337, while Murder ranges from ~0.8 to 17.4. Without scaling, Assault dominates the distance metric.
After scaling, each variable contributes equally, and clustering becomes more balanced and reflective of all features.
22 states changed cluster assignments between scaled and unscaled versions — a significant shift.
Conclusion: Yes, the variables should be scaled before computing distances in hierarchical clustering, especially when the features are on different scales.
set.seed(1234)
class1 <- matrix(rnorm(1000, mean = 0, sd = 3), nrow = 20)
class2 <- matrix(rnorm(1000, mean = 10, sd = 3), nrow = 20)
class3 <- matrix(rnorm(1000, mean = 20, sd = 3), nrow = 20)
sim_data <- rbind(class1, class2, class3)
dfClust <- data.frame(sim_data)
true_labels <- rep(1:3, each = 20)
library(ggplot2)
pcaGen1 <- prcomp(dfClust, scale. = TRUE)
pc1p2 <- pcaGen1$x[, 1:2]
dfClust2 <- cbind(dfClust, PC1 = pc1p2[, 1], PC2 = pc1p2[, 2])
dfClust2$grp <- as.factor(true_labels)
ggplot(dfClust2, aes(x = PC1, y = PC2, color = grp)) +
geom_point(size = 2) +
labs(title = "PCA: First 2 Principal Components", color = "Group") +
theme_minimal()
set.seed(1234)
km1Gen <- kmeans(dfClust[, 1:50], centers = 3, nstart = 20)
table(observed = dfClust2$grp, predicted = km1Gen$cluster)
## predicted
## observed 1 2 3
## 1 0 20 0
## 2 20 0 0
## 3 0 0 20
set.seed(1234)
km2Gen <- kmeans(dfClust[, 1:50], centers = 2, nstart = 20)
table(observed = dfClust2$grp, predicted = km2Gen$cluster)
## predicted
## observed 1 2
## 1 0 20
## 2 20 0
## 3 20 0
set.seed(1234)
set.seed(1234)
km4Gen <- kmeans(dfClust[, 1:50], centers = 4, nstart = 20)
table(observed = dfClust2$grp, predicted = km4Gen$cluster)
## predicted
## observed 1 2 3 4
## 1 0 0 0 20
## 2 20 0 0 0
## 3 0 10 10 0
library(ggplot2)
ggplot(data = dfClust, aes(x = X1, y = X2, color = as.factor(km4Gen$cluster))) +
geom_point(size = 2) +
labs(title = "K-means Clustering (K = 4) on Raw Data", color = "Cluster") +
theme_minimal()
apply(dfClust, 2, var)
## X1 X2 X3 X4 X5 X6 X7 X8
## 75.56377 90.24011 89.88951 68.41239 81.37546 71.96201 80.45309 76.21942
## X9 X10 X11 X12 X13 X14 X15 X16
## 70.80045 82.61500 75.43233 70.58546 77.28182 74.54818 78.60427 76.15304
## X17 X18 X19 X20 X21 X22 X23 X24
## 77.69029 84.47877 75.17476 79.37404 89.22844 75.86312 72.09148 81.38697
## X25 X26 X27 X28 X29 X30 X31 X32
## 70.76246 69.58550 67.08389 85.20998 80.86051 78.00331 72.57235 85.86888
## X33 X34 X35 X36 X37 X38 X39 X40
## 79.01548 80.86763 78.75439 82.00076 75.56149 75.02611 71.13518 84.67417
## X41 X42 X43 X44 X45 X46 X47 X48
## 81.06913 72.53952 85.59838 73.76235 72.76319 79.41029 75.26227 84.10153
## X49 X50
## 86.84702 68.58006
set.seed(1234)
kmpc2 <- kmeans(pc1p2, centers = 3, nstart = 20)
table(predicted = kmpc2$cluster, observed = dfClust2$grp)
## observed
## predicted 1 2 3
## 1 0 20 0
## 2 20 0 0
## 3 0 0 20
set.seed(1234)
scaled_df <- scale(dfClust[, 1:50])
km1GenSc <- kmeans(scaled_df, centers = 3, nstart = 20)
table(km1GenSc$cluster, km1Gen$cluster)
##
## 1 2 3
## 1 20 0 0
## 2 0 20 0
## 3 0 0 20
Explanation: (a) Generate a simulated dataset
generated 60 observations (20 per class) and 50 variables using rnorm(), assigning each class a distinct mean (0, 10, 20). Purpose: Creating three clearly separated classes in a high-dimensional space allows for meaningful dimensionality reduction and cluster testing. Interpretation: The mean shift ensures the groups are distinct, which is critical for PCA and K-means to work well.
PCA was applied to the scaled data, and the first two principal components were plotted using ggplot2.
Output Interpretation: The plot shows clear separation between the three classes in 2D space. This indicates that the first two PCs capture most of the variance responsible for class distinction. Conclusion: moving to clustering since PCA confirms visible class separation.
K-means was applied to the raw (unscaled) dataset with 3 clusters. Result: The clustering matched the true class labels perfectly, though cluster labels may differ (e.g., 1 ≠ “class 1”). Interpretation: Since the class structure was well-separated and balanced, K-means could recover the clusters effectively. Conclusion: K-means is successful when the groups are linearly separable and well-separated in the original space.
K-means was run with 2 clusters. Output: Two true classes got merged, and the algorithm failed to distinguish all three. Interpretation: Choosing fewer clusters than actual groups causes under-segmentation. Conclusion: This misclassification highlights the importance of selecting the right number of clusters (K).
K-means with 4 clusters was applied. A plot of X1 vs X2 showed cluster distribution. Variable-wise variances were computed.
Output Interpretation: One true group was split into two smaller clusters (over-segmentation). The variances showed substantial spread across variables — reinforcing the importance of scaling. Conclusion: Using too many clusters leads to unnecessary splitting of well-defined classes.
You applied K-means with K = 3 on only the first two PC score vectors.
Output Interpretation: The clustering was still nearly perfect. This confirms that PC1 and PC2 captured most of the useful variation for distinguishing groups. Conclusion: Dimensionality reduction via PCA before clustering is effective and computationally efficient.
Variables were standardized using scale(), and K-means clustering was repeated with K = 3.
Output Interpretation: The clustering results were similar or slightly improved, depending on whether original variable variances differed. The apply(dfClust, 2, var) showed that variable scales were uneven — justifying the need for scaling. Conclusion: Scaling is important when variables have unequal variance, ensuring no single variable dominates the distance measure used in clustering.