8. In Section 12.2.3, a formula for calculating PVE was given
in Equation 12.10. We also saw that the PVE can be obtained using the
sdev output of the prcomp() function. On the USArrests data, calculate
PVE in two ways:
(a) Using the sdev output of the prcomp() function, as was done in
Section 12.2.3.
# Perform PCA with scaling
pr.out <- prcomp(USArrests, scale = TRUE)
# Calculate PVE using sdev
pr.var <- pr.out$sdev^2
pve <- pr.var / sum(pr.var)
pve
## [1] 0.62006039 0.24744129 0.08914080 0.04335752
Explanation: This code performs PCA on the USArrests data with scaling and calculates PVE using the standard deviations.
(b) By applying Equation 12.10 directly. That is, use the prcomp() function to compute the principal component loadings. Then, use those loadings in Equation 12.10 to obtain the PVE.
# Scale the data
scaled_data <- scale(USArrests)
# Get loadings from prcomp
loadings <- pr.out$rotation
# Calculate PVE using equation 12.10
pve2 <- colSums(pr.out$x^2) / sum(colSums(scaled_data^2))
pve2
## PC1 PC2 PC3 PC4
## 0.62006039 0.24744129 0.08914080 0.04335752
Explanation: This code applies Equation 12.10 directly using principal component loadings to calculate PVE.
9. Consider the USArrests data. We will now perform hierarchical clus tering on the states.
(a) Using hierarchical clustering with complete linkage and Euclidean distance, cluster the states.
# Compute distance matrix
dist_matrix <- dist(USArrests)
# Perform hierarchical clustering with complete linkage
hc <- hclust(dist_matrix, method = "complete")
# Plot dendrogram
plot(hc, main = "Hierarchical Clustering with Complete Linkage",
xlab = "States", sub = "", cex = 0.6)
Explanation: This performs hierarchical clustering using complete linkage and Euclidean distance.
(b) Cut the dendrogram at a height that results in three distinct clusters. Which states belong to which clusters?
# Cut tree to get 3 clusters
clusters <- cutree(hc, k = 3)
# See which states belong to which clusters
states_clusters <- data.frame(State = rownames(USArrests), Cluster = clusters)
# View clustered states
table(clusters)
## clusters
## 1 2 3
## 16 14 20
states_clusters[order(states_clusters$Cluster), ]
## State Cluster
## Alabama Alabama 1
## Alaska Alaska 1
## Arizona Arizona 1
## California California 1
## Delaware Delaware 1
## Florida Florida 1
## Illinois Illinois 1
## Louisiana Louisiana 1
## Maryland Maryland 1
## Michigan Michigan 1
## Mississippi Mississippi 1
## Nevada Nevada 1
## New Mexico New Mexico 1
## New York New York 1
## North Carolina North Carolina 1
## South Carolina South Carolina 1
## Arkansas Arkansas 2
## Colorado Colorado 2
## Georgia Georgia 2
## Massachusetts Massachusetts 2
## Missouri Missouri 2
## New Jersey New Jersey 2
## Oklahoma Oklahoma 2
## Oregon Oregon 2
## Rhode Island Rhode Island 2
## Tennessee Tennessee 2
## Texas Texas 2
## Virginia Virginia 2
## Washington Washington 2
## Wyoming Wyoming 2
## Connecticut Connecticut 3
## Hawaii Hawaii 3
## Idaho Idaho 3
## Indiana Indiana 3
## Iowa Iowa 3
## Kansas Kansas 3
## Kentucky Kentucky 3
## Maine Maine 3
## Minnesota Minnesota 3
## Montana Montana 3
## Nebraska Nebraska 3
## New Hampshire New Hampshire 3
## North Dakota North Dakota 3
## Ohio Ohio 3
## Pennsylvania Pennsylvania 3
## South Dakota South Dakota 3
## Utah Utah 3
## Vermont Vermont 3
## West Virginia West Virginia 3
## Wisconsin Wisconsin 3
Explanation: This code cuts the dendrogram to create three clusters and shows which states belong to each.
(c) Hierarchically cluster the states using complete linkage and Euclidean distance, after scaling the variables to have standard de viation one.
# Scale the data
scaled_arrests <- scale(USArrests)
# Compute distance matrix on scaled data
dist_scaled <- dist(scaled_arrests)
# Perform hierarchical clustering with complete linkage
hc_scaled <- hclust(dist_scaled, method = "complete")
# Plot dendrogram
plot(hc_scaled, main = "Hierarchical Clustering with Scaling",
xlab = "States", sub = "", cex = 0.6)
Explanation: This performs hierarchical clustering after scaling the variables to have standard deviation one.
(d) What effect does scaling the variables have on the hierarchical clustering obtained? In your opinion, should the variables be scaled before the inter-observation dissimilarities are computed? Provide a justification for your answer.
# Cut scaled tree to get 3 clusters
clusters_scaled <- cutree(hc_scaled, k = 3)
# Compare clusters from scaled and unscaled data
table(unscaled = clusters, scaled = clusters_scaled)
## scaled
## unscaled 1 2 3
## 1 6 9 1
## 2 2 2 10
## 3 0 0 20
Justification for answer:
The question asks about the effect of scaling variables before
hierarchical clustering and whether variables should be scaled before
computing inter-observation dissimilarities.
The effect of scaling is quite significant in hierarchical clustering with Euclidean distance. When variables have different units or scales (as they do in the USArrests dataset), variables with larger numerical ranges will disproportionately influence the clustering results.
In the USArrests data, the variables measure different crime statistics with different scales:
Murder (likely smaller numbers per 100,000 residents)
Assault (likely larger numbers per 100,000 residents)
UrbanPop (percentage, ranging from 0-100)
Rape (per 100,000 residents)
Without scaling, the clustering would be dominated by variables with larger absolute values or variances (likely Assault and UrbanPop). The Murder variable, which might have smaller absolute values but could be equally important in characterizing states, would have minimal influence on the clustering results.
Scaling puts all variables on equal footing, ensuring each contributes equally to the distance calculations regardless of their original units or variances. This is generally appropriate when:
There’s no substantive reason for one variable to be weighted more heavily than others
The variables represent different types of measurements with different natural scales
We want to identify patterns based on relative differences across all variables
For the USArrests data, scaling is appropriate because we want each crime statistic to contribute equally to our understanding of similarities between states, rather than letting the clustering be dominated by whichever crime happens to have the largest numerical values.
10. In this problem, you will generate simulated data, and then perform PCA and K-means clustering on the data.
(a) Generate a simulated data set with 20 observations in each of three classes (i.e. 60 observations total), and 50 variables. Hint: There are a number of functions in R that you can use to generate data. One example is the rnorm() function; runif() is another option. Be sure to add a mean shift to the observations in each class so that there are three distinct classes.
# Set seed for reproducibility
set.seed(123)
# Generate data for 3 classes with mean shifts
class1 <- matrix(rnorm(20 * 50), nrow = 20, ncol = 50)
class2 <- matrix(rnorm(20 * 50, mean = 3), nrow = 20, ncol = 50)
class3 <- matrix(rnorm(20 * 50, mean = 6), nrow = 20, ncol = 50)
# Combine into one dataset
sim_data <- rbind(class1, class2, class3)
# Create true class labels
true_labels <- rep(1:3, each = 20)
Explanation: This generates simulated data with 20 observations in each of three classes with 50 variables.
(b) Perform PCA on the 60 observations and plot the first two principal component score vectors. Use a different color to indicate the observations in each of the three classes. If the three classes appear separated in this plot, then continue on to part (c). If not, then return to part (a) and modify the simulation so that there is greater separation between the three classes. Do not continue to part (c) until the three classes show at least some separation in the first two principal component score vectors.
# Perform PCA
pr_sim <- prcomp(sim_data)
# Plot first two principal components
plot(pr_sim$x[,1], pr_sim$x[,2], col = true_labels, pch = 19,
xlab = "PC1", ylab = "PC2", main = "PCA of Simulated Data")
legend("topright", legend = paste("Class", 1:3), col = 1:3, pch = 19)
Explanation: This performs PCA and plots the first two principal components with different colors for each class.
(c) Perform K-means clustering of the observations with K =3. How well do the clusters that you obtained in K-means clustering compare to the true class labels? Hint: You can use the table() function in R to compare the true class labels to the class labels obtained by clustering. Be careful how you interpret the results: K-means clustering will arbitrarily number the clusters, so you cannot simply check whether the true class labels and clustering labels are the same.
# Perform K-means with K=3
km3 <- kmeans(sim_data, centers = 3, nstart = 25)
# Compare with true labels
table(km3$cluster, true_labels)
## true_labels
## 1 2 3
## 1 0 0 20
## 2 0 20 0
## 3 20 0 0
Explanation: This performs K-means clustering with K=3 and compares results to true class labels.
(d) Perform K-means clustering with K =2. Describe your results.
# Perform K-means with K=2
km2 <- kmeans(sim_data, centers = 2, nstart = 25)
# Compare with true labels
table(km2$cluster, true_labels)
## true_labels
## 1 2 3
## 1 20 0 0
## 2 0 20 20
Explanation: This performs K-means clustering with K=2 to observe how classes are merged.
(e) Now perform K-means clustering with K =4, and describe your results.
# Perform K-means with K=4
km4 <- kmeans(sim_data, centers = 4, nstart = 25)
# Compare with true labels
table(km4$cluster, true_labels)
## true_labels
## 1 2 3
## 1 0 0 20
## 2 0 12 0
## 3 20 0 0
## 4 0 8 0
Explanation: This performs K-means clustering with K=4 to observe how classes are split.
(f) Now perform K-means clustering with K =3on the first two principal component score vectors, rather than on the raw data. That is, perform K-means clustering on the 60 × 2 matrix of which the first column is the first principal component score vector, and the second column is the second principal component score vector. Comment on the results.
# Perform K-means on first two PC scores
km_pc <- kmeans(pr_sim$x[,1:2], centers = 3, nstart = 25)
# Compare with true labels
table(km_pc$cluster, true_labels)
## true_labels
## 1 2 3
## 1 0 0 20
## 2 20 0 0
## 3 0 20 0
# Extract the first two principal component scores
pc_scores <- pr_sim$x[,1:2]
# Perform K-means with K=3 on the first two PC scores
km_pc <- kmeans(pc_scores, centers = 3, nstart = 25)
# Compare with true labels
table(km_pc$cluster, true_labels)
## true_labels
## 1 2 3
## 1 0 0 20
## 2 20 0 0
## 3 0 20 0
# Visualize the clustering results
plot(pc_scores, col = km_pc$cluster, pch = 19,
xlab = "PC1", ylab = "PC2", main = "K-means on PC Scores")
points(km_pc$centers, col = 1:3, pch = 8, cex = 2)
Explanation: This code performs K-means clustering using just the first two principal component scores rather than the original 50-dimensional data. By clustering in this reduced space that captures the main directions of variation, we simplify the computation while often maintaining good separation between clusters.
(g) Using the scale() function, perform K-means clustering with K =3on the data after scaling each variable to have standard deviation one. How do these results compare to those obtained in (b)? Explain.
# Scale the data
scaled_sim_data <- scale(sim_data)
# Perform K-means on scaled data
km_scaled <- kmeans(scaled_sim_data, centers = 3, nstart = 25)
# Compare with true labels and unscaled results
table(km_scaled$cluster, true_labels)
## true_labels
## 1 2 3
## 1 0 0 20
## 2 0 20 0
## 3 20 0 0
table(km_scaled$cluster, km3$cluster)
##
## 1 2 3
## 1 20 0 0
## 2 0 20 0
## 3 0 0 20
Explanation: This performs K-means after scaling each variable to have standard deviation one.