Ensuring to use centered and scaled data in both approaches for consistency.
sdev
output from
prcomp()
# Load data
data("USArrests")
# Perform PCA with scaling (important!)
pca_result <- prcomp(USArrests, center = TRUE, scale. = TRUE)
# Calculate PVE from sdev
pve_sdev <- pca_result$sdev^2 / sum(pca_result$sdev^2)
pve_sdev
## [1] 0.62006039 0.24744129 0.08914080 0.04335752
Equation 12.10 states: \[ \text{PVE}_m = \frac{\sum_{i=1}^n \left( \sum_{j=1}^p \phi_{jm}x_{ij} \right)^2}{\sum_{j=1}^p \sum_{i=1}^n x_{ij}^2} \] where \(\phi_{jm}\) are the loadings and \(x_{ij}\) are the (scaled) observations.
# Center and scale the data
scaled_data <- scale(USArrests, center = TRUE, scale = TRUE)
# Get PC scores (Z = X * phi)
scores <- scaled_data %*% pca_result$rotation
# Calculate numerator (sum of squared scores for each PC)
numerator <- colSums(scores^2)
# Calculate denominator (total sum of squares of scaled data)
denominator <- sum(scaled_data^2) # Equivalent to sum of variances when centered
# Calculate PVE
pve_eq12.10 <- numerator / denominator
pve_eq12.10
## PC1 PC2 PC3 PC4
## 0.62006039 0.24744129 0.08914080 0.04335752
Let’s compare both methods:
# Compare results
data.frame(
PC = 1:4,
PVE_sdev = pve_sdev,
PVE_eq12.10 = pve_eq12.10,
Difference = pve_sdev - pve_eq12.10
)
## PC PVE_sdev PVE_eq12.10 Difference
## PC1 1 0.62006039 0.62006039 3.330669e-16
## PC2 2 0.24744129 0.24744129 1.110223e-16
## PC3 3 0.08914080 0.08914080 0.000000e+00
## PC4 4 0.04335752 0.04335752 4.163336e-17
Key Points:
sdev^2
gives the eigenvalues (variances of PCs)# Load data
data("USArrests")
# Compute distances (Euclidean by default)
d <- dist(USArrests)
# Perform hierarchical clustering with complete linkage
hc_complete <- hclust(d, method = "complete")
# Plot dendrogram
plot(hc_complete, main = "Complete Linkage", xlab = "", sub = "")
# Cut dendrogram to get 3 clusters
cluster_cut <- cutree(hc_complete, k = 3)
# Identify which states belong to which clusters
clusters <- data.frame(State = rownames(USArrests), Cluster = cluster_cut)
clusters[order(clusters$Cluster), ]
## State Cluster
## Alabama Alabama 1
## Alaska Alaska 1
## Arizona Arizona 1
## California California 1
## Delaware Delaware 1
## Florida Florida 1
## Illinois Illinois 1
## Louisiana Louisiana 1
## Maryland Maryland 1
## Michigan Michigan 1
## Mississippi Mississippi 1
## Nevada Nevada 1
## New Mexico New Mexico 1
## New York New York 1
## North Carolina North Carolina 1
## South Carolina South Carolina 1
## Arkansas Arkansas 2
## Colorado Colorado 2
## Georgia Georgia 2
## Massachusetts Massachusetts 2
## Missouri Missouri 2
## New Jersey New Jersey 2
## Oklahoma Oklahoma 2
## Oregon Oregon 2
## Rhode Island Rhode Island 2
## Tennessee Tennessee 2
## Texas Texas 2
## Virginia Virginia 2
## Washington Washington 2
## Wyoming Wyoming 2
## Connecticut Connecticut 3
## Hawaii Hawaii 3
## Idaho Idaho 3
## Indiana Indiana 3
## Iowa Iowa 3
## Kansas Kansas 3
## Kentucky Kentucky 3
## Maine Maine 3
## Minnesota Minnesota 3
## Montana Montana 3
## Nebraska Nebraska 3
## New Hampshire New Hampshire 3
## North Dakota North Dakota 3
## Ohio Ohio 3
## Pennsylvania Pennsylvania 3
## South Dakota South Dakota 3
## Utah Utah 3
## Vermont Vermont 3
## West Virginia West Virginia 3
## Wisconsin Wisconsin 3
table(clusters$Cluster)
##
## 1 2 3
## 16 14 20
# Scale variables to have SD = 1 (mean is already 0 from dist calculation)
scaled_data <- scale(USArrests)
# Compute distances and perform clustering
d_scaled <- dist(scaled_data)
hc_scaled <- hclust(d_scaled, method = "complete")
# Plot dendrogram
plot(hc_scaled, main = "Complete Linkage (Scaled Data)", xlab = "", sub = "")
# Cut dendrogram to get 3 clusters
cluster_scaled_cut <- cutree(hc_scaled, k = 3)
# Identify which states belong to which clusters
clusters_scaled <- data.frame(State = rownames(USArrests), Cluster = cluster_scaled_cut)
clusters_scaled[order(clusters$Cluster), ]
## State Cluster
## Alabama Alabama 1
## Alaska Alaska 1
## Arizona Arizona 2
## California California 2
## Delaware Delaware 3
## Florida Florida 2
## Illinois Illinois 2
## Louisiana Louisiana 1
## Maryland Maryland 2
## Michigan Michigan 2
## Mississippi Mississippi 1
## Nevada Nevada 2
## New Mexico New Mexico 2
## New York New York 2
## North Carolina North Carolina 1
## South Carolina South Carolina 1
## Arkansas Arkansas 3
## Colorado Colorado 2
## Georgia Georgia 1
## Massachusetts Massachusetts 3
## Missouri Missouri 3
## New Jersey New Jersey 3
## Oklahoma Oklahoma 3
## Oregon Oregon 3
## Rhode Island Rhode Island 3
## Tennessee Tennessee 1
## Texas Texas 2
## Virginia Virginia 3
## Washington Washington 3
## Wyoming Wyoming 3
## Connecticut Connecticut 3
## Hawaii Hawaii 3
## Idaho Idaho 3
## Indiana Indiana 3
## Iowa Iowa 3
## Kansas Kansas 3
## Kentucky Kentucky 3
## Maine Maine 3
## Minnesota Minnesota 3
## Montana Montana 3
## Nebraska Nebraska 3
## New Hampshire New Hampshire 3
## North Dakota North Dakota 3
## Ohio Ohio 3
## Pennsylvania Pennsylvania 3
## South Dakota South Dakota 3
## Utah Utah 3
## Vermont Vermont 3
## West Virginia West Virginia 3
## Wisconsin Wisconsin 3
table(clusters_scaled$Cluster)
##
## 1 2 3
## 8 11 31
Clearly, scaling has changed the hierarchically cluster from before. Upon carefully analysing along with the summary, we can say
summary(USArrests)
## Murder Assault UrbanPop Rape
## Min. : 0.800 Min. : 45.0 Min. :32.00 Min. : 7.30
## 1st Qu.: 4.075 1st Qu.:109.0 1st Qu.:54.50 1st Qu.:15.07
## Median : 7.250 Median :159.0 Median :66.00 Median :20.10
## Mean : 7.788 Mean :170.8 Mean :65.54 Mean :21.23
## 3rd Qu.:11.250 3rd Qu.:249.0 3rd Qu.:77.75 3rd Qu.:26.18
## Max. :17.400 Max. :337.0 Max. :91.00 Max. :46.00
Effect of Scaling:
Without scaling, variables with larger magnitudes (like Assault) dominate the distance calculations
After scaling, all variables contribute equally to the distance measure
Cluster assignments often change significantly after scaling
Should variables be scaled?
Yes, in most cases including this one, because:
The variables are measured in different units: - Murder (per 100,000) - Assault (absolute number) - UrbanPop (percentage) - Rape (per 100,000)
Assault values are orders of magnitude larger than other variables
Without scaling, Assault would dominate the clustering
We typically want all variables to contribute equally to the similarity measure
Justification: The purpose of clustering here is to group states based on similar crime patterns, not based on which crimes happen to have numerically larger values. Scaling ensures each crime type contributes equally to determining similarity between states.
Unscaled clustering tends to group states primarily by Assault rates
Scaled clustering reveals more nuanced patterns considering all crime types equally
For this dataset, scaling produces more meaningful clusters that better represent overall crime profiles
Let’s tackle each part of this problem systematically.
set.seed(123)
# Create 3 classes with 20 observations each and 50 variables
class1 <- matrix(rnorm(20*50, mean=0), nrow=20)
class2 <- matrix(rnorm(20*50, mean=3), nrow=20)
class3 <- matrix(rnorm(20*50, mean=6), nrow=20)
# Combine into one dataset
data <- rbind(class1, class2, class3)
true_classes <- rep(1:3, each=20)
pca_result <- prcomp(data, center=TRUE, scale.=TRUE)
pca_scores <- pca_result$x
# Plot first two PCs
plot(pca_scores[,1], pca_scores[,2], col=true_classes,
pch=19, xlab="PC1", ylab="PC2",
main="First Two Principal Components")
legend("topright", legend=paste("Class",1:3), col=1:3, pch=19)
kmeans_result <- kmeans(data, centers=3, nstart=20)
# Compare with true classes
table(true_classes, kmeans_result$cluster)
##
## true_classes 1 2 3
## 1 0 0 20
## 2 0 20 0
## 3 20 0 0
kmeans_k2 <- kmeans(data, centers=2, nstart=20)
table(true_classes, kmeans_k2$cluster)
##
## true_classes 1 2
## 1 0 20
## 2 20 0
## 3 20 0
# Typically merges two of the three original classes
kmeans_k4 <- kmeans(data, centers=4, nstart=20)
table(true_classes, kmeans_k4$cluster)
##
## true_classes 1 2 3 4
## 1 0 0 20 0
## 2 8 12 0 0
## 3 0 0 0 20
# Typically splits one of the original classes into two
kmeans_pca <- kmeans(pca_scores[,1:2], centers=3, nstart=20)
table(true_classes, kmeans_pca$cluster)
##
## true_classes 1 2 3
## 1 0 0 20
## 2 20 0 0
## 3 0 20 0
# Results should be similar to part (c) if PC1 and PC2 capture class separation well
scaled_data <- scale(data)
kmeans_scaled <- kmeans(scaled_data, centers=3, nstart=20)
table(true_classes, kmeans_scaled$cluster)
##
## true_classes 1 2 3
## 1 20 0 0
## 2 0 0 20
## 3 0 20 0
Scaling shouldn’t make much difference here because:
All variables were generated with same variance
Class separation comes from mean shifts
In real data with variables of different scales, scaling would be crucial