(8)

Ensuring to use centered and scaled data in both approaches for consistency.

(a) PVE using sdev output from prcomp()

# Load data
data("USArrests")

# Perform PCA with scaling (important!)
pca_result <- prcomp(USArrests, center = TRUE, scale. = TRUE)

# Calculate PVE from sdev
pve_sdev <- pca_result$sdev^2 / sum(pca_result$sdev^2)
pve_sdev
## [1] 0.62006039 0.24744129 0.08914080 0.04335752

(b) PVE using Equation 12.10 directly

Equation 12.10 states: \[ \text{PVE}_m = \frac{\sum_{i=1}^n \left( \sum_{j=1}^p \phi_{jm}x_{ij} \right)^2}{\sum_{j=1}^p \sum_{i=1}^n x_{ij}^2} \] where \(\phi_{jm}\) are the loadings and \(x_{ij}\) are the (scaled) observations.

# Center and scale the data
scaled_data <- scale(USArrests, center = TRUE, scale = TRUE)

# Get PC scores (Z = X * phi)
scores <- scaled_data %*% pca_result$rotation

# Calculate numerator (sum of squared scores for each PC)
numerator <- colSums(scores^2)

# Calculate denominator (total sum of squares of scaled data)
denominator <- sum(scaled_data^2)  # Equivalent to sum of variances when centered

# Calculate PVE
pve_eq12.10 <- numerator / denominator
pve_eq12.10
##        PC1        PC2        PC3        PC4 
## 0.62006039 0.24744129 0.08914080 0.04335752

Verification:

Let’s compare both methods:

# Compare results
data.frame(
  PC = 1:4,
  PVE_sdev = pve_sdev,
  PVE_eq12.10 = pve_eq12.10,
  Difference = pve_sdev - pve_eq12.10
)
##     PC   PVE_sdev PVE_eq12.10   Difference
## PC1  1 0.62006039  0.62006039 3.330669e-16
## PC2  2 0.24744129  0.24744129 1.110223e-16
## PC3  3 0.08914080  0.08914080 0.000000e+00
## PC4  4 0.04335752  0.04335752 4.163336e-17
  • Both methods give identical results (differences should be essentially zero, up to possible tiny numerical precision differences).

Key Points:

  1. Scaling is crucial: Both methods must use the same scaled data
  2. Equation 12.10 breakdown:
    • Numerator: Sum of squares of PC scores for each component
    • Denominator: Total variance in the original (scaled) data
  3. Equivalence:
    • sdev^2 gives the eigenvalues (variances of PCs)
    • Sum of eigenvalues equals total variance in scaled data
    • Therefore both methods are mathematically equivalent

(9)

(a) Hierarchical Clustering with Complete Linkage and Euclidean Distance

# Load data
data("USArrests")

# Compute distances (Euclidean by default)
d <- dist(USArrests)

# Perform hierarchical clustering with complete linkage
hc_complete <- hclust(d, method = "complete")

# Plot dendrogram
plot(hc_complete, main = "Complete Linkage", xlab = "", sub = "")

(b) Cutting the Dendrogram for Three Clusters

# Cut dendrogram to get 3 clusters
cluster_cut <- cutree(hc_complete, k = 3)

# Identify which states belong to which clusters
clusters <- data.frame(State = rownames(USArrests), Cluster = cluster_cut)
clusters[order(clusters$Cluster), ]
##                         State Cluster
## Alabama               Alabama       1
## Alaska                 Alaska       1
## Arizona               Arizona       1
## California         California       1
## Delaware             Delaware       1
## Florida               Florida       1
## Illinois             Illinois       1
## Louisiana           Louisiana       1
## Maryland             Maryland       1
## Michigan             Michigan       1
## Mississippi       Mississippi       1
## Nevada                 Nevada       1
## New Mexico         New Mexico       1
## New York             New York       1
## North Carolina North Carolina       1
## South Carolina South Carolina       1
## Arkansas             Arkansas       2
## Colorado             Colorado       2
## Georgia               Georgia       2
## Massachusetts   Massachusetts       2
## Missouri             Missouri       2
## New Jersey         New Jersey       2
## Oklahoma             Oklahoma       2
## Oregon                 Oregon       2
## Rhode Island     Rhode Island       2
## Tennessee           Tennessee       2
## Texas                   Texas       2
## Virginia             Virginia       2
## Washington         Washington       2
## Wyoming               Wyoming       2
## Connecticut       Connecticut       3
## Hawaii                 Hawaii       3
## Idaho                   Idaho       3
## Indiana               Indiana       3
## Iowa                     Iowa       3
## Kansas                 Kansas       3
## Kentucky             Kentucky       3
## Maine                   Maine       3
## Minnesota           Minnesota       3
## Montana               Montana       3
## Nebraska             Nebraska       3
## New Hampshire   New Hampshire       3
## North Dakota     North Dakota       3
## Ohio                     Ohio       3
## Pennsylvania     Pennsylvania       3
## South Dakota     South Dakota       3
## Utah                     Utah       3
## Vermont               Vermont       3
## West Virginia   West Virginia       3
## Wisconsin           Wisconsin       3
table(clusters$Cluster)
## 
##  1  2  3 
## 16 14 20
  • The clusters table shows all the states that belong to each clusters. There are 16 states in CLuster-1, 14 states in Cluster-2 and 20 states in Cluster-3.

(c) Hierarchical Clustering After Scaling Variables

# Scale variables to have SD = 1 (mean is already 0 from dist calculation)
scaled_data <- scale(USArrests)

# Compute distances and perform clustering
d_scaled <- dist(scaled_data)
hc_scaled <- hclust(d_scaled, method = "complete")

# Plot dendrogram
plot(hc_scaled, main = "Complete Linkage (Scaled Data)", xlab = "", sub = "")

(d) Effect of Scaling and Justification

# Cut dendrogram to get 3 clusters
cluster_scaled_cut <- cutree(hc_scaled, k = 3)

# Identify which states belong to which clusters
clusters_scaled <- data.frame(State = rownames(USArrests), Cluster = cluster_scaled_cut)
clusters_scaled[order(clusters$Cluster), ]
##                         State Cluster
## Alabama               Alabama       1
## Alaska                 Alaska       1
## Arizona               Arizona       2
## California         California       2
## Delaware             Delaware       3
## Florida               Florida       2
## Illinois             Illinois       2
## Louisiana           Louisiana       1
## Maryland             Maryland       2
## Michigan             Michigan       2
## Mississippi       Mississippi       1
## Nevada                 Nevada       2
## New Mexico         New Mexico       2
## New York             New York       2
## North Carolina North Carolina       1
## South Carolina South Carolina       1
## Arkansas             Arkansas       3
## Colorado             Colorado       2
## Georgia               Georgia       1
## Massachusetts   Massachusetts       3
## Missouri             Missouri       3
## New Jersey         New Jersey       3
## Oklahoma             Oklahoma       3
## Oregon                 Oregon       3
## Rhode Island     Rhode Island       3
## Tennessee           Tennessee       1
## Texas                   Texas       2
## Virginia             Virginia       3
## Washington         Washington       3
## Wyoming               Wyoming       3
## Connecticut       Connecticut       3
## Hawaii                 Hawaii       3
## Idaho                   Idaho       3
## Indiana               Indiana       3
## Iowa                     Iowa       3
## Kansas                 Kansas       3
## Kentucky             Kentucky       3
## Maine                   Maine       3
## Minnesota           Minnesota       3
## Montana               Montana       3
## Nebraska             Nebraska       3
## New Hampshire   New Hampshire       3
## North Dakota     North Dakota       3
## Ohio                     Ohio       3
## Pennsylvania     Pennsylvania       3
## South Dakota     South Dakota       3
## Utah                     Utah       3
## Vermont               Vermont       3
## West Virginia   West Virginia       3
## Wisconsin           Wisconsin       3
table(clusters_scaled$Cluster)
## 
##  1  2  3 
##  8 11 31

Clearly, scaling has changed the hierarchically cluster from before. Upon carefully analysing along with the summary, we can say

summary(USArrests)
##      Murder          Assault         UrbanPop          Rape      
##  Min.   : 0.800   Min.   : 45.0   Min.   :32.00   Min.   : 7.30  
##  1st Qu.: 4.075   1st Qu.:109.0   1st Qu.:54.50   1st Qu.:15.07  
##  Median : 7.250   Median :159.0   Median :66.00   Median :20.10  
##  Mean   : 7.788   Mean   :170.8   Mean   :65.54   Mean   :21.23  
##  3rd Qu.:11.250   3rd Qu.:249.0   3rd Qu.:77.75   3rd Qu.:26.18  
##  Max.   :17.400   Max.   :337.0   Max.   :91.00   Max.   :46.00

Effect of Scaling:

  1. Without scaling, variables with larger magnitudes (like Assault) dominate the distance calculations

  2. After scaling, all variables contribute equally to the distance measure

  3. Cluster assignments often change significantly after scaling

Should variables be scaled?

Yes, in most cases including this one, because:

  1. The variables are measured in different units: - Murder (per 100,000) - Assault (absolute number) - UrbanPop (percentage) - Rape (per 100,000)

  2. Assault values are orders of magnitude larger than other variables

  3. Without scaling, Assault would dominate the clustering

  4. We typically want all variables to contribute equally to the similarity measure

Justification: The purpose of clustering here is to group states based on similar crime patterns, not based on which crimes happen to have numerically larger values. Scaling ensures each crime type contributes equally to determining similarity between states.

Key Observations:

  1. Unscaled clustering tends to group states primarily by Assault rates

  2. Scaled clustering reveals more nuanced patterns considering all crime types equally

  3. For this dataset, scaling produces more meaningful clusters that better represent overall crime profiles

(10)

Let’s tackle each part of this problem systematically.

(a) Generate Simulated Data

set.seed(123)
# Create 3 classes with 20 observations each and 50 variables
class1 <- matrix(rnorm(20*50, mean=0), nrow=20)
class2 <- matrix(rnorm(20*50, mean=3), nrow=20)
class3 <- matrix(rnorm(20*50, mean=6), nrow=20)

# Combine into one dataset
data <- rbind(class1, class2, class3)
true_classes <- rep(1:3, each=20)

(b) Perform PCA and Plot

pca_result <- prcomp(data, center=TRUE, scale.=TRUE)
pca_scores <- pca_result$x

# Plot first two PCs
plot(pca_scores[,1], pca_scores[,2], col=true_classes, 
     pch=19, xlab="PC1", ylab="PC2", 
     main="First Two Principal Components")
legend("topright", legend=paste("Class",1:3), col=1:3, pch=19)

  • The classes are well separated

(c) K-means with K=3

kmeans_result <- kmeans(data, centers=3, nstart=20)

# Compare with true classes
table(true_classes, kmeans_result$cluster)
##             
## true_classes  1  2  3
##            1  0  0 20
##            2  0 20  0
##            3 20  0  0
  • Each true class maps perfectly to one cluster
  • (d) K-means with K=2

kmeans_k2 <- kmeans(data, centers=2, nstart=20)
table(true_classes, kmeans_k2$cluster)
##             
## true_classes  1  2
##            1  0 20
##            2 20  0
##            3 20  0
# Typically merges two of the three original classes
  • Two of the class are merged, two support 2-clusters (since K = 2)

(e) K-means with K=4

kmeans_k4 <- kmeans(data, centers=4, nstart=20)
table(true_classes, kmeans_k4$cluster)
##             
## true_classes  1  2  3  4
##            1  0  0 20  0
##            2  8 12  0  0
##            3  0  0  0 20
# Typically splits one of the original classes into two
  • Splitted one of the original classes into two.

(f) K-means on PC Scores (K=3)

kmeans_pca <- kmeans(pca_scores[,1:2], centers=3, nstart=20)
table(true_classes, kmeans_pca$cluster)
##             
## true_classes  1  2  3
##            1  0  0 20
##            2 20  0  0
##            3  0 20  0
# Results should be similar to part (c) if PC1 and PC2 capture class separation well
  • The results are similar to (c). Meaning the PC1 and PC2 capture class separation very well.

(g) Scaled Data K-means (K=3)

scaled_data <- scale(data)
kmeans_scaled <- kmeans(scaled_data, centers=3, nstart=20)
table(true_classes, kmeans_scaled$cluster)
##             
## true_classes  1  2  3
##            1 20  0  0
##            2  0  0 20
##            3  0 20  0

Comparison with (b):

Scaling shouldn’t make much difference here because:

  1. All variables were generated with same variance

  2. Class separation comes from mean shifts

In real data with variables of different scales, scaling would be crucial

Key Observations:

  1. PCA Visualization: The first two PCs should show clear separation if the classes are well-separated in the original space
  2. K-means Performance: With proper separation, K=3 should recover the true classes well
  3. Under/Over-clustering:
    • K=2 merges classes
    • K=4 splits classes
  4. PCA vs Raw Data: Using top PCs can sometimes improve clustering by focusing on most informative dimensions
  5. Scaling Impact: Less critical for this simulated data but crucial for real data with variables in different units