SLlab10

Question 8

In Section 12.2.3, a formula for calculating PVE was given in Equation 12.10. We also saw that the PVE can be obtained using the sdev output of the prcomp() function.On the USArrests data, calculate PVE in two ways:

# Load the data
data("USArrests")

# Center and scale the data
scaled_data <- scale(USArrests)

Using the sdev output of the prcomp() function, as was done in Section 12.2.3.

# (a) Using prcomp()$sdev
prcomp_result <- prcomp(USArrests, scale. = TRUE)

# Standard deviations of PCs
pc_sdev <- prcomp_result$sdev

# PVE: variance of each PC / total variance
pve_a <- pc_sdev^2 / sum(pc_sdev^2)
print("PVE using prcomp()$sdev:")

## [1] "PVE using prcomp()$sdev:"

print(pve_a)

## [1] 0.62006039 0.24744129 0.08914080 0.04335752

By applying Equation 12.10 directly. That is, use the prcomp() function to compute the principal component loadings. Then,use those loadings in Equation 12.10 to obtain the PVE.

loadings <- prcomp_result$rotation

# Get the PC scores (Z = X x Loadings)
scores <- as.matrix(scaled_data) %*% loadings

# Numerators: sum of squared scores for each PC
numerators <- colSums(scores^2)

# Denominator: total sum of squares of the data matrix
total_variance <- sum(scaled_data^2)

# Apply Equation 12.10
pve_b <- numerators / total_variance

print("PVE using Equation 12.10:")

## [1] "PVE using Equation 12.10:"

print(pve_b)

##        PC1        PC2        PC3        PC4 
## 0.62006039 0.24744129 0.08914080 0.04335752

Observation:

Both methods yielded identical PVEs—[0.62, 0.25, 0.09, 0.04]—confirming that the variance explained by each principal component is consistent whether computed via prcomp() or manually using Equation 12.10.

Question 9

Consider the USArrests data. We will now perform hierarchical clustering on the states. (a) Using hierarchical clustering with complete linkage and Euclidean distance, cluster the states.

# Load data
data("USArrests")

# Compute distance matrix
dist_matrix <- dist(USArrests)

# Perform hierarchical clustering with complete linkage
hc_complete <- hclust(dist_matrix, method = "complete")

# Plot dendrogram
plot(hc_complete, main = "Hierarchical Clustering (Complete Linkage)")

Cut the dendrogram at a height that results in three distinctclusters. Which states belong to which clusters?

# Cut tree into 3 clusters
clusters <- cutree(hc_complete, k = 3)

# Create a table of clusters
cluster_groups <- split(rownames(USArrests), clusters)
print(cluster_groups)

## $`1`
##  [1] "Alabama"        "Alaska"         "Arizona"        "California"    
##  [5] "Delaware"       "Florida"        "Illinois"       "Louisiana"     
##  [9] "Maryland"       "Michigan"       "Mississippi"    "Nevada"        
## [13] "New Mexico"     "New York"       "North Carolina" "South Carolina"
## 
## $`2`
##  [1] "Arkansas"      "Colorado"      "Georgia"       "Massachusetts"
##  [5] "Missouri"      "New Jersey"    "Oklahoma"      "Oregon"       
##  [9] "Rhode Island"  "Tennessee"     "Texas"         "Virginia"     
## [13] "Washington"    "Wyoming"      
## 
## $`3`
##  [1] "Connecticut"   "Hawaii"        "Idaho"         "Indiana"      
##  [5] "Iowa"          "Kansas"        "Kentucky"      "Maine"        
##  [9] "Minnesota"     "Montana"       "Nebraska"      "New Hampshire"
## [13] "North Dakota"  "Ohio"          "Pennsylvania"  "South Dakota" 
## [17] "Utah"          "Vermont"       "West Virginia" "Wisconsin"

Hierarchically cluster the states using complete linkage and Euclidean distance, after scaling the variables to have standard deviation one.

# Scale data (mean 0, sd 1)
scaled_data <- scale(USArrests)

# Distance matrix for scaled data
dist_matrix_scaled <- dist(scaled_data)

# Hierarchical clustering
hc_scaled <- hclust(dist_matrix_scaled, method = "complete")

# Plot dendrogram
plot(hc_scaled, main = "Hierarchical Clustering (Scaled Data)")

# Cut into 3 clusters
clusters_scaled <- cutree(hc_scaled, k = 3)

# List of clusters
cluster_groups_scaled <- split(rownames(USArrests), clusters_scaled)
print(cluster_groups_scaled)

## $`1`
## [1] "Alabama"        "Alaska"         "Georgia"        "Louisiana"     
## [5] "Mississippi"    "North Carolina" "South Carolina" "Tennessee"     
## 
## $`2`
##  [1] "Arizona"    "California" "Colorado"   "Florida"    "Illinois"  
##  [6] "Maryland"   "Michigan"   "Nevada"     "New Mexico" "New York"  
## [11] "Texas"     
## 
## $`3`
##  [1] "Arkansas"      "Connecticut"   "Delaware"      "Hawaii"       
##  [5] "Idaho"         "Indiana"       "Iowa"          "Kansas"       
##  [9] "Kentucky"      "Maine"         "Massachusetts" "Minnesota"    
## [13] "Missouri"      "Montana"       "Nebraska"      "New Hampshire"
## [17] "New Jersey"    "North Dakota"  "Ohio"          "Oklahoma"     
## [21] "Oregon"        "Pennsylvania"  "Rhode Island"  "South Dakota" 
## [25] "Utah"          "Vermont"       "Virginia"      "Washington"   
## [29] "West Virginia" "Wisconsin"     "Wyoming"

What effect does scaling the variables have on the hierarchical clustering obtained? In your opinion, should the variables be scaled before the inter-observation dissimilarities are computed?Provide a justification for your answer.

Scaling had a clear impact on the clustering—before scaling, variables with larger variances (like ‘Assault’) dominated the results, while after scaling, all variables contributed equally, changing the cluster structure. I believe scaling is essential when variables are on different scales, as it ensures no single variable unfairly influences the clustering, leading to more balanced and meaningful results.

Question 10

In this problem, you will generate simulated data, and then perform PCA and K-means clustering on the data.

Generate a simulated data set with 20 observations in each of three classes (i.e. 60 observations total), and 50 variables. Hint: There are a number of functions in R that you can use to generate data. One example is the rnorm() function; runif() is another option. Be sure to add a mean shift to the observations in each class so that there are three distinct classes.

set.seed(123)

# 20 observations per class, 50 variables
n <- 20
p <- 50

# Class 1: centered at 0
class1 <- matrix(rnorm(n * p), nrow = n)

# Class 2: shifted by +2
class2 <- matrix(rnorm(n * p, mean = 2), nrow = n)

# Class 3: shifted by -2
class3 <- matrix(rnorm(n * p, mean = -2), nrow = n)

# Combine into one dataset
X <- rbind(class1, class2, class3)

# True labels
labels <- c(rep(1, n), rep(2, n), rep(3, n))

Perform PCA on the 60 observations and plot the first two principal component score vectors. Use a different color to indicate the observations in each of the three classes. If the three classes appear separated in this plot, then continue on to part (c). If not, then return to part (a) and modify the simulation so that there is greater separation between the three classes. Do not continue to part (c) until the three classes show at least some separation in the first two principal component score vectors.

# PCA
pca_out <- prcomp(X)

# Plot PC1 vs PC2
plot(pca_out$x[,1:2], col = labels, pch = 19,
     main = "PCA: First Two Principal Components")
legend("topright", legend = c("Class 1", "Class 2", "Class 3"), col = 1:3, pch = 19)

Observation: I confirmed that the three classes are clearly separated in the PC1 vs PC2 plot, showing distinct groupings.

Perform K-means clustering of the observations with K =3.How well do the clusters that you obtained in K-means clustering compare to the true class labels?Hint: You can use the table() function in R to compare the trueclass labels to the class labels obtained by clustering. Be careful how you interpret the results: K-means clustering will arbitrarily number the clusters, so you cannot simply check whether the true class labels and clustering labels are the same.

set.seed(123)
km_out <- kmeans(X, centers = 3, nstart = 20)

# Compare clusters to true labels
table(km_out$cluster, labels)

##    labels
##      1  2  3
##   1  0 20  0
##   2 20  0  0
##   3  0  0 20

Observation: The K-means clustering aligns closely with the true labels, although cluster numbers are arbitrary. There’s a strong match between clusters and actual classes, confirming K-means effectively identified the underlying groups.

Perform K-means clustering with K =2. Describe your results.

set.seed(123)
km_out2 <- kmeans(X, centers = 2, nstart = 20)
table(km_out2$cluster, labels)

##    labels
##      1  2  3
##   1  0  0 20
##   2 20 20  0

Observation: With K = 2, the clustering tends to merge two classes together, losing the clear distinction between all three true groups. This highlights that K = 2 is too coarse for this dataset.

Now perform K-means clustering with K =4, and describe your results.

set.seed(123)
km_out4 <- kmeans(X, centers = 4, nstart = 20)
table(km_out4$cluster, labels)

##    labels
##      1  2  3
##   1  0  8  0
##   2  0  0 20
##   3 20  0  0
##   4  0 12  0

Observation: For K = 4, the algorithm splits one or more true classes into multiple clusters, over-segmenting the data. This shows that K = 4 is too granular for the true structure of the data.

Now perform K-means clustering with K =3on the first twoprincipal component score vectors, rather than on the raw data.That is, perform K-means clustering on the 60 × 2 matrix of which the first column is the first principal component score vector, and the second column is the second principal component score vector. Comment on the results.

set.seed(123)
km_pc <- kmeans(pca_out$x[, 1:2], centers = 3, nstart = 20)
table(km_pc$cluster, labels)

##    labels
##      1  2  3
##   1  0 20  0
##   2 20  0  0
##   3  0  0 20

Observation: Clustering on the first two principal components produced results very similar to clustering on the full data, showing that the main structure is well captured by the first two PCs.

Using the scale() function, perform K-means clustering with K =3on the data after scaling each variable to have standard deviation one. How do these results compare to those obtained in (b)? Explain.

# Scale the data
X_scaled <- scale(X)

set.seed(123)
km_scaled <- kmeans(X_scaled, centers = 3, nstart = 20)
table(km_scaled$cluster, labels)

##    labels
##      1  2  3
##   1  0 20  0
##   2 20  0  0
##   3  0  0 20

Observation: After scaling, clustering results remained strong, similar to part (c). Scaling ensures that each variable contributes equally, which is especially important if some variables have different variances.

Here PCA effectively captured the class separation, and K-means worked well when the correct K was chosen. Scaling and dimensionality reduction (PCA) both helped maintain meaningful clustering by balancing the influence of each variable.

SLlab10

Sai Pranam

2025-05-01

Question 8

Observation:

Question 9

Question 10