Junaid SL-Lab11

Question 8

In Section 12.2.3, a formula for calculating PVE was given in Equation 12.10. We also saw that the PVE can be obtained using the sdev output of the prcomp() function.

On the USArrests data, calculate PVE in two ways:

Using the sdev output of the prcomp() function, as was done in Section 12.2.3.

pr <- prcomp(USArrests, scale = TRUE)
# Calculate Proportion of Variance Explained (PVE) using sdev
pve_a <- pr$sdev^2 / sum(pr$sdev^2)
print(pve_a)

## [1] 0.62006039 0.24744129 0.08914080 0.04335752

This code computes the proportion of variance explained (PVE) by each principal component using the squared standard deviations (sdev) of the components obtained from prcomp(). Each squared standard deviation corresponds to the variance captured by the respective principal component. The PVE is then the variance explained by each component divided by the total variance.

By applying Equation 12.10 directly. That is, use the prcomp() function to compute the principal component loadings. Then, use those loadings in Equation 12.10 to obtain the PVE.

These two approaches should give the same results.

# Scale the data as in the PCA process
scaled_data <- scale(USArrests)
# Calculate squared scores per component, then normalize by total variance
pve_b <- colSums(pr$x^2) / sum(colSums(scaled_data^2))
print(pve_b)

##        PC1        PC2        PC3        PC4 
## 0.62006039 0.24744129 0.08914080 0.04335752

Here we replicate Equation 12.10 from the textbook, which states that the PVE for the \(m\)-th principal component is the sum of the squared component scores divided by the total sum of squares of the scaled data. This approach verifies the computation done via sdev, ensuring correctness.

As expected, both methods give the same PVE values, reaffirming that PCA decomposes the total variance consistently whether calculated from eigenvalues (via sdev^2) or the projected scores (via Equation 12.10).

Question 9

Consider the USArrests data. We will now perform hierarchical clustering on the states.

Using hierarchical clustering with complete linkage and Euclidean distance, cluster the states.

set.seed(42)
hc <- hclust(dist(USArrests), method = "complete")
plot(hc, main = "Hierarchical Clustering - Complete Linkage", xlab = "States", sub = "", cex = 0.6)

We use the hclust() function with complete linkage to cluster the states based on their arrest statistics. The dendrogram gives a visual representation of how states group together based on the similarity in the four variables.

Cut the dendrogram at a height that results in three distinct clusters. Which states belong to which clusters?

ct <- cutree(hc, 3)
sapply(1:3, function(i) names(ct)[ct == i])

## [[1]]
##  [1] "Alabama"        "Alaska"         "Arizona"        "California"    
##  [5] "Delaware"       "Florida"        "Illinois"       "Louisiana"     
##  [9] "Maryland"       "Michigan"       "Mississippi"    "Nevada"        
## [13] "New Mexico"     "New York"       "North Carolina" "South Carolina"
## 
## [[2]]
##  [1] "Arkansas"      "Colorado"      "Georgia"       "Massachusetts"
##  [5] "Missouri"      "New Jersey"    "Oklahoma"      "Oregon"       
##  [9] "Rhode Island"  "Tennessee"     "Texas"         "Virginia"     
## [13] "Washington"    "Wyoming"      
## 
## [[3]]
##  [1] "Connecticut"   "Hawaii"        "Idaho"         "Indiana"      
##  [5] "Iowa"          "Kansas"        "Kentucky"      "Maine"        
##  [9] "Minnesota"     "Montana"       "Nebraska"      "New Hampshire"
## [13] "North Dakota"  "Ohio"          "Pennsylvania"  "South Dakota" 
## [17] "Utah"          "Vermont"       "West Virginia" "Wisconsin"

We apply cutree() to divide the states into 3 clusters and print out the states in each. This shows how the states are grouped based on similar levels of violent crime and urban population.

Hierarchically cluster the states using complete linkage and Euclidean distance, after scaling the variables to have standard deviation one.

hc2 <- hclust(dist(scale(USArrests)), method = "complete")
plot(hc2, main = "Hierarchical Clustering (Scaled Data) - Complete Linkage", xlab = "States", sub = "", cex = 0.6)

This time, we standardize the variables using scale() so that all four variables have equal influence on the clustering. This is important since features like Assault and UrbanPop are measured on different scales.

What effect does scaling the variables have on the hierarchical clustering obtained? In your opinion, should the variables be scaled before the inter-observation dissimilarities are computed? Provide a justification for your answer.

ct_scaled <- cutree(hc2, 3)
sapply(1:3, function(i) names(ct_scaled)[ct_scaled == i])

## [[1]]
## [1] "Alabama"        "Alaska"         "Georgia"        "Louisiana"     
## [5] "Mississippi"    "North Carolina" "South Carolina" "Tennessee"     
## 
## [[2]]
##  [1] "Arizona"    "California" "Colorado"   "Florida"    "Illinois"  
##  [6] "Maryland"   "Michigan"   "Nevada"     "New Mexico" "New York"  
## [11] "Texas"     
## 
## [[3]]
##  [1] "Arkansas"      "Connecticut"   "Delaware"      "Hawaii"       
##  [5] "Idaho"         "Indiana"       "Iowa"          "Kansas"       
##  [9] "Kentucky"      "Maine"         "Massachusetts" "Minnesota"    
## [13] "Missouri"      "Montana"       "Nebraska"      "New Hampshire"
## [17] "New Jersey"    "North Dakota"  "Ohio"          "Oklahoma"     
## [21] "Oregon"        "Pennsylvania"  "Rhode Island"  "South Dakota" 
## [25] "Utah"          "Vermont"       "Virginia"      "Washington"   
## [29] "West Virginia" "Wisconsin"     "Wyoming"

After scaling, the clusters differ significantly from the unscaled result. This is because the scaling removes the dominance of variables with larger numerical ranges. Since USArrests includes variables with different units and ranges (e.g., percent urban population vs. counts of arrests), scaling is necessary to ensure no single variable dominates the clustering outcome.

Question 10

In this problem, you will generate simulated data, and then perform PCA and \(K\)-means clustering on the data.

Generate a simulated data set with 20 observations in each of three classes (i.e. 60 observations total), and 50 variables.

set.seed(42)
data <- matrix(rnorm(60 * 50), ncol = 50)
classes <- rep(c("A", "B", "C"), each = 20)
data[classes == "B", 1:10] <- data[classes == "B", 1:10] + 1.2
data[classes == "C", 5:30] <- data[classes == "C", 5:30] + 1

Here, we simulate a dataset of 60 observations (20 per class) and 50 variables. Classes B and C are separated from class A by adding a mean shift to certain variables. This structure ensures visible group separations.

Perform PCA on the 60 observations and plot the first two principal component score vectors.

library(ggplot2)
pca <- prcomp(data)
ggplot(
  data.frame(Class = classes, PC1 = pca$x[, 1], PC2 = pca$x[, 2]),
  aes(x = PC1, y = PC2, color = Class)
) +
  geom_point(size = 2) +
  labs(title = "PCA - First Two Components") +
  theme_minimal()

PCA is used to reduce the dimensionality of the data and visualize class separation in 2D. The distinct separation of classes on the PC1-PC2 plane validates the effectiveness of our simulated design.

Perform \(K\)-means clustering of the observations with \(K = 3\). How well do the clusters that you obtained in \(K\)-means clustering compare to the true class labels?

km <- kmeans(data, centers = 3)$cluster
table(km, classes)

##    classes
## km   A  B  C
##   1  1 20  1
##   2  0  0 19
##   3 19  0  0

The confusion matrix shows how the unsupervised K-means clustering aligns with the actual class labels. High alignment means good clustering performance.

Perform \(K\)-means clustering with \(K = 2\). Describe your results.

km2 <- kmeans(data, centers = 2)$cluster
table(km2, classes)

##    classes
## km2  A  B  C
##   1 18 20  1
##   2  2  0 19

With \(K=2\), one class is often isolated (e.g., class B), while the other cluster is a mix of the remaining two classes. This highlights K-means’ limitations when the true number of classes is unknown or not well-aligned with K.

Now perform \(K\)-means clustering with \(K = 4\), and describe your results.

km4 <- kmeans(data, centers = 4)$cluster
table(km4, classes)

##    classes
## km4  A  B  C
##   1  0  7  2
##   2 18  1  0
##   3  0  0 18
##   4  2 12  0

At \(K=4\), the algorithm may over-segment the data, splitting one class into two clusters. This reinforces that more clusters don’t always lead to better interpretation.

Now perform \(K\)-means clustering with \(K = 3\) on the first two principal component score vectors.

km_pca <- kmeans(pca$x[, 1:2], centers = 3)$cluster
table(km_pca, classes)

##       classes
## km_pca  A  B  C
##      1  0 20  2
##      2 20  0  0
##      3  0  0 18

Clustering on PCA-reduced data often enhances performance by removing noise and reducing dimensionality. The confusion matrix again confirms that PCA preserves enough structure for effective clustering.

Using the scale() function, perform \(K\)-means clustering with \(K = 3\) on the data after scaling each variable to have standard deviation one.

km_scaled <- kmeans(scale(data), centers = 3)$cluster
table(km_scaled, classes)

##          classes
## km_scaled  A  B  C
##         1  1 20  1
##         2 19  0  0
##         3  0  0 19

Scaling equalizes the influence of each variable, which can improve or worsen clustering depending on whether the signal lies in high-variance or low-variance variables. Here, performance slightly drops, suggesting that unscaled variables better reflected cluster structures.

Junaid SL-Lab11

Junaid Ahmed Mohammed

2025-04-25

Question 8

Question 9

Question 10