Lab 11

In Section 12.2.3, a formula for calculating PVE was given in Equation 12.10. We also saw that the PVE can be obtained using the sdev output of the prcomp() function. On the USArrests data, calculate PVE in two ways:

Using the sdev output of the prcomp() function, as was done in Section 12.2.3.

data("USArrests")

USArrests_scaled <- scale(USArrests)

pca_result <- prcomp(USArrests_scaled)
sdev <- pca_result$sdev
pve_a <- sdev^2 / sum(sdev^2)
pve_a

## [1] 0.62006039 0.24744129 0.08914080 0.04335752

By applying Equation 12.10 directly. That is, use the prcomp() function to compute the principal component loadings. Then, use those loadings in Equation 12.10 to obtain the PVE. These two approaches should give the same results. Hint: You will only obtain the same results in (a) and (b) if the same data is used in both cases. For instance, if in (a) you performed prcomp() using centered and scaled variables, then you must center and scale the variables before applying Equation 12.10 in (b).

loadings <- pca_result$rotation        # principal component loadings
scores <- USArrests_scaled %*% loadings  # principal component scores
pc_var <- apply(scores^2, 2, sum)        # Numerator: variance of each PC
total_var <- sum(USArrests_scaled^2)    # Denominator: total variance in data
pve_b <- pc_var / total_var

pve_a

## [1] 0.62006039 0.24744129 0.08914080 0.04335752

pve_b

##        PC1        PC2        PC3        PC4 
## 0.62006039 0.24744129 0.08914080 0.04335752

Both methods gave identical PVE values, which confirms the correct implementation and consistency of data scaling.

Consider the USArrests data. We will now perform hierarchical clustering on the states.

Using hierarchical clustering with complete linkage and Euclidean distance, cluster the states.

dist_usarrests <- dist(USArrests)

hc_complete <- hclust(dist_usarrests, method = "complete")

plot(hc_complete, main = "  Complete Linkage", xlab = "", sub = "", cex = 0.6)

Cut the dendrogram at a height that results in three distinct clusters. Which states belong to which clusters?

clusters_divided <- cutree(hc_complete, k = 3)

split(rownames(USArrests), clusters_divided)

## $`1`
##  [1] "Alabama"        "Alaska"         "Arizona"        "California"    
##  [5] "Delaware"       "Florida"        "Illinois"       "Louisiana"     
##  [9] "Maryland"       "Michigan"       "Mississippi"    "Nevada"        
## [13] "New Mexico"     "New York"       "North Carolina" "South Carolina"
## 
## $`2`
##  [1] "Arkansas"      "Colorado"      "Georgia"       "Massachusetts"
##  [5] "Missouri"      "New Jersey"    "Oklahoma"      "Oregon"       
##  [9] "Rhode Island"  "Tennessee"     "Texas"         "Virginia"     
## [13] "Washington"    "Wyoming"      
## 
## $`3`
##  [1] "Connecticut"   "Hawaii"        "Idaho"         "Indiana"      
##  [5] "Iowa"          "Kansas"        "Kentucky"      "Maine"        
##  [9] "Minnesota"     "Montana"       "Nebraska"      "New Hampshire"
## [13] "North Dakota"  "Ohio"          "Pennsylvania"  "South Dakota" 
## [17] "Utah"          "Vermont"       "West Virginia" "Wisconsin"

These are the distinct clusters divided and the states in each cluster is shown.

Hierarchically cluster the states using complete linkage and Euclidean distance, after scaling the variables to have standard deviation one.

USArrests_scaled <- scale(USArrests)

dist_scaled <- dist(USArrests_scaled)

hc_scaled <- hclust(dist_scaled, method = "complete")

plot(hc_scaled, main = "Dendrogram - Scaled Complete Linkage", xlab = "", sub = "", cex = 0.6)

clusters_scaled <- cutree(hc_scaled, k = 3)

split(rownames(USArrests_scaled), clusters_scaled)

## $`1`
## [1] "Alabama"        "Alaska"         "Georgia"        "Louisiana"     
## [5] "Mississippi"    "North Carolina" "South Carolina" "Tennessee"     
## 
## $`2`
##  [1] "Arizona"    "California" "Colorado"   "Florida"    "Illinois"  
##  [6] "Maryland"   "Michigan"   "Nevada"     "New Mexico" "New York"  
## [11] "Texas"     
## 
## $`3`
##  [1] "Arkansas"      "Connecticut"   "Delaware"      "Hawaii"       
##  [5] "Idaho"         "Indiana"       "Iowa"          "Kansas"       
##  [9] "Kentucky"      "Maine"         "Massachusetts" "Minnesota"    
## [13] "Missouri"      "Montana"       "Nebraska"      "New Hampshire"
## [17] "New Jersey"    "North Dakota"  "Ohio"          "Oklahoma"     
## [21] "Oregon"        "Pennsylvania"  "Rhode Island"  "South Dakota" 
## [25] "Utah"          "Vermont"       "Virginia"      "Washington"   
## [29] "West Virginia" "Wisconsin"     "Wyoming"

This is the scaled dendogram. As scaling ensures all variables contribute equally to distance calculations, resulting in different clusters.

What efect does scaling the variables have on the hierarchical clustering obtained? In your opinion, should the variables be scaled before the inter-observation dissimilarities are computed? Provide a justifcation for your answer.

Scaling changes the clustering results significantly. Without scaling, variables with larger variances (like Murder) dominate the distance calculation, biasing the clusters. After scaling, all variables contribute equally. Yes, I think variables should be scaled before computing dissimilarities when they are measured on different scales. This ensures that each variable contributes equally to the distance calculation, preventing dominance by high-variance variables.

Therefore, scaling is essential when variables are measured in different units or have vastly different variances, as in the USArrests dataset.

In this problem, you will generate simulated data, and then perform PCA and K-means clustering on the data.

Generate a simulated data set with 20 observations in each of three classes (i.e. 60 observations total), and 50 variables. Hint: There are a number of functions in R that you can use to generate data. One example is the rnorm() function; runif() is another option. Be sure to add a mean shift to the observations in each class so that there are three distinct classes.

set.seed(1234)

class1 <- matrix(rnorm(20 * 50, mean = 0), nrow = 20)
class2 <- matrix(rnorm(20 * 50, mean = 3), nrow = 20)
class3 <- matrix(rnorm(20 * 50, mean = -3), nrow = 20)

data <- rbind(class1, class2, class3)
true_labels <- rep(1:3, each = 20)

Perform PCA on the 60 observations and plot the frst two principal component score vectors. Use a diferent color to indicate the observations in each of the three classes. If the three classes appear separated in this plot, then continue on to part (c). If not, then return to part (a) and modify the simulation so that there is greater separation between the three classes. Do not continue to part (c) until the three classes show at least some separation in the frst two principal component score vectors.

pca_result <- prcomp(data)

plot(pca_result$x[, 1:2], col = true_labels, pch = 19,
     xlab = "PC1", ylab = "PC2", main = "PCA - First Two Components")

The plot shows clear separation among classes.

Perform K-means clustering of the observations with K = 3. How well do the clusters that you obtained in K-means clustering compare to the true class labels? Hint: You can use the table() function in R to compare the true class labels to the class labels obtained by clustering. Be careful how you interpret the results: K-means clustering will arbitrarily number the clusters, so you cannot simply check whether the true class labels and clustering labels are the same.

set.seed(1234)
kmeans_3 <- kmeans(data, centers = 3, nstart = 20)

table(Cluster = kmeans_3$cluster, Truth = true_labels)

##        Truth
## Cluster  1  2  3
##       1  0 20  0
##       2 20  0  0
##       3  0  0 20

K-means with K=3 aligns well with true classes, though cluster labels may be permuted.

Perform K-means clustering with K = 2. Describe your results.

set.seed(1234)
kmeans_2 <- kmeans(data, centers = 2, nstart = 20)

table(Cluster = kmeans_2$cluster, Truth = true_labels)

##        Truth
## Cluster  1  2  3
##       1  0 20  0
##       2 20  0 20

With K=2, K-means merges two classes together, which reduces clustering accuracy.

Now perform K-means clustering with K = 4, and describe your results.

set.seed(1234)
kmeans_4 <- kmeans(data, centers = 4, nstart = 20)

table(Cluster = kmeans_4$cluster, Truth = true_labels)

##        Truth
## Cluster  1  2  3
##       1 20  0  0
##       2  0 20  0
##       3  0  0 10
##       4  0  0 10

With K=4, K-means splits one true class into two, creating unnecessary clusters. So this is also not preferable.

Now perform K-means clustering with K = 3 on the frst two principal component score vectors, rather than on the raw data. That is, perform K-means clustering on the 60 × 2 matrix of which the frst column is the frst principal component score vector, and the second column is the second principal component score vector. Comment on the results.

set.seed(1234)
pc_data <- pca_result$x[, 1:2]
kmeans_pc <- kmeans(pc_data, centers = 3, nstart = 20)

table(Cluster = kmeans_pc$cluster, Truth = true_labels)

##        Truth
## Cluster  1  2  3
##       1  0 20  0
##       2 20  0  0
##       3  0  0 20

Clustering on first 2 PCs retains class structure while reducing dimensionality, often improving clarity.

Using the scale() function, perform K-means clustering with K = 3 on the data after scaling each variable to have standard deviation one. How do these results compare to those obtained in (b)? Explain.

set.seed(12345)
scaled_data <- scale(data)
kmeans_scaled <- kmeans(scaled_data, centers = 3, nstart = 20)

table(Cluster = kmeans_scaled$cluster, Truth = true_labels)

##        Truth
## Cluster  1  2  3
##       1  0 20  0
##       2  0  0 20
##       3 20  0  0

Scaling equalizes variable influence, which can improve clustering when variances differ across features.

Clustering on the first two PCA components captures the underlying class structure effectively, with minimal information loss. This shows that PCA-based dimensionality reduction can enhance K-means performance while simplifying the data.