lab 11

# Load USArrests dataset
data("USArrests")

# Apply PCA
pca_result <- prcomp(USArrests, scale. = TRUE)

# Extract the standard deviations of principal components
sdev <- pca_result$sdev

# Compute variance explained by each principal component
pve <- (sdev^2) / sum(sdev^2)

# Display PVE
pve

## [1] 0.62006039 0.24744129 0.08914080 0.04335752

# Cumulative PVE
cumulative_pve <- cumsum(pve)

# Display cumulative PVE
cumulative_pve

## [1] 0.6200604 0.8675017 0.9566425 1.0000000

PC1 alone explains 62% of the total variance. PC1 + PC2 together explain ~86.75%, a very strong summary of the data in just 2 dimensions. PC1 + PC2 + PC3 explain ~95.66%, nearly all the variability. All four PCs together explain 100%, as expected.

# Load data and perform PCA
data("USArrests")
x <- scale(USArrests)  # Standardize the data
pca <- prcomp(x)

# Get PCA scores (Z matrix)
z <- pca$x

# Numerator: sum of squared scores for each PC (across all observations)
num <- colSums(z^2)

# Denominator: total sum of squares of original (standardized) data
denom <- sum(x^2)

# Compute PVE using Equation 12.10
pve_manual <- num / denom

# Print manually computed PVE
pve_manual

##        PC1        PC2        PC3        PC4 
## 0.62006039 0.24744129 0.08914080 0.04335752

PC1 (Principal Component 1) explains about 62% of the total variance in the USArrests dataset. PC2 explains about 24.74% of the variance. PC3 explains about 8.91% of the variance. PC4 explains about 4.34% of the variance.

#9a

# Load the data
data("USArrests")

# Standardize the variables
x <- scale(USArrests)

# Compute the Euclidean distance matrix
d <- dist(x)

# Perform hierarchical clustering using complete linkage
hc_complete <- hclust(d, method = "complete")

# Plot the dendrogram
plot(hc_complete, main = "Hierarchical Clustering with Complete Linkage", xlab = "", sub = "", cex = 0.7)

Complete linkage merges clusters based on the farthest pairwise distance between their elements, leading to compact and well-separated clusters. The vertical axis (height) represents the dissimilarity (or distance) at which clusters are joined. Higher fusions = more dissimilar.

Cluster 1 (Top left): States like North Dakota, South Dakota, West Virginia, Iowa, Maine are merged early (low height), indicating similar crime profiles — likely low overall crime rates. Cluster 2 (Top right): South Carolina, Mississippi, Georgia, North Carolina are merged much later, suggesting they are more distinct and possibly have higher arrest rates or violent crime indicators. Alaska stands out: It joins other clusters at a very high height, indicating it’s very different from most states in terms of standardized crime statistics — possibly due to a unique combination of high urbanization in isolated areas and crime patterns. Mid-cluster (e.g., Missouri, Oregon, Washington): These states form a moderate-sized cluster that’s distinct from both extremes — possibly representing average or mixed crime rates.

# Cut the dendrogram into 3 clusters
clusters <- cutree(hc_complete, k = 3)

# Assign states to their respective clusters
cluster1 <- names(clusters[clusters == 1])
cluster2 <- names(clusters[clusters == 2])
cluster3 <- names(clusters[clusters == 3])

# Display the clusters
cat("Cluster 1:\n", paste(cluster1, collapse = ", "), "\n\n")

## Cluster 1:
##  Alabama, Alaska, Georgia, Louisiana, Mississippi, North Carolina, South Carolina, Tennessee

cat("Cluster 2:\n", paste(cluster2, collapse = ", "), "\n\n")

## Cluster 2:
##  Arizona, California, Colorado, Florida, Illinois, Maryland, Michigan, Nevada, New Mexico, New York, Texas

cat("Cluster 3:\n", paste(cluster3, collapse = ", "), "\n")

## Cluster 3:
##  Arkansas, Connecticut, Delaware, Hawaii, Idaho, Indiana, Iowa, Kansas, Kentucky, Maine, Massachusetts, Minnesota, Missouri, Montana, Nebraska, New Hampshire, New Jersey, North Dakota, Ohio, Oklahoma, Oregon, Pennsylvania, Rhode Island, South Dakota, Utah, Vermont, Virginia, Washington, West Virginia, Wisconsin, Wyoming

# Plot the dendrogram with clusters highlighted
plot(hc_complete, main = "Hierarchical Clustering Dendrogram", xlab = "", sub = "", cex = 0.6)
rect.hclust(hc_complete, k = 3, border = "red")

Cluster 1: High Crime States States: Alabama, Alaska, Georgia, Louisiana, Mississippi, North Carolina, South Carolina, Tennessee

These states are grouped together because they tend to have higher rates of violent crimes (like murder and assault). Alaska is especially unique — likely due to its high assault and rape rates, making it more distinct but still grouped with other high-crime states at a higher dendrogram height.

Cluster 2: Moderate Crime States States: Arizona, California, Colorado, Florida, Illinois, Maryland, Michigan, Nevada, New Mexico, New York, Texas

These states exhibit moderate levels of crime, likely driven by large urban centers and higher population density. Many are large, diverse, and urbanized, contributing to variation in crime statistics without being as extreme as Cluster 1.

Cluster 3: Low Crime States States: Arkansas, Connecticut, Delaware, Hawaii, Idaho, Indiana, Iowa, Kansas, Kentucky, Maine, Massachusetts, Minnesota, Missouri, Montana, Nebraska, New Hampshire, New Jersey, North Dakota, Ohio, Oklahoma, Oregon, Pennsylvania, Rhode Island, South Dakota, Utah, Vermont, Virginia, Washington, West Virginia, Wisconsin, Wyoming

This is the largest cluster, representing states with relatively low overall crime rates. Many are rural or smaller states, or ones with strong law enforcement and low urban crime. These states tend to cluster early (low height) in the dendrogram, indicating high similarity.

# Load USArrests data
data("USArrests")

# Standardize variables: mean = 0, sd = 1
scaled_data <- scale(USArrests)

# Compute Euclidean distance on standardized data
d_scaled <- dist(scaled_data)

# Perform hierarchical clustering using complete linkage
hc_scaled <- hclust(d_scaled, method = "complete")

# Plot dendrogram
plot(hc_scaled, main = "Hierarchical Clustering (Complete Linkage, Scaled Data)",
     xlab = "", sub = "", cex = 0.7)

This dendrogram shows hierarchical clustering using complete linkage on data where each variable has been standardized (mean = 0, sd = 1). This prevents any variable (e.g., Assault) from dominating due to larger scale.

Cluster 1: Low Crime States States: South Dakota, West Virginia, North Dakota, Vermont, Maine, Iowa, New Hampshire, Idaho, Montana, Nebraska, Kentucky, Arkansas, Wyoming Interpretation: These states cluster early, suggesting similar, lower levels of violent crime across the standardized variables. They likely share low murder and assault rates.

Cluster 2: Moderate Crime States States: Connecticut, Delaware, Massachusetts, New Jersey, Rhode Island, Minnesota, Wisconsin, Indiana, Ohio, Pennsylvania, Hawaii, Utah, Colorado, California, Nevada, Oregon, Washington, Missouri, Kansas, Oklahoma Interpretation: These states likely have mixed crime profiles — moderate assault, urban populations, and arrest rates. Urbanized states like California and New Jersey are here, suggesting scaling equalized their impact.

Cluster 3: High Crime States States: Alabama, Louisiana, Georgia, Mississippi, North Carolina, South Carolina, Tennessee, Texas, Florida, Illinois, New York, Arizona, Michigan, Maryland, New Mexico, Alaska Interpretation: These states likely have higher standardized crime rates, especially in murder and assault, leading to their separation and late merging in the dendrogram. Alaska still appears distinct but closer to these states than before scaling.

#d Effect of Scaling Scaling the variables (standardizing to mean 0 and standard deviation 1) has a significant effect on the hierarchical clustering results:

Before scaling: Variables with larger numerical ranges (like Assault or UrbanPop) dominate the Euclidean distance calculation. This skews the clustering toward these variables. After scaling: All variables contribute equally to the distance calculations, leading to clusters based on overall similarity across all features, not just high-magnitude ones. This is evident from how Alaska, for example, clustered very late when unscaled (due to extreme Assault values), but integrated more naturally after scaling.

Equal Contribution: Scaling ensures that all variables contribute equally to distance calculations, rather than letting variables with larger units dominate. Meaningful Clusters: Clustering should reflect overall behavioral patterns across all variables — not just those with high numerical magnitude. PCA Analogy: Just like in Principal Component Analysis, scaling is critical when variable scales differ — the same logic applies to clustering.

Scaling is essential for fair and interpretable clustering in datasets like USArrests, where features like “Assault” and “UrbanPop” vary on very different scales. Without scaling, you risk clustering on magnitude rather than structure.

#10a

set.seed(1)  # for reproducibility

# Number of observations per class
n <- 20

# Number of variables (features)
p <- 50

# Class 1: centered at 0
class1 <- matrix(rnorm(n * p), nrow = n)

# Class 2: centered at 2
class2 <- matrix(rnorm(n * p, mean = 2), nrow = n)

# Class 3: centered at -2
class3 <- matrix(rnorm(n * p, mean = -2), nrow = n)

# Combine all data
X <- rbind(class1, class2, class3)

# Create true class labels (1, 2, 3)
true_labels <- c(rep(1, n), rep(2, n), rep(3, n))

# Optional: convert to data frame
df <- data.frame(X)
head(df)

##           X1          X2         X3          X4         X5          X6
## 1 -0.6264538  0.91897737 -0.1645236  2.40161776 -0.5686687 -0.62036668
## 2  0.1836433  0.78213630 -0.2533617 -0.03924000 -0.1351786  0.04211587
## 3 -0.8356286  0.07456498  0.6969634  0.68973936  1.1780870 -0.91092165
## 4  1.5952808 -1.98935170  0.5566632  0.02800216 -1.5235668  0.15802877
## 5  0.3295078  0.61982575 -0.6887557 -0.74327321  0.5939462 -0.65458464
## 6 -0.8204684 -0.05612874 -0.7074952  0.18879230  0.3329504  1.76728727
##           X7         X8         X9        X10        X11         X12        X13
## 1 -0.5059575 -1.9143594  0.4251004 -1.2313234  0.4094018 -1.73321841  0.7073107
## 2  1.3430388  1.1765833 -0.2386471  0.9838956  1.6888733  0.00213186  1.0341077
## 3 -0.2145794 -1.6649724  1.0584830  0.2199248  1.5865884 -0.63030033  0.2234804
## 4 -0.1795565 -0.4635304  0.8864227 -1.4672500 -0.3309078 -0.34096858 -0.8787076
## 5 -0.1001907 -1.1159201 -0.6192430  0.5210227 -2.2852355 -1.15657236  1.1629646
## 6  0.7126663 -0.7508190  2.2061025 -0.1587546  2.4976616  1.80314191 -2.0001649
##          X14        X15        X16        X17        X18        X19         X20
## 1  0.9510128  0.3981302  0.8936737  1.3079015  0.7395892 -2.5923277  0.76258651
## 2 -0.3892372 -0.4075286 -1.0472981  1.4970410 -1.0634574  1.3140022  1.11143108
## 3 -0.2843307  1.3242586  1.9713374  0.8147027  0.2462108 -0.6355430 -0.92320695
## 4  0.8574098 -0.7012317 -0.3836321 -1.8697888 -0.2894994 -0.4299788  0.16434184
## 5  1.7196273 -0.5806143  1.6541453  0.4820295 -2.2648894 -0.1693183  1.15482519
## 6  0.2700549 -1.0010722  1.5122127  0.4561356 -1.4088505  0.6122182 -0.05652142
##          X21         X22         X23         X24        X25         X26
## 1  1.0744410  1.43506957 -0.43383274 -0.45303708 -0.3572989  0.07730312
## 2  1.8956548 -0.71037115  1.77261118  2.16536850 -1.1468141 -0.29686864
## 3 -0.6029973 -0.06506757 -0.01825971  1.24574667 -0.5174205 -1.18324224
## 4 -0.3908678 -1.75946874  0.85281499  0.59549803 -0.3621238  0.01129269
## 5 -0.4162220  0.56972297  0.20516290  0.00488445  2.3505543  0.99160104
## 6 -0.3756574  1.61234680 -3.00804860  0.27936078  2.4465314  1.59396745
##           X27        X28         X29        X30        X31        X32
## 1 -0.73732753 -0.4184181 -1.21536404 -1.3765192 -0.3410670 -0.2555104
## 2  0.29066665  0.3551355 -0.02255863  0.1676799  1.5024245 -1.7869381
## 3 -0.88484957  0.5134811  0.70123930  1.5846291  0.5283077  1.7846628
## 4  0.20800648  0.0186074 -0.58748203  1.6778890  0.5421914  1.7635863
## 5 -0.04773017  1.3184490 -0.60672794  0.4882967 -0.1366734  0.6896002
## 6 -1.68452065 -0.0658320  1.09664022  0.8786733 -1.1367339 -1.1007406
##           X33        X34        X35         X36         X37        X38
## 1 -2.43263975  0.3309763  1.0778503 -0.70756823  0.48934096 -2.2451526
## 2 -0.34048493  0.9763275 -1.1989744  1.97157201 -0.77891030 -1.3353714
## 3  0.71303319 -0.8433399  0.2166370 -0.08999868  1.74355935  1.2827752
## 4 -0.65903739 -0.9705799  0.1430870 -0.01401725 -0.07838729  0.6907959
## 5 -0.03640262 -1.7715313 -1.0657501 -1.12345694 -0.97555379 -0.9670627
## 6 -1.59328630 -0.3224703 -0.4286234 -1.34413012  0.07065982 -1.3457937
##           X39        X40         X41         X42        X43        X44
## 1 -0.01128123 -0.7598457 -1.08690882 -0.17405549  2.0057186  0.7960927
## 2  0.61967726  1.1489591 -1.82608301  0.96129056 -2.0705715  0.9864283
## 3 -1.28123874 -0.8424763  0.99528181  0.29382666  3.0557424 -0.7945317
## 4 -0.12426133  0.3914133 -0.01186178  0.08099936 -0.2613506 -0.3088180
## 5  0.17574165  0.8913772 -0.59962839  0.18366184 -0.4543933  0.3614448
## 6  1.69277379 -1.3352587 -0.17794799  0.16625504  0.1575606  1.3987911
##           X45        X46        X47         X48         X49         X50
## 1 -0.94061130 -1.5414026  0.7372137 -0.04163922 -0.79506319  0.46365865
## 2  0.63470293  0.1943211  2.3213339  0.67611201 -0.01995512  0.07477833
## 3 -0.06248848  0.2644225  0.3489093  0.86643615 -2.51442512 -0.48683624
## 4  0.18283787 -1.1187352 -1.1339167  0.23517502  2.21095203  0.74891082
## 5  1.10364102  0.6509530  0.4213353 -0.93397013 -1.48876223  0.46423458
## 6  1.75203562 -1.0329002 -0.9245563  0.81325217 -1.16075188  0.12942046

Each row represents an observation (from one of the three classes). Each column is a feature (out of 50). Since you haven’t added the class labels yet, the data frame df only contains the raw variables — the next step would be to attach the class labels so we can color and interpret clusters in PCA or K-means later.

# Ensure Class is added correctly to the data frame
df$Class <- factor(true_labels)

# Perform PCA only on numeric columns (exclude Class column)
pca_result <- prcomp(df[, -ncol(df)], scale. = TRUE)

# Create data frame of first two PCs and Class
pca_df <- data.frame(
  PC1 = pca_result$x[, 1],
  PC2 = pca_result$x[, 2],
  Class = df$Class
)

# Plot using ggplot2
library(ggplot2)

ggplot(pca_df, aes(x = PC1, y = PC2, color = Class)) +
  geom_point(size = 3, alpha = 0.8) +
  labs(title = "PCA: First Two Principal Components",
       x = "PC1", y = "PC2") +
  theme_minimal()

Each point represents one of the 60 simulated observations (in 50D), now reduced to 2D using PCA. The colors indicate the true class labels (Class 1 = red, Class 2 = green, Class 3 = blue). The three classes are clearly separated along the first principal component (PC1), which captures the majority of variance due to the class-wise mean shifts.

The separation is very clear — there’s no significant overlap between the three groups. Therefore, you can proceed to part (c) (K-means clustering), as the PCA plot confirms the clusters are distinguishable in the feature space.

set.seed(2)  # For reproducibility

# Perform K-means clustering on the PCA input data (50 features, not labels)
kmeans_result <- kmeans(df[, -ncol(df)], centers = 3, nstart = 20)

# Predicted cluster labels
kmeans_labels <- kmeans_result$cluster

# Compare with true class labels
table(True = df$Class, Clustered = kmeans_labels)

##     Clustered
## True  1  2  3
##    1 20  0  0
##    2  0  0 20
##    3  0 20  0

Class 1 (True label 1) was perfectly clustered into Cluster 1 (20/20 correct). Class 2 (True label 2) was entirely assigned to Cluster 3. Class 3 (True label 3) was entirely assigned to Cluster 2.

K-means perfectly identified all three classes, just assigned different numeric labels than the true class labels (which is expected — K-means does not preserve label order). This is a 100% accurate clustering, just with different cluster IDs (e.g., K-means label 2 corresponds to true Class 3, etc.).

set.seed(3)  # For reproducibility

# Perform K-means with 2 clusters
kmeans_k2 <- kmeans(df[, -ncol(df)], centers = 2, nstart = 20)

# Compare with true class labels
table(True = df$Class, Clustered = kmeans_k2$cluster)

##     Clustered
## True  1  2
##    1 20  0
##    2  0 20
##    3 20  0

The K-means algorithm tried to split the data into only 2 clusters, but the true structure had 3 well-separated groups. It merged Class 1 and Class 3 into a single cluster because they were closer in terms of distance (in the 50-dimensional feature space or PCA-reduced space). Class 2 was distinct enough to be separated correctly.

set.seed(4)  # For reproducibility

# Perform K-means clustering with 4 centers
kmeans_k4 <- kmeans(df[, -ncol(df)], centers = 4, nstart = 20)

# Compare with true class labels
table(True = df$Class, Clustered = kmeans_k4$cluster)

##     Clustered
## True  1  2  3  4
##    1  9 11  0  0
##    2  0  0  0 20
##    3  0  0 20  0

Overestimating the number of clusters (K=4 instead of true K=3) forced K-means to split one natural class — in this case, Class 1. The algorithm arbitrarily divided Class 1 into two groups (possibly due to internal spread or minor noise). The other two classes were distinct enough to remain intact.

K = 4 causes over-segmentation, specifically splitting Class 1, which was likely the least compact or most spread out class. This reinforces that K = 3 was the best choice, perfectly matching the natural structure of the data.

set.seed(5)  # For reproducibility

# Use the PCA result from earlier (already done in part b)
# Extract first two principal components
pca_data <- pca_result$x[, 1:2]

# Perform K-means clustering on PC1 and PC2
kmeans_pca <- kmeans(pca_data, centers = 3, nstart = 20)

# Compare with true labels
table(True = df$Class, Clustered = kmeans_pca$cluster)

##     Clustered
## True  1  2  3
##    1  0 20  0
##    2 20  0  0
##    3  0  0 20

The first two principal components retained enough structure of the original high-dimensional data to accurately distinguish the three classes. This confirms that PCA + K-means is a powerful combo for dimensionality reduction and clustering, especially when the main variation is along a few directions (which you ensured by using mean-shifted classes).

set.seed(6)  # For reproducibility

# Scale all 50 features to have mean 0 and standard deviation 1
scaled_data <- scale(df[, -ncol(df)])  # exclude the Class label

# Apply K-means clustering with K = 3 on scaled data
kmeans_scaled <- kmeans(scaled_data, centers = 3, nstart = 20)

# Compare with true class labels
table(True = df$Class, Clustered = kmeans_scaled$cluster)

##     Clustered
## True  1  2  3
##    1 20  0  0
##    2  0  0 20
##    3  0 20  0

In part (b) (on unscaled data), you also got perfect clustering. So, in this specific simulation: Scaling had no negative effect, and Did not change the clustering quality — because the signal (mean shift) was strong and uniformly distributed across all 50 features.

Scaling didn’t harm or improve clustering in this controlled case because: All features were generated from the same distribution, All features contributed equally, Mean shifts were applied uniformly.

lab 11

2025-04-29