This is an R Markdown
Notebook. When you execute code within the notebook, the results appear
beneath the code.
Try executing this chunk by clicking the Run button within
the chunk or by placing your cursor inside it and pressing
Cmd+Shift+Enter.
8. Proportion of Variance Explained (PVE) in two ways for the
USArrests dataset
(a) Using the sdev Output of prcomp()
To calculate the Proportion of Variance Explained (PVE) using the
built-in PCA function, I applied prcomp() to the USArrests dataset with
both centering and scaling enabled. I then used the squared standard
deviations (sdev^2) to compute the PVE for each principal component.
# Load data
data("USArrests")
# Perform PCA with centering and scaling
pca_out <- prcomp(USArrests, scale. = TRUE)
# Calculate the proportion of variance explained
eigenvalues <- pca_out$sdev^2
pve_a <- eigenvalues / sum(eigenvalues)
# Display the PVE
pve_a
[1] 0.62006039 0.24744129 0.08914080 0.04335752
Interpretation:
The first principal component explains 62.0% of the total variance.
The second explains 24.7%, The third explains 8.9%, and The fourth
explains 4.3%.
(b) Using Equation 12.10 Directly
To calculate the Proportion of Variance Explained (PVE) manually
using Equation 12.10, I first scaled the USArrests dataset so that all
variables have mean zero and standard deviation one, just like I did in
part (a). Then I used the principal component loadings obtained from the
prcomp() function and followed these steps:
I calculated the principal component scores manually by multiplying
the scaled data matrix with the loading matrix. I computed the numerator
of Equation 12.10 as the sum of squared scores for each principal
component. I computed the denominator as the total sum of squared values
from the scaled dataset. I divided the numerator by the denominator to
get the PVE for each principal component.
# Scale the data
scaled_data <- scale(USArrests)
# Perform PCA
pca_out <- prcomp(scaled_data)
# Extract loadings
loadings <- pca_out$rotation
# Calculate scores manually
scores <- scaled_data %*% loadings
# Compute numerator and denominator
numerator <- colSums(scores^2)
denominator <- sum(scaled_data^2)
# Calculate PVE
pve_b <- numerator / denominator
pve_b
PC1 PC2 PC3 PC4
0.62006039 0.24744129 0.08914080 0.04335752
This result matches exactly with the output from part (a), confirming
that both methods give the same PVE values when the data is properly
centered and scaled.
9(a) Hierarchical Clustering Using Complete Linkage and Euclidean
Distance
I performed hierarchical clustering on the USArrests dataset using
complete linkage and Euclidean distance. I did not scale the variables
in this part.
# Load data
data("USArrests")
# Compute the distance matrix
dist_matrix <- dist(USArrests)
# Perform hierarchical clustering with complete linkage
hc_complete <- hclust(dist_matrix, method = "complete")
# Plot the dendrogram
plot(hc_complete, main = "Dendrogram - Complete Linkage (Unscaled)")

# Dendrogram with clear labels
plot(hc_complete,
main = "Dendrogram - Complete Linkage (Unscaled)",
xlab = "",
sub = "",
cex = 0.6, # Shrink label size
las = 2) # Rotate labels for better spacing

(b) Cut the Dendrogram to Get 3 Clusters
I cut the dendrogram at a height that gives three clusters using
cutree(), and then listed which states belong to each cluster.
# Cut tree into 3 clusters
clusters_unscaled <- cutree(hc_complete, k = 3)
# Group states by clusters
split(names(clusters_unscaled), clusters_unscaled)
$`1`
[1] "Alabama" "Alaska" "Arizona" "California" "Delaware" "Florida" "Illinois" "Louisiana"
[9] "Maryland" "Michigan" "Mississippi" "Nevada" "New Mexico" "New York" "North Carolina" "South Carolina"
$`2`
[1] "Arkansas" "Colorado" "Georgia" "Massachusetts" "Missouri" "New Jersey" "Oklahoma" "Oregon" "Rhode Island"
[10] "Tennessee" "Texas" "Virginia" "Washington" "Wyoming"
$`3`
[1] "Connecticut" "Hawaii" "Idaho" "Indiana" "Iowa" "Kansas" "Kentucky" "Maine" "Minnesota"
[10] "Montana" "Nebraska" "New Hampshire" "North Dakota" "Ohio" "Pennsylvania" "South Dakota" "Utah" "Vermont"
[19] "West Virginia" "Wisconsin"
(c) Hierarchical Clustering After Scaling the Variables
I repeated the clustering after scaling all variables to have
standard deviation one.
# Scale the variables
scaled_data <- scale(USArrests)
# Compute new distance matrix
dist_scaled <- dist(scaled_data)
# Perform clustering
hc_scaled <- hclust(dist_scaled, method = "complete")
# Plot the scaled dendrogram
plot(hc_scaled, main = "Dendrogram - Complete Linkage (Scaled)")

# Dendrogram with clear labels
plot(hc_complete,
main = "Dendrogram - Complete Linkage (Unscaled)",
xlab = "",
sub = "",
cex = 0.6, # Shrink label size
las = 2) # Rotate labels for better spacing

Then I again cut the dendrogram into 3 clusters:
clusters_scaled <- cutree(hc_scaled, k = 3)
split(names(clusters_scaled), clusters_scaled)
$`1`
[1] "Alabama" "Alaska" "Georgia" "Louisiana" "Mississippi" "North Carolina" "South Carolina" "Tennessee"
$`2`
[1] "Arizona" "California" "Colorado" "Florida" "Illinois" "Maryland" "Michigan" "Nevada" "New Mexico" "New York" "Texas"
$`3`
[1] "Arkansas" "Connecticut" "Delaware" "Hawaii" "Idaho" "Indiana" "Iowa" "Kansas" "Kentucky"
[10] "Maine" "Massachusetts" "Minnesota" "Missouri" "Montana" "Nebraska" "New Hampshire" "New Jersey" "North Dakota"
[19] "Ohio" "Oklahoma" "Oregon" "Pennsylvania" "Rhode Island" "South Dakota" "Utah" "Vermont" "Virginia"
[28] "Washington" "West Virginia" "Wisconsin" "Wyoming"
(d) Effect of Scaling and Justification
Scaling had a significant effect on the clustering results.
Before scaling, variables like ‘Assault’ (which has a larger
variance) dominated the distance calculations. As a result, the
clustering was heavily influenced by the raw magnitude of certain crime
types.
After scaling, all variables contributed equally to the distance
calculation. This gave a more balanced view, and the cluster assignments
changed accordingly.
In my opinion, the variables should be scaled before computing
inter-observation dissimilarities, especially when the variables are
measured on different scales (e.g., ‘Murder’ and ‘UrbanPop’). Without
scaling, clustering reflects the influence of high-variance variables
rather than all features equally.
10 (a) Generate Simulated Data for 3 Classes
A simulated dataset was generated with 60 observations (20 in each of
3 distinct classes) and 50 variables. A mean shift was added to each
class to ensure clear separation between them.
set.seed(1)
# Generate 20 observations for each class
x1 <- matrix(rnorm(20 * 50, mean = 0), nrow = 20)
x2 <- matrix(rnorm(20 * 50, mean = 3), nrow = 20)
x3 <- matrix(rnorm(20 * 50, mean = 6), nrow = 20)
# Combine all observations
x <- rbind(x1, x2, x3)
# Create true class labels
labels <- c(rep(1, 20), rep(2, 20), rep(3, 20))
Interpretation:
The three classes appeared well-separated in the PCA plot, confirming
that the mean shifts were sufficient for visual class distinction.
Interpretation:
The clustering closely aligned with the actual class labels, with
only minor discrepancies due to arbitrary cluster numbering by
K-means.
Interpretation:
Two clusters resulted in the merging of two actual classes, reducing
clustering accuracy and underrepresenting the data’s true structure.
Interpretation:
Using four clusters led to overfitting, as one of the original
classes was split into two clusters unnecessarily.
Interpretation:
Clustering on the first two principal components preserved much of
the structure and yielded results similar to those obtained on the
original dataset.
- Perform K-means Clustering on Scaled Data The dataset was scaled
using scale(), and K-means clustering was performed again with 3
clusters.
x_scaled <- scale(x)
set.seed(1)
km_scaled <- kmeans(x_scaled, centers = 3, nstart = 20)
table(labels, km_scaled$cluster)
labels 1 2 3
1 0 0 20
2 0 20 0
3 20 0 0
Interpretation:
Scaling the data before clustering helped standardize variable
influence, although in this simulated case the improvement was minimal.
In real-world datasets where variables differ in scale, scaling is
essential to prevent dominance by high-variance features.
