In Section 12.2.3, a formula for calculating PVE was given in Equation 12.10. We also saw that the PVE can be obtained using the
sdev
output of theprcomp()
function.On the
USArrests
data, calculate PVE in two ways:
- Using the
sdev
output of theprcomp()
function, as was done in Section 12.2.3.
pr <- prcomp(USArrests, scale = TRUE)
# Calculate Proportion of Variance Explained (PVE) using sdev
pve_a <- pr$sdev^2 / sum(pr$sdev^2)
print(pve_a)
## [1] 0.62006039 0.24744129 0.08914080 0.04335752
This code computes the proportion of variance explained (PVE) by each
principal component using the squared standard deviations
(sdev
) of the components obtained from
prcomp()
. Each squared standard deviation corresponds to
the variance captured by the respective principal component. The PVE is
then the variance explained by each component divided by the total
variance.
- By applying Equation 12.10 directly. That is, use the
prcomp()
function to compute the principal component loadings. Then, use those loadings in Equation 12.10 to obtain the PVE.These two approaches should give the same results.
# Scale the data as in the PCA process
scaled_data <- scale(USArrests)
# Calculate squared scores per component, then normalize by total variance
pve_b <- colSums(pr$x^2) / sum(colSums(scaled_data^2))
print(pve_b)
## PC1 PC2 PC3 PC4
## 0.62006039 0.24744129 0.08914080 0.04335752
Here we replicate Equation 12.10 from the textbook, which states that
the PVE for the \(m\)-th principal
component is the sum of the squared component scores divided by the
total sum of squares of the scaled data. This approach verifies the
computation done via sdev
, ensuring correctness.
As expected, both methods give the same PVE values, reaffirming that
PCA decomposes the total variance consistently whether calculated from
eigenvalues (via sdev^2
) or the projected scores (via
Equation 12.10).
Consider the
USArrests
data. We will now perform hierarchical clustering on the states.
- Using hierarchical clustering with complete linkage and Euclidean distance, cluster the states.
set.seed(42)
hc <- hclust(dist(USArrests), method = "complete")
plot(hc, main = "Hierarchical Clustering - Complete Linkage", xlab = "States", sub = "", cex = 0.6)
We use the hclust()
function with complete linkage to
cluster the states based on their arrest statistics. The dendrogram
gives a visual representation of how states group together based on the
similarity in the four variables.
- Cut the dendrogram at a height that results in three distinct clusters. Which states belong to which clusters?
ct <- cutree(hc, 3)
sapply(1:3, function(i) names(ct)[ct == i])
## [[1]]
## [1] "Alabama" "Alaska" "Arizona" "California"
## [5] "Delaware" "Florida" "Illinois" "Louisiana"
## [9] "Maryland" "Michigan" "Mississippi" "Nevada"
## [13] "New Mexico" "New York" "North Carolina" "South Carolina"
##
## [[2]]
## [1] "Arkansas" "Colorado" "Georgia" "Massachusetts"
## [5] "Missouri" "New Jersey" "Oklahoma" "Oregon"
## [9] "Rhode Island" "Tennessee" "Texas" "Virginia"
## [13] "Washington" "Wyoming"
##
## [[3]]
## [1] "Connecticut" "Hawaii" "Idaho" "Indiana"
## [5] "Iowa" "Kansas" "Kentucky" "Maine"
## [9] "Minnesota" "Montana" "Nebraska" "New Hampshire"
## [13] "North Dakota" "Ohio" "Pennsylvania" "South Dakota"
## [17] "Utah" "Vermont" "West Virginia" "Wisconsin"
We apply cutree()
to divide the states into 3 clusters
and print out the states in each. This shows how the states are grouped
based on similar levels of violent crime and urban population.
- Hierarchically cluster the states using complete linkage and Euclidean distance, after scaling the variables to have standard deviation one.
hc2 <- hclust(dist(scale(USArrests)), method = "complete")
plot(hc2, main = "Hierarchical Clustering (Scaled Data) - Complete Linkage", xlab = "States", sub = "", cex = 0.6)
This time, we standardize the variables using scale()
so
that all four variables have equal influence on the clustering. This is
important since features like Assault and UrbanPop are measured on
different scales.
- What effect does scaling the variables have on the hierarchical clustering obtained? In your opinion, should the variables be scaled before the inter-observation dissimilarities are computed? Provide a justification for your answer.
ct_scaled <- cutree(hc2, 3)
sapply(1:3, function(i) names(ct_scaled)[ct_scaled == i])
## [[1]]
## [1] "Alabama" "Alaska" "Georgia" "Louisiana"
## [5] "Mississippi" "North Carolina" "South Carolina" "Tennessee"
##
## [[2]]
## [1] "Arizona" "California" "Colorado" "Florida" "Illinois"
## [6] "Maryland" "Michigan" "Nevada" "New Mexico" "New York"
## [11] "Texas"
##
## [[3]]
## [1] "Arkansas" "Connecticut" "Delaware" "Hawaii"
## [5] "Idaho" "Indiana" "Iowa" "Kansas"
## [9] "Kentucky" "Maine" "Massachusetts" "Minnesota"
## [13] "Missouri" "Montana" "Nebraska" "New Hampshire"
## [17] "New Jersey" "North Dakota" "Ohio" "Oklahoma"
## [21] "Oregon" "Pennsylvania" "Rhode Island" "South Dakota"
## [25] "Utah" "Vermont" "Virginia" "Washington"
## [29] "West Virginia" "Wisconsin" "Wyoming"
After scaling, the clusters differ significantly from the unscaled
result. This is because the scaling removes the dominance of variables
with larger numerical ranges. Since USArrests
includes
variables with different units and ranges (e.g., percent urban
population vs. counts of arrests), scaling is necessary to ensure no
single variable dominates the clustering outcome.
In this problem, you will generate simulated data, and then perform PCA and \(K\)-means clustering on the data.
- Generate a simulated data set with 20 observations in each of three classes (i.e. 60 observations total), and 50 variables.
set.seed(42)
data <- matrix(rnorm(60 * 50), ncol = 50)
classes <- rep(c("A", "B", "C"), each = 20)
data[classes == "B", 1:10] <- data[classes == "B", 1:10] + 1.2
data[classes == "C", 5:30] <- data[classes == "C", 5:30] + 1
Here, we simulate a dataset of 60 observations (20 per class) and 50 variables. Classes B and C are separated from class A by adding a mean shift to certain variables. This structure ensures visible group separations.
- Perform PCA on the 60 observations and plot the first two principal component score vectors.
library(ggplot2)
pca <- prcomp(data)
ggplot(
data.frame(Class = classes, PC1 = pca$x[, 1], PC2 = pca$x[, 2]),
aes(x = PC1, y = PC2, color = Class)
) +
geom_point(size = 2) +
labs(title = "PCA - First Two Components") +
theme_minimal()
PCA is used to reduce the dimensionality of the data and visualize class separation in 2D. The distinct separation of classes on the PC1-PC2 plane validates the effectiveness of our simulated design.
- Perform \(K\)-means clustering of the observations with \(K = 3\). How well do the clusters that you obtained in \(K\)-means clustering compare to the true class labels?
km <- kmeans(data, centers = 3)$cluster
table(km, classes)
## classes
## km A B C
## 1 1 20 1
## 2 0 0 19
## 3 19 0 0
The confusion matrix shows how the unsupervised K-means clustering aligns with the actual class labels. High alignment means good clustering performance.
- Perform \(K\)-means clustering with \(K = 2\). Describe your results.
km2 <- kmeans(data, centers = 2)$cluster
table(km2, classes)
## classes
## km2 A B C
## 1 18 20 1
## 2 2 0 19
With \(K=2\), one class is often isolated (e.g., class B), while the other cluster is a mix of the remaining two classes. This highlights K-means’ limitations when the true number of classes is unknown or not well-aligned with K.
- Now perform \(K\)-means clustering with \(K = 4\), and describe your results.
km4 <- kmeans(data, centers = 4)$cluster
table(km4, classes)
## classes
## km4 A B C
## 1 0 7 2
## 2 18 1 0
## 3 0 0 18
## 4 2 12 0
At \(K=4\), the algorithm may over-segment the data, splitting one class into two clusters. This reinforces that more clusters don’t always lead to better interpretation.
- Now perform \(K\)-means clustering with \(K = 3\) on the first two principal component score vectors.
km_pca <- kmeans(pca$x[, 1:2], centers = 3)$cluster
table(km_pca, classes)
## classes
## km_pca A B C
## 1 0 20 2
## 2 20 0 0
## 3 0 0 18
Clustering on PCA-reduced data often enhances performance by removing noise and reducing dimensionality. The confusion matrix again confirms that PCA preserves enough structure for effective clustering.
- Using the
scale()
function, perform \(K\)-means clustering with \(K = 3\) on the data after scaling each variable to have standard deviation one.
km_scaled <- kmeans(scale(data), centers = 3)$cluster
table(km_scaled, classes)
## classes
## km_scaled A B C
## 1 1 20 1
## 2 19 0 0
## 3 0 0 19
Scaling equalizes the influence of each variable, which can improve or worsen clustering depending on whether the signal lies in high-variance or low-variance variables. Here, performance slightly drops, suggesting that unscaled variables better reflected cluster structures.