data("USArrests")
USArrests_scaled <- scale(USArrests)
pca_result <- prcomp(USArrests_scaled)
sdev <- pca_result$sdev
pve_a <- sdev^2 / sum(sdev^2)
pve_a
## [1] 0.62006039 0.24744129 0.08914080 0.04335752
loadings <- pca_result$rotation # principal component loadings
scores <- USArrests_scaled %*% loadings # principal component scores
pc_var <- apply(scores^2, 2, sum) # Numerator: variance of each PC
total_var <- sum(USArrests_scaled^2) # Denominator: total variance in data
pve_b <- pc_var / total_var
pve_a
## [1] 0.62006039 0.24744129 0.08914080 0.04335752
pve_b
## PC1 PC2 PC3 PC4
## 0.62006039 0.24744129 0.08914080 0.04335752
Both methods gave identical PVE values, which confirms the correct implementation and consistency of data scaling.
dist_usarrests <- dist(USArrests)
hc_complete <- hclust(dist_usarrests, method = "complete")
plot(hc_complete, main = " Complete Linkage", xlab = "", sub = "", cex = 0.6)
clusters_divided <- cutree(hc_complete, k = 3)
split(rownames(USArrests), clusters_divided)
## $`1`
## [1] "Alabama" "Alaska" "Arizona" "California"
## [5] "Delaware" "Florida" "Illinois" "Louisiana"
## [9] "Maryland" "Michigan" "Mississippi" "Nevada"
## [13] "New Mexico" "New York" "North Carolina" "South Carolina"
##
## $`2`
## [1] "Arkansas" "Colorado" "Georgia" "Massachusetts"
## [5] "Missouri" "New Jersey" "Oklahoma" "Oregon"
## [9] "Rhode Island" "Tennessee" "Texas" "Virginia"
## [13] "Washington" "Wyoming"
##
## $`3`
## [1] "Connecticut" "Hawaii" "Idaho" "Indiana"
## [5] "Iowa" "Kansas" "Kentucky" "Maine"
## [9] "Minnesota" "Montana" "Nebraska" "New Hampshire"
## [13] "North Dakota" "Ohio" "Pennsylvania" "South Dakota"
## [17] "Utah" "Vermont" "West Virginia" "Wisconsin"
These are the distinct clusters divided and the states in each cluster is shown.
USArrests_scaled <- scale(USArrests)
dist_scaled <- dist(USArrests_scaled)
hc_scaled <- hclust(dist_scaled, method = "complete")
plot(hc_scaled, main = "Dendrogram - Scaled Complete Linkage", xlab = "", sub = "", cex = 0.6)
clusters_scaled <- cutree(hc_scaled, k = 3)
split(rownames(USArrests_scaled), clusters_scaled)
## $`1`
## [1] "Alabama" "Alaska" "Georgia" "Louisiana"
## [5] "Mississippi" "North Carolina" "South Carolina" "Tennessee"
##
## $`2`
## [1] "Arizona" "California" "Colorado" "Florida" "Illinois"
## [6] "Maryland" "Michigan" "Nevada" "New Mexico" "New York"
## [11] "Texas"
##
## $`3`
## [1] "Arkansas" "Connecticut" "Delaware" "Hawaii"
## [5] "Idaho" "Indiana" "Iowa" "Kansas"
## [9] "Kentucky" "Maine" "Massachusetts" "Minnesota"
## [13] "Missouri" "Montana" "Nebraska" "New Hampshire"
## [17] "New Jersey" "North Dakota" "Ohio" "Oklahoma"
## [21] "Oregon" "Pennsylvania" "Rhode Island" "South Dakota"
## [25] "Utah" "Vermont" "Virginia" "Washington"
## [29] "West Virginia" "Wisconsin" "Wyoming"
This is the scaled dendogram. As scaling ensures all variables contribute equally to distance calculations, resulting in different clusters.
Scaling changes the clustering results significantly. Without scaling, variables with larger variances (like Murder) dominate the distance calculation, biasing the clusters. After scaling, all variables contribute equally. Yes, I think variables should be scaled before computing dissimilarities when they are measured on different scales. This ensures that each variable contributes equally to the distance calculation, preventing dominance by high-variance variables.
Therefore, scaling is essential when variables are measured in different units or have vastly different variances, as in the USArrests dataset.
set.seed(1234)
class1 <- matrix(rnorm(20 * 50, mean = 0), nrow = 20)
class2 <- matrix(rnorm(20 * 50, mean = 3), nrow = 20)
class3 <- matrix(rnorm(20 * 50, mean = -3), nrow = 20)
data <- rbind(class1, class2, class3)
true_labels <- rep(1:3, each = 20)
pca_result <- prcomp(data)
plot(pca_result$x[, 1:2], col = true_labels, pch = 19,
xlab = "PC1", ylab = "PC2", main = "PCA - First Two Components")
The plot shows clear separation among classes.
set.seed(1234)
kmeans_3 <- kmeans(data, centers = 3, nstart = 20)
table(Cluster = kmeans_3$cluster, Truth = true_labels)
## Truth
## Cluster 1 2 3
## 1 0 20 0
## 2 20 0 0
## 3 0 0 20
K-means with K=3 aligns well with true classes, though cluster labels may be permuted.
set.seed(1234)
kmeans_2 <- kmeans(data, centers = 2, nstart = 20)
table(Cluster = kmeans_2$cluster, Truth = true_labels)
## Truth
## Cluster 1 2 3
## 1 0 20 0
## 2 20 0 20
With K=2, K-means merges two classes together, which reduces clustering accuracy.
set.seed(1234)
kmeans_4 <- kmeans(data, centers = 4, nstart = 20)
table(Cluster = kmeans_4$cluster, Truth = true_labels)
## Truth
## Cluster 1 2 3
## 1 20 0 0
## 2 0 20 0
## 3 0 0 10
## 4 0 0 10
With K=4, K-means splits one true class into two, creating unnecessary clusters. So this is also not preferable.
set.seed(1234)
pc_data <- pca_result$x[, 1:2]
kmeans_pc <- kmeans(pc_data, centers = 3, nstart = 20)
table(Cluster = kmeans_pc$cluster, Truth = true_labels)
## Truth
## Cluster 1 2 3
## 1 0 20 0
## 2 20 0 0
## 3 0 0 20
Clustering on first 2 PCs retains class structure while reducing dimensionality, often improving clarity.
set.seed(12345)
scaled_data <- scale(data)
kmeans_scaled <- kmeans(scaled_data, centers = 3, nstart = 20)
table(Cluster = kmeans_scaled$cluster, Truth = true_labels)
## Truth
## Cluster 1 2 3
## 1 0 20 0
## 2 0 0 20
## 3 20 0 0
Scaling equalizes variable influence, which can improve clustering when variances differ across features.