pr.out <- prcomp(USArrests, scale = TRUE)
pr.var <- pr.out$sdev^2
pve <- pr.var / sum(pr.var)
sum(pr.var)
## [1] 4
pve
## [1] 0.62006039 0.24744129 0.08914080 0.04335752
loadings <- pr.out$rotation
USArrests2 <- scale(USArrests)
sumvar <- sum(apply(as.matrix(USArrests2)^2, 2, sum))
apply((as.matrix(USArrests2) %*% loadings)^2, 2, sum) / sumvar
## PC1 PC2 PC3 PC4
## 0.62006039 0.24744129 0.08914080 0.04335752
Consider the USArrests data. We will now perform hierarchical clustering on the states.
set.seed(2)
hc.complete <- hclust(dist(USArrests), method = "complete")
plot(hc.complete)
ct <- cutree(hc.complete, 3)
sapply(1:3, function(i) names(ct)[ct ==i])
## [[1]]
## [1] "Alabama" "Alaska" "Arizona" "California"
## [5] "Delaware" "Florida" "Illinois" "Louisiana"
## [9] "Maryland" "Michigan" "Mississippi" "Nevada"
## [13] "New Mexico" "New York" "North Carolina" "South Carolina"
##
## [[2]]
## [1] "Arkansas" "Colorado" "Georgia" "Massachusetts"
## [5] "Missouri" "New Jersey" "Oklahoma" "Oregon"
## [9] "Rhode Island" "Tennessee" "Texas" "Virginia"
## [13] "Washington" "Wyoming"
##
## [[3]]
## [1] "Connecticut" "Hawaii" "Idaho" "Indiana"
## [5] "Iowa" "Kansas" "Kentucky" "Maine"
## [9] "Minnesota" "Montana" "Nebraska" "New Hampshire"
## [13] "North Dakota" "Ohio" "Pennsylvania" "South Dakota"
## [17] "Utah" "Vermont" "West Virginia" "Wisconsin"
sd.data <- scale(USArrests)
hc.complete.sd <- hclust(dist(sd.data), method = "complete")
plot(hc.complete.sd)
ct2 <- cutree(hc.complete.sd, 3)
sapply(1:3, function(i) names(ct2)[ct2==i])
## [[1]]
## [1] "Alabama" "Alaska" "Georgia" "Louisiana"
## [5] "Mississippi" "North Carolina" "South Carolina" "Tennessee"
##
## [[2]]
## [1] "Arizona" "California" "Colorado" "Florida" "Illinois"
## [6] "Maryland" "Michigan" "Nevada" "New Mexico" "New York"
## [11] "Texas"
##
## [[3]]
## [1] "Arkansas" "Connecticut" "Delaware" "Hawaii"
## [5] "Idaho" "Indiana" "Iowa" "Kansas"
## [9] "Kentucky" "Maine" "Massachusetts" "Minnesota"
## [13] "Missouri" "Montana" "Nebraska" "New Hampshire"
## [17] "New Jersey" "North Dakota" "Ohio" "Oklahoma"
## [21] "Oregon" "Pennsylvania" "Rhode Island" "South Dakota"
## [25] "Utah" "Vermont" "Virginia" "Washington"
## [29] "West Virginia" "Wisconsin" "Wyoming"
table(cutree(hc.complete, 3), cutree(hc.complete.sd, 3))
##
## 1 2 3
## 1 6 9 1
## 2 2 2 10
## 3 0 0 20
Scaling the variables affect the clusters obtained although the trees are similar. The variables must be scaled first because the data has different measurement units.
In this problem, you will generate simulated data, and then perform PCA and K-means clustering on the data.
set.seed(2)
x <- matrix(rnorm(20 * 3 * 50, mean = 0, sd = 0.001), ncol = 50)
x[1:20, 2] <- 1
x[21:40, 1] <- 2
x[21:40, 2] <- 2
x[41:60, 1] <- 1
true.labels <- c(rep(1, 20), rep(2, 20), rep(3, 20))
pr.out <- prcomp(x)
plot(pr.out$x[, 1:2], col = 1:3, xlab = "Z1", ylab = "Z2", pch = 19)
km.out <- kmeans(x, 3, nstart = 20)
table(true.labels, km.out$cluster)
##
## true.labels 1 2 3
## 1 0 0 20
## 2 20 0 0
## 3 0 20 0
K-means separates the clusters out perfectly.
km.out <- kmeans(x, 2, nstart = 20)
table(true.labels, km.out$cluster)
##
## true.labels 1 2
## 1 20 0
## 2 0 20
## 3 20 0
This method tries to find two clusters, but there are three true classes in the data so the clusters may not match the original classes. Cluster 1 contains all class 2 observations and cluster 2 has a mix of class 1 and 3.
km.out <- kmeans(x, 4, nstart = 20)
table(true.labels, km.out$cluster)
##
## true.labels 1 2 3 4
## 1 11 9 0 0
## 2 0 0 20 0
## 3 0 0 0 20
The clusters produced do not align perfectly with the original three classes. Cluster 1 contains all class 3 observations, cluster 2 contains some observations from class 1, cluster 3 contains some observations from class 1, and cluster 4 contains all observations from class 2.
km.out <- kmeans(pr.out$x[, 1:2], 3, nstart = 20)
table(true.labels, km.out$cluster)
##
## true.labels 1 2 3
## 1 0 0 20
## 2 0 20 0
## 3 20 0 0
The clusters perfectly align with the true class labels. This means k-means clustering on the first two principal components is just as effective as on the original data.
km.out <- kmeans(scale(x), 3, nstart = 20)
table(true.labels, km.out$cluster)
##
## true.labels 1 2 3
## 1 9 2 9
## 2 2 18 0
## 3 7 1 12
Class 2 is perfectly clustered into cluster 1, but the other classes are divided. This is less accurate than the previously perfect match on the unscaled data. Scaling may have diluted the useful signal in the data.