Consider the USArrests data. We will now perform hierarchical clustering on the states.
arrests <- USArrests
hc_complete <- hclust(dist(arrests), method = 'complete')
plot(hc_complete, main = 'Complete Linkage Dendrogram')
We can use the cutree function with 3 clusters to see which states belong to which cluster, as well as create a plot and table showing the distributions of each cluster.
clusters_b <- cutree(hc_complete, k=3)
clusters_b
## Alabama Alaska Arizona Arkansas California
## 1 1 1 2 1
## Colorado Connecticut Delaware Florida Georgia
## 2 3 1 1 2
## Hawaii Idaho Illinois Indiana Iowa
## 3 3 1 3 3
## Kansas Kentucky Louisiana Maine Maryland
## 3 3 1 3 1
## Massachusetts Michigan Minnesota Mississippi Missouri
## 2 1 3 1 2
## Montana Nebraska Nevada New Hampshire New Jersey
## 3 3 1 3 2
## New Mexico New York North Carolina North Dakota Ohio
## 1 1 1 3 3
## Oklahoma Oregon Pennsylvania Rhode Island South Carolina
## 2 2 3 2 1
## South Dakota Tennessee Texas Utah Vermont
## 3 2 2 3 3
## Virginia Washington West Virginia Wisconsin Wyoming
## 2 2 3 3 2
plot(clusters_b)
table(clusters_b)
## clusters_b
## 1 2 3
## 16 14 20
scaled_arrests <- scale(arrests)
hc_scaled <- hclust(dist(scaled_arrests), method = 'complete')
plot(hc_scaled, main = 'Complete Linkage Dendrogram')
clusters_c <- cutree(hc_scaled, k=3)
clusters_c
## Alabama Alaska Arizona Arkansas California
## 1 1 2 3 2
## Colorado Connecticut Delaware Florida Georgia
## 2 3 3 2 1
## Hawaii Idaho Illinois Indiana Iowa
## 3 3 2 3 3
## Kansas Kentucky Louisiana Maine Maryland
## 3 3 1 3 2
## Massachusetts Michigan Minnesota Mississippi Missouri
## 3 2 3 1 3
## Montana Nebraska Nevada New Hampshire New Jersey
## 3 3 2 3 3
## New Mexico New York North Carolina North Dakota Ohio
## 2 2 1 3 3
## Oklahoma Oregon Pennsylvania Rhode Island South Carolina
## 3 3 3 3 1
## South Dakota Tennessee Texas Utah Vermont
## 3 1 2 3 3
## Virginia Washington West Virginia Wisconsin Wyoming
## 3 3 3 3 3
plot(clusters_c)
table(clusters_c)
## clusters_c
## 1 2 3
## 8 11 31
table(clusters_b)
## clusters_b
## 1 2 3
## 16 14 20
table(clusters_c)
## clusters_c
## 1 2 3
## 8 11 31
We can see that scaling the variables leads to different clusters being obtained, and different distributions. The variables should be scaled because they may be in different scales, with some variables being much larger than others. This would cause the larger variables to play an exaggerated role in clustering.
We using the following R code to simulate a data and then perform PCA and K-means clustering on the data.
set.seed(1)
x = matrix(runif(n=3000, 0, 0.01), ncol=50)
x[1:20, 2] = 2
x[21:40, 1] = 5
x[21:40, 2] = 5
x[41:60, 1] = 2
# making labels
class_labels <- factor(rep(c("Class 1", "Class 2", "Class 3"), each = 20))
library(ggplot2)
pca <- prcomp(x, rank. = 2)
pc_scores <- as.data.frame(pca$x[, 1:2])
pc_scores$class <- class_labels
ggplot(pc_scores, aes(x = PC1, y = PC2, color = class)) +
geom_point(size = 3, alpha = 0.8) +
labs(title = "Two Principal Components",
x = "Principal Component 1",
y = "Principal Component 2") +
theme_minimal() +
scale_color_manual(values = c("pink", "lightskyblue", "palegreen")) # Custom colors for classes
Hint: You can use the table() function in R to compare the true class labels to the class labels obtained by clustering. Be careful how you interpret the results: K-means clustering will arbitrarily number the clusters, so you cannot simply check whether the true class labels and clustering labels are the same.
kcluster <- kmeans(x, 3, nstart = 10)
table(kcluster$cluster, class_labels)
## class_labels
## Class 1 Class 2 Class 3
## 1 0 20 0
## 2 20 0 0
## 3 0 0 20
We see that the K-means clustering with k=3 perfectly sorts out the true class labels.
kcluster <- kmeans(x, 2, nstart = 10)
table(kcluster$cluster, class_labels)
## class_labels
## Class 1 Class 2 Class 3
## 1 0 20 0
## 2 20 0 20
As expected, this version does not work as well. We can see that classes 1 and 3 get grouped together in the same cluster, while class 2 remains in its own cluster. This is likely because class 2 involved setting the first two columns to 5, while class 1 and 2 involved setting only one of the first two columns to 2.
kcluster <- kmeans(x, 4, nstart = 10)
table(kcluster$cluster, class_labels)
## class_labels
## Class 1 Class 2 Class 3
## 1 0 0 9
## 2 0 0 11
## 3 20 0 0
## 4 0 20 0
In this clustering set it clusters 2 perfectly, but when forced to make a nonexistent cluster it splits class 3 into two different clusters.
kmean_pca <- kmeans(pca$x, 3, nstart = 10)
table(kmean_pca$cluster, class_labels)
## class_labels
## Class 1 Class 2 Class 3
## 1 0 0 20
## 2 0 20 0
## 3 20 0 0
plot(pca)
Using K-means clustering with the first two principal components, we see that it is able to perfectly cluster each individual class. If we check with the scree plot we can see this makes sense, as the first two components are able to essentially perfectly account for any variation in the data.
k_scale <- kmeans(scale(x), 3, nstart = 10)
table(k_scale$cluster, class_labels)
## class_labels
## Class 1 Class 2 Class 3
## 1 9 4 9
## 2 1 13 1
## 3 10 3 10
We see that when the data is scaled, K-means clustering performs significantly worse with much more misclassifications than by using the PCA without scaling found in part a). This is likely due to the large size of the first two altered columns in comparison to the rest of the data, which would have been significantly altered with the use of scaling.
We can also demonstrate this phenomenon by using PCA with scaling:
pca <- prcomp(x, scale. = T, rank. = 2)
pc_scores <- as.data.frame(pca$x[, 1:2])
pc_scores$class <- class_labels
ggplot(pc_scores, aes(x = PC1, y = PC2, color = class)) +
geom_point(size = 3, alpha = 0.8) +
labs(title = "Two Principal Components",
x = "Principal Component 1",
y = "Principal Component 2") +
theme_minimal() +
scale_color_manual(values = c("pink", "lightskyblue", "palegreen")) # Custom colors for classes
plot(pca)
We can see that they are not as readily separate as in the first PCA plot generated, and the variation explained by two principal components is much less than if they were not scaled.
On the book website, https://www.statlearning.com/resources-second-edition, there is a gene expression data set (Ch12Ex13.csv) that consists of 40 tissue samples with measurements on 1,000 genes. The first 20 samples are from healthy patients, while the second 20 are from a diseased group. The data is also available on the blackboard.
genes <- read.csv('Ch12Ex13.csv', header = F)
dd <- as.dist((1 - cor(genes))/2) # correlation based distance
hc_complete <- hclust(dd, method = 'complete')
plot(hc_complete, main = 'Complete')
hc_single <- hclust(dd, method = 'single')
plot(hc_single, main = 'Single')
hc_average <- hclust(dd, method = 'average')
plot(hc_average, main = 'Average')
hc_centroid <- hclust(dd, method = 'centroid')
plot(hc_centroid, main = 'Centroid')
It appears that complete, single, and centroid generally cluster the population into two similarly sized groups. Average instead appears to cluster into three groups, with one being smaller than the other two.
To see which genes have the most difference, we can use PCA. As PCA uses PC to explain the most variance possible in the data, if we look at the first PC we can determine which genes differ the most. We can do this by looking at the 10 genes with the highest loadings for the first PC.
pca <- prcomp(genes, scale. = T, center = T)
summary(pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.5401 1.1455 1.12727 1.10915 1.09034 1.06729 1.06464
## Proportion of Variance 0.1613 0.0328 0.03177 0.03076 0.02972 0.02848 0.02834
## Cumulative Proportion 0.1613 0.1941 0.22587 0.25663 0.28635 0.31483 0.34316
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 1.05514 1.04623 1.03430 1.0218 1.0040 1.0000 0.99680
## Proportion of Variance 0.02783 0.02736 0.02674 0.0261 0.0252 0.0250 0.02484
## Cumulative Proportion 0.37100 0.39836 0.42510 0.4512 0.4764 0.5014 0.52625
## PC15 PC16 PC17 PC18 PC19 PC20 PC21
## Standard deviation 0.97913 0.9674 0.96207 0.95900 0.94880 0.93664 0.91466
## Proportion of Variance 0.02397 0.0234 0.02314 0.02299 0.02251 0.02193 0.02092
## Cumulative Proportion 0.55021 0.5736 0.59675 0.61974 0.64225 0.66418 0.68509
## PC22 PC23 PC24 PC25 PC26 PC27 PC28
## Standard deviation 0.90370 0.89974 0.87668 0.86341 0.85710 0.8532 0.84544
## Proportion of Variance 0.02042 0.02024 0.01921 0.01864 0.01837 0.0182 0.01787
## Cumulative Proportion 0.70551 0.72575 0.74496 0.76360 0.78196 0.8002 0.81803
## PC29 PC30 PC31 PC32 PC33 PC34 PC35
## Standard deviation 0.83839 0.82676 0.81452 0.79681 0.7924 0.78095 0.77890
## Proportion of Variance 0.01757 0.01709 0.01659 0.01587 0.0157 0.01525 0.01517
## Cumulative Proportion 0.83561 0.85269 0.86928 0.88515 0.9009 0.91610 0.93126
## PC36 PC37 PC38 PC39 PC40
## Standard deviation 0.77086 0.7509 0.74522 0.72987 0.70944
## Proportion of Variance 0.01486 0.0141 0.01388 0.01332 0.01258
## Cumulative Proportion 0.94612 0.9602 0.97410 0.98742 1.00000
plot(pca)
We can see from this that the first PC explains the highest variance, thus confirming that it will show which genes differ the most.
loadings <- pca$rotation
most_different_genes <- order(abs(loadings[,1]), decreasing = T) # gets the ones contributing most to the first PC
most_different_genes[1:10]
## [1] 28 22 39 34 21 31 38 30 36 24
The top 10 most different genes in order are: 28, 22, 39, 34, 21, 31, 38, 30, 36, 24