Problem 1

Consider the USArrests data. We will now perform hierarchical clustering on the states.

arrests <- USArrests
  1. Using hierarchical clustering with complete linkage and Euclidean distance, cluster the states.
hc_complete <- hclust(dist(arrests), method = 'complete')
plot(hc_complete, main = 'Complete Linkage Dendrogram')

  1. Cut the dendrogram at a height that results in three distinct clusters. Which states belong to which clusters?

We can use the cutree function with 3 clusters to see which states belong to which cluster, as well as create a plot and table showing the distributions of each cluster.

clusters_b <- cutree(hc_complete, k=3)
clusters_b
##        Alabama         Alaska        Arizona       Arkansas     California 
##              1              1              1              2              1 
##       Colorado    Connecticut       Delaware        Florida        Georgia 
##              2              3              1              1              2 
##         Hawaii          Idaho       Illinois        Indiana           Iowa 
##              3              3              1              3              3 
##         Kansas       Kentucky      Louisiana          Maine       Maryland 
##              3              3              1              3              1 
##  Massachusetts       Michigan      Minnesota    Mississippi       Missouri 
##              2              1              3              1              2 
##        Montana       Nebraska         Nevada  New Hampshire     New Jersey 
##              3              3              1              3              2 
##     New Mexico       New York North Carolina   North Dakota           Ohio 
##              1              1              1              3              3 
##       Oklahoma         Oregon   Pennsylvania   Rhode Island South Carolina 
##              2              2              3              2              1 
##   South Dakota      Tennessee          Texas           Utah        Vermont 
##              3              2              2              3              3 
##       Virginia     Washington  West Virginia      Wisconsin        Wyoming 
##              2              2              3              3              2
plot(clusters_b)

table(clusters_b)
## clusters_b
##  1  2  3 
## 16 14 20
  1. Hierarchically cluster the states using complete linkage and Euclidean distance, after scaling the variables to have standard deviation one.
scaled_arrests <- scale(arrests)

hc_scaled <- hclust(dist(scaled_arrests), method = 'complete')
plot(hc_scaled, main = 'Complete Linkage Dendrogram')

clusters_c <- cutree(hc_scaled, k=3)
clusters_c
##        Alabama         Alaska        Arizona       Arkansas     California 
##              1              1              2              3              2 
##       Colorado    Connecticut       Delaware        Florida        Georgia 
##              2              3              3              2              1 
##         Hawaii          Idaho       Illinois        Indiana           Iowa 
##              3              3              2              3              3 
##         Kansas       Kentucky      Louisiana          Maine       Maryland 
##              3              3              1              3              2 
##  Massachusetts       Michigan      Minnesota    Mississippi       Missouri 
##              3              2              3              1              3 
##        Montana       Nebraska         Nevada  New Hampshire     New Jersey 
##              3              3              2              3              3 
##     New Mexico       New York North Carolina   North Dakota           Ohio 
##              2              2              1              3              3 
##       Oklahoma         Oregon   Pennsylvania   Rhode Island South Carolina 
##              3              3              3              3              1 
##   South Dakota      Tennessee          Texas           Utah        Vermont 
##              3              1              2              3              3 
##       Virginia     Washington  West Virginia      Wisconsin        Wyoming 
##              3              3              3              3              3
plot(clusters_c)

table(clusters_c)
## clusters_c
##  1  2  3 
##  8 11 31
  1. What effect does scaling the variables have on the hierarchical clustering obtained? In your opinion, should the variables be scaled before the inter-observation dissimilarities are computed? Provide a justification for your answer.
table(clusters_b)
## clusters_b
##  1  2  3 
## 16 14 20
table(clusters_c)
## clusters_c
##  1  2  3 
##  8 11 31

We can see that scaling the variables leads to different clusters being obtained, and different distributions. The variables should be scaled because they may be in different scales, with some variables being much larger than others. This would cause the larger variables to play an exaggerated role in clustering.

Problem 2

We using the following R code to simulate a data and then perform PCA and K-means clustering on the data.

set.seed(1)
x = matrix(runif(n=3000, 0, 0.01), ncol=50)
x[1:20, 2] = 2
x[21:40, 1] = 5
x[21:40, 2] = 5
x[41:60, 1] = 2

# making labels
class_labels <- factor(rep(c("Class 1", "Class 2", "Class 3"), each = 20))
  1. Perform PCA on the 60 observations and plot the first two principal component score vectors. Use a different color to indicate the observations in each of the three classes.
library(ggplot2)
pca <- prcomp(x, rank. = 2)

pc_scores <- as.data.frame(pca$x[, 1:2])

pc_scores$class <- class_labels


ggplot(pc_scores, aes(x = PC1, y = PC2, color = class)) +
  geom_point(size = 3, alpha = 0.8) + 
  labs(title = "Two Principal Components",
       x = "Principal Component 1",
       y = "Principal Component 2") +
  theme_minimal() +
  scale_color_manual(values = c("pink", "lightskyblue", "palegreen"))  # Custom colors for classes

  1. Perform K-means clustering of the observations with K = 3. How well do the clusters that you obtained in K-means clustering compare to the true class labels?

Hint: You can use the table() function in R to compare the true class labels to the class labels obtained by clustering. Be careful how you interpret the results: K-means clustering will arbitrarily number the clusters, so you cannot simply check whether the true class labels and clustering labels are the same.

kcluster <- kmeans(x, 3, nstart = 10)
table(kcluster$cluster, class_labels)
##    class_labels
##     Class 1 Class 2 Class 3
##   1       0      20       0
##   2      20       0       0
##   3       0       0      20

We see that the K-means clustering with k=3 perfectly sorts out the true class labels.

  1. Perform K-means clustering with K = 2. Describe your results.
kcluster <- kmeans(x, 2, nstart = 10)
table(kcluster$cluster, class_labels)
##    class_labels
##     Class 1 Class 2 Class 3
##   1       0      20       0
##   2      20       0      20

As expected, this version does not work as well. We can see that classes 1 and 3 get grouped together in the same cluster, while class 2 remains in its own cluster. This is likely because class 2 involved setting the first two columns to 5, while class 1 and 2 involved setting only one of the first two columns to 2.

  1. Now perform K-means clustering with K = 4, and describe your results.
kcluster <- kmeans(x, 4, nstart = 10)
table(kcluster$cluster, class_labels)
##    class_labels
##     Class 1 Class 2 Class 3
##   1       0       0       9
##   2       0       0      11
##   3      20       0       0
##   4       0      20       0

In this clustering set it clusters 2 perfectly, but when forced to make a nonexistent cluster it splits class 3 into two different clusters.

  1. Now perform K-means clustering with K = 3 on the first two principal component score vectors, rather than on the raw data. That is, perform K-means clustering on the 60 × 2 matrix of which the first column is the first principal component score vector, and the second column is the second principal component score vector. Comment on the results.
kmean_pca <- kmeans(pca$x, 3, nstart = 10)
table(kmean_pca$cluster, class_labels)
##    class_labels
##     Class 1 Class 2 Class 3
##   1       0       0      20
##   2       0      20       0
##   3      20       0       0
plot(pca)

Using K-means clustering with the first two principal components, we see that it is able to perfectly cluster each individual class. If we check with the scree plot we can see this makes sense, as the first two components are able to essentially perfectly account for any variation in the data.

  1. Using the scale() function, perform K-means clustering with K = 3 on the data after scaling each variable to have standard deviation one. How do these results compare to those obtained in (a)? Explain.
k_scale <- kmeans(scale(x), 3, nstart = 10)
table(k_scale$cluster, class_labels)
##    class_labels
##     Class 1 Class 2 Class 3
##   1       9       4       9
##   2       1      13       1
##   3      10       3      10

We see that when the data is scaled, K-means clustering performs significantly worse with much more misclassifications than by using the PCA without scaling found in part a). This is likely due to the large size of the first two altered columns in comparison to the rest of the data, which would have been significantly altered with the use of scaling.

We can also demonstrate this phenomenon by using PCA with scaling:

pca <- prcomp(x, scale. = T, rank. = 2)

pc_scores <- as.data.frame(pca$x[, 1:2])

pc_scores$class <- class_labels


ggplot(pc_scores, aes(x = PC1, y = PC2, color = class)) +
  geom_point(size = 3, alpha = 0.8) + 
  labs(title = "Two Principal Components",
       x = "Principal Component 1",
       y = "Principal Component 2") +
  theme_minimal() +
  scale_color_manual(values = c("pink", "lightskyblue", "palegreen"))  # Custom colors for classes

plot(pca)

We can see that they are not as readily separate as in the first PCA plot generated, and the variation explained by two principal components is much less than if they were not scaled.

Problem 3

On the book website, https://www.statlearning.com/resources-second-edition, there is a gene expression data set (Ch12Ex13.csv) that consists of 40 tissue samples with measurements on 1,000 genes. The first 20 samples are from healthy patients, while the second 20 are from a diseased group. The data is also available on the blackboard.

  1. Load in the data using read.csv(). You will need to select header = F.
genes <- read.csv('Ch12Ex13.csv', header = F)
  1. Apply hierarchical clustering to the samples using correlation-based distance, and plot the dendrogram. Do the genes separate the samples into the two groups? Do your results depend on the type of linkage used?
dd <- as.dist((1 - cor(genes))/2) # correlation based distance

hc_complete <- hclust(dd, method = 'complete')
plot(hc_complete, main = 'Complete')

hc_single <- hclust(dd, method = 'single')
plot(hc_single, main = 'Single')

hc_average <- hclust(dd, method = 'average')
plot(hc_average, main = 'Average')

hc_centroid <- hclust(dd, method = 'centroid')
plot(hc_centroid, main = 'Centroid')

It appears that complete, single, and centroid generally cluster the population into two similarly sized groups. Average instead appears to cluster into three groups, with one being smaller than the other two.

  1. Your collaborator wants to know which genes differ the most across the two groups. Suggest a way to answer this question, and apply it here.

To see which genes have the most difference, we can use PCA. As PCA uses PC to explain the most variance possible in the data, if we look at the first PC we can determine which genes differ the most. We can do this by looking at the 10 genes with the highest loadings for the first PC.

pca <- prcomp(genes, scale. = T, center = T)
summary(pca)
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.5401 1.1455 1.12727 1.10915 1.09034 1.06729 1.06464
## Proportion of Variance 0.1613 0.0328 0.03177 0.03076 0.02972 0.02848 0.02834
## Cumulative Proportion  0.1613 0.1941 0.22587 0.25663 0.28635 0.31483 0.34316
##                            PC8     PC9    PC10   PC11   PC12   PC13    PC14
## Standard deviation     1.05514 1.04623 1.03430 1.0218 1.0040 1.0000 0.99680
## Proportion of Variance 0.02783 0.02736 0.02674 0.0261 0.0252 0.0250 0.02484
## Cumulative Proportion  0.37100 0.39836 0.42510 0.4512 0.4764 0.5014 0.52625
##                           PC15   PC16    PC17    PC18    PC19    PC20    PC21
## Standard deviation     0.97913 0.9674 0.96207 0.95900 0.94880 0.93664 0.91466
## Proportion of Variance 0.02397 0.0234 0.02314 0.02299 0.02251 0.02193 0.02092
## Cumulative Proportion  0.55021 0.5736 0.59675 0.61974 0.64225 0.66418 0.68509
##                           PC22    PC23    PC24    PC25    PC26   PC27    PC28
## Standard deviation     0.90370 0.89974 0.87668 0.86341 0.85710 0.8532 0.84544
## Proportion of Variance 0.02042 0.02024 0.01921 0.01864 0.01837 0.0182 0.01787
## Cumulative Proportion  0.70551 0.72575 0.74496 0.76360 0.78196 0.8002 0.81803
##                           PC29    PC30    PC31    PC32   PC33    PC34    PC35
## Standard deviation     0.83839 0.82676 0.81452 0.79681 0.7924 0.78095 0.77890
## Proportion of Variance 0.01757 0.01709 0.01659 0.01587 0.0157 0.01525 0.01517
## Cumulative Proportion  0.83561 0.85269 0.86928 0.88515 0.9009 0.91610 0.93126
##                           PC36   PC37    PC38    PC39    PC40
## Standard deviation     0.77086 0.7509 0.74522 0.72987 0.70944
## Proportion of Variance 0.01486 0.0141 0.01388 0.01332 0.01258
## Cumulative Proportion  0.94612 0.9602 0.97410 0.98742 1.00000
plot(pca)

We can see from this that the first PC explains the highest variance, thus confirming that it will show which genes differ the most.

loadings <- pca$rotation
most_different_genes <- order(abs(loadings[,1]), decreasing = T) # gets the ones contributing most to the first PC
most_different_genes[1:10]
##  [1] 28 22 39 34 21 31 38 30 36 24

The top 10 most different genes in order are: 28, 22, 39, 34, 21, 31, 38, 30, 36, 24