Advanced Statistical Learning Homework 1

Juan Olazaba 2/3/2023

Question 1

Consider the USArrests data. We will now perform hierarchical clustering on the states.

Using hierarchical clustering with complete linkage and Euclidean distance, cluster the states.

df = USArrests
hc.complete=hclust(dist(df), method="complete")
hc.single=hclust(dist(df), method="single")

par(mfrow = c(1,2))

plot(hc.complete, main = "comlete linkage", xlab = "", sub="", cex=0.9)
plot(hc.single, main="Single Linkage", xlab="", sub="", cex=.9)

# Add a red rectangle to indicate the cut height
rect.hclust(hc.complete, h = 3, border = "red")

# Print the cut height
abline(h = 3, col = "red", lty = 2)

If we cut the complete linkage dendrogram, we see that all states are separated into their own cluster. However, if we use a different function and specify the amount of clusters desired we are able to separate the states into the following three clusters:

a: (Alabama, Alaska, Arizona, California, Delaware, Florida, Illinois, Louisiana, Maryland, Michigan, Mississippi, Nevada, New Mexico, New York, North Carolina, South Carolina)

b: (Colorado, Georgia, Massachusetts, Missouri, New Jersey, Oklahoma, Oregon, Rhode Island, Tennessee, Texas, Virginia, Washington, Wyoming)

c: (Connecticut, Hawaii, Idaho, Indiana, Iowa, Kansas, Kentucky, Maine, Minnesota, Montana, Nebraska, New Hampshire, North Dakota, Ohio, Pennsylvania, South Dakota, Utah, Vermont, West Virginia, Wisconsin)

Please see below for additional analysis.

# Cut the dendrogram at a specific height (replace h with desired height)
h <- 3
clusters <- cutree(hc.complete, 3)

# Display cluster membership
clusters

##        Alabama         Alaska        Arizona       Arkansas     California 
##              1              1              1              2              1 
##       Colorado    Connecticut       Delaware        Florida        Georgia 
##              2              3              1              1              2 
##         Hawaii          Idaho       Illinois        Indiana           Iowa 
##              3              3              1              3              3 
##         Kansas       Kentucky      Louisiana          Maine       Maryland 
##              3              3              1              3              1 
##  Massachusetts       Michigan      Minnesota    Mississippi       Missouri 
##              2              1              3              1              2 
##        Montana       Nebraska         Nevada  New Hampshire     New Jersey 
##              3              3              1              3              2 
##     New Mexico       New York North Carolina   North Dakota           Ohio 
##              1              1              1              3              3 
##       Oklahoma         Oregon   Pennsylvania   Rhode Island South Carolina 
##              2              2              3              2              1 
##   South Dakota      Tennessee          Texas           Utah        Vermont 
##              3              2              2              3              3 
##       Virginia     Washington  West Virginia      Wisconsin        Wyoming 
##              2              2              3              3              2

# Visualize the clusters
plot(df, col = clusters, pch = 19, main = "Clusters")

Hierarchically cluster the states using complete linkage and Euclidean distance, after scaling the variables to have standard deviation one.

scaled <- scale(df)
complete = hclust(dist(scaled), method="complete")

plot(complete, main = "comlete linkage", xlab = "", sub="", cex=0.9)

# Add a red rectangle to indicate the cut height
rect.hclust(hc.complete, h = 3, border = "lightblue")

# Print the cut height
abline(h = 3, col = "lightblue", lty = 2)

What effect does scaling the variables have on the hierarchical clustering obtained? In your opinion, should the variables be scaled before the inter-observation dissimilarities are computed? Provide a justification for your answer.

The variables should be scaled prior to performing hierarchical clustering.In our first attempt in cutting the complete linkage model at a height of 3, it output all states; each state as one cluster. On the contrary, scaling the variables before performing hierarchical clustering help in more accurate clustering several variables into a group.

Question 2

We using the following R code to simulate a data and then perform PCA and K-means clustering on the data.

Perform PCA on the 60 observations and plot the first two principal component score vectors. Use a different color to indicate the observations in each of the three classes.

# Standardize the features (optional)
scaled_features <- scale(x)

# Perform PCA
pca_result <- prcomp(scaled_features)
summary(pca_result)

## Importance of components:
##                            PC1     PC2     PC3    PC4     PC5     PC6     PC7
## Standard deviation     1.89855 1.79565 1.69444 1.6355 1.55177 1.51767 1.45409
## Proportion of Variance 0.07209 0.06449 0.05742 0.0535 0.04816 0.04607 0.04229
## Cumulative Proportion  0.07209 0.13658 0.19400 0.2475 0.29566 0.34173 0.38401
##                            PC8     PC9    PC10    PC11    PC12    PC13    PC14
## Standard deviation     1.43949 1.40333 1.37831 1.36463 1.31258 1.28962 1.24317
## Proportion of Variance 0.04144 0.03939 0.03799 0.03724 0.03446 0.03326 0.03091
## Cumulative Proportion  0.42546 0.46484 0.50284 0.54008 0.57454 0.60780 0.63871
##                           PC15    PC16   PC17    PC18    PC19    PC20    PC21
## Standard deviation     1.21324 1.17179 1.1467 1.09815 1.09131 1.01812 0.95896
## Proportion of Variance 0.02944 0.02746 0.0263 0.02412 0.02382 0.02073 0.01839
## Cumulative Proportion  0.66815 0.69561 0.7219 0.74603 0.76985 0.79058 0.80897
##                           PC22    PC23    PC24    PC25    PC26    PC27    PC28
## Standard deviation     0.94701 0.91875 0.87787 0.85368 0.82025 0.78397 0.72699
## Proportion of Variance 0.01794 0.01688 0.01541 0.01458 0.01346 0.01229 0.01057
## Cumulative Proportion  0.82691 0.84379 0.85920 0.87378 0.88723 0.89953 0.91010
##                           PC29    PC30    PC31    PC32    PC33    PC34    PC35
## Standard deviation     0.71670 0.67290 0.63895 0.62285 0.60539 0.58271 0.56877
## Proportion of Variance 0.01027 0.00906 0.00817 0.00776 0.00733 0.00679 0.00647
## Cumulative Proportion  0.92037 0.92942 0.93759 0.94535 0.95268 0.95947 0.96594
##                           PC36   PC37    PC38    PC39   PC40    PC41   PC42
## Standard deviation     0.54157 0.5246 0.45787 0.43383 0.4060 0.34415 0.3242
## Proportion of Variance 0.00587 0.0055 0.00419 0.00376 0.0033 0.00237 0.0021
## Cumulative Proportion  0.97181 0.9773 0.98150 0.98527 0.9886 0.99093 0.9930
##                           PC43    PC44    PC45    PC46   PC47    PC48    PC49
## Standard deviation     0.31167 0.27195 0.23054 0.20144 0.1876 0.15925 0.13987
## Proportion of Variance 0.00194 0.00148 0.00106 0.00081 0.0007 0.00051 0.00039
## Cumulative Proportion  0.99498 0.99646 0.99752 0.99833 0.9990 0.99954 0.99993
##                           PC50
## Standard deviation     0.05808
## Proportion of Variance 0.00007
## Cumulative Proportion  1.00000

colors <- c(rep("red", 20), rep("lightgreen", 20), rep("black", 20))

# Extract the scores for the first two principal components
PC_scores <- pca_result$x[, 1:2]

# Plot the scores
plot(PC_scores, pch = 19, col = colors, main = "PC1 vs PC2", xlab = "PC1", ylab = "PC2")
legend("topright", legend = c("Class 1", "Class 2", "Class 3"), col = 1:3, pch = 19, title = "Species")

Perform K-means clustering of the observations with K = 3.

How well do the clusters that you obtained in K-means clustering compare to the true class labels? When compared to the true class labels, k-means clustering did really well at separating the observations. We can see from the table that they are equally separated.

set.seed(4)
k = kmeans(x, center = 3)
true_labels = c(rep(1,20), rep(2,20), rep(3,20))

table(k$cluster, true_labels)

##    true_labels
##      1  2  3
##   1  0  0 20
##   2 20  0  0
##   3  0 20  0

plot(x, col=(k$cluster+1), main="K-Means Clustering Results with K=3", xlab="", ylab="", pch=20, cex=2)

Perform K-means clustering with K = 2. Describe your results.

The K-means procedure is still correctly classified the observations, however it moved the middle class to another class.

set.seed(40)
k2 = kmeans(x, center = 2)

table(k2$cluster, true_labels)

##    true_labels
##      1  2  3
##   1  0 20  0
##   2 20  0 20

Now perform K-means clustering with K = 4, and describe your results.

The k-means algorithm split up our middle class into two classes and the two extreme were untouched.

set.seed(60)
k4 = kmeans(x, center = 4)

table(k4$cluster, true_labels)

##    true_labels
##      1  2  3
##   1  0  0 20
##   2 11  0  0
##   3  9  0  0
##   4  0 20  0

Now perform K-means clustering with K = 3 on the first two principal component score vectors, rather than on the raw data. That is, perform K-means clustering on the 60 × 2 matrix of which the first column is the first principal component score vector, and the second column is the second principal component score vector. Comment on the results.

It did not perform as expected. Although PCA helps to reduce dimensionality, it ultimately depends on the structure of our data and the preprocessing techniques used. Theoretically, PCA should enhance our results, but as stated, there are many factors that play a role when performing analysis.

set.seed(1)

PCAs = kmeans(PC_scores, center = 3)
true_labels = c(rep(1,20), rep(2,20), rep(3,20))

table(PCAs$cluster, true_labels)

##    true_labels
##      1  2  3
##   1  9  8  9
##   2  8  2  8
##   3  3 10  3

Using the scale() function, perform K-means clustering with K = 3 on the data after scaling each variable to have standard deviation one. How do these results compare to those obtained in (a)? Explain.

Scaling our features also affected our results. Class one seemed to have give clear separation, however our other classes suffered from misclassification. This could be due to all features being on a similar scale.

set.seed(10)

sf = kmeans(scaled_features, center = 3)
true_labels = c(rep(1,20), rep(2,20), rep(3,20))

table(sf$cluster, true_labels)

##    true_labels
##      1  2  3
##   1  0 16  0
##   2 17  1 15
##   3  3  3  5

Question 3

On the book website, https://www.statlearning.com/resources-second-edition, there is a gene expression data set (Ch12Ex13.csv) that consists of 40 tissue samples with measurements on 1,000 genes. The first 20 samples are from healthy patients, while the second 20 are from a diseased group. The data is also available on the blackboard.

Load in the data using read.csv(). You will need to select header = F.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)
library(readr)
Ch12Ex13 <- read.csv("Ch12Ex13.csv", header = F)
View(Ch12Ex13)

Apply hierarchical clustering to the samples using correlation-based distance, and plot the dendrogram. Do the genes separate the samples into the two groups? Do your results depend on the type of linkage used?

Yes, samples are separated into two groups. We can also observe that using the average linkage procedure separate the samples into three groups. Thus, our results do depend on the type of linkage used.

cor_distance <- as.dist(1 - cor(Ch12Ex13))
par(mfrow=c(1,2))

plot(hclust(cor_distance, method="complete"), main="Complete Linkage with Correlation-Based Distance", xlab="", sub="")

plot(hclust(cor_distance, method="average"), main="Complete Linkage with Correlation-Based Distance", xlab="", sub="")

Your collaborator wants to know which genes differ the most across the two groups. Suggest a way to answer this question, and apply it here.

If we are interested in measuring variability between genes, we would benefit the most from performing principal component analysis on the data set. We will review the load for each element within our principal components columns. The loadings indicate the contribution of each original feature to each principal component.

pr.out=prcomp(Ch12Ex13, scale=TRUE)
plot(pr.out, main="Principal Component Analysis")

cor.load <- apply(pr.out$rotation, 1, sum)
index <- order(abs(cor.load), decreasing = TRUE)
index[1:10]

##  [1] 18 35 28 29 31 11 12 34 32  6