Juan Olazaba 2/3/2023
Consider the USArrests data. We will now perform hierarchical clustering on the states.
df = USArrests
hc.complete=hclust(dist(df), method="complete")
hc.single=hclust(dist(df), method="single")
par(mfrow = c(1,2))
plot(hc.complete, main = "comlete linkage", xlab = "", sub="", cex=0.9)
plot(hc.single, main="Single Linkage", xlab="", sub="", cex=.9)
# Add a red rectangle to indicate the cut height
rect.hclust(hc.complete, h = 3, border = "red")
# Print the cut height
abline(h = 3, col = "red", lty = 2)a: (Alabama, Alaska, Arizona, California, Delaware, Florida, Illinois, Louisiana, Maryland, Michigan, Mississippi, Nevada, New Mexico, New York, North Carolina, South Carolina)
b: (Colorado, Georgia, Massachusetts, Missouri, New Jersey, Oklahoma, Oregon, Rhode Island, Tennessee, Texas, Virginia, Washington, Wyoming)
c: (Connecticut, Hawaii, Idaho, Indiana, Iowa, Kansas, Kentucky, Maine, Minnesota, Montana, Nebraska, New Hampshire, North Dakota, Ohio, Pennsylvania, South Dakota, Utah, Vermont, West Virginia, Wisconsin)
Please see below for additional analysis.
# Cut the dendrogram at a specific height (replace h with desired height)
h <- 3
clusters <- cutree(hc.complete, 3)
# Display cluster membership
clusters## Alabama Alaska Arizona Arkansas California
## 1 1 1 2 1
## Colorado Connecticut Delaware Florida Georgia
## 2 3 1 1 2
## Hawaii Idaho Illinois Indiana Iowa
## 3 3 1 3 3
## Kansas Kentucky Louisiana Maine Maryland
## 3 3 1 3 1
## Massachusetts Michigan Minnesota Mississippi Missouri
## 2 1 3 1 2
## Montana Nebraska Nevada New Hampshire New Jersey
## 3 3 1 3 2
## New Mexico New York North Carolina North Dakota Ohio
## 1 1 1 3 3
## Oklahoma Oregon Pennsylvania Rhode Island South Carolina
## 2 2 3 2 1
## South Dakota Tennessee Texas Utah Vermont
## 3 2 2 3 3
## Virginia Washington West Virginia Wisconsin Wyoming
## 2 2 3 3 2
scaled <- scale(df)
complete = hclust(dist(scaled), method="complete")
plot(complete, main = "comlete linkage", xlab = "", sub="", cex=0.9)
# Add a red rectangle to indicate the cut height
rect.hclust(hc.complete, h = 3, border = "lightblue")
# Print the cut height
abline(h = 3, col = "lightblue", lty = 2)The variables should be scaled prior to performing hierarchical clustering.In our first attempt in cutting the complete linkage model at a height of 3, it output all states; each state as one cluster. On the contrary, scaling the variables before performing hierarchical clustering help in more accurate clustering several variables into a group.
We using the following R code to simulate a data and then perform PCA and K-means clustering on the data.
# Standardize the features (optional)
scaled_features <- scale(x)
# Perform PCA
pca_result <- prcomp(scaled_features)
summary(pca_result)## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.89855 1.79565 1.69444 1.6355 1.55177 1.51767 1.45409
## Proportion of Variance 0.07209 0.06449 0.05742 0.0535 0.04816 0.04607 0.04229
## Cumulative Proportion 0.07209 0.13658 0.19400 0.2475 0.29566 0.34173 0.38401
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 1.43949 1.40333 1.37831 1.36463 1.31258 1.28962 1.24317
## Proportion of Variance 0.04144 0.03939 0.03799 0.03724 0.03446 0.03326 0.03091
## Cumulative Proportion 0.42546 0.46484 0.50284 0.54008 0.57454 0.60780 0.63871
## PC15 PC16 PC17 PC18 PC19 PC20 PC21
## Standard deviation 1.21324 1.17179 1.1467 1.09815 1.09131 1.01812 0.95896
## Proportion of Variance 0.02944 0.02746 0.0263 0.02412 0.02382 0.02073 0.01839
## Cumulative Proportion 0.66815 0.69561 0.7219 0.74603 0.76985 0.79058 0.80897
## PC22 PC23 PC24 PC25 PC26 PC27 PC28
## Standard deviation 0.94701 0.91875 0.87787 0.85368 0.82025 0.78397 0.72699
## Proportion of Variance 0.01794 0.01688 0.01541 0.01458 0.01346 0.01229 0.01057
## Cumulative Proportion 0.82691 0.84379 0.85920 0.87378 0.88723 0.89953 0.91010
## PC29 PC30 PC31 PC32 PC33 PC34 PC35
## Standard deviation 0.71670 0.67290 0.63895 0.62285 0.60539 0.58271 0.56877
## Proportion of Variance 0.01027 0.00906 0.00817 0.00776 0.00733 0.00679 0.00647
## Cumulative Proportion 0.92037 0.92942 0.93759 0.94535 0.95268 0.95947 0.96594
## PC36 PC37 PC38 PC39 PC40 PC41 PC42
## Standard deviation 0.54157 0.5246 0.45787 0.43383 0.4060 0.34415 0.3242
## Proportion of Variance 0.00587 0.0055 0.00419 0.00376 0.0033 0.00237 0.0021
## Cumulative Proportion 0.97181 0.9773 0.98150 0.98527 0.9886 0.99093 0.9930
## PC43 PC44 PC45 PC46 PC47 PC48 PC49
## Standard deviation 0.31167 0.27195 0.23054 0.20144 0.1876 0.15925 0.13987
## Proportion of Variance 0.00194 0.00148 0.00106 0.00081 0.0007 0.00051 0.00039
## Cumulative Proportion 0.99498 0.99646 0.99752 0.99833 0.9990 0.99954 0.99993
## PC50
## Standard deviation 0.05808
## Proportion of Variance 0.00007
## Cumulative Proportion 1.00000
colors <- c(rep("red", 20), rep("lightgreen", 20), rep("black", 20))
# Extract the scores for the first two principal components
PC_scores <- pca_result$x[, 1:2]
# Plot the scores
plot(PC_scores, pch = 19, col = colors, main = "PC1 vs PC2", xlab = "PC1", ylab = "PC2")
legend("topright", legend = c("Class 1", "Class 2", "Class 3"), col = 1:3, pch = 19, title = "Species")How well do the clusters that you obtained in K-means clustering compare to the true class labels? When compared to the true class labels, k-means clustering did really well at separating the observations. We can see from the table that they are equally separated.
set.seed(4)
k = kmeans(x, center = 3)
true_labels = c(rep(1,20), rep(2,20), rep(3,20))
table(k$cluster, true_labels)## true_labels
## 1 2 3
## 1 0 0 20
## 2 20 0 0
## 3 0 20 0
plot(x, col=(k$cluster+1), main="K-Means Clustering Results with K=3", xlab="", ylab="", pch=20, cex=2)The K-means procedure is still correctly classified the observations, however it moved the middle class to another class.
## true_labels
## 1 2 3
## 1 0 20 0
## 2 20 0 20
The k-means algorithm split up our middle class into two classes and the two extreme were untouched.
## true_labels
## 1 2 3
## 1 0 0 20
## 2 11 0 0
## 3 9 0 0
## 4 0 20 0
It did not perform as expected. Although PCA helps to reduce dimensionality, it ultimately depends on the structure of our data and the preprocessing techniques used. Theoretically, PCA should enhance our results, but as stated, there are many factors that play a role when performing analysis.
set.seed(1)
PCAs = kmeans(PC_scores, center = 3)
true_labels = c(rep(1,20), rep(2,20), rep(3,20))
table(PCAs$cluster, true_labels)## true_labels
## 1 2 3
## 1 9 8 9
## 2 8 2 8
## 3 3 10 3
Scaling our features also affected our results. Class one seemed to have give clear separation, however our other classes suffered from misclassification. This could be due to all features being on a similar scale.
set.seed(10)
sf = kmeans(scaled_features, center = 3)
true_labels = c(rep(1,20), rep(2,20), rep(3,20))
table(sf$cluster, true_labels)## true_labels
## 1 2 3
## 1 0 16 0
## 2 17 1 15
## 3 3 3 5
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Yes, samples are separated into two groups. We can also observe that using the average linkage procedure separate the samples into three groups. Thus, our results do depend on the type of linkage used.
cor_distance <- as.dist(1 - cor(Ch12Ex13))
par(mfrow=c(1,2))
plot(hclust(cor_distance, method="complete"), main="Complete Linkage with Correlation-Based Distance", xlab="", sub="")
plot(hclust(cor_distance, method="average"), main="Complete Linkage with Correlation-Based Distance", xlab="", sub="")If we are interested in measuring variability between genes, we would benefit the most from performing principal component analysis on the data set. We will review the load for each element within our principal components columns. The loadings indicate the contribution of each original feature to each principal component.
cor.load <- apply(pr.out$rotation, 1, sum)
index <- order(abs(cor.load), decreasing = TRUE)
index[1:10]## [1] 18 35 28 29 31 11 12 34 32 6