Suppose that we have four observations, for which we compute a dissimilarity matrix, given by
\[\begin{bmatrix} & 0.3 & 0.4 & 0.7 \\ 0.3 & & 0.5 & 0.8 \\ 0.4 & 0.5 & & 0.45 \\ 0.7 & 0.8 & 0.45 & \\ \end{bmatrix}\]For instance, the dissimilarity between the first and second observations is 0.3, and the dissimilarity between the second and fourth observations is 0.8.
- On the basis of this dissimilarity matrix, sketch the dendrogram that results from hierarchically clustering these four observations using complete linkage. Be sure to indicate on the plot the height at which each fusion occurs, as well as the observations corresponding to each leaf in the dendrogram.
m <- matrix(c(0, 0.3, 0.4, 0.7, 0.3, 0, 0.5, 0.8, 0.4, 0.5, 0., 0.45, 0.7, 0.8, 0.45, 0), ncol = 4)
c1 <- hclust(as.dist(m), method = "complete")
plot(c1)
Observation: We see that observations 1 and 2 merge first (at height 0.3), then observations 3 and 4 merge (at height 0.45), and finally both clusters merge together at height 0.8.
- Repeat (a), this time using single linkage clustering.
c2 <- hclust(as.dist(m), method = "single")
plot(c2)
Observation: The single linkage dendrogram shows 1 and 2 merging first (0.3), then (1,2) merges with 3 (0.4), and finally 4 joins in at 0.45. This reflects how single linkage chains observations based on the smallest distances.
- Suppose that we cut the dendrogram obtained in (a) such that two clusters result. Which observations are in each cluster?
table(1:4, cutree(c1, 2))
##
## 1 2
## 1 1 0
## 2 1 0
## 3 0 1
## 4 0 1
Observation: We get two clusters: {1, 2} in one group and {3, 4} in the other.
- Suppose that we cut the dendrogram obtained in (b) such that two clusters result. Which observations are in each cluster?
table(1:4, cutree(c2, 2))
##
## 1 2
## 1 1 0
## 2 1 0
## 3 1 0
## 4 0 1
Observation: This time, we get {1, 2, 3} in one cluster and {4} in the other. Single linkage tends to form one big chain and leaves out the most distant point last.
- It is mentioned in the chapter that at each fusion in the dendrogram, the position of the two clusters being fused can be swapped without changing the meaning of the dendrogram. Draw a dendrogram that is equivalent to the dendrogram in (a), for which two or more of the leaves are repositioned, but for which the meaning of the dendrogram is the same.
plot(c1, labels = c(2, 1, 3, 4))
Observation: We swapped the positions of some leaves (for example, 1 and 2), but the dendrogram meaning remains unchanged—it’s just a visual reordering, which is totally fine in hierarchical clustering.
In this problem, you will perform \(K\)-means clustering manually, with \(K = 2\), on a small example with \(n = 6\) observations and \(p = 2\) features. The observations are as follows.
Obs. \(X_1\) \(X_2\) 1 1 4 2 1 3 3 0 4 4 5 1 5 6 2 6 4 0
- Plot the observations.
library(ggplot2)
d <- data.frame(
x1 = c(1, 1, 0, 5, 6, 4),
x2 = c(4, 3, 4, 1, 2, 0)
)
ggplot(d, aes(x = x1, y = x2)) +
geom_point(size = 3) +
labs(title = "Scatter Plot of Observations")
- Randomly assign a cluster label to each observation. You can use the
sample()
command inR
to do this. Report the cluster labels for each observation.
set.seed(42)
d$cluster <- sample(c(1, 2), nrow(d), replace = TRUE)
d
## x1 x2 cluster
## 1 1 4 1
## 2 1 3 1
## 3 0 4 1
## 4 5 1 1
## 5 6 2 2
## 6 4 0 2
Observation: Each observation is randomly assigned to cluster 1 or 2 to start the K-means process.
- Compute the centroid for each cluster.
centroids <- sapply(1:2, function(i) colMeans(d[d$cluster == i, 1:2]))
centroids
## [,1] [,2]
## x1 1.75 5
## x2 3.00 1
We calculated the mean X1 and X2 for each cluster to find their centroids.
- Assign each observation to the centroid to which it is closest, in terms of Euclidean distance. Report the cluster labels for each observation.
dist <- sapply(1:2, function(i) {
sqrt((d$x1 - centroids[1, i])^2 + (d$x2 - centroids[2, i])^2)
})
d$cluster <- apply(dist, 1, which.min)
d
## x1 x2 cluster
## 1 1 4 1
## 2 1 3 1
## 3 0 4 1
## 4 5 1 2
## 5 6 2 2
## 6 4 0 2
Each point is now re-assigned to the cluster whose centroid is closest, making the clusters more “tight” around their centers.
- Repeat (c) and (d) until the answers obtained stop changing.
centroids <- sapply(1:2, function(i) colMeans(d[d$cluster == i, 1:2]))
dist <- sapply(1:2, function(i) {
sqrt((d$x1 - centroids[1, i])^2 + (d$x2 - centroids[2, i])^2)
})
d$cluster <- apply(dist, 1, which.min)
d
## x1 x2 cluster
## 1 1 4 1
## 2 1 3 1
## 3 0 4 1
## 4 5 1 2
## 5 6 2 2
## 6 4 0 2
After a repeat, the cluster labels stayed the same, indicating the algorithm has converged.
- In your plot from (a), color the observations according to the cluster labels obtained.
ggplot(d, aes(x = x1, y = x2, color = factor(cluster))) +
geom_point(size = 3) +
labs(title = "Final Clusters")
Observation: We now have a clean cluster plot showing how K-means has divided the points into two distinct groups.
Suppose that for a particular data set, we perform hierarchical clustering using single linkage and using complete linkage. We obtain two dendrograms.
- At a certain point on the single linkage dendrogram, the clusters {1, 2, 3} and {4, 5} fuse. On the complete linkage dendrogram, the clusters {1, 2, 3} and {4, 5} also fuse at a certain point. Which fusion will occur higher on the tree, or will they fuse at the same height, or is there not enough information to tell?
The complete linkage fusion will likely be higher in the tree since single linkage is defined as being the minimum distance between two clusters. However, there is a chance that they could be at the same height (so technically there is not enough information to tell).
- At a certain point on the single linkage dendrogram, the clusters {5} and {6} fuse. On the complete linkage dendrogram, the clusters {5} and {6} also fuse at a certain point. Which fusion will occur higher on the tree, or will they fuse at the same height, or is there not enough information to tell?
They will fuse at the same height (the algorithm for calculating distance is the same when the clusters are of size 1).