Question 2

Suppose that we have four observations, for which we compute a dissimilarity matrix, given by \[ \begin{bmatrix} & 0.3 & 0.4 & 0.7 \\ 0.3 & & 0.5 & 0.8 \\ 0.4 & 0.5 & & 0.45 \\ 0.7 & 0.8 & 0.45 & \\ \end{bmatrix} \] For instance, the dissimilarity between the first and second observations is 0.3, and the dissimilarity between the second and fourth observations is 0.8.

m <- matrix(c(0, 0.3, 0.4, 0.7,
               0.3, 0, 0.5, 0.8,
               0.4, 0.5, 0, 0.45,
               0.7, 0.8, 0.45, 0), ncol = 4)
colnames(m) <- rownames(m) <- paste("Obs", 1:4)
  1. On the basis of this dissimilarity matrix, sketch the dendrogram that results from hierarchically clustering these four observations using complete linkage.
c1 <- hclust(as.dist(m), method = "complete")
plot(c1, main = "Complete Linkage Dendrogram")

Explanation:

Complete linkage clustering evaluates the maximum distance between observations across two clusters. Observations 1 and 2, having the smallest dissimilarity (0.3), are the first to merge. The next merge combines this cluster with observation 3 based on the maximum pairwise dissimilarity (0.5), and finally, observation 4 joins at a higher dissimilarity (0.7). This method ensures tighter, more compact clusters but delays merging when even a single pair of observations is far apart.

  1. Repeat (a), this time using single linkage clustering.
c2 <- hclust(as.dist(m), method = "single")
plot(c2, main = "Single Linkage Dendrogram")

Explanation:

Single linkage clustering considers the minimum distance between any two points in different clusters. It starts by merging the closest pair (Obs 1 & 2 at 0.3), and then forms a chain by successively joining the next nearest observation (Obs 3 at 0.4, then Obs 4 at 0.45). This can lead to elongated clusters, sometimes called the “chaining effect,” which can be seen in this dendrogram.

  1. Suppose that we cut the dendrogram obtained in (a) such that two clusters result. Which observations are in each cluster?
table(1:4, cutree(c1, 2))
##    
##     1 2
##   1 1 0
##   2 1 0
##   3 0 1
##   4 0 1

Explanation:

Cutting the complete linkage dendrogram at a height that yields two clusters places Obs 1 and 2 in Cluster 1 and Obs 3 and 4 in Cluster 2. This aligns with the hierarchical logic that Obs 3 and 4 are more similar to each other than to Obs 1 and 2 under complete linkage.

  1. Suppose that we cut the dendrogram obtained in (b) such that two clusters result. Which observations are in each cluster?
table(1:4, cutree(c2, 2))
##    
##     1 2
##   1 1 0
##   2 1 0
##   3 1 0
##   4 0 1

Explanation:

With single linkage, the looser chaining structure results in Obs 1, 2, and 3 being grouped together in Cluster 1, while Obs 4 forms its own cluster. This is due to the minimum distances driving the clustering rather than maximum ones, which can lead to less compact clusters.

  1. Draw a dendrogram that is equivalent to the dendrogram in (a), for which two or more of the leaves are repositioned.
plot(c1, labels = c(2, 1, 3, 4), main = "Complete Linkage (Reordered Leaves)")

Explanation:

Reordering leaves in a dendrogram does not change the actual clustering or distances. The new dendrogram simply flips the visual order of the leaves while preserving merge heights. This emphasizes that the dendrogram’s shape is fixed by the linkage method and dissimilarities—not the left-to-right position of leaves.


Question 3

In this problem, you will perform \(K\)-means clustering manually, with \(K = 2\), on a small example with \(n = 6\) observations and \(p = 2\) features.

library(ggplot2)
d <- data.frame(
  x1 = c(1, 1, 0, 5, 6, 4),
  x2 = c(4, 3, 4, 1, 2, 0)
)
  1. Plot the observations.
ggplot(d, aes(x = x1, y = x2)) +
  geom_point(size = 3) +
  labs(title = "Initial Plot of Observations") +
  theme_minimal()

Explanation:

The initial scatter plot reveals two visually distinct groups—likely to be split across left/top and right/bottom quadrants of the plot. This intuitive separation motivates K = 2 as a reasonable choice for clustering.

  1. Randomly assign a cluster label to each observation.
set.seed(42)
d$cluster <- sample(c(1, 2), size = nrow(d), replace = TRUE)
d
##   x1 x2 cluster
## 1  1  4       1
## 2  1  3       1
## 3  0  4       1
## 4  5  1       1
## 5  6  2       2
## 6  4  0       2

Explanation:

This step simulates an initial guess. It’s common in K-means to randomly assign clusters before iteratively improving them.

  1. Compute the centroid for each cluster.
centroids <- sapply(c(1, 2), function(i) colMeans(d[d$cluster == i, 1:2]))
centroids
##    [,1] [,2]
## x1 1.75    5
## x2 3.00    1

Explanation:

Centroids represent the mean position of observations within each cluster. Cluster 1, composed of mostly upper-left points, has a lower average x1 and higher x2, while Cluster 2 is more bottom-right.

  1. Assign each observation to the closest centroid.
dist <- sapply(1:2, function(i) {
  sqrt((d$x1 - centroids[1, i])^2 + (d$x2 - centroids[2, i])^2)
})
d$cluster <- apply(dist, 1, which.min)
d
##   x1 x2 cluster
## 1  1  4       1
## 2  1  3       1
## 3  0  4       1
## 4  5  1       2
## 5  6  2       2
## 6  4  0       2

Explanation:

Each observation is reclassified to the cluster whose centroid is closest by Euclidean distance. This realignment better reflects natural groupings in the data.

  1. Repeat (c) and (d) until assignments no longer change.
# Recalculate centroids
centroids <- sapply(c(1, 2), function(i) colMeans(d[d$cluster == i, 1:2]))
# Recalculate distances and assign again
dist <- sapply(1:2, function(i) {
  sqrt((d$x1 - centroids[1, i])^2 + (d$x2 - centroids[2, i])^2)
})
d$cluster <- apply(dist, 1, which.min)
d
##   x1 x2 cluster
## 1  1  4       1
## 2  1  3       1
## 3  0  4       1
## 4  5  1       2
## 5  6  2       2
## 6  4  0       2

Explanation:

Recomputing centroids and reassigning clusters shows no change in membership—indicating that K-means has converged. In practice, this iterative loop continues until assignments stabilize.

  1. Color the observations according to the final cluster labels.
ggplot(d, aes(x = x1, y = x2, color = factor(cluster))) +
  geom_point(size = 3) +
  labs(title = "K-Means Cluster Assignments", color = "Cluster") +
  theme_minimal()

Explanation:

The final plot confirms well-separated clusters. The coloring clearly shows Cluster 1 (upper left) and Cluster 2 (lower right), which align with the spatial separation observed initially.


Question 4

Suppose that for a particular data set, we perform hierarchical clustering using single linkage and complete linkage.

  1. At a certain point on the single linkage dendrogram, the clusters {1, 2, 3} and {4, 5} fuse. Which fusion will occur higher on the tree?

Explanation:

In hierarchical clustering, complete linkage uses the maximum distance between points in two clusters, while single linkage uses the minimum. As a result, clusters like {1, 2, 3} and {4, 5} will typically merge earlier (at a lower height) in single linkage because it only considers the closest pair of points. In contrast, complete linkage waits until all pairwise distances across clusters are smaller, causing the fusion to occur at a higher point in the dendrogram.

  1. Clusters {5} and {6} fuse. Which fusion will occur higher?

Explanation: When both clusters contain a single observation—like {5} and {6}—the distance between them is the same regardless of linkage method. Therefore, the fusion of these singleton clusters will occur at the same height in both single and complete linkage dendrograms, since there’s only one distance to consider.