2. Suppose that we have four observations, for which we compute a dissimilarity matrix, given by (matrix) For instance, the dissimilarity between the first and second observations is 0.3, and the dissimilarity between the second and fourth observations is 0.8.

a). On the basis of this dissimilarity matrix, sketch the dendrogram that results from hierarchically clustering these four observations using complete linkage. Be sure to indicate on the plot the height at which each fusion occurs, as well as the observations corresponding to each leaf in the dendrogram.

diss_matrix <- matrix(c(
  0,   0.3, 0.4, 0.7,
  0.3, 0,   0.5, 0.8,
  0.4, 0.5, 0,   0.45,
  0.7, 0.8, 0.45, 0
), nrow = 4, byrow = TRUE)

rownames(diss_matrix) <- colnames(diss_matrix) <- paste0("Obs", 1:4)

hc_complete <- hclust(as.dist(diss_matrix), method = "complete")
plot(hc_complete, main = "Complete Linkage Dendrogram", ylab = "Height")

b). Repeat (a), this time using single linkage clustering.

hc_single <- hclust(as.dist(diss_matrix), method = "single")
plot(hc_single, main = "Single Linkage Dendrogram", ylab = "Height")

c). Suppose that we cut the dendrogram obtained in (a) such that two clusters result. Which observations are in each cluster?

cutree(hc_complete, k = 2)
## Obs1 Obs2 Obs3 Obs4 
##    1    1    2    2

d). Suppose that we cut the dendrogram obtained in (b) such that two clusters result. Which observations are in each cluster?

cutree(hc_single, k = 2)
## Obs1 Obs2 Obs3 Obs4 
##    1    1    1    2

e). It is mentioned in the chapter that at each fusion in the dendrogram, the position of the two clusters being fused can be swapped without changing the meaning of the dendrogram. Draw a dendrogram that is equivalent to the dendrogram in (a), for which two or more of the leaves are repositioned, but for which the meaning of the dendrogram is the same.

plot(hc_complete, hang = -1, main = "Alternate Complete Linkage Dendrogram")

3. In this problem, you will perform K-means clustering manually, with K =2, on a small example with n =6observations and p =2 features. The observations are as follows.

data <- data.frame(
  Obs = 1:6,
  X1 = c(1, 1, 0, 5, 6, 4),
  X2 = c(4, 3, 4, 1, 2, 0)
)

a). Plot the observations.

plot(data$X1, data$X2, xlab = "X1", ylab = "X2", pch = 19, main = "Observations")

b). Randomly assign a cluster label to each observation. You can use the sample() command in R to do this. Report the cluster labels for each observation.

set.seed(123)
data$Cluster <- sample(1:2, size = nrow(data), replace = TRUE)

cat("Initial Cluster Assignments:\n")
## Initial Cluster Assignments:
print(data)
##   Obs X1 X2 Cluster
## 1   1  1  4       1
## 2   2  1  3       1
## 3   3  0  4       1
## 4   4  5  1       2
## 5   5  6  2       1
## 6   6  4  0       2

c). Compute the centroid for each cluster.

c1 <- data[data$Cluster == 1, ]
c2 <- data[data$Cluster == 2, ]

# Centroids
c1_x <- mean(c1$X1)
c1_y <- mean(c1$X2)
c2_x <- mean(c2$X1)
c2_y <- mean(c2$X2)

cat("Centroid of Cluster 1:", c1_x, c1_y, "\n")
## Centroid of Cluster 1: 2 3.25
cat("Centroid of Cluster 2:", c2_x, c2_y, "\n")
## Centroid of Cluster 2: 4.5 0.5

d). Assign each observation to the centroid to which it is closest, in terms of Euclidean distance. Report the cluster labels for each observation.

data$Cluster <- apply(data[, c("X1", "X2")], 1, function(row) {
  d1 <- sqrt((row[1] - c1_x)^2 + (row[2] - c1_y)^2)
  d2 <- sqrt((row[1] - c2_x)^2 + (row[2] - c2_y)^2)
  ifelse(d1 < d2, 1, 2)
})

print(data)
##   Obs X1 X2 Cluster
## 1   1  1  4       1
## 2   2  1  3       1
## 3   3  0  4       1
## 4   4  5  1       2
## 5   5  6  2       2
## 6   6  4  0       2

e). Repeat (c) and (d) until the answers obtained stop changing.

repeat {
  old_cluster <- data$Cluster
  
  c1 <- data[data$Cluster == 1, ]
  c2 <- data[data$Cluster == 2, ]
  c1_x <- mean(c1$X1)
  c1_y <- mean(c1$X2)
  c2_x <- mean(c2$X1)
  c2_y <- mean(c2$X2)
  
  data$Cluster <- apply(data[, c("X1", "X2")], 1, function(row) {
    d1 <- sqrt((row[1] - c1_x)^2 + (row[2] - c1_y)^2)
    d2 <- sqrt((row[1] - c2_x)^2 + (row[2] - c2_y)^2)
    ifelse(d1 < d2, 1, 2)
  })
  
  if (all(data$Cluster == old_cluster)) break
}

cat("Final cluster assignments after convergence:\n")
## Final cluster assignments after convergence:
print(data)
##   Obs X1 X2 Cluster
## 1   1  1  4       1
## 2   2  1  3       1
## 3   3  0  4       1
## 4   4  5  1       2
## 5   5  6  2       2
## 6   6  4  0       2

f). In your plot from (a), color the observations according to the cluster labels obtained.

plot(data$X1, data$X2, col = data$Cluster, pch = 19,
     main = "Final Clusters (Manual K-Means)", xlab = "X1", ylab = "X2")
text(data$X1 + 0.2, data$X2, labels = data$Obs)

4. Suppose that for a particular data set, we perform hierarchical clustering using single linkage and using complete linkage. We obtain two dendrograms.

a). At a certain point on the single linkage dendrogram, the clusters {1,2,3} and {4,5} fuse. On the complete linkage dendrogram, the clusters {1,2,3} and {4,5} also fuse at a certain point. Which fusion will occur higher on the tree, or will they fuse at the same height, or is there not enough information to tell?

points <- matrix(c(
  1, 4,
  1, 3,
  0, 4,
  5, 1,
  6, 2,
  4, 0    
), ncol = 2, byrow = TRUE)

rownames(points) <- paste0("Obs", 1:6)

points
##      [,1] [,2]
## Obs1    1    4
## Obs2    1    3
## Obs3    0    4
## Obs4    5    1
## Obs5    6    2
## Obs6    4    0
hc_single <- hclust(dist(points), method = "single")
hc_complete <- hclust(dist(points), method = "complete")

par(mfrow = c(1, 2))
plot(hc_single, main = "Single Linkage", xlab = "", sub = "", cex = 0.8)
plot(hc_complete, main = "Complete Linkage", xlab = "", sub = "", cex = 0.8)

Since single linkage merges based on the shortest distance between clusters and complete linkage merges based on the furthest distance, the {1,2,3} and {4,5} merge will happen at a higher height in complete linkage.

b). At a certain point on the single linkage dendrogram, the clusters {5} and {6} fuse. On the complete linkage dendrogram, the clusters {5} and {6} also fuse at a certain point. Which fusion will occur higher on the tree, or will they fuse at the same height, or is there not enough information to tell?

dists <- as.matrix(dist(points))
dists["Obs5", "Obs6"]
## [1] 2.828427

They will fuse at the same height (2.83) in both single and complete linkage because Obs5 and Obs6 are individual points, and the only pairwise distance considered is 2.83.