given a dissimilarity matrix for four observations. The matrix is:
1 2 3 4
1 0.0 0.3 0.4 0.7 2 0.3 0.0 0.5 0.8 3 0.4 0.5 0.0 0.45 4 0.7 0.8 0.45 0.0
Complete linkage uses the maximum dissimilarity between elements of two clusters.
Steps based on the matrix:
Cluster (1,2): Minimum dissimilarity = 0.3.
Cluster (3,4): Next smallest dissimilarity = 0.45.
Merge (1,2) and (3,4) at height = max(0.7, 0.8, 0.45, 0.5) = 0.8.
d_matrix <- matrix(c(
0, 0.3, 0.4, 0.7,
0.3, 0, 0.5, 0.8,
0.4, 0.5, 0, 0.45,
0.7, 0.8, 0.45, 0
), nrow = 4, byrow = TRUE)
colnames(d_matrix) <- rownames(d_matrix) <- c("1", "2", "3", "4")
dist_mat <- as.dist(d_matrix)
hc_complete <- hclust(dist_mat, method = "complete")
plot(hc_complete, main = "Dendrogram: Complete Linkage", ylab = "Height")
Explanation: Observations 1 and 2 merge at height 0.3, observations 3 and 4 at 0.45, and the two resulting clusters merge at 0.8.
Single linkage uses the minimum dissimilarity between clusters.
Steps:
Cluster (1,2) at 0.3.
Then merge 3 into (1,2) at 0.4 (min of 0.4 and 0.5).
Then merge 4 into cluster (1,2,3) at 0.45 (min of 0.45, 0.7, 0.8).
hc_single <- hclust(dist_mat, method = "single")
plot(hc_single, main = "Dendrogram: Single Linkage", ylab = "Height")
Explanation: Order of merging is different. Here, observation 3 joins cluster (1,2) at height 0.4, and then 4 joins at 0.45.
cut_complete <- cutree(hc_complete, k = 2)
cut_complete
## 1 2 3 4
## 1 1 2 2
Observations 1 and 2 are in Cluster 1.
Observations 3 and 4 are in Cluster 2.
Explanation:
From the complete linkage dendrogram: Observations 1 and 2 were merged first (at height 0.3) Observations 3 and 4 were merged next (at height 0.45) The final merge between the two clusters happens at height 0.8 So when you cut the tree just below 0.8, it separates the dendrogram into:
One branch with (1, 2) One branch with (3, 4)
Result: Cluster 1: {1, 2} Cluster 2: {3, 4}
cut_single <- cutree(hc_single, k = 2)
cut_single
## 1 2 3 4
## 1 1 1 2
Observations 1, 2, and 3 are in Cluster 1
Observation 4 is in Cluster 2
Single linkage measures cluster distance using the minimum pairwise dissimilarity between any members of the clusters.
Let’s walk through the process (as described in your solution document):
(1, 2) merge first at 0.3 Then 3 joins (1,2) at height 0.4 (minimum of 0.4 and 0.5) Then 4 joins the cluster (1,2,3) at height 0.45
So, the last merge happens at 0.45. When you cut at k = 2 (i.e., just below height 0.45): You break the tree before 4 merges with the rest.
So: One cluster contains {1, 2, 3} The other cluster is {4}
plot(hc_complete, hang = -1, labels = c("4", "3", "2", "1"),
main = "Equivalent Complete Linkage Dendrogram (Reordered)", ylab = "Height")
plot shows an equivalent dendrogram to the one created using complete linkage in part (a).
The key idea here is:
The branching structure (merge order and height) stays the same.
But the order of the leaf labels has been rearranged.
flipped the positions: Leaves 1 and 2 Leaves 3 and 4
But the merge order and heights (0.3 for {1,2}, 0.45 for {3,4}, 0.8 for full merge) are still the same.
x1 <- c(1, 1, 0, 5, 6, 4)
x2 <- c(4, 3, 4, 1, 2, 0)
plot(x1, x2, xlab = "X1", ylab = "X2", main = "Scatterplot: Observations for K-means Clustering", pch = 19)
Using sample() to randomly assign initial cluster labels:
set.seed(1234)
cl <- sample(1:2, 6, replace = TRUE)
cl
## [1] 2 2 2 2 1 2
df <- data.frame(obs = 1:6, x1 = x1, x2 = x2, cl = cl)
df
## obs x1 x2 cl
## 1 1 1 4 2
## 2 2 1 3 2
## 3 3 0 4 2
## 4 4 5 1 2
## 5 5 6 2 1
## 6 6 4 0 2
Explanation: Randomly assigned each observation to one of two clusters (since K=2) using the sample() function.
By setting a seed (set.seed(1234)), ensured reproducibility — everyone using that seed will get the same assignment.
The cluster labels (cl) in the output mean:
Cluster 1: Observation 5
Cluster 2: Observations 1, 2, 3, 4, 6
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
centroids <- df %>%
group_by(cl) %>%
summarise(mx1 = mean(x1), mx2 = mean(x2))
centroids
## # A tibble: 2 × 3
## cl mx1 mx2
## <int> <dbl> <dbl>
## 1 1 6 2
## 2 2 2.2 2.4
computing the mean position (centroid) for each cluster based on current cluster assignments from part (b).
Cluster 1 includes only observation 5: x1 =6 , x2=2 -> mean = (6.0,2.0)
Cluster 2 includes observations 1, 2, 3, 4, 6:
x1 =[1,1,0,5,4]→mean=2.2 x2 =[4,3,4,1,0]→mean=2.4
computing the Euclidean distance from each point to each centroid and assign the point to the closest one.
updateCluster <- function(df, centroids) {
df$newCl <- apply(df, 1, function(row) {
distances <- apply(centroids, 1, function(centroid) {
sqrt((row["x1"] - centroid["mx1"])^2 + (row["x2"] - centroid["mx2"])^2)
})
return(which.min(distances))
})
return(df)
}
df <- updateCluster(df, centroids)
df
## obs x1 x2 cl newCl
## 1 1 1 4 2 2
## 2 2 1 3 2 2
## 3 3 0 4 2 2
## 4 4 5 1 2 1
## 5 5 6 2 1 1
## 6 6 4 0 2 1
using the above founction - - it Computes the Euclidean distance from each observation to both cluster centroids - Assigns the observation to the nearest centroid - Saves this as newCl
From the previous centroid step, I had:
Cluster 1 centroid: (6.0,2.0) Cluster 2 centroid: (2.2,2.4)
Each observation is now assigned a new cluster label (newCl) based on its proximity to these centroids.
Observation 1: (1, 4) Distance to Cluster 1: sqrt((1-6)2+(4-2)2) = sqrt(29) = 5.39 Distance to Cluster 2: sqrt((1-2.2)2+(4-2.4)2) = sqrt(4) =2
Closer to Cluster 2 → newCl = 2
Observation 4: (5, 1) Distance to Cluster 1: 1.41 Distance to Cluster 2: 3.13
Closer to Cluster 1 → newCl = 1
This logic is applied to all six observations.
Repeat steps (c) and (d) until cl == newCl for all observations:
while (!all(df$cl == df$newCl)) {
df$cl <- df$newCl
centroids <- df %>%
group_by(cl) %>%
summarise(mx1 = mean(x1), mx2 = mean(x2))
df <- updateCluster(df, centroids)
}
df
## obs x1 x2 cl newCl
## 1 1 1 4 2 2
## 2 2 1 3 2 2
## 3 3 0 4 2 2
## 4 4 5 1 1 1
## 5 5 6 2 1 1
## 6 6 4 0 1 1
Cluster 1: Observations 4, 5, 6
Cluster 2: Observations 1, 2, 3
plot(df$x1, df$x2, col = df$cl, pch = 19,
xlab = "X1", ylab = "X2",
main = "Final Clusters after K-means (K = 2)")
legend("topright", legend = c("Cluster 1", "Cluster 2"), col = c(1, 2), pch = 19)
The fusion will occur higher on the complete linkage dendrogram.
Explanation: Single linkage merges clusters using the smallest pairwise distance. Complete linkage merges clusters using the largest pairwise distance. Therefore, the height at which clusters merge will be higher using complete linkage, unless all pairwise distances are equal (which is rare).
d_matrix <- matrix(c(
0, 1, 1, 4, 5,
1, 0, 1, 4, 5,
1, 1, 0, 4, 5,
4, 4, 4, 0, 1,
5, 5, 5, 1, 0
), nrow = 5, byrow = TRUE)
colnames(d_matrix) <- rownames(d_matrix) <- c("1", "2", "3", "4", "5")
d <- as.dist(d_matrix)
hc_single <- hclust(d, method = "single")
hc_complete <- hclust(d, method = "complete")
# Plot both
par(mfrow = c(1, 2))
plot(hc_single, main = "Single Linkage Dendrogram")
plot(hc_complete, main = "Complete Linkage Dendrogram")
The fusion will occur at the same height in both dendrograms.
Explanation:
When two individual observations are fused, there is only one distance between them. Single linkage and complete linkage both use that same distance for merging. So, the merge height is identical in both methods.
data_three <- matrix(c(
1, 1, # obs 5
4, 4, # obs 6
10, 10 # extra obs 7 (to allow plotting)
), ncol = 2, byrow = TRUE)
rownames(data_three) <- c("5", "6", "7")
d3 <- dist(data_three)
hc_s3 <- hclust(d3, method = "single")
hc_c3 <- hclust(d3, method = "complete")
par(mfrow = c(1, 2))
plot(hc_s3, main = "Single Linkage (5,6,7)")
plot(hc_c3, main = "Complete Linkage (5,6,7)")