2 Problem Statement

given a dissimilarity matrix for four observations. The matrix is:

  1     2     3     4

1 0.0 0.3 0.4 0.7 2 0.3 0.0 0.5 0.8 3 0.4 0.5 0.0 0.45 4 0.7 0.8 0.45 0.0

(a) Dendrogram Using Complete Linkage

Complete linkage uses the maximum dissimilarity between elements of two clusters.

Steps based on the matrix:

Cluster (1,2): Minimum dissimilarity = 0.3.

Cluster (3,4): Next smallest dissimilarity = 0.45.

Merge (1,2) and (3,4) at height = max(0.7, 0.8, 0.45, 0.5) = 0.8.

d_matrix <- matrix(c(
  0,   0.3, 0.4, 0.7,
  0.3, 0,   0.5, 0.8,
  0.4, 0.5, 0,   0.45,
  0.7, 0.8, 0.45, 0
), nrow = 4, byrow = TRUE)

colnames(d_matrix) <- rownames(d_matrix) <- c("1", "2", "3", "4")
dist_mat <- as.dist(d_matrix)

hc_complete <- hclust(dist_mat, method = "complete")
plot(hc_complete, main = "Dendrogram: Complete Linkage", ylab = "Height")

Explanation: Observations 1 and 2 merge at height 0.3, observations 3 and 4 at 0.45, and the two resulting clusters merge at 0.8.

(b) Dendrogram Using Single Linkage

Single linkage uses the minimum dissimilarity between clusters.

Steps:

Cluster (1,2) at 0.3.

Then merge 3 into (1,2) at 0.4 (min of 0.4 and 0.5).

Then merge 4 into cluster (1,2,3) at 0.45 (min of 0.45, 0.7, 0.8).

hc_single <- hclust(dist_mat, method = "single")
plot(hc_single, main = "Dendrogram: Single Linkage", ylab = "Height")

Explanation: Order of merging is different. Here, observation 3 joins cluster (1,2) at height 0.4, and then 4 joins at 0.45.

(c) Cut Complete Linkage Dendrogram into Two Clusters

cut_complete <- cutree(hc_complete, k = 2) 
cut_complete

## 1 2 3 4 
## 1 1 2 2

Observations 1 and 2 are in Cluster 1.

Observations 3 and 4 are in Cluster 2.

Explanation:

From the complete linkage dendrogram: Observations 1 and 2 were merged first (at height 0.3) Observations 3 and 4 were merged next (at height 0.45) The final merge between the two clusters happens at height 0.8 So when you cut the tree just below 0.8, it separates the dendrogram into:

One branch with (1, 2) One branch with (3, 4)

Result: Cluster 1: {1, 2} Cluster 2: {3, 4}

(d) Cut Single Linkage Dendrogram into Two Clusters

cut_single <- cutree(hc_single, k = 2) 
cut_single

## 1 2 3 4 
## 1 1 1 2

Observations 1, 2, and 3 are in Cluster 1

Observation 4 is in Cluster 2

Single linkage measures cluster distance using the minimum pairwise dissimilarity between any members of the clusters.

Let’s walk through the process (as described in your solution document):

(1, 2) merge first at 0.3 Then 3 joins (1,2) at height 0.4 (minimum of 0.4 and 0.5) Then 4 joins the cluster (1,2,3) at height 0.45

So, the last merge happens at 0.45. When you cut at k = 2 (i.e., just below height 0.45): You break the tree before 4 merges with the rest.

So: One cluster contains {1, 2, 3} The other cluster is {4}

(e) Equivalent Dendrogram to (a) With Reordered Leaves

plot(hc_complete, hang = -1, labels = c("4", "3", "2", "1"),       
main = "Equivalent Complete Linkage Dendrogram (Reordered)", ylab = "Height")

plot shows an equivalent dendrogram to the one created using complete linkage in part (a).

The key idea here is:

The branching structure (merge order and height) stays the same.

But the order of the leaf labels has been rearranged.

flipped the positions: Leaves 1 and 2 Leaves 3 and 4

But the merge order and heights (0.3 for {1,2}, 0.45 for {3,4}, 0.8 for full merge) are still the same.

3 (a) Plot the Observations

x1 <- c(1, 1, 0, 5, 6, 4) 
x2 <- c(4, 3, 4, 1, 2, 0) 
plot(x1, x2, xlab = "X1", ylab = "X2", main = "Scatterplot: Observations for K-means Clustering", pch = 19)

(b) Random Cluster Initialization

Using sample() to randomly assign initial cluster labels:

set.seed(1234)   
cl <- sample(1:2, 6, replace = TRUE) 
cl

## [1] 2 2 2 2 1 2

df <- data.frame(obs = 1:6, x1 = x1, x2 = x2, cl = cl) 
df

##   obs x1 x2 cl
## 1   1  1  4  2
## 2   2  1  3  2
## 3   3  0  4  2
## 4   4  5  1  2
## 5   5  6  2  1
## 6   6  4  0  2

Explanation: Randomly assigned each observation to one of two clusters (since K=2) using the sample() function.

By setting a seed (set.seed(1234)), ensured reproducibility — everyone using that seed will get the same assignment.

The cluster labels (cl) in the output mean:

Cluster 1: Observation 5

Cluster 2: Observations 1, 2, 3, 4, 6

(c) Compute Centroids

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

centroids <- df %>%
  group_by(cl) %>%
  summarise(mx1 = mean(x1), mx2 = mean(x2))

centroids

## # A tibble: 2 × 3
##      cl   mx1   mx2
##   <int> <dbl> <dbl>
## 1     1   6     2  
## 2     2   2.2   2.4

computing the mean position (centroid) for each cluster based on current cluster assignments from part (b).

Cluster 1 includes only observation 5: x1 =6 , x2=2 -> mean = (6.0,2.0)

Cluster 2 includes observations 1, 2, 3, 4, 6:

x1 =[1,1,0,5,4]→mean=2.2 x2 =[4,3,4,1,0]→mean=2.4

(d) Assign Observations to Closest Centroid

computing the Euclidean distance from each point to each centroid and assign the point to the closest one.

updateCluster <- function(df, centroids) {
  df$newCl <- apply(df, 1, function(row) {
    distances <- apply(centroids, 1, function(centroid) {
      sqrt((row["x1"] - centroid["mx1"])^2 + (row["x2"] - centroid["mx2"])^2)
    })
    return(which.min(distances))
  })
  return(df)
}

df <- updateCluster(df, centroids)
df

##   obs x1 x2 cl newCl
## 1   1  1  4  2     2
## 2   2  1  3  2     2
## 3   3  0  4  2     2
## 4   4  5  1  2     1
## 5   5  6  2  1     1
## 6   6  4  0  2     1

using the above founction - - it Computes the Euclidean distance from each observation to both cluster centroids - Assigns the observation to the nearest centroid - Saves this as newCl

From the previous centroid step, I had:

Cluster 1 centroid: (6.0,2.0) Cluster 2 centroid: (2.2,2.4)

Each observation is now assigned a new cluster label (newCl) based on its proximity to these centroids.

Observation 1: (1, 4) Distance to Cluster 1: sqrt((1-6)^2+(4-2)2) = sqrt(29) = 5.39 Distance to Cluster 2: sqrt((1-2.2)^2+(4-2.4)2) = sqrt(4) =2

Closer to Cluster 2 → newCl = 2

Observation 4: (5, 1) Distance to Cluster 1: 1.41 Distance to Cluster 2: 3.13

Closer to Cluster 1 → newCl = 1

This logic is applied to all six observations.

(e) Repeat Until Convergence

Repeat steps (c) and (d) until cl == newCl for all observations:

while (!all(df$cl == df$newCl)) {
  df$cl <- df$newCl
  centroids <- df %>%
    group_by(cl) %>%
    summarise(mx1 = mean(x1), mx2 = mean(x2))
  df <- updateCluster(df, centroids)
}
df

##   obs x1 x2 cl newCl
## 1   1  1  4  2     2
## 2   2  1  3  2     2
## 3   3  0  4  2     2
## 4   4  5  1  1     1
## 5   5  6  2  1     1
## 6   6  4  0  1     1

Cluster 1: Observations 4, 5, 6
Cluster 2: Observations 1, 2, 3

(f) Plot with Final Cluster Labels

plot(df$x1, df$x2, col = df$cl, pch = 19,
     xlab = "X1", ylab = "X2",
     main = "Final Clusters after K-means (K = 2)")
legend("topright", legend = c("Cluster 1", "Cluster 2"), col = c(1, 2), pch = 19)

4 (a) Fusion of Clusters {1,2,3} and {4,5}

The fusion will occur higher on the complete linkage dendrogram.

Explanation: Single linkage merges clusters using the smallest pairwise distance. Complete linkage merges clusters using the largest pairwise distance. Therefore, the height at which clusters merge will be higher using complete linkage, unless all pairwise distances are equal (which is rare).

d_matrix <- matrix(c(
  0, 1, 1, 4, 5,
  1, 0, 1, 4, 5,
  1, 1, 0, 4, 5,
  4, 4, 4, 0, 1,
  5, 5, 5, 1, 0
), nrow = 5, byrow = TRUE)

colnames(d_matrix) <- rownames(d_matrix) <- c("1", "2", "3", "4", "5")

d <- as.dist(d_matrix)


hc_single <- hclust(d, method = "single")
hc_complete <- hclust(d, method = "complete")

# Plot both
par(mfrow = c(1, 2))
plot(hc_single, main = "Single Linkage Dendrogram")
plot(hc_complete, main = "Complete Linkage Dendrogram")

(b) Fusion of Singleton Clusters {5} and {6}

The fusion will occur at the same height in both dendrograms.

Explanation:

When two individual observations are fused, there is only one distance between them. Single linkage and complete linkage both use that same distance for merging. So, the merge height is identical in both methods.

data_three <- matrix(c(
  1, 1,   # obs 5
  4, 4,   # obs 6
  10, 10  # extra obs 7 (to allow plotting)
), ncol = 2, byrow = TRUE)

rownames(data_three) <- c("5", "6", "7")


d3 <- dist(data_three)

hc_s3 <- hclust(d3, method = "single")
hc_c3 <- hclust(d3, method = "complete")


par(mfrow = c(1, 2))
plot(hc_s3, main = "Single Linkage (5,6,7)")
plot(hc_c3, main = "Complete Linkage (5,6,7)")

ex-8

2025-04-26

2

Problem Statement