This is an R Markdown Notebook. When you execute code within the notebook, the results appear beneath the code.

Try executing this chunk by clicking the Run button within the chunk or by placing your cursor inside it and pressing Cmd+Shift+Enter.


2. Hierarchical Clustering Analysis Using Dissimilarity Matrix

We have 4 observations: A, B, C, D (label them as 1, 2, 3, 4).

Matrix:

 1    2    3    4

1 0 0.3 0.4 0.7
2 0.3 0 0.5 0.8
3 0.4 0.5 0 0.45
4 0.7 0.8 0.45 0

(a) Complete Linkage Dendrogram

Complete linkage: distance between two clusters = maximum pairwise distance between elements in the clusters.

Step-by-step:

Closest pair: (1,2) → 0.3 → merge (1,2) Compute distances from (1,2) to other observations: (1,2)–3: max(0.4, 0.5) = 0.5 (1,2)–4: max(0.7, 0.8) = 0.8 (3,4): 0.45 → merge (3,4) at 0.45 Remaining clusters: (1,2) and (3,4) Distance between them = max(0.5, 0.7, 0.45, 0.8) = 0.8

Final merges:

Merge (1,2) at height 0.3 Merge (3,4) at height 0.45 Merge ((1,2), (3,4)) at height 0.8

Dendrogram sketch:

# Define the dissimilarity matrix
d <- matrix(c(
  0,   0.3, 0.4, 0.7,
  0.3, 0,   0.5, 0.8,
  0.4, 0.5, 0,   0.45,
  0.7, 0.8, 0.45, 0
), nrow = 4, byrow = TRUE)

# Assign names to the observations
rownames(d) <- colnames(d) <- c("1", "2", "3", "4")

# Convert to a dist object
dist_matrix <- as.dist(d)

# Perform hierarchical clustering using complete linkage
hc_complete <- hclust(dist_matrix, method = "complete")

# Plot the dendrogram
plot(hc_complete, main = "Complete Linkage Dendrogram", xlab = "", sub = "", ylab = "Height")

# Optional: flip branches to match visual layout (e.g., 3-4 on left, 1-2 on right)
# Use the `ape` package to reorder (install if not already installed)
# install.packages("ape")
library(ape)
plot(as.phylo(hc_complete), type = "cladogram", tip.order = c(3, 4, 1, 2), main = "Reordered Dendrogram")

(b) Single Linkage Dendrogram

Single linkage: distance between two clusters = minimum pairwise distance between elements in the clusters.

Step-by-step:

Closest pair: (1,2) = 0.3 → merge (1,2) (1,2)–3: min(0.4, 0.5) = 0.4 (3,4) = 0.45 → merge (1,2) and 3 at 0.4 ((1,2,3),4): min(0.7, 0.8, 0.45) = 0.45 → merge all

Final merges:

Merge (1,2) at height 0.3 Merge (1,2,3) at height 0.4 Merge (1,2,3,4) at height 0.45

# Define the dissimilarity matrix
d <- matrix(c(
  0,   0.3, 0.4, 0.7,
  0.3, 0,   0.5, 0.8,
  0.4, 0.5, 0,   0.45,
  0.7, 0.8, 0.45, 0
), nrow = 4, byrow = TRUE)

# Assign names to the observations
rownames(d) <- colnames(d) <- c("1", "2", "3", "4")

# Convert to dist object
dist_matrix <- as.dist(d)

# Perform hierarchical clustering using single linkage
hc_single <- hclust(dist_matrix, method = "single")

# Plot the dendrogram
plot(hc_single, main = "Single Linkage Dendrogram", xlab = "", sub = "", ylab = "Height")

# Optional: reorder tips to match your drawing using ape
# install.packages("ape") # if not installed
library(ape)
plot(as.phylo(hc_single), type = "cladogram", tip.order = c(1, 2, 3, 4), main = "Reordered Single Linkage")

(c) Two clusters from (a) [Complete linkage]

Cut at height just below 0.8:

Cluster 1: (1,2) Cluster 2: (3,4)

(d) Two clusters from (b) [Single linkage]

Cut at height just below 0.45:

Cluster 1: (1,2,3) Cluster 2: (4)

(e) Equivalent dendrogram to (a) with different leaf positions

We can swap subtrees or reorder leaves, e.g., place (2) before (1), or (4) before (3):

# Load required library
# install.packages("ape")  # Uncomment if not already installed
library(ape)

# Define the dissimilarity matrix
d <- matrix(c(
  0,   0.3, 0.4, 0.7,
  0.3, 0,   0.5, 0.8,
  0.4, 0.5, 0,   0.45,
  0.7, 0.8, 0.45, 0
), nrow = 4, byrow = TRUE)

# Assign names to the observations
rownames(d) <- colnames(d) <- c("1", "2", "3", "4")

# Convert to dist object
dist_matrix <- as.dist(d)

# Perform complete linkage hierarchical clustering
hc_complete <- hclust(dist_matrix, method = "complete")

# Convert to phylo object and plot with custom tip order
phylo_tree <- as.phylo(hc_complete)

# Reorder tips: 4, 3, 2, 1
tip_order <- c("4", "3", "2", "1")
plot(phylo_tree, type = "cladogram", tip.order = tip_order,
     main = "Complete Linkage (4, 3, 2, 1)", cex = 1.2)

# Optional: Add height labels manually

This code ensures:

Observations 4 and 3 are on the left side of the dendrogram. Observations 2 and 1 are on the right. Merge heights will match the complete linkage steps.

3. Manual K-Means Clustering on a Small 2D Dataset (K = 2, n = 6)

Given Data Table

Obs X1 X2 1 1 4 2 2 3 3 3 0 4 4 5 5 6 2 6 6 4

(a) Plot the observations

data <- data.frame(
  Obs = 1:6,
  X1 = c(1, 2, 3, 4, 6, 6),
  X2 = c(4, 3, 0, 5, 2, 4)
)

plot(data$X1, data$X2, pch = 19, xlab = "X1", ylab = "X2", main = "Observations")
text(data$X1, data$X2, labels = data$Obs, pos = 3)

We plotted the 6 observations on a 2D scatterplot using X1 and X2 as coordinates. Each point is labeled with its observation number for easy identification.

(b) Randomly assign a cluster label

set.seed(123)  # for reproducibility
data$Cluster <- sample(c(1, 2), size = 6, replace = TRUE)
data[, c("Obs", "Cluster")]

(c) Compute centroid for each cluster

Using the cluster labels above:

Cluster 1: Obs 1 and 4 → Points (1,4) and (4,5) Centroid = ((1+4)/2, (4+5)/2) = (2.5, 4.5) Cluster 2: Obs 2,3,5,6 → Points (2,3), (3,0), (6,2), (6,4) Centroid = ((2+3+6+6)/4, (3+0+2+4)/4) = (4.25, 2.25)

(d) Reassign Each Observation

Obs | Coordinates (X1, X2) | Distance to C1 | Distance to C2 | Assigned Cluster

1 | (1, 4) | 1.58 | 3.61 | 1 2 | (2, 3) | 1.58 | 2.42 | 1 3 | (3, 0) | 4.53 | 2.54 | 2 4 | (4, 5) | 1.58 | 2.76 | 1 5 | (6, 2) | 4.30 | 1.76 | 2 6 | (6, 4) | 3.54 | 2.49 | 2

Final Reassigned Clusters:

Cluster 1: Obs 1, 2, 4 Cluster 2: Obs 3, 5, 6

(e) Repeat (c) and (d) until convergence

You keep recalculating centroids and reassigning until the cluster labels stop changing.

Final cluster assignment (after convergence):

Obs Cluster 1 1 2 2 3 2 4 1 5 2 6 1

(f) Plot with final cluster coloring

plot(data$X1, data$X2, col = data$Cluster, pch = 19,
     xlab = "X1", ylab = "X2", main = "Final Clusters")
text(data$X1, data$X2, labels = data$Obs, pos = 3)
legend("topright", legend = c("Cluster 1", "Cluster 2"), col = 1:2, pch = 19)

Hierarchical Clustering Comparisons

(a) Fusion of {1,2,3} and {4,5}

In single linkage, the distance between two clusters is defined as the minimum distance between any pair of points (one from each cluster). In complete linkage, it’s the maximum distance between such pairs.

So, for the same two clusters {1,2,3} and {4,5}:

Single linkage fuses them at the smallest distance between any point in {1,2,3} and any point in {4,5} Complete linkage fuses them at the largest such distance

Therefore, the fusion in complete linkage will occur at a higher height than in single linkage.

Answer: The fusion occurs higher in the complete linkage dendrogram.

(b) Fusion of {5} and {6}

Since these are single-element clusters, the distance between them is simply the Euclidean distance (or whatever metric is used) between point 5 and point 6.

That value is fixed and the same for both single and complete linkage, because both use the actual pairwise distance when merging two individual points.

Answer: They will fuse at the same height in both dendrograms.

---
title: "R Notebook"
output: html_notebook
---

This is an [R Markdown](http://rmarkdown.rstudio.com) Notebook. When you execute code within the notebook, the results appear beneath the code. 

Try executing this chunk by clicking the *Run* button within the chunk or by placing your cursor inside it and pressing *Cmd+Shift+Enter*. 

---------

# 2. Hierarchical Clustering Analysis Using Dissimilarity Matrix

We have 4 observations: A, B, C, D (label them as 1, 2, 3, 4).

Matrix:

     1    2    3    4
1   0   0.3  0.4  0.7  
2  0.3   0   0.5  0.8  
3  0.4  0.5   0   0.45  
4  0.7  0.8 0.45   0


# (a) Complete Linkage Dendrogram

Complete linkage: distance between two clusters = maximum pairwise distance between elements in the clusters.

Step-by-step:

Closest pair: (1,2) → 0.3 → merge (1,2)
Compute distances from (1,2) to other observations:
(1,2)–3: max(0.4, 0.5) = 0.5
(1,2)–4: max(0.7, 0.8) = 0.8
(3,4): 0.45 → merge (3,4) at 0.45
Remaining clusters: (1,2) and (3,4)
Distance between them = max(0.5, 0.7, 0.45, 0.8) = 0.8

Final merges:

Merge (1,2) at height 0.3
Merge (3,4) at height 0.45
Merge ((1,2), (3,4)) at height 0.8
 
# Dendrogram sketch:

```{r}
# Define the dissimilarity matrix
d <- matrix(c(
  0,   0.3, 0.4, 0.7,
  0.3, 0,   0.5, 0.8,
  0.4, 0.5, 0,   0.45,
  0.7, 0.8, 0.45, 0
), nrow = 4, byrow = TRUE)

# Assign names to the observations
rownames(d) <- colnames(d) <- c("1", "2", "3", "4")

# Convert to a dist object
dist_matrix <- as.dist(d)

# Perform hierarchical clustering using complete linkage
hc_complete <- hclust(dist_matrix, method = "complete")

# Plot the dendrogram
plot(hc_complete, main = "Complete Linkage Dendrogram", xlab = "", sub = "", ylab = "Height")

# Optional: flip branches to match visual layout (e.g., 3-4 on left, 1-2 on right)
# Use the `ape` package to reorder (install if not already installed)
# install.packages("ape")
library(ape)
plot(as.phylo(hc_complete), type = "cladogram", tip.order = c(3, 4, 1, 2), main = "Reordered Dendrogram")
```


# (b) Single Linkage Dendrogram
Single linkage: distance between two clusters = minimum pairwise distance between elements in the clusters.

Step-by-step:

Closest pair: (1,2) = 0.3 → merge (1,2)
(1,2)–3: min(0.4, 0.5) = 0.4
(3,4) = 0.45 → merge (1,2) and 3 at 0.4
((1,2,3),4): min(0.7, 0.8, 0.45) = 0.45 → merge all

Final merges:

Merge (1,2) at height 0.3
Merge (1,2,3) at height 0.4
Merge (1,2,3,4) at height 0.45

```{r}
# Define the dissimilarity matrix
d <- matrix(c(
  0,   0.3, 0.4, 0.7,
  0.3, 0,   0.5, 0.8,
  0.4, 0.5, 0,   0.45,
  0.7, 0.8, 0.45, 0
), nrow = 4, byrow = TRUE)

# Assign names to the observations
rownames(d) <- colnames(d) <- c("1", "2", "3", "4")

# Convert to dist object
dist_matrix <- as.dist(d)

# Perform hierarchical clustering using single linkage
hc_single <- hclust(dist_matrix, method = "single")

# Plot the dendrogram
plot(hc_single, main = "Single Linkage Dendrogram", xlab = "", sub = "", ylab = "Height")

# Optional: reorder tips to match your drawing using ape
# install.packages("ape") # if not installed
library(ape)
plot(as.phylo(hc_single), type = "cladogram", tip.order = c(1, 2, 3, 4), main = "Reordered Single Linkage")
```


# (c) Two clusters from (a) [Complete linkage]
Cut at height just below 0.8:

Cluster 1: (1,2)
Cluster 2: (3,4)

# (d) Two clusters from (b) [Single linkage]
Cut at height just below 0.45:

Cluster 1: (1,2,3)
Cluster 2: (4)

# (e) Equivalent dendrogram to (a) with different leaf positions
We can swap subtrees or reorder leaves, e.g., place (2) before (1), or (4) before (3):

```{r}
# Load required library
# install.packages("ape")  # Uncomment if not already installed
library(ape)

# Define the dissimilarity matrix
d <- matrix(c(
  0,   0.3, 0.4, 0.7,
  0.3, 0,   0.5, 0.8,
  0.4, 0.5, 0,   0.45,
  0.7, 0.8, 0.45, 0
), nrow = 4, byrow = TRUE)

# Assign names to the observations
rownames(d) <- colnames(d) <- c("1", "2", "3", "4")

# Convert to dist object
dist_matrix <- as.dist(d)

# Perform complete linkage hierarchical clustering
hc_complete <- hclust(dist_matrix, method = "complete")

# Convert to phylo object and plot with custom tip order
phylo_tree <- as.phylo(hc_complete)

# Reorder tips: 4, 3, 2, 1
tip_order <- c("4", "3", "2", "1")
plot(phylo_tree, type = "cladogram", tip.order = tip_order,
     main = "Complete Linkage (4, 3, 2, 1)", cex = 1.2)

# Optional: Add height labels manually
```

This code ensures:

Observations 4 and 3 are on the left side of the dendrogram.
Observations 2 and 1 are on the right.
Merge heights will match the complete linkage steps.

# 3. Manual K-Means Clustering on a Small 2D Dataset (K = 2, n = 6)

Given Data Table

Obs   X1   X2
1     1    4
2     2    3
3     3    0
4     4    5
5     6    2
6     6    4

# (a) Plot the observations

```{r}
data <- data.frame(
  Obs = 1:6,
  X1 = c(1, 2, 3, 4, 6, 6),
  X2 = c(4, 3, 0, 5, 2, 4)
)

plot(data$X1, data$X2, pch = 19, xlab = "X1", ylab = "X2", main = "Observations")
text(data$X1, data$X2, labels = data$Obs, pos = 3)
```

We plotted the 6 observations on a 2D scatterplot using X1 and X2 as coordinates. Each point is labeled with its observation number for easy identification.

# (b) Randomly assign a cluster label

```{r}
set.seed(123)  # for reproducibility
data$Cluster <- sample(c(1, 2), size = 6, replace = TRUE)
data[, c("Obs", "Cluster")]
```

# (c) Compute centroid for each cluster
Using the cluster labels above:

Cluster 1: Obs 1 and 4 → Points (1,4) and (4,5)
Centroid = ((1+4)/2, (4+5)/2) = (2.5, 4.5)
Cluster 2: Obs 2,3,5,6 → Points (2,3), (3,0), (6,2), (6,4)
Centroid = ((2+3+6+6)/4, (3+0+2+4)/4) = (4.25, 2.25)

# (d) Reassign Each Observation

Obs | Coordinates (X1, X2) | Distance to C1 | Distance to C2 | Assigned Cluster
-------------------------------------------------------------------------------
 1  |       (1, 4)         |     1.58       |     3.61       |        1
 2  |       (2, 3)         |     1.58       |     2.42       |        1
 3  |       (3, 0)         |     4.53       |     2.54       |        2
 4  |       (4, 5)         |     1.58       |     2.76       |        1
 5  |       (6, 2)         |     4.30       |     1.76       |        2
 6  |       (6, 4)         |     3.54       |     2.49       |        2
 

Final Reassigned Clusters:

Cluster 1: Obs 1, 2, 4
Cluster 2: Obs 3, 5, 6

# (e) Repeat (c) and (d) until convergence
You keep recalculating centroids and reassigning until the cluster labels stop changing.

Final cluster assignment (after convergence):

Obs Cluster
1     1
2     2
3     2
4     1
5     2
6     1

# (f) Plot with final cluster coloring

```{r}
plot(data$X1, data$X2, col = data$Cluster, pch = 19,
     xlab = "X1", ylab = "X2", main = "Final Clusters")
text(data$X1, data$X2, labels = data$Obs, pos = 3)
legend("topright", legend = c("Cluster 1", "Cluster 2"), col = 1:2, pch = 19)
```

# Hierarchical Clustering Comparisons

# (a) Fusion of {1,2,3} and {4,5}
In single linkage, the distance between two clusters is defined as the minimum distance between any pair of points (one from each cluster).
In complete linkage, it's the maximum distance between such pairs.

So, for the same two clusters {1,2,3} and {4,5}:

Single linkage fuses them at the smallest distance between any point in {1,2,3} and any point in {4,5}
Complete linkage fuses them at the largest such distance

 Therefore, the fusion in complete linkage will occur at a higher height than in single linkage.

Answer: The fusion occurs higher in the complete linkage dendrogram.

# (b) Fusion of {5} and {6}

Since these are single-element clusters, the distance between them is simply the Euclidean distance (or whatever metric is used) between point 5 and point 6.

That value is fixed and the same for both single and complete linkage, because both use the actual pairwise distance when merging two individual points.

Answer: They will fuse at the same height in both dendrograms.

