Exercise 8

Suppose that we have four observations, for which we compute a dissimilarity matrix, given by 0.3 0.4 0.7 0.3 0.5 0.8 0.4 0.5 0.45 0.7 0.8 0.45 For instance, the dissimilarity between the frst and second observations is 0.3, and the dissimilarity between the second and fourth observations is 0.8.

On the basis of this dissimilarity matrix, sketch the dendrogram that results from hierarchically clustering these four observations using complete linkage. Be sure to indicate on the plot the height at which each fusion occurs, as well as the observations corresponding to each leaf in the dendrogram. A: Step 1: The smallest dissimilarity is 0.3 between observations 1 and 2 which gives merge {1, 2} Step 2: The next minimum max-distance is between observations 3 and 4 So, merge {3, 4} at 0.45 Step 3: Finally, merge {1, 2} and {3, 4} at the maximum dissimilarity across them: 0.8 Fusions: {1, 2} = 0.3 {3, 4} = 0.45 Final merge = 0.8

Decision Tree

Repeat (a), this time using single linkage clustering. Fusions: {1, 2} = 0.3 {3, 4} = 0.4 (since min of 0.4 < 0.45) Final merge = 0.45

Decision Tree

Suppose that we cut the dendrogram obtained in (a) such that two clusters result. Which observations are in each cluster? Cutting just before the final merge (0.8) results in two clusters: Cluster 1: {1, 2} Cluster 2: {3, 4}
Suppose that we cut the dendrogram obtained in (b) such that two clusters result. Which observations are in each cluster? Cutting between 0.4 and 0.45 so : Cluster 1: {1, 2, 3} Cluster 2: {4}
It is mentioned in the chapter that at each fusion in the dendrogram, the position of the two clusters being fused can be swapped without changing the meaning of the dendrogram. Draw a dendrogram that is equivalent to the dendrogram in (a), for which two or more of the leaves are repositioned, but for which the meaning of the dendrogram is the same.

Decision Tree

In this problem, you will perform K-means clustering manually, with K = 2, on a small example with n = 6 observations and p = 2 features. The observations are as follows. Obs. X1 X2 1 1 4 2 1 3 3 0 4 4 5 1 5 6 2 6 4 0

Plot the observations.

x1 <- c(1, 1, 0, 5, 6, 4)
x2 <- c(4, 3, 4, 1, 2, 0)
df <- data.frame(Obs = 1:6, x1 = x1, x2 = x2)
library(ggplot2)
ggplot(data.frame(x1, x2), aes(x = x1, y = x2)) + geom_point()

Randomly assign a cluster label to each observation. You can use the sample() command in R to do this. Report the cluster labels for each observation.

set.seed(1234) 
df$cl <- sample(1:2, nrow(df), replace = TRUE)

df[, c("Obs", "x1", "x2", "cl")]

Compute the centroid for each cluster. Now, lets see the centroids of the current clusters.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

centroids <- df %>%
  group_by(cl) %>%
  summarise(cx = mean(x1), cy = mean(x2), .groups = "drop")

centroids

Assign each observation to the centroid to which it is closest, in terms of Euclidean distance. Report the cluster labels for each observation.

Each point is now reassigned to the cluster whose centroid is closest, using Euclidean distance.

assign_clusters <- function(data, centers) {
  apply(data, 1, function(row) {
    dists <- apply(centers, 1, function(c) sqrt((row["x1"] - c["cx"])^2 + (row["x2"] - c["cy"])^2))
    which.min(dists)
  })
}

df$newCl <- assign_clusters(df[, c("x1", "x2")], centroids)

df[, c("Obs", "x1", "x2", "cl", "newCl")]

Repeat (c) and (d) until the answers obtained stop changing.

while (!all(df$cl == df$newCl)) {
  df$cl <- df$newCl
  
  # Recalculate centroids
  centroids <- df %>%
    group_by(cl) %>%
    summarise(cx = mean(x1), cy = mean(x2), .groups = "drop")
  
  # Reassign clusters
  df$newCl <- assign_clusters(df[, c("x1", "x2")], centroids)
}

df$final_cl <- df$cl  
df

In your plot from (a), color the observations according to the cluster labels obtained.

ggplot(df, aes(x = x1, y = x2, color = factor(final_cl))) +
  geom_point(size = 4) +
  labs(title = "Final K-means Clustering", x = "X1", y = "X2", color = "Cluster") +
  theme_minimal()

Suppose that for a particular data set, we perform hierarchical clustering using single linkage and using complete linkage. We obtain two dendrograms.

At a certain point on the single linkage dendrogram, the clusters {1, 2, 3} and {4, 5} fuse. On the complete linkage dendrogram, the clusters {1, 2, 3} and {4, 5} also fuse at a certain point. Which fusion will occur higher on the tree, or will they fuse at the same height, or is there not enough information to tell?

A:Complete linkage uses maximum pairwise distance So, fusion will occur at a higher height. Single linkage uses minimum distance for which fusion occurs earlier (lower height),So Fusion happens higher in complete linkage.

At a certain point on the single linkage dendrogram, the clusters {5} and {6} fuse. On the complete linkage dendrogram, the clusters {5} and {6} also fuse at a certain point. Which fusion will occur higher on the tree, or will they fuse at the same height, or is there not enough information to tell?

A: There is only one pair to compare: (5, 6) Since max and min are equal when only one pair exists, So Fusion height is the same in both linkages