2. Suppose that we have four observations, for which we compute a dissimilarity matrix, given by:

For instance, the dissimilarity between the first and second observations is 0.3, and the dissimilarity between the second and fourth observations is 0.8.

(a) On the basis of this dissimilarity matrix, sketch the dendrogram that results from hierarchically clustering these four observations using complete linkage. Be sure to indicate on the plot the height at which each fusion occurs, as well as the observations corresponding to each leaf in the dendrogram.

# Create the dissimilarity matrix
diss_matrix <- matrix(c(
  0, 0.3, 0.4, 0.7,
  0.3, 0, 0.5, 0.8,
  0.4, 0.5, 0, 0.45,
  0.7, 0.8, 0.45, 0
), nrow = 4)

# Convert to dist object
dist_obj <- as.dist(diss_matrix)

# Perform hierarchical clustering with complete linkage
hc_complete <- hclust(dist_obj, method = "complete")

# Plot the dendrogram
plot(hc_complete, main = "Hierarchical Clustering with Complete Linkage",
     xlab = "", sub = "", cex = 0.9)

Explanation: This code creates the dissimilarity matrix, converts it to a distance object, and performs hierarchical clustering with complete linkage.

(b) Repeat (a), this time using single linkage clustering.

# Perform hierarchical clustering with single linkage
hc_single <- hclust(dist_obj, method = "single")

# Plot the dendrogram
plot(hc_single, main = "Hierarchical Clustering with Single Linkage",
     xlab = "", sub = "", cex = 0.9)

Explanation: This code performs hierarchical clustering with single linkage using the same distance matrix.

(c) Suppose that we cut the dendrogram obtained in (a) such that two clusters result. Which observations are in each cluster?

# Cut the complete linkage dendrogram to get 2 clusters
clusters_complete <- cutree(hc_complete, k = 2)
print("Observations in each cluster (Complete Linkage):")
## [1] "Observations in each cluster (Complete Linkage):"
print(clusters_complete)
## [1] 1 1 2 2

Explanation: This code cuts the complete linkage dendrogram to obtain two clusters and shows which observations are in each cluster.

(d) Suppose that we cut the dendrogram obtained in (b) such that two clusters result. Which observations are in each cluster?

# Cut the single linkage dendrogram to get 2 clusters
clusters_single <- cutree(hc_single, k = 2)
print("Observations in each cluster (Single Linkage):")
## [1] "Observations in each cluster (Single Linkage):"
print(clusters_single)
## [1] 1 1 1 2

Explanation: This code cuts the single linkage dendrogram to obtain two clusters and shows which observations are in each cluster.

(e) It is mentioned in the chapter that at each fusion in the dendrogram, the position of the two clusters being fused can be swapped without changing the meaning of the dendrogram. Draw a dendrogram that is equivalent to the dendrogram in (a), for which two or more of the leaves are repositioned, but for which the meaning of the dendrogram is the same.

# Create a dendrogram object
dend_complete <- as.dendrogram(hc_complete)

# Reorder the dendrogram (for example, swap branches)
# This is a simple example - actual repositioning would depend on specific requirements
dend_reordered <- reorder(dend_complete, wts = 4:1)

# Plot the reordered dendrogram
plot(dend_reordered, main = "Equivalent Dendrogram with Repositioned Leaves",
     xlab = "", sub = "", cex = 0.9)

Explanation: This code creates an equivalent dendrogram to the complete linkage one but with repositioned leaves.

3. In this problem, you will perform K-means clustering manually, with K =2, on a small example with n =6observations and p =2 features. The observations are as follows.

(a) Plot the observations.

# Create dataset
data <- data.frame(
  X1 = c(1, 1, 0, 5, 6, 4),
  X2 = c(4, 3, 4, 1, 2, 0)
)

# Plot observations
plot(data$X1, data$X2, xlim = c(0, 7), ylim = c(0, 5), 
     main = "Scatter Plot of Observations",
     xlab = "X1", ylab = "X2", pch = 19, col = "blue")
text(data$X1, data$X2, labels = 1:6, pos = 3)

Explanation: This code creates the dataset and plots the observations with their indices.

(b) Randomly assign a cluster label to each observation. You can use the sample() command in R to do this. Report the cluster labels for each observation.

# Set seed for reproducibility
set.seed(123)

# Randomly assign cluster labels
initial_clusters <- sample(1:2, 6, replace = TRUE)
print("Initial random cluster assignments:")
## [1] "Initial random cluster assignments:"
print(data.frame(Observation = 1:6, Cluster = initial_clusters))
##   Observation Cluster
## 1           1       1
## 2           2       1
## 3           3       1
## 4           4       2
## 5           5       1
## 6           6       2

Explanation: This code randomly assigns each observation to one of two clusters.

(c) Compute the centroid for each cluster.

# Function to compute centroids
compute_centroids <- function(data, clusters) {
  centroids <- matrix(0, nrow = 2, ncol = 2)
  for (i in 1:2) {
    if (sum(clusters == i) > 0) {
      centroids[i,] <- colMeans(data[clusters == i, , drop = FALSE])
    }
  }
  return(centroids)
}

# Compute initial centroids
centroids <- compute_centroids(data, initial_clusters)
print("Initial centroids:")
## [1] "Initial centroids:"
print(centroids)
##      [,1] [,2]
## [1,]  2.0 3.25
## [2,]  4.5 0.50

Explanation: This code computes the centroid for each initial cluster.

(d) Assign each observation to the centroid to which it is closest, in terms of Euclidean distance. Report the cluster labels for each observation.

# Function to assign observations to closest centroid
assign_clusters <- function(data, centroids) {
  n <- nrow(data)
  clusters <- numeric(n)
  
  for (i in 1:n) {
    # Calculate distances to each centroid
    dist1 <- sqrt(sum((data[i,] - centroids[1,])^2))
    dist2 <- sqrt(sum((data[i,] - centroids[2,])^2))
    
    # Assign to closest centroid
    clusters[i] <- ifelse(dist1 < dist2, 1, 2)
  }
  return(clusters)
}

# Assign observations to closest centroid
new_clusters <- assign_clusters(data, centroids)
print("Updated cluster assignments:")
## [1] "Updated cluster assignments:"
print(data.frame(Observation = 1:6, Cluster = new_clusters))
##   Observation Cluster
## 1           1       1
## 2           2       1
## 3           3       1
## 4           4       2
## 5           5       2
## 6           6       2

Explanation: This code assigns each observation to the closest centroid based on Euclidean distance.

(e) Repeat (c) and (d) until the answers obtained stop changing.

# Iterate until convergence
old_clusters <- initial_clusters
new_clusters <- assign_clusters(data, compute_centroids(data, old_clusters))

iteration <- 1
while (!all(old_clusters == new_clusters)) {
  print(paste("Iteration", iteration))
  print("Clusters:")
  print(new_clusters)
  
  # Compute new centroids
  centroids <- compute_centroids(data, new_clusters)
  print("Centroids:")
  print(centroids)
  
  # Update clusters
  old_clusters <- new_clusters
  new_clusters <- assign_clusters(data, centroids)
  
  iteration <- iteration + 1
}
## [1] "Iteration 1"
## [1] "Clusters:"
## [1] 1 1 1 2 2 2
## [1] "Centroids:"
##           [,1]     [,2]
## [1,] 0.6666667 3.666667
## [2,] 5.0000000 1.000000
print("Final cluster assignments:")
## [1] "Final cluster assignments:"
print(data.frame(Observation = 1:6, Cluster = new_clusters))
##   Observation Cluster
## 1           1       1
## 2           2       1
## 3           3       1
## 4           4       2
## 5           5       2
## 6           6       2

Explanation: This code iterates the process of computing centroids and assigning observations until convergence.

(f) In your plot from (a), color the observations according to the cluster labels obtained.

# Plot with final cluster colors
plot(data$X1, data$X2, xlim = c(0, 7), ylim = c(0, 5), 
     main = "Final K-means Clustering",
     xlab = "X1", ylab = "X2", pch = 19, 
     col = c("red", "blue")[new_clusters])
text(data$X1, data$X2, labels = 1:6, pos = 3)

# Add centroids to the plot
final_centroids <- compute_centroids(data, new_clusters)
points(final_centroids[,1], final_centroids[,2], pch = 8, 
       col = c("red", "blue"), cex = 2)
legend("topright", legend = c("Cluster 1", "Cluster 2"), 
       col = c("red", "blue"), pch = 19)

Explanation: This code plots the observations colored by their final cluster assignments and shows the final centroids.

4. Suppose that for a particular data set, we perform hierarchical clustering using single linkage and using complete linkage. We obtain two dendrograms.

(a) At a certain point on the single linkage dendrogram, the clusters {1,2,3} and {4,5} fuse. On the complete linkage dendrogram, the clusters {1,2,3} and {4,5} also fuse at a certain point. Which fusion will occur higher on the tree, or will they fuse at the same height, or is there not enough information to tell?

# simple code that would help understand the concept in general

# Create a simple example distance matrix
set.seed(123)
example_dist <- dist(matrix(rnorm(5*2), ncol=2))

# Perform hierarchical clustering
hc_single <- hclust(example_dist, method = "single")
hc_complete <- hclust(example_dist, method = "complete")

# Plot both dendrograms side by side
par(mfrow=c(1,2))
plot(hc_single, main="Single Linkage")
plot(hc_complete, main="Complete Linkage")

Explanation: In general, the fusion will occur higher (at greater height) in the complete linkage dendrogram because complete linkage uses the maximum distance between any two points from each cluster.

(b) At a certain point on the single linkage dendrogram, the clusters {5} and {6} fuse. On the complete linkage dendrogram, the clusters {5} and {6} also fuse at a certain point. Which fusion will occur higher on the tree, or will they fuse at the same height, or is there not enough information to tell?

# For singleton clusters like {5} and {6}, both methods will fuse at the same height
# because there's only one distance to consider between the two clusters

# In general, this would be the distance between object 5 and object 6 in the distance matrix
print("For singleton clusters {5} and {6}, both single and complete linkage")
## [1] "For singleton clusters {5} and {6}, both single and complete linkage"
print("will fuse at the same height, which is exactly the distance between")
## [1] "will fuse at the same height, which is exactly the distance between"
print("objects 5 and 6 in the original distance matrix.")
## [1] "objects 5 and 6 in the original distance matrix."

Explanation: The fusion will occur at the same height in both dendrograms because with singleton clusters, the minimum and maximum distances between clusters are the same.