Hierarchical Cluster Analysis

Introduction

In data mining and statistics, hierarchical clustering (also called hierarchical cluster analysis or HCA) is a method of cluster analysis which seeks to build a hierarchy of clusters. Strategies for hierarchical clustering generally fall into two types:

Agglomerative: This is a “bottom-up” approach: each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy.

Divisive: This is a “top-down” approach: it starts off with all the points into one cluster and divides them to create more clusters.

These algorithms create a distance matrix of all the existing clusters and perform the linkage between the clusters depending on the criteria of the linkage.

There are different types of linkages:

The linkage criterion determines the distance between sets of observations as a function of the pairwise distances between observations.

Some commonly used linkage criteria between two sets of observations A and B are:

Complete Linkage(Maximum)
Single Linkage(Minimum)
Weighted Average Linkage

Data:

The data file Ramusbone length records the length of the Ramusbone from 20 boys ages 8, 8.5, 9, and 9.5.

Objective:

Carry out different hierarchical cluster analysis techniques on the Ramusbone length data set and cluster the kids based on given growth in height and compare the results.

Research Questions

Perform hierarchical clustering using the single, average, and complete methods to obtain k = 4 groups. Compare the results. For each unique clustering, make a plot of the first two principal components of each individual and indicate the cluster to which each point belongs.
One way in which hierarchical methods are compared is their robustness to error. Add noise to the bone data by adding independent normal random variables with mean 0 and standard deviation 0.25 to each measurement. Repeat the analyses from part (1). Which of the methods, if any, had their clustering change due to the added noise?
Another way in which hierarchical models are compared is their robustness to outliers. Add the observation x21 = (47:7; 48:8; 45:7; 45:3)T to the data set and rerun the analyses from part (1). Which of the methods, if any, had their clustering change due to the outlier?
Run a k-means analysis for k = 4 groups on the ramus bone data. Perform the analysis three times with random seeds to begin the procedure. Do the three k-means analyses return the same grouping? How do the groupings from the k-means procedure compare to those from the hierarchical methods?

Part I covers question 1 Part II covers question 2 Part III covers question 3. Part IV covers question4

PART I

Data Preparation

library(tidyverse)   # Data manipulation.

## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --

## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.4     v dplyr   1.0.7
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   2.0.1     v forcats 0.5.1

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(factoextra)  # Clustering Visualization

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

# Clear the workspace: 
rm(list = ls())

# Load data
ramus<- read.table("ramusbonelength.txt", head = T)

head(ramus)

##   Individual X8yr X8.5yr X9yr X9.5yr
## 1          1 47.8   48.8 49.0   49.7
## 2          2 46.4   47.3 47.7   48.4
## 3          3 46.3   46.8 47.8   48.5
## 4          4 45.1   45.3 46.1   47.2
## 5          5 47.6   48.5 48.9   49.3
## 6          6 52.5   53.2 53.3   53.7

# Normalize the data set:
scaled <- function(x,na.rm=FALSE) ((x - mean(x)) / sd(x))

ramusdf <- ramus %>% mutate_at(c("X8yr", "X8.5yr","X9yr","X9.5yr"), scaled, na.rm=TRUE)

head(ramusdf)

##   Individual       X8yr     X8.5yr       X9yr     X9.5yr
## 1          1 -0.3398327 -0.3248600 -0.5969107 -0.6400435
## 2          2 -0.8962839 -0.9155145 -1.0911680 -1.1168668
## 3          3 -0.9360304 -1.1123994 -1.0531482 -1.0801881
## 4          4 -1.4129886 -1.7030539 -1.6994846 -1.5570114
## 5          5 -0.4193257 -0.4429909 -0.6349305 -0.7867584
## 6          6  1.5282535  1.4077267  1.0379403  0.8271050

df<-ramusdf[,2:5] # Dropping "Individual" column

# Create a function of dendrogram:

dend_func <- (function(x) {fviz_dend(x, 
          k = 4,   
          cex = 0.5, 
          rect = TRUE, 
          rect_fill = TRUE, 
          horiz = FALSE, 
          palette = "jco", 
          rect_border = "jco", 
          color_labels_by_k = TRUE) })

# Create a theme of dendrogram plots : 

theme1<- theme_gray() + 
         theme(plot.margin = unit(rep(0.7, 4), "cm"))

# Create a function of cluster:

clust_func<-(function(x){fviz_cluster(list(data = df, cluster = paste0("Group", x)), 
                         alpha = 1, 
                         colors = x, 
                         labelsize = 9, 
                         ellipse.type = "norm")})

# Create a theme of cluster plots : 

theme2<-  theme(legend.position = c(0.1, 0.8)) + 
          theme(plot.margin = unit(rep(0.5, 4), "cm"))

Distance matrix computation and visualization

# Compute distances: 
dd <- dist(df, method = "euclidean")

# Visualize the dissimilarity: 
fviz_dist(dd, lab_size = 6)

In the plot above, similar objects are close to one another. Red color corresponds to small distance and blue color indicates big distance between observation. For example, observation 18 and 4 have the biggest distance. We will check this information on the PC and Cluster plot later on.

PC Analysis

Eigen Values

S <- cor(df)
eig <- eigen(S)
cumsum(eig$values)/sum(eig$values)

## [1] 0.9238118 0.9876482 0.9957886 1.0000000

# PC values
pc1 <- t( t(eig$vectors[,1] %*% t(df)))
pc2 <- t( t(eig$vectors[,2] %*% t(df)))

Single Linkage

Dendrogram Plot

singleOut <- hclust(dd, method = "single")

dend_func(singleOut) -> basic_plot

basic_plot + theme1 + 
  labs(title = "Hierarchical Clustering with Single Linkage Method")

In the dendrogram displayed above, each leaf corresponds to one observation. As we move up the tree, observations that are similar to each other are combined into branches, which are themselves fused at a higher height.

The height of the fusion, provided on the vertical axis, indicates the (dis)similarity between two observations. The higher the height of the fusion, the less similar the observations are. Note that, conclusions about the proximity of two observations can be drawn only based on the height where branches containing those two observations first are fused. We cannot use the proximity of two observations along the horizontal axis as a criteria of their similarity.

PCA Plot

sgroup<- cutree(singleOut, k = 4)
plot(pc1, pc2, ylim = c(-2,1))
text(x = pc1, y = pc2+.1, labels = rownames(ramus))
points(pc1, pc2, col = sgroup, pch = 19)

Cluster Plot

clust_func(sgroup)-> cluster_plot

cluster_plot + theme2 + 
  labs(title = "Cluster based on Hierarchical Clustering")

## Too few points to calculate an ellipse
## Too few points to calculate an ellipse

These plots confirm that observation 18 and 4 have the biggest distance.

Complete Linkage

Dendrogram Plot

completeOut <- hclust(dd, method = "complete")

dend_func(completeOut) -> basic_plot

basic_plot + theme1 + 
  labs(title = "Hierarchical Clustering with Complete Linkage Method")

PCA Plot

cgroup<- cutree(completeOut, k = 4)
plot(pc1, pc2, ylim = c(-2,1))
text(x = pc1, y = pc2+.1, labels = rownames(ramus))
points(pc1, pc2, col = cgroup, pch = 19)

Cluster Plot

clust_func(cgroup)-> cluster_plot

cluster_plot + theme2 + 
  labs(title = "Cluster based on Hierarchical Clustering")

Average Linkage

Dendrogram Plot

averageOut <- hclust(dd, method = "average")

dend_func(averageOut) -> basic_plot

basic_plot + theme1 + 
  labs(title = "Hierarchical Clustering with Average Linkage Method")

PCA Plot

agroup<- cutree(averageOut, k = 4)
plot(pc1, pc2, ylim = c(-2,1))
text(x = pc1, y = pc2+.1, labels = rownames(ramus))
points(pc1, pc2, col = agroup, pch = 19)

Cluster Plot

# Cut tree into 4 groups: 
clust_func(agroup)-> cluster_plot

cluster_plot + theme2 + 
  labs(title = "Cluster based on Hierarchical Clustering")

## Too few points to calculate an ellipse

References:

**************

Hierarchical Cluster Analysis

Mustafa Arslan

9/2/2021

PART I

References: