Hierarchical Clustering

Hierarchical clustering

The algorithm is recursive and goes as follows:

How do we define close?

There are a number of commonly used metrics for characterizing distance or its inverse, similarity:

The important thing is to always pick a distance or similarity metric that makes sense for your problem.

Example: Hierarchical clustering

# Set Seed
set.seed(1234)

# Create X & Y values (clusters in the plot)
x <- rnorm(12, rep(1:3, each = 4), 0.2)
y <- rnorm(12, rep(c(1, 2, 1), each = 4), 0.2)

# Create the Plot
plot(x, y, col = "blue", pch = 19, cex = 2)

# Set Text labels on the plot itself
text(x + 0.05, y + 0.05, labels = as.character(1:12))

# Create the Dateframe of the plot valuess
dataFrame <- data.frame(x=x, y=y)

# The default distance metric used by the dist() function is Euclidean distance.
dist(dataFrame)
##             1          2          3          4          5          6
## 2  0.34120511                                                       
## 3  0.57493739 0.24102750                                            
## 4  0.26381786 0.52578819 0.71861759                                 
## 5  1.69424700 1.35818182 1.11952883 1.80666768                      
## 6  1.65812902 1.31960442 1.08338841 1.78081321 0.08150268           
## 7  1.49823399 1.16620981 0.92568723 1.60131659 0.21110433 0.21666557
## 8  1.99149025 1.69093111 1.45648906 2.02849490 0.61704200 0.69791931
## 9  2.13629539 1.83167669 1.67835968 2.35675598 1.18349654 1.11500116
## 10 2.06419586 1.76999236 1.63109790 2.29239480 1.23847877 1.16550201
## 11 2.14702468 1.85183204 1.71074417 2.37461984 1.28153948 1.21077373
## 12 2.05664233 1.74662555 1.58658782 2.27232243 1.07700974 1.00777231
##             7          8          9         10         11
## 2                                                        
## 3                                                        
## 4                                                        
## 5                                                        
## 6                                                        
## 7                                                        
## 8  0.65062566                                            
## 9  1.28582631 1.76460709                                 
## 10 1.32063059 1.83517785 0.14090406                      
## 11 1.37369662 1.86999431 0.11624471 0.08317570           
## 12 1.17740375 1.66223814 0.10848966 0.19128645 0.20802789
# Create the distance matrix
rdistxy <- as.matrix(dist(dataFrame))

## Remove the diagonal from consideration
diag(rdistxy) <- diag(rdistxy) + 100000

# Find the index of the points with minimum distance
ind <- which(rdistxy == min(rdistxy), arr.ind = TRUE)

ind
##   row col
## 6   6   5
## 5   5   6
# Create the plot
plot(x, y, col = "blue", pch = 19, cex = 2)

# Create the text labels on the plot itself
text(x + 0.05, y + 0.05, labels = as.character(1:12))

# Color the closest two points yellow.
points(x[ind[1, ]], y[ind[1, ]], col = "orange", pch = 19, cex = 2)

# Set some graph parameters 
par(mfrow = c(1, 2))

# Create the plot
plot(x, y, col = "blue", pch = 19, cex = 2, main = "Data")

# Create the text on the plot
text(x + 0.05, y + 0.05, labels = as.character(1:12))

# Color the points in question orange
points(x[ind[1, ]], y[ind[1, ]], col = "orange", pch = 19, cex = 2)


# Make a cluster and cut it at the right height
# Print the Tree of the two closest nodes in question  (5 & 6)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
hcluster <- dist(dataFrame) %>% hclust
dendro <- as.dendrogram(hcluster)
cutDendro <- cut(dendro, h = (hcluster$height[1] + 0.00001))
plot(cutDendro$lower[[11]], yaxt = "n", main = "Begin building tree")

hierarchial-clustering-combine.png

hierarchial-clustering-combine.png

nextmin <- rdistxy[order(rdistxy)][3]
ind <- which(rdistxy == nextmin,arr.ind=TRUE)
ind
##    row col
## 11  11  10
## 10  10  11
hClustering <- data.frame(x=x,y=y) %>% dist %>% hclust
plot(hClustering)

Prettier dendrograms

It’s possible to make slightly prettier dendrograms with some modification to the usual plotting method for the output of hclust(). Here’s a function that takes the output of hclust() and color codes each of the cluster members by their cluster membership.

# Create the Function
myplclust <- function(hclust, lab = hclust$labels, lab.col = rep(1, length(hclust$labels)),
hang = 0.1, ...) {
## modifiction of plclust for plotting hclust objects *in colour*! Copyright
## Eva KF Chan 2009 Arguments: hclust: hclust object lab: a character vector
## of labels of the leaves of the tree lab.col: colour for the labels;
## NA=default device foreground colour hang: as in hclust & plclust Side
## effect: A display of hierarchical cluster with coloured leaf labels.
y <- rep(hclust$height, 2)
x <- as.numeric(hclust$merge)
y <- y[which(x < 0)]
x <- x[which(x < 0)]
x <- abs(x)
y <- y[order(x)]
x <- x[order(x)]
plot(hclust, labels = FALSE, hang = hang, ...)
text(x = x, y = y[hclust$order] - (max(hclust$height) * hang), labels = lab[hclust$order],
col = lab.col[hclust$order], srt = 90, adj = c(1, 0.5), xpd = NA, ...)
}


# Running the Function

# First - perform the hierarchial clustering with the "hclust" function on the dataframe.
hClustering <- data.frame(x = x, y = y) %>% dist %>% hclust

# Then run the function to color the end nodes
myplclust(hClustering, lab = rep(1:3, each = 4), lab.col = rep(1:3, each = 4))

Merging points: Complete

hierarchial-clustering-merge.png

hierarchial-clustering-merge.png

Merging points: Average

hierarchial-clustering-averaging.png

hierarchial-clustering-averaging.png

Using the heatmap() function

dataMatrix <- data.frame(x=x,y=y) %>% data.matrix
heatmap(dataMatrix)

Notes and further resources

Hierarchical clustering is a really useful tool because it quickly gives you an idea of the relationships between variables/observations.

But caution should be used with clustering as often the picture that you produce can be unstable. In particular, it may be sensitive to:

Another issue is that choosing where to “cut” the tree to determine the number of clusters isn’t always obvious. In light of some of these limitations, hierarchical clustering should be primarily used for exploration of data.

Other Resources:

.