EDA - Hierarchical Clustering

Hierarchical Clustering

Video Link 1 - https://www.youtube.com/watch?v=BKoChxguelA&feature=youtu.be
Video Link 2 - https://www.youtube.com/watch?v=ZQYLGS7ptWM&feature=youtu.be
Video Link 3 - https://www.youtube.com/watch?v=lmSMEZAjE-4&feature=youtu.be
Clustering or cluster analysis is a bread and butter technique for visualizing high dimensional or multidimensional data. It’s very simple to use, the ideas are fairly intuitive, and it can serve as a really quick way to get a sense of what’s going on in a very high dimensional data set.
And it’s a widely applied method in many different areas of science, business, and other applications. So it’s useful to know how these techniques work.
The point of clustering is to organize things or observations that are close together and separate them into groups. Of course, this simple definition raises some immediate questions:
How do we define close?
How do we group things?
How do we visualize the grouping?
How do we interpret the grouping?
All clustering techniques confront a basic issue, which is how do we define when things are close together and when things are far apart? Essentially, the wide variety of clustering techniques out there that you can apply to data differ in the ways that they answer these questions.

Hierarchical clustering

Hierarchical clustering, as is denoted by the name, involves organizing your data into a kind of hierarchy. The common approach is what’s called an agglomerative approach. This is a kind of bottom up approach, where you start by thinking of the data as individual data points. Then you start lumping them together into clusters little by little until eventually your entire data set is just one big cluster.
Imagine there’s all these little particles floating around (your data points), and you start kind of grouping them together into little balls. And then the balls get grouped up into bigger balls, and the bigger balls get grouped together into one big massive cluster. That’s the agglomerative approach to clustering, and that’s what we’re going to talk about here.

The algorithm is recursive and goes as follows:

1. Find closest two things points in your dataset
1. Put them together and call them a “point”
1. Use your new “dataset” with this new point and repeat
This methodology requires that you have a way to measure the distance between two points and that you have an approach to merging two points to create a new “point”. A benefit of this clustering methodology is that you can produce a tree showing how close things are to each other, which is simply a by product of running the algorithm.

How do we define close?

Defining closeness is a key aspect of defining a clustering method. Ultimately, the old rule of “garbage in, garbage out” applies. If you don’t use a distance metric that makes sense for your data, then you won’t get any useful information out of the clustering.

There are a number of commonly used metrics for characterizing distance or its inverse, similarity:

Euclidean distance: A continuous metric which can be thought of in geometric terms as the “straight-line” distance between two points.
Correlation similarity: Similar in nature to Euclidean distance
Manhattan distance: on a grid or lattice, how many “city blocks” would you have to travel to get from point A to point B?

The important thing is to always pick a distance or similarity metric that makes sense for your problem.

Euclidean distance:
If instead of two dimensions you have 100 dimensions, you can easily take the differences between each of the 100 dimensions, square them, sum them together and then take the square root. So the Euclidean distance metric extends very naturally to very high dimensions problems.
Manhattan distance:
In a city, if you want to go from point A to point B, you usually cannot take the direct route there because there will be buildings in the way. So instead, you have to follow the streets, or the grid layout, of the city to navigate around. That’s the idea behind Manhattan distance
The Manhattan distance between the points is simply the sum of the right-left moves plus the sum of all the up-down moves on the grid.

Example: Hierarchical clustering

Here is a simple example demonstrating how hierarchical clustering works. First we’ll simulate some data in three separate clusters.

# Set Seed
set.seed(1234)

# Create X & Y values (clusters in the plot)
x <- rnorm(12, rep(1:3, each = 4), 0.2)
y <- rnorm(12, rep(c(1, 2, 1), each = 4), 0.2)

# Create the Plot
plot(x, y, col = "blue", pch = 19, cex = 2)

# Set Text labels on the plot itself
text(x + 0.05, y + 0.05, labels = as.character(1:12))

The first step in the basic clustering approach is to calculate the distance between every point with every other point. The result is a distance matrix, which can be computed with the dist() function in R.

# Create the Dateframe of the plot valuess
dataFrame <- data.frame(x=x, y=y)

# The default distance metric used by the dist() function is Euclidean distance.
dist(dataFrame)

##             1          2          3          4          5          6
## 2  0.34120511                                                       
## 3  0.57493739 0.24102750                                            
## 4  0.26381786 0.52578819 0.71861759                                 
## 5  1.69424700 1.35818182 1.11952883 1.80666768                      
## 6  1.65812902 1.31960442 1.08338841 1.78081321 0.08150268           
## 7  1.49823399 1.16620981 0.92568723 1.60131659 0.21110433 0.21666557
## 8  1.99149025 1.69093111 1.45648906 2.02849490 0.61704200 0.69791931
## 9  2.13629539 1.83167669 1.67835968 2.35675598 1.18349654 1.11500116
## 10 2.06419586 1.76999236 1.63109790 2.29239480 1.23847877 1.16550201
## 11 2.14702468 1.85183204 1.71074417 2.37461984 1.28153948 1.21077373
## 12 2.05664233 1.74662555 1.58658782 2.27232243 1.07700974 1.00777231
##             7          8          9         10         11
## 2                                                        
## 3                                                        
## 4                                                        
## 5                                                        
## 6                                                        
## 7                                                        
## 8  0.65062566                                            
## 9  1.28582631 1.76460709                                 
## 10 1.32063059 1.83517785 0.14090406                      
## 11 1.37369662 1.86999431 0.11624471 0.08317570           
## 12 1.17740375 1.66223814 0.10848966 0.19128645 0.20802789

The default distance metric used by the dist() function is Euclidean distance.
Note that usually you will not have to explicitly compute the distance matrix (unless you are inventing your own clustering method). Here I just print it out to show what’s going on internally.
First an agglomerative clustering approach attempts to find the two points that are closest together. In other words, we want to find the smallest non-zero entry in the distance matrix.

# Create the distance matrix
rdistxy <- as.matrix(dist(dataFrame))

## Remove the diagonal from consideration
diag(rdistxy) <- diag(rdistxy) + 100000

# Find the index of the points with minimum distance
ind <- which(rdistxy == min(rdistxy), arr.ind = TRUE)

ind

##   row col
## 6   6   5
## 5   5   6

Now we can plot the points and show which two points are closest together according to our distance metric.

# Create the plot
plot(x, y, col = "blue", pch = 19, cex = 2)

# Create the text labels on the plot itself
text(x + 0.05, y + 0.05, labels = as.character(1:12))

# Color the closest two points yellow.
points(x[ind[1, ]], y[ind[1, ]], col = "orange", pch = 19, cex = 2)

The next step for the algorithm is to start drawing the tree, the first step of which would be to “merge” these two points together.

# Set some graph parameters 
par(mfrow = c(1, 2))

# Create the plot
plot(x, y, col = "blue", pch = 19, cex = 2, main = "Data")

# Create the text on the plot
text(x + 0.05, y + 0.05, labels = as.character(1:12))

# Color the points in question orange
points(x[ind[1, ]], y[ind[1, ]], col = "orange", pch = 19, cex = 2)


# Make a cluster and cut it at the right height
# Print the Tree of the two closest nodes in question  (5 & 6)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

hcluster <- dist(dataFrame) %>% hclust
dendro <- as.dendrogram(hcluster)
cutDendro <- cut(dendro, h = (hcluster$height[1] + 0.00001))
plot(cutDendro$lower[[11]], yaxt = "n", main = "Begin building tree")

Now that we’ve merged the first two “leaves” of this tree, we can turn the algorithm crank and continue to build the tree. Now, the two points we identified in the previous iteration will get “merged” into a single point, as depicted below.

hierarchial-clustering-combine.png

We need to search the distance matrix for the next two closest points, ignoring the first two that we already merged.

nextmin <- rdistxy[order(rdistxy)][3]
ind <- which(rdistxy == nextmin,arr.ind=TRUE)
ind

##    row col
## 11  11  10
## 10  10  11

And on and on in this manner. If we were to continue in this fashion-identifying the two closest points and merging them, we’d end up with a dendrogram that looks like this one. Here, we call the hclust() do run the clustering algorithm.

hClustering <- data.frame(x=x,y=y) %>% dist %>% hclust
plot(hClustering)

From the tree/dendrogram it’s clear that there are three clusters each with four points.

Prettier dendrograms

It’s possible to make slightly prettier dendrograms with some modification to the usual plotting method for the output of hclust(). Here’s a function that takes the output of hclust() and color codes each of the cluster members by their cluster membership.

# Create the Function
myplclust <- function(hclust, lab = hclust$labels, lab.col = rep(1, length(hclust$labels)),
hang = 0.1, ...) {
## modifiction of plclust for plotting hclust objects *in colour*! Copyright
## Eva KF Chan 2009 Arguments: hclust: hclust object lab: a character vector
## of labels of the leaves of the tree lab.col: colour for the labels;
## NA=default device foreground colour hang: as in hclust & plclust Side
## effect: A display of hierarchical cluster with coloured leaf labels.
y <- rep(hclust$height, 2)
x <- as.numeric(hclust$merge)
y <- y[which(x < 0)]
x <- x[which(x < 0)]
x <- abs(x)
y <- y[order(x)]
x <- x[order(x)]
plot(hclust, labels = FALSE, hang = hang, ...)
text(x = x, y = y[hclust$order] - (max(hclust$height) * hang), labels = lab[hclust$order],
col = lab.col[hclust$order], srt = 90, adj = c(1, 0.5), xpd = NA, ...)
}


# Running the Function

# First - perform the hierarchial clustering with the "hclust" function on the dataframe.
hClustering <- data.frame(x = x, y = y) %>% dist %>% hclust

# Then run the function to color the end nodes
myplclust(hClustering, lab = rep(1:3, each = 4), lab.col = rep(1:3, each = 4))

Merging points: Complete

One issue that we haven’t discussed yet is how exactly the merging of clusters works. Recall that once we find the two points that are closest together, we “merge” them and then consider the merged pair as a single “point”. When we compare this merged “point” with other points, how should we measure the distance from one point to this merged cluster of points?
One method, called “complete” is to measure the distance between two groups of points by the maximun distance between the two groups. That is, take all points in group 1 and all points in group 2 and find the two points that are furthest apart-that’s the distance between the groups.
Here’s what that would look like with our simulated data.

hierarchial-clustering-merge.png

NOTE - Complete merging is the default method in the hclust() function

Merging points: Average

Another approach is average merging, which takes the average of the coordinate values in each group and measures the distance between these two averages. That approach is shown below.

hierarchial-clustering-averaging.png

While there’s not necessarily a correct merging approach for any given application, it’s important to note that the resulting tree/hierarchy that you get can be sensitive to the merging approach that you use.

Using the heatmap() function

The heatmap() function is a handy way to visualize matrix data. The basic idea is that heatmap() sorts the rows and columns of a matrix according to the clustering determined by a call to hclust().
Conceptually, heatmap() first treats the rows of a matrix as observations and calls hclust() on them, then it treats the columns of a matrix as observations and calls hclust() on those values.
The end result is that you get a dendrogram associated with both the rows and columns of a matrix, which can help you to spot obvious patterns in the data.

dataMatrix <- data.frame(x=x,y=y) %>% data.matrix
heatmap(dataMatrix)

Notes and further resources

Hierarchical clustering is a really useful tool because it quickly gives you an idea of the relationships between variables/observations.

But caution should be used with clustering as often the picture that you produce can be unstable. In particular, it may be sensitive to:

Changing a few points in the dataset
Having different missing values in some of the observations
Picking a different distance metric (i.e. Euclidean vs. Manhattan)
Changing the merging strategy (i.e. complete vs. average)
Changing the scale of points for one variable

Another issue is that choosing where to “cut” the tree to determine the number of clusters isn’t always obvious. In light of some of these limitations, hierarchical clustering should be primarily used for exploration of data.

Once major patterns have been identified, it’s often best to delve further with other tools and formal modeling

Other Resources:

Rafa’s Distances and Clustering Video : http://www.youtube.com/watch?v=wQhVWUcXM0A
Elements of statistical learning: http://www-stat.stanford.edu/~tibs/ElemStatLearn/