Video Link 3 - https://www.youtube.com/watch?v=lmSMEZAjE-4&feature=youtu.be
Clustering or cluster analysis is a bread and butter technique for visualizing high dimensional or multidimensional data. It’s very simple to use, the ideas are fairly intuitive, and it can serve as a really quick way to get a sense of what’s going on in a very high dimensional data set.
And it’s a widely applied method in many different areas of science, business, and other applications. So it’s useful to know how these techniques work.
The point of clustering is to organize things or observations that are close together and separate them into groups. Of course, this simple definition raises some immediate questions:
How do we interpret the grouping?
All clustering techniques confront a basic issue, which is how do we define when things are close together and when things are far apart? Essentially, the wide variety of clustering techniques out there that you can apply to data differ in the ways that they answer these questions.
Hierarchical clustering, as is denoted by the name, involves organizing your data into a kind of hierarchy. The common approach is what’s called an agglomerative approach. This is a kind of bottom up approach, where you start by thinking of the data as individual data points. Then you start lumping them together into clusters little by little until eventually your entire data set is just one big cluster.
Imagine there’s all these little particles floating around (your data points), and you start kind of grouping them together into little balls. And then the balls get grouped up into bigger balls, and the bigger balls get grouped together into one big massive cluster. That’s the agglomerative approach to clustering, and that’s what we’re going to talk about here.
The algorithm is recursive and goes as follows:
There are a number of commonly used metrics for characterizing distance or its inverse, similarity:
Euclidean distance: A continuous metric which can be thought of in geometric terms as the “straight-line” distance between two points.
Correlation similarity: Similar in nature to Euclidean distance
Manhattan distance: on a grid or lattice, how many “city blocks” would you have to travel to get from point A to point B?
The important thing is to always pick a distance or similarity metric that makes sense for your problem.
Euclidean distance:
If instead of two dimensions you have 100 dimensions, you can easily take the differences between each of the 100 dimensions, square them, sum them together and then take the square root. So the Euclidean distance metric extends very naturally to very high dimensions problems.
Manhattan distance:
In a city, if you want to go from point A to point B, you usually cannot take the direct route there because there will be buildings in the way. So instead, you have to follow the streets, or the grid layout, of the city to navigate around. That’s the idea behind Manhattan distance
The Manhattan distance between the points is simply the sum of the right-left moves plus the sum of all the up-down moves on the grid.
# Set Seed
set.seed(1234)
# Create X & Y values (clusters in the plot)
x <- rnorm(12, rep(1:3, each = 4), 0.2)
y <- rnorm(12, rep(c(1, 2, 1), each = 4), 0.2)
# Create the Plot
plot(x, y, col = "blue", pch = 19, cex = 2)
# Set Text labels on the plot itself
text(x + 0.05, y + 0.05, labels = as.character(1:12))
# Create the Dateframe of the plot valuess
dataFrame <- data.frame(x=x, y=y)
# The default distance metric used by the dist() function is Euclidean distance.
dist(dataFrame)
## 1 2 3 4 5 6
## 2 0.34120511
## 3 0.57493739 0.24102750
## 4 0.26381786 0.52578819 0.71861759
## 5 1.69424700 1.35818182 1.11952883 1.80666768
## 6 1.65812902 1.31960442 1.08338841 1.78081321 0.08150268
## 7 1.49823399 1.16620981 0.92568723 1.60131659 0.21110433 0.21666557
## 8 1.99149025 1.69093111 1.45648906 2.02849490 0.61704200 0.69791931
## 9 2.13629539 1.83167669 1.67835968 2.35675598 1.18349654 1.11500116
## 10 2.06419586 1.76999236 1.63109790 2.29239480 1.23847877 1.16550201
## 11 2.14702468 1.85183204 1.71074417 2.37461984 1.28153948 1.21077373
## 12 2.05664233 1.74662555 1.58658782 2.27232243 1.07700974 1.00777231
## 7 8 9 10 11
## 2
## 3
## 4
## 5
## 6
## 7
## 8 0.65062566
## 9 1.28582631 1.76460709
## 10 1.32063059 1.83517785 0.14090406
## 11 1.37369662 1.86999431 0.11624471 0.08317570
## 12 1.17740375 1.66223814 0.10848966 0.19128645 0.20802789
The default distance metric used by the dist() function is Euclidean distance.
Note that usually you will not have to explicitly compute the distance matrix (unless you are inventing your own clustering method). Here I just print it out to show what’s going on internally.
First an agglomerative clustering approach attempts to find the two points that are closest together. In other words, we want to find the smallest non-zero entry in the distance matrix.
# Create the distance matrix
rdistxy <- as.matrix(dist(dataFrame))
## Remove the diagonal from consideration
diag(rdistxy) <- diag(rdistxy) + 100000
# Find the index of the points with minimum distance
ind <- which(rdistxy == min(rdistxy), arr.ind = TRUE)
ind
## row col
## 6 6 5
## 5 5 6
# Create the plot
plot(x, y, col = "blue", pch = 19, cex = 2)
# Create the text labels on the plot itself
text(x + 0.05, y + 0.05, labels = as.character(1:12))
# Color the closest two points yellow.
points(x[ind[1, ]], y[ind[1, ]], col = "orange", pch = 19, cex = 2)
# Set some graph parameters
par(mfrow = c(1, 2))
# Create the plot
plot(x, y, col = "blue", pch = 19, cex = 2, main = "Data")
# Create the text on the plot
text(x + 0.05, y + 0.05, labels = as.character(1:12))
# Color the points in question orange
points(x[ind[1, ]], y[ind[1, ]], col = "orange", pch = 19, cex = 2)
# Make a cluster and cut it at the right height
# Print the Tree of the two closest nodes in question (5 & 6)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
hcluster <- dist(dataFrame) %>% hclust
dendro <- as.dendrogram(hcluster)
cutDendro <- cut(dendro, h = (hcluster$height[1] + 0.00001))
plot(cutDendro$lower[[11]], yaxt = "n", main = "Begin building tree")
hierarchial-clustering-combine.png
nextmin <- rdistxy[order(rdistxy)][3]
ind <- which(rdistxy == nextmin,arr.ind=TRUE)
ind
## row col
## 11 11 10
## 10 10 11
hClustering <- data.frame(x=x,y=y) %>% dist %>% hclust
plot(hClustering)
It’s possible to make slightly prettier dendrograms with some modification to the usual plotting method for the output of hclust(). Here’s a function that takes the output of hclust() and color codes each of the cluster members by their cluster membership.
# Create the Function
myplclust <- function(hclust, lab = hclust$labels, lab.col = rep(1, length(hclust$labels)),
hang = 0.1, ...) {
## modifiction of plclust for plotting hclust objects *in colour*! Copyright
## Eva KF Chan 2009 Arguments: hclust: hclust object lab: a character vector
## of labels of the leaves of the tree lab.col: colour for the labels;
## NA=default device foreground colour hang: as in hclust & plclust Side
## effect: A display of hierarchical cluster with coloured leaf labels.
y <- rep(hclust$height, 2)
x <- as.numeric(hclust$merge)
y <- y[which(x < 0)]
x <- x[which(x < 0)]
x <- abs(x)
y <- y[order(x)]
x <- x[order(x)]
plot(hclust, labels = FALSE, hang = hang, ...)
text(x = x, y = y[hclust$order] - (max(hclust$height) * hang), labels = lab[hclust$order],
col = lab.col[hclust$order], srt = 90, adj = c(1, 0.5), xpd = NA, ...)
}
# Running the Function
# First - perform the hierarchial clustering with the "hclust" function on the dataframe.
hClustering <- data.frame(x = x, y = y) %>% dist %>% hclust
# Then run the function to color the end nodes
myplclust(hClustering, lab = rep(1:3, each = 4), lab.col = rep(1:3, each = 4))
One issue that we haven’t discussed yet is how exactly the merging of clusters works. Recall that once we find the two points that are closest together, we “merge” them and then consider the merged pair as a single “point”. When we compare this merged “point” with other points, how should we measure the distance from one point to this merged cluster of points?
One method, called “complete” is to measure the distance between two groups of points by the maximun distance between the two groups. That is, take all points in group 1 and all points in group 2 and find the two points that are furthest apart-that’s the distance between the groups.
Here’s what that would look like with our simulated data.
hierarchial-clustering-merge.png
hierarchial-clustering-averaging.png
The heatmap() function is a handy way to visualize matrix data. The basic idea is that heatmap() sorts the rows and columns of a matrix according to the clustering determined by a call to hclust().
Conceptually, heatmap() first treats the rows of a matrix as observations and calls hclust() on them, then it treats the columns of a matrix as observations and calls hclust() on those values.
The end result is that you get a dendrogram associated with both the rows and columns of a matrix, which can help you to spot obvious patterns in the data.
dataMatrix <- data.frame(x=x,y=y) %>% data.matrix
heatmap(dataMatrix)
Hierarchical clustering is a really useful tool because it quickly gives you an idea of the relationships between variables/observations.
But caution should be used with clustering as often the picture that you produce can be unstable. In particular, it may be sensitive to:
Another issue is that choosing where to “cut” the tree to determine the number of clusters isn’t always obvious. In light of some of these limitations, hierarchical clustering should be primarily used for exploration of data.
.