Hierarchical Clustering Analysis Tutorial

Tutorial Overview

Welcome to this comprehensive tutorial on Hierarchical Clustering!

Introduction to Hierarchical Clustering: Understand the fundamental concepts behind this clustering method, how it works, and where it’s best applied.

Diverse Dataset Explorations:

Iris Dataset, USArrests Dataset, Mtcars Dataset

Visual Insights with Dendrograms: Uncover the power of dendrograms in visually representing data hierarchies and relationships.

Concluding Remarks: Find the links to these datasets in the links section and Reflect on our observations, understand the strengths and potential pitfalls of hierarchical clustering, and explore further possibilities .

By the end of this tutorial, you’ll be well-equipped to implement hierarchical clustering on diverse datasets and draw meaningful insights. Let’s dive in!

1. Introduction

Brief Overview:

Hierarchical clustering is a method used to build a hierarchy of clusters. Starting with each data point as its own cluster, it frequently combines the closest clusters into larger clusters, making it one large cluster that contains all data points. It’s commonly used in numerous fields such as biology, for gene expression analysis, and in marketing for customer segmentation.

While….

Clustering is an unsupervised machine learning technique used to identify and group similar data points together without having prior labels for the groups.

2. Background and Theory

Linkage Methods:

Single: This method takes the shortest distance between two points in each cluster.

Complete: It considers the longest distance between two points in each cluster.

Average: This is the average of the pairwise distances between the members of the two clusters.

Ward’s: Minimizes the variance of the distances between the clusters being merged.

Distances between clusters can be measured using different metrics, like Euclidean or Manhattan, depending on the nature of data and preference.

Dendrogram:

A dendrogram is a tree-like diagram that showcases the hierarchical relationship of clusters. Its y-axis shows the distance (or dissimilarity) at which clusters merged, while objects are placed on the x-axis such that merging clusters are depicted by joining two branches with a branch.

Example 1: Simple 2D Data

For this example, we’ll generate a set of points in a 2D plane and then apply hierarchical clustering to observe how it performs on such data.

Objective: The goal is to visually understand how hierarchical clustering forms clusters in a stepwise fashion and how different linkage methods might impact the resulting dendrogram and clusters.

Data Creation: Let’s create a small set of 2D data points that have some discernible clusters.

# Load necessary libraries
library(ggplot2)
library(cluster)

# Generate data points
set.seed(123)
X1 <- matrix(rnorm(50 * 2), ncol=2)
X2 <- matrix(rnorm(30 * 2, mean=3), ncol=2)
data <- rbind(X1, X2)

# Plot the data points
ggplot(as.data.frame(data), aes(x = V1, y = V2)) + 
  geom_point(color = "blue", size = 3) +
  theme_minimal() +
  labs(title = "2D Data Points", x = "X-axis", y = "Y-axis")

Clustering and Dendrogram:

Now, we’ll perform hierarchical clustering on this data using complete linkage method and then plot the dendrogram.

# Hierarchical clustering using complete linkage
hc_complete <- hclust(dist(data), method="complete")

# Plotting the dendrogram
plot(hc_complete, main="Dendrogram with Complete Linkage", 
     xlab="Data Points", ylab="Distance", hang=-1, 
     las=2, cex=0.6)

Example 2: Hierarchical Clustering on the iris dataset and cutting Using Colors

The iris dataset is a classic dataset in pattern recognition literature. It contains measurements for 150 iris flowers from three different species.

Data Exploration

# Load the iris dataset
data(iris)
head(iris)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

Data Preparation

Since the iris dataset contains a categorical column (Species), we will drop it and only use the numeric measurements for clustering.

iris_numeric <- iris[, 1:4]

# Hierarchical clustering using complete linkage
hc_iris_complete <- hclust(dist(iris_numeric), method="complete")

# Plotting the dendrogram
plot(hc_iris_complete, main="Dendrogram for iris dataset (Complete Linkage)", 
     xlab="Flowers", ylab="Distance", hang=-1, 
     las=2, cex=0.6)

Cutting the Dendrogram:

You can cut the dendrogram to create k clusters. For instance, let’s cut it to get three clusters (which corresponds to the actual number of species in the dataset).

# Cut tree into groups and color them
library(dendextend)

## 
## ---------------------
## Welcome to dendextend version 1.17.1
## Type citation('dendextend') for how to cite the package.
## 
## Type browseVignettes(package = 'dendextend') for the package vignette.
## The github page is: https://github.com/talgalili/dendextend/
## 
## Suggestions and bug-reports can be submitted at: https://github.com/talgalili/dendextend/issues
## You may ask questions at stackoverflow, use the r and dendextend tags: 
##   https://stackoverflow.com/questions/tagged/dendextend
## 
##  To suppress this message use:  suppressPackageStartupMessages(library(dendextend))
## ---------------------

## 
## Attaching package: 'dendextend'

## The following object is masked from 'package:stats':
## 
##     cutree

dend_iris <- as.dendrogram(hc_iris_complete)
dend_colored <- color_branches(dend_iris, h=6)
plot(dend_colored, main="Colored Dendrogram for iris dataset (Complete Linkage)")

cut_clusters <- cutree(hc_iris_complete, k=3)
table(iris$Species, cut_clusters)

##             cut_clusters
##               1  2  3
##   setosa     50  0  0
##   versicolor  0 23 27
##   virginica   0 49  1

Dendrogram (Complete Linkage):

The dendrogram shows the step-by-step process of how individual data points are merged into clusters based on their similarity (or distance).

There is a prominent higher-level split near a distance of 6, which hints towards the natural existence of two or three main clusters in the data . Colored Dendrogram (Complete Linkage):

The colored dendrogram helps visualize the height (distance) at which merges occur. At a height of 6, there is a clear distinction between the red and green clusters. The cyan branch becomes a sub-cluster of the green cluster. If we were to cut the dendrogram at this height, we would get three clusters, which corresponds with the actual number of species in the dataset.

Results (Cluster comparison with actual species):

Setosa is perfectly clustered, with all 50 samples being grouped into cluster 1. This implies that setosa flowers have distinct characteristics that set them apart from versicolor and virginica.

versicolor and virginica have some overlap. 23 out of 50 versicolor samples are in cluster 2, while the remaining 27 are in cluster 3. This implies that while a significant portion of versicolor is distinct, there’s also a significant portion that’s similar to virginica.

Almost all virginica samples (49 out of 50) are in cluster 2, with just one in cluster 3. This shows that virginica has features that are closer to a portion of versicolor.

Example 3 Comparison between Ward(“Ward.D2”, Average Linkage(average) and Noise)

Dataset: USArrests

This dataset contains statistics, in arrests per 100,000 residents for assault, murder, and rape in each of the 50 US states in 1973. It also provides the percentage of the population living in urban areas.

Step 1: Load the dataset and inspect the first few rows.

data(USArrests)
head(USArrests)

##            Murder Assault UrbanPop Rape
## Alabama      13.2     236       58 21.2
## Alaska       10.0     263       48 44.5
## Arizona       8.1     294       80 31.0
## Arkansas      8.8     190       50 19.5
## California    9.0     276       91 40.6
## Colorado      7.9     204       78 38.7

step 2: Scale the data since the variables are measured in different units.

scaled_data <- scale(USArrests)

Step 3: Compute the hierarchical clustering using the average linkage method.

hc_average <- hclust(dist(scaled_data), method="average")

Step 4: Plot the dendrogram.

plot(hc_average, main="Dendrogram with Average Linkage", 
     xlab="States", ylab="Distance", hang=-1, cex=0.6)

hc_ward <- hclust(dist(scaled_data), method="ward.D2")

plot(hc_ward, main="Dendrogram with Ward's Linkage", 
     xlab="States", ylab="Distance", hang=-1, cex=0.6)

### Comparison:

Ward’s method creates more balanced cluster, where each cluster has roughly the same number of observations. It emphasizes on minimizing the variance within each cluster.

The average linkage method creates a cluster based on average distances between clusters, which can sometimes lead to unbalanced clusters.

From the dendrograms, the hierarchical clustering structure of the multiple states differs depending on the linkage method used. Depending on the specific objective of the clustering analysis (e.g., balanced clusters vs. detailed classification), it is better to decide what method works best, in this case Ward’s.

Example 3: Noisy Data

When talking about introducing noise to a dataset, it means adding some random data or disturbances that don’t follow the main structure or pattern of the original dataset.

# Assuming USArrests dataset is loaded
set.seed(123) # for reproducibility
noise_level <- 0.05

# Add noise to the USArrests dataset
noisy_data <- USArrests + sapply(USArrests, function(column) {
  column + column * (runif(length(column)) * 2 * noise_level - noise_level)
})

# Hierarchical clustering on noisy data
dist_noisy <- dist(noisy_data)
hclust_noisy <- hclust(dist_noisy, method = "ward.D2")

# Plot dendrogram
plot(hclust_noisy, main = "Dendrogram with Noise")

#### What we notice after adding noise

Stability: Noise tends to make clustering results less stable. Running the clustering process multiple times with different noise could produce different dendrograms.

Interpretation: The introduction of noise can make clusters less interpretable. States that were closely clustered based on certain crime rates are now more dispersed or associated differently.

Example 4 Cutting a Dendogram

Introduction to the mtcars Dataset

The mtcars dataset, available in the R datasets package, provides details about various car models from the 1973-1974 model years. Originally sourced from the 1974 Motor Trend magazine, it has since become a staple for learning and teaching data analysis techniques in R.

# Load necessary libraries
library(datasets)
library(ggplot2)

# Data normalization: It's a good practice to normalize data before clustering
normalized_mtcars <- as.data.frame(lapply(mtcars, scale))

# Ensure row names are preserved during normalization
rownames(normalized_mtcars) <- rownames(mtcars)

# Compute the distance matrix
dist_matrix <- dist(normalized_mtcars)

# Perform hierarchical clustering
hc_mtcars <- hclust(dist_matrix, method = "ward.D2")

# Plot the dendrogram with car names
plot(hc_mtcars, main = "Dendrogram for mtcars Dataset", 
     xlab = "Car Models", ylab = "Distance", cex = 0.7)

# Optionally, you can cut the dendrogram to form k clusters and color them
# For this example, let's assume k=3
cluster_assignments <- cutree(hc_mtcars, k = 3)
rect.hclust(hc_mtcars, k = 3, border = 2:4)

A few observations we can see from the dendrogram:

Hierarchical Structure: The dendrogram showcases how the cars can be grouped hierarchically. Starting from the bottom, each leaf represents a car model, and as you move up the tree, cars (or groups of cars) are merged together based on their similarity.

Cutting the Dendrogram: The colored horizontal lines suggest different levels at which the tree can be “cut” to form clusters. For example:

The red line shows a cut that would create two large clusters.

The green line would produce three clusters.

The blue line results in more specific clusters.

Car Clusters: By looking at where the branches merge, we can see which cars are more similar to each other. For instance:

The “Porsche 914-2”, “Lotus Europa”, and “Datsun 710” appear quite similar as they merge at a lower distance.

On the rightmost side, cars like the “Mazda RX4”, “Mazda RX4 Wag”, and “Maserati Bora” are clustered together.

Distance Metric: The “Distance” y-axis represents how different cars or clusters of cars are. The higher up two clusters merge, the more dissimilar they are.

Summary:

In this tutorial, we delved deep into hierarchical clustering using three distinct datasets: iris, USArrests, and mtcars.

Iris Dataset:

Observations: The dendrogram provided clear distinctions among the three species. It demonstrated the power of hierarchical clustering in distinguishing between different clusters based on feature similarities.

USArrests Dataset:

Observations: We witnessed how different U.S. states clustered based on their crime rates and urban population percentages. Dendrograms showed groupings that might suggest regional similarities or shared socio-economic factors influencing crime rates.

Mtcars Dataset:

Observations: The hierarchical clustering revealed patterns in car models based on their performance and design features. For instance, cars with similar horsepower or fuel efficiency tended to cluster together.

Conclusion:

Hierarchical clustering is a powerful unsupervised learning technique that finds structure in data by grouping similar items into clusters. Through the use of dendrograms, we can visually interpret these clusters, making decisions on how to best group data points. Both the mtcars and U.S. states examples illustrated the versatility and interpretative power of hierarchical clustering in understanding data. Whether analyzing cars or states, hierarchical clustering provides valuable insights into the inherent groupings present within datasets. As with all analyses, careful data preprocessing and method selection are essential to obtaining meaningful results.

References & Dataset Links:

mtcars Dataset:

Source: Henderson and Velleman (1981), Building multiple regression models interactively. Biometrics, 37, 391–411.

Link to Dataset: The mtcars dataset comes built-in with R’s datasets package. You can access it directly in R using the command data(mtcars). For a direct link https://gist.github.com/seankross/a412dfbd88b3db70b74b

iris Dataset:

Source: Fisher, R.A. (1936). The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7, Part II, 179–188.

Link to Dataset: The iris dataset is also a built-in dataset in R. You can access it using the command data(iris). For a direct link https://archive.ics.uci.edu/dataset/53/iris

U.S. Arrest States Data:

Link to Dataset: https://rstudio-pubs-static.s3.amazonaws.com/542881_05378d169a2649039cba0cc8db072183.html