Clustering, in data analysis and machine learning, is a technique that involves grouping similar data points together based on their characteristics or patterns. By categorizing data points into clusters, analysts can better understand the underlying patterns and trends within their datasets. This understanding can lead to more informed decision-making, enhanced data visualization, improved anomaly detection, and targeted marketing strategies, among other benefits. This makes clustering an essential tool in various domains, including marketing, biology, social network analysis, and image recognition, to name a few.
There are several clustering methods available, each with its own approach and suitability for different types of data. Some of the most commonly used clustering methods include K-means Clustering, used for partitioning data into cohesive clusters; Density-Based Clustering, for identifying dense regions within data noise; Spectral Clustering, combines graph theory and linear algebra to uncover intricate data structures; Fuzzy Clustering, allows the nuances of data points with overlapping features; and the focal point of this tutorial, Hierarchical Clustering, for building an insightful tree of clusters.
In this tutorial, we’ll explore different types of hierarchical clustering using easy examples. We’ll also dive into hierarchical clustering methods and get to know dendrograms; what they show and how to interpret them.
R stands out as a great programming language for statistical computing, offering a wide range of user-friendly libraries and packages that simplify clustering. Libraries like “cluster,” “stats,” and “fpc” provide easy access to clustering methods such as K-means, hierarchical clustering, and DBSCAN. Moreover, R offers visualization tools like scatter plots and dendrograms, which aid in interpreting the clustering results and gaining insights into the underlying patterns and structures present in the data.
Hierarchical clustering stands out from other clustering methods because it organizes data points into a dendrogram, based on their similarities. This tree-like structure uncovers subclusters and allows for an in-depth exploration of patterns within the data. Imagine looking at a family tree; it doesn’t just show immediate family members but also extended relatives, providing a comprehensive view of relationships. Similarly, hierarchical clustering uses dendrograms to paint a detailed picture of how data points relate to each other, revealing intricate clusters within larger groupings. To further illustrate, in a dendrogram, the leaves represent individual data points, similar to family members in a family tree. The root, on the other hand, signifies the entire dataset or the common origin point, much like the oldest generation in a family tree. When two leaves join together to form a branch, it indicates a similarity or a cluster between those data points. The height value in a dendrogram reflects the dissimilarity between the clusters. The taller the height where branches merge, the less similar the clusters are. This height value plays a crucial role in determining the number of clusters; you choose a height cutoff to decide how many clusters you want, much like deciding how many generations to include in a family tree.
Consider a practical scenario in a retail store aiming to understand its customers better. By gathering information on customers’ purchasing habits, like what they buy (electronics, clothing, groceries), how often they visit the store, and their average spending, hierarchical clustering can group similar customers together. For example, customers who frequently invest in electronics and spend generously might fall into one cluster, while those who mainly buy groceries and visit less often could form another. This segmentation helps the store tailor its marketing strategies. For instance, they could offer exclusive electronics discounts to the first cluster and loyalty rewards to the second, enhancing customer satisfaction and optimizing their marketing efforts based on distinct customer behaviors.
Hierarchical clustering comes in two types:
Agglomerative Clustering: This method starts from individual data points and gradually groups them together, building clusters in a step-by-step manner. It’s like putting together puzzle pieces, starting with the small parts and joining them to form larger, cohesive clusters.
In the scenario mentioned earlier, let’s imagine creating a fictional dataset for customers.
# Set random seed for reproducibility
set.seed(42)
# Number of customers
n_customers <- 100
# Create a synthetic customer dataset
customer_data <- data.frame(
CustomerID = 1:n_customers,
ProductCategory = sample(c("electronics", "clothing", "groceries"), n_customers, replace = TRUE),
VisitsPerMonth = round(runif(n_customers, min = 1, max = 10)),
AverageSpending = rnorm(n_customers, mean = 50, sd = 10)
)
# Filter customers who frequently invest in electronics and spend generously
electronics_customers <- subset(customer_data, ProductCategory == "electronics" & AverageSpending > 50)
# Filter customers who mainly buy groceries and visit less often
groceries_customers <- subset(customer_data, ProductCategory == "groceries" & VisitsPerMonth < 5)
# Combine the filtered data
filtered_data <- rbind(electronics_customers, groceries_customers)
# Perform Agglomerative Hierarchical Clustering
hc <- hclust(dist(filtered_data[, c("VisitsPerMonth", "AverageSpending")]))
# Plot the dendrogram
plot(hc)
Looking at the resulting dendrogram, each leaf on the tree represents a customer. The height where branches merge shows how different or similar the customer groups are. Customers who share similar habits in terms of how often they visit the store and how much they spend end up in nearby branches of the tree, forming distinct clusters.
In this dendrogram, we can notice two main groups at the bottom. The first group includes customers who frequently buy electronics and spend a lot, which was the criteria set in the code. These customers tend to visit often and spend more than others. The second group consists of customers who mainly purchase groceries and visit the store infrequently. This group has fewer visits and lower spending compared to the electronics customers. These clear distinctions in purchasing behavior highlight the different customer segments.
Divisive Clustering: In contrast, divisive clustering begins with all the data points grouped as one big cluster. Then, it divides this large cluster into smaller ones, following a top-down approach. It’s akin to breaking a big task into smaller, manageable chunks for easier understanding and analysis.
divisive_result <- diana(filtered_data[, c("VisitsPerMonth", "AverageSpending")])
# Plot the divisive hierarchical clustering result
plot(as.hclust(divisive_result))
The algorithm behind hierarchical clustering doesn’t require specifying the number of clusters beforehand. Instead, it starts by considering each data point as a separate cluster and then repeatedly merges the closest clusters until all the data points are grouped into one cluster. There are different methods or linkages used in hierarchical clustering to measure the distance between clusters. Here are four commonly used linkage methods:
Single Linkage: In single linkage hierarchical clustering, the distance between two clusters is defined as the shortest distance between any two points in each cluster. It focuses on the closest neighboring points of the clusters being compared. It tends to create long, elongated clusters.
Complete Linkage: Complete linkage, on the other hand, calculates the distance between two clusters as the longest distance between any two points in each cluster. It emphasizes the farthest points of the clusters being compared, often resulting in compact, spherical clusters.
Average Linkage: Average linkage calculates the distance between two clusters as the average distance between all pairs of points in the two clusters. It provides a balanced approach, considering the overall internal distances within the clusters. It often results in more balanced, round-shaped clusters compared to single linkage.
Centroid Linkage: Centroid linkage computes the distance between two clusters as the distance between their centroids, which are the mean positions of all points in each cluster. It’s computationally efficient and tends to create well-balanced, evenly sized clusters.
In simpler terms, single linkage focuses on the closest points, complete linkage looks at the farthest points, average linkage balances overall distances, and centroid linkage compares the average positions of clusters. Each method has its own strengths and weaknesses, making them suitable for different types of data and clustering goals.
Note, to implement either of the linkages in R, it follows three easy steps:
Calculate Distances: First, compute the distance
matrix between your data points using the
dist() function.
Perform Clustering: Use the
hclust() function, passing the distance
matrix and specifying the method (“single”, “complete”, “average”,
“centroid”) to perform single linkage clustering.
Visualize the Dendrogram: Using the
plot() function, plot the resulting dendrogram to visualize
the hierarchical clustering structure and understand how the data points
are linked in the clusters.
Now, let’s work with a real dataset, “mtcars,” to understand hierarchical clustering methods step by step. This dataset provides details about different car models, such as miles per gallon, horsepower, and weight, allowing us to apply clustering methods effectively. Let’s take a closer look at the dataset by viewing the first few rows and summarizing its key statistics.
#view first six rows of mtcars dataset
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
#summarize mtcars dataset
summary(mtcars)
## mpg cyl disp hp
## Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
## 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
## Median :19.20 Median :6.000 Median :196.3 Median :123.0
## Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
## 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
## Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
## drat wt qsec vs
## Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000
## 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000
## Median :3.695 Median :3.325 Median :17.71 Median :0.0000
## Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375
## 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000
## Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000
## am gear carb
## Min. :0.0000 Min. :3.000 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
## Median :0.0000 Median :4.000 Median :2.000
## Mean :0.4062 Mean :3.688 Mean :2.812
## 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :1.0000 Max. :5.000 Max. :8.000
#display column names
names(mtcars)
## [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear"
## [11] "carb"
We focus on specific attributes like miles per gallon (mpg), horsepower (hp), and weight (wt). These attributes will be the basis for our clustering analysis.
Next, we perform hierarchical clustering using different methods. To start, we calculate the distances between data points based on these attributes. For each method - single linkage, complete linkage, average linkage, and centroid linkage - we create a dendrogram. In these diagrams, each car model is represented as a leaf, and the height at which branches merge indicates the similarity between models. Lower branches show close similarities, while higher ones indicate greater dissimilarities.
# Extract the features (attributes) from the mtcars dataset
mtcars_features <- mtcars[, c("mpg", "hp", "wt")]
# Perform Single Linkage Hierarchical Clustering
single_linkage_result <- hclust(dist(mtcars_features), method = "single")
# Perform Complete Linkage Hierarchical Clustering
complete_linkage_result <- hclust(dist(mtcars_features), method = "complete")
# Perform Average Linkage Hierarchical Clustering
average_linkage_result <- hclust(dist(mtcars_features), method = "average")
# Perform Centroid Linkage Hierarchical Clustering
centroid_linkage_result <- hclust(dist(mtcars_features), method = "centroid")
# Plot dendrograms for different linkage methods
par(mfrow = c(2, 2))
plot(single_linkage_result, main = "Single Linkage")
plot(complete_linkage_result, main = "Complete Linkage")
plot(average_linkage_result, main = "Average Linkage")
plot(centroid_linkage_result, main = "Centroid Linkage")
Furthermore, dendrograms won’t tell you how many clusters there are;
you have to decide the number of clusters yourself. Here’s where the
rect.hclust() function comes into play. This function
allows you to draw rectangles around hierarchical clusters, providing a
visual representation of the clusters within your data. For instance, in
the example below, the ‘k’ parameter sets the clusters’ number, and by
using ‘border = 2:4’, you assign distinct colors to each cluster. This
helps in visually understanding the clusters within your data.
# Create dendrogram plots with clusters visualized
par(mfrow = c(2, 2))
plot(single_linkage_result, main = "Single Linkage", sub = "")
rect.hclust(single_linkage_result, k = 3, border = 2:4)
plot(complete_linkage_result, main = "Complete Linkage", sub = "")
rect.hclust(complete_linkage_result, k = 3, border = 2:4)
plot(average_linkage_result, main = "Average Linkage", sub = "")
rect.hclust(average_linkage_result, k = 3, border = 2:4)
plot(centroid_linkage_result, main = "Centroid Linkage", sub = "")
rect.hclust(centroid_linkage_result, k = 3, border = 2:4)
In the generated dendrogram, it’s important to observe that Maserati Bora forms a separate cluster in all linkage methods. The reason behind this is rooted in the unique characteristics we’re considering, such as miles per gallon (mpg), horsepower (hp), and weight (wt). When compared to other car models, Maserati Bora stands out due to its significantly higher horsepower, making it dissimilar from the rest of the cars in the dataset. To illustrate this visually, a dissimilarity heatmap is provided below. The heatmap displays color-coded matrices representing dissimilarities between pairs of data points. Darker colors indicate higher dissimilarity. Once again, we can observe that Maserati Bora appears as the darkest shade, indicating its unique attributes in the dataset.
To wrap up, remember, the way you link your data in hierarchical clustering really matters; it can change your results a lot. So, pick the method that suits your data and goals best. With this knowledge, you can now effectively apply hierarchical clustering in various real-world scenarios.
ML | Types of Linkages in Clustering. (2019, July 19). GeeksforGeeks. https://www.geeksforgeeks.org/ml-types-of-linkages-in-clustering/
Forming Clusters. (2023). Stanford.edu. https://hlab.stanford.edu/brian/forming_clusters.htm
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). An Introduction to Statistical Learning with Applications in R Second Edition. https://hastie.su.domains/ISLR2/ISLRv2_website.pdf