Clustering is a technique used to group similar objects (close in terms of distance) together in the same group (cluster). Unlike supervised learning methods (for example, classification and regression), a clustering analysis does not use any label information, but simply uses the similarity between data features to group them into clusters.
Clustering can be widely adapted in the analysis of businesses. For example, a marketing department can use clustering to segment customers by personal attributes. As a result of this, different marketing campaigns targeting various types of customers can be designed.
Hierarchical clustering adopts either an agglomerative or divisive method to build a hierarchy of clusters. Regardless of which approach is adopted, both first use a distance similarity measure to combine or split clusters. The recursive process continues until there is only one cluster left or you cannot split more clusters. Eventually, we can use a dendrogram to represent the hierarchy of clusters.
![]()
Agglomerative clustering
Agglomerative clustering is Bottom-up technique start by considering each data point as its own cluster and merging them together into larger groups from the bottom up into a single giant cluster.
Divisive clustering
Divisive clustering is the opposite, it starts with one cluster, which is then divided in two as a function of the similarities or distances in the data. These new clusters are then divided, and so on until each case is a cluster.
Agglomerative clustering linkage algorithms
Use a linkage criterion to merge data points (at the first stage) or clusters (in subsequent phases), where the linkage is represented by a function such as:
Maximum or complete linkage clustering: It computes all pairwise dissimilarities between the elements in cluster 1 and the elements in cluster 2, and considers the largest value (i.e., maximum value) of these dissimilarities as the distance between the two clusters. It tends to produce more compact clusters.
Minimum or single linkage clustering: It computes all pairwise dissimilarities between the elements in cluster 1 and the elements in cluster 2, and considers the smallest of these dissimilarities as a linkage criterion. It tends to produce long, “loose” clusters.
Mean or average linkage clustering: It computes all pairwise dissimilarities between the elements in cluster 1 and the elements in cluster 2, and considers the average of these dissimilarities as the distance between the two clusters.
Centroid linkage clustering: It computes the dissimilarity between the centroid for cluster 1 (a mean vector of length p variables) and the centroid for cluster 2.
Ward’s minimum variance method: It minimizes the total within-cluster variance. At each step, the pair of clusters with minimum between-cluster distance are merged.
Data Preparation
To perform a cluster analysis in R, generally, the data should be prepared as follows:
1. Rows are observations (individuals) and columns are variables
2. Any missing value in the data must be removed or estimated.
3. The data must be standardized (i.e., scaled) to make variables comparable. Standardization consists of transforming the variables such that they have mean zero and standard deviation one.
Data used
We will perform hierarchical clustering on customer data, which involves segmenting customers into different groups. You can download the data from this Github page: https://github.com/ywchiu/ml_R_cookbook/tree/master/CH9.
Install Packages
#if(!require(devtools)) install.packages("devtools")
#devtools::install_github("kassambara/factoextra")
#install.packages("cluster")
#install.packages("dendextend")
#install.packages("corrplot")
#install.packages("tidyverse")
#install.packages("NbClust")
library(factoextra) #clustering visualization
library(cluster) #clustering algorithms
library(dendextend) #for comparing two dendrograms
library(corrplot) #corelation between dendrograms
library(tidyverse) #data manupulation
library(NbClust) #determine optimal no. of clusters
Load the data
library(readr)
customer_data <- read_csv("D:/Analytics/RPubs/Datasets/customer.csv")
## Parsed with column specification:
## cols(
## ID = col_integer(),
## Visit.Time = col_integer(),
## Average.Expense = col_double(),
## Sex = col_integer(),
## Age = col_integer()
## )
head(customer_data)
## # A tibble: 6 x 5
## ID Visit.Time Average.Expense Sex Age
## <int> <int> <dbl> <int> <int>
## 1 1 3 5.70 0 10
## 2 2 5 14.5 0 27
## 3 3 16 33.5 0 32
## 4 4 5 15.9 0 30
## 5 5 16 24.9 0 23
## 6 6 3 12.0 0 15
str(customer_data)
## Classes 'tbl_df', 'tbl' and 'data.frame': 60 obs. of 5 variables:
## $ ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Visit.Time : int 3 5 16 5 16 3 12 14 6 3 ...
## $ Average.Expense: num 5.7 14.5 33.5 15.9 24.9 12 28.5 18.8 23.8 5.3 ...
## $ Sex : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Age : int 10 27 32 30 23 15 33 27 16 11 ...
## - attr(*, "spec")=List of 2
## ..$ cols :List of 5
## .. ..$ ID : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ Visit.Time : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ Average.Expense: list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ Sex : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ Age : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## ..$ default: list()
## .. ..- attr(*, "class")= chr "collector_guess" "collector"
## ..- attr(*, "class")= chr "col_spec"
NA_Check <- is.na(customer_data)
summary(NA_Check)
## ID Visit.Time Average.Expense Sex
## Mode :logical Mode :logical Mode :logical Mode :logical
## FALSE:60 FALSE:60 FALSE:60 FALSE:60
## Age
## Mode :logical
## FALSE:60
Normalize the customer data into the same scale
customer <- scale(customer_data[,-1])
Use agglomerative hierarchical clustering to cluster the customer data
hc <- hclust(dist(customer, method = "euclidean"), method = "ward.D2")
hc
##
## Call:
## hclust(d = dist(customer, method = "euclidean"), method = "ward.D2")
##
## Cluster method : ward.D2
## Distance : euclidean
## Number of objects: 60
Plot the dendogram
plot(hc,hang = -0.01, cex = 0.7)
Cutting trees into clusters
In a dendrogram, we can see the hierarchy of clusters, but we have not grouped data into different clusters yet. However, we can determine how many clusters are within the dendrogram and cut the dendrogram at a certain tree height to separate the data into different groups. We will use the cutree function to separate the data into a given number of clusters.
We can determine the number of clusters from the dendrogram,here there should be four clusters within the tree. Therefore, we will specify the number of clusters as 4 in the cutree function. Besides using the number of clusters to cut the tree, we can also specify the height as the cut tree parameter. Next, we can output the cluster labels of the data and use the table function to count the number of data within each cluster. From the counting table, we find that most of the data is in cluster 4. Lastly, we can draw red rectangles around the clusters to show how data is categorized into the four clusters with the rect.hclust function.
fit <- cutree(hc, k=4)
fit #cluster labels for the data
## [1] 1 1 2 1 2 1 2 2 1 1 1 2 2 1 1 1 2 1 2 3 4 3 4 3 3 4 4 3 4 4 4 3 3 3 4
## [36] 4 3 4 4 4 4 4 4 4 3 3 4 4 4 3 4 3 3 4 4 4 3 4 4 3
table(fit) #count of data within each cluster
## fit
## 1 2 3 4
## 11 8 16 25
Visualize the clustered data with red rectangle border
plot(hc)
rect.hclust(hc,k=4, border="red")
Hilighting a certain cluster with red rectangle border
plot(hc)
rect.hclust(hc,k = 4, which = 2, border = "red")
Distance matrix and cluster height
#dist(customer, method = "euclidean")
#hc$height
Profiling the clusters
customer_data$Clusters <- cutree(hc, k = 4)
aggr <- aggregate(customer_data[,-c(1,4,6)],list(customer_data$Clusters),mean)
clus.profile <- data.frame (Cluster = aggr[,1],
Freq = as.vector(table(customer_data$Clusters)),
aggr[,-1]
)
round(clus.profile,2)
## Cluster Freq Visit.Time Average.Expense Age
## 1 1 11 4.91 12.71 17.00
## 2 2 8 14.38 25.59 26.62
## 3 3 16 12.25 25.36 31.19
## 4 4 25 5.56 10.93 15.48
Optional Visualization
If you are interested in drawing a horizontal dendrogram, you can use the dendextend package.
dend <- customer %>% dist %>% hclust %>% as.dendrogram
dend %>%plot(horiz=TRUE, main="Horizontal Dendrogram")
#Color the branch according to the cluster it belongs to:
dend %>% color_branches(k=4) %>% plot(horiz=TRUE, main ="Horizontal Dendrogram")
#Add a red rectangle around the clusters:
dend %>% rect.dendrogram(k=4,horiz=TRUE)
#Add a line to show the tree cutting location:
abline(v = heights_per_k.dendrogram(dend)["4"] + .1, lwd = 2,lty = 2, col = "blue")
Using the function fviz_cluster() [in factoextra], we can also visualize the result in a scatter plot. Observations are represented by points in the plot, using principal components. A frame is drawn around each cluster.
fviz_cluster(list(data = customer, cluster = fit))
Comparing two dendrogram
We’ll use the package dendextend which contains many functions for comparing two dendrograms.
Compute two hierarchical clusterings
hc1 <- hclust((dist(customer, method="euclidean")), method="average")
hc2 <- hclust((dist(customer, method="euclidean")), method="ward.D2")
Compute two dendograms
dend1 <- as.dendrogram((hc1))
dend2 <- as.dendrogram((hc2))
Create a list of dendrograms
dend_list <- dendlist(dend1, dend2)
The function tanglegram() plots two dendrograms, side by side, with their labels connected by lines. It can be used for visually comparing two methods of Hierarchical clustering as follow:
tanglegram(dend1, dend2)
Note that “unique” nodes, with a combination of labels/items not present in the other tree, are highlighted with dashed lines.
The quality of the alignment of the two trees can be measured using the function entanglement(). The output of tanglegram() can be customized using many other options as follow:
tanglegram(dend1, dend2,
highlight_distinct_edges = FALSE, # Turn-off dashed lines
common_subtrees_color_lines = FALSE, # Turn-off line colors
common_subtrees_color_branches = TRUE, # Color common branches
main = paste("entanglement =", round(entanglement(dend_list), 2))
)
Entanglement is a measure between 1 (full entanglement) and 0 (no entanglement). A lower entanglement coefficient corresponds to a good alignment.
Correlation matrix between a list of dendrogram
Compare simultaneously multiple dendrograms. A chaining operator %>% (available in dendextend) is used to run multiple function at the same time. It’s useful for simplifying the code:
# Create multiple dendrograms by chaining
dend1 <- customer %>% dist %>% hclust("com") %>% as.dendrogram
dend2 <- customer %>% dist %>% hclust("single") %>% as.dendrogram
dend3 <- customer %>% dist %>% hclust("ave") %>% as.dendrogram
dend4 <- customer %>% dist %>% hclust("centroid") %>% as.dendrogram
# Compute correlation matrix
dend_list <- dendlist("Complete" = dend1, "Single" = dend2,
"Average" = dend3, "Centroid" = dend4)
cors <- cor.dendlist(dend_list)
# Print correlation matrix
round(cors, 2)
## Complete Single Average Centroid
## Complete 1.00 0.40 0.96 0.89
## Single 0.40 1.00 0.54 0.62
## Average 0.96 0.54 1.00 0.95
## Centroid 0.89 0.62 0.95 1.00
# Visualize the correlation matrix using corrplot package
corrplot(cors, "pie", "lower")
Exporting the dataset with cluster information
#Exporting the dataset with cluster information
FinalCluster <- customer_data
FinalCluster$Cluster<- fit
colnames(FinalCluster)<- c("ID","VisitTime","AverageExpense","Sex","Age","Cluster")
head(FinalCluster[,1:6])
## # A tibble: 6 x 6
## ID VisitTime AverageExpense Sex Age Cluster
## <int> <int> <dbl> <int> <int> <int>
## 1 1 3 5.70 0 10 1
## 2 2 5 14.5 0 27 1
## 3 3 16 33.5 0 32 2
## 4 4 5 15.9 0 30 1
## 5 5 16 24.9 0 23 2
## 6 6 3 12.0 0 15 1
#library(xlsx)
#write.xlsx(FinalCluster[,1:6],"D:/Analytics/RPubs/Datasets/FinalCluster.xlsx")
Alternatively, we can use the agnes function. This functions behave very similarly; however, the agnes function can also get the agglomerative coefficient, which measures the amount of clustering structure found (values closer to 1 suggest strong clustering structure).
# Compute with agnes
hc_a <- agnes(customer, method = "complete")
# Agglomerative coefficient
hc_a$ac
## [1] 0.9164282
To find which hierarchical clustering methods that can identify stronger clustering structures. Here we see that Ward’s method identifies the strongest clustering structure of the four methods assessed.
#method to assess
m <- c("average", "single","complete","ward")
names(m) <- c("average", "single","complete","ward")
#function to compute coefficient
ac <- function(x){agnes(customer, method = x)$ac}
map_dbl(m,ac)
## average single complete ward
## 0.8582740 0.8082953 0.9164282 0.9669135
The R function diana provided by the cluster package allows us to perform divisive hierarchical clustering. diana works similar to agnes; however, there is no method to provide.
# compute divisive hierarchical clustering
hc_d <- diana(customer)
# Divise coefficient; amount of clustering structure found
hc_d$dc
## [1] 0.9117911
# plot dendrogram
pltree(hc_d, cex = 0.6, hang = -1, main = "Dendrogram of diana")
Cutree with agnes and diana
#Cut agnes() tree into 4 groups
cutree(as.hclust(hc_a), k=4)
## [1] 1 1 2 1 2 1 2 2 1 1 1 2 2 1 1 1 2 1 2 3 4 2 4 2 3 4 4 3 4 4 4 3 3 3 4
## [36] 4 3 4 4 4 4 4 4 4 3 3 4 4 4 2 4 3 2 4 4 4 3 4 4 3
#Cut diana()tree into 4 groups
cutree(as.hclust(hc_d), k=4)
## [1] 1 1 2 1 2 1 2 2 1 1 1 2 2 1 1 1 2 1 2 3 4 3 4 3 3 4 4 3 4 4 4 3 3 3 4
## [36] 4 3 4 4 4 4 4 4 4 3 3 4 4 4 3 4 3 3 4 4 4 3 4 4 3
Clustering validation and evaluation strategies, consist of measuring the goodness of clustering results. Before applying any clustering algorithm to a data set, the first thing to do is to assess the clustering tendency. That is, whether the data contains any inherent grouping structure.
If yes, then how many clusters are there. Next, you can perform hierarchical clustering or partitioning clustering (with a pre-specified number of clusters). Finally, evaluate the goodness of the clustering results.
To assess the clustering tendency, the Hopkins’ statistic and a visual approach can be used. This can be performed using the function get_clust_tendency() [factoextra package], which creates an ordered dissimilarity image (ODI).
Hopkins statistic: If the value of Hopkins statistic is close to 1 (far above 0.5), then we can conclude that the dataset is significantly clusterable.
Visual approach: The visual approach detects the clustering tendency by counting the number of square shaped dark (or colored) blocks along the diagonal in the ordered dissimilarity image.
gradient.color <- list(low = "steelblue", high = "white")
customer_data[, -1] %>% # Remove column 1 (ID)
scale() %>% # Scale variables
get_clust_tendency(n = 50, gradient = gradient.color)
## $hopkins_stat
## [1] 0.7222321
##
## $plot
There are different methods for determining the optimal number of clusters.
“NbClust”" R package, provides 30 indices for determining the best number of clusters.
1. data:** matrix
2. diss:** dissimilarity matrix to be used. By default, diss=NULL, but if it is replaced by a dissimilarity matrix, distance should be “NULL”.
3. distance: the distance measure to be used to compute the dissimilarity matrix. Possible values include “euclidean”, “manhattan” or “NULL”.
4. min.nc, max.nc: minimal and maximal number of clusters, respectively.
5. method: The cluster analysis method to be used including “ward.D”, “ward.D2”, “single”, “complete”, “average”, “kmeans” and more.
To compute NbClust() for kmeans, use method = “kmeans”.
To compute NbClust() for hierarchical clustering, method should be one of c(“ward.D”, “ward.D2”, “single”, “complete”, “average”)
nb <- NbClust(customer, distance = "euclidean", min.nc = 2, max.nc = 10, method = "ward.D2")
## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a
## significant increase of the value of the measure i.e the significant peak in Hubert
## index second differences plot.
##
## *** : The D index is a graphical method of determining the number of clusters.
## In the plot of D index, we seek a significant knee (the significant peak in Dindex
## second differences plot) that corresponds to a significant increase of the value of
## the measure.
##
## *******************************************************************
## * Among all indices:
## * 2 proposed 2 as the best number of clusters
## * 6 proposed 3 as the best number of clusters
## * 8 proposed 4 as the best number of clusters
## * 3 proposed 5 as the best number of clusters
## * 2 proposed 6 as the best number of clusters
## * 2 proposed 10 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 4
##
##
## *******************************************************************
The term clustering validation is used to design the procedure of evaluating the results of a clustering algorithm.
The silhouette plot is one of the many measures for inspecting and validating clustering results. The silhouette measures how similar an object is to the the other objects in its own cluster versus those in the neighbor cluster. Its values range from 1 to - 1:
> A value of close to 1 indicates that the object is well clustered. In the other words, the object is similar to the other objects in its group.
> A value of close to -1 indicates that the object is poorly clustered, and that assignment to some other cluster would probably improve the overall results.
# Enhanced hierarchical clustering, cut in 3 groups
res.hc <- customer_data[, -1] %>%
scale() %>%
eclust("hclust", k = 4, graph = FALSE)
# Visualize with factoextra
fviz_dend(res.hc, palette = "jco",
rect = TRUE, show_labels = FALSE)
fviz_silhouette(res.hc)
## cluster size ave.sil.width
## 1 1 11 0.52
## 2 2 8 0.57
## 3 3 16 0.41
## 4 4 25 0.61