Clustering is an important unsupervised machine learning technique in which data points are grouped into clusters so that points within the same cluster have similar features, while points in different clusters are different. There are a variety of applications, including customer segmentation, image analysis, anomaly detection, gene or protein classification and data compression. Popular algorithms include K-Means, which is suitable for simple, spherical clusters, Hierarchical Clustering, which creates dendrograms using agglomerative or divisive methods, DBSCAN, which is suitable for complex clusters and outliers, Gaussian Mixture Models, ideal for overlapping clusters, and Spectral Clustering, which is suitable for data with complicated structures. Each algorithm offers unique advantages depending on the data and the objectives.
In this research, I will attempt to determine which method is best suited for the customer purchase behavior dataset so that customers can be segmented in an effective manner.
The dataset used for this analysis is Customer Purchasing Behaviors.cvs, which contains information about customer behavior including: budget, frequency, demographics, income, loyalty, provided by Kaggle (https://www.kaggle.com/datasets/hanaksoy/customer-purchasing-behaviors)
# Packages & Libraries
if (!require("pacman")) install.packages("pacman")
## Loading required package: pacman
pacman::p_load(corrplot,clustertend,factoextra,NbClust,ggplot2,cluster,gridExtra,proxy,ClustGeo)
# Import dataset
data<-read.csv("Customer Purchasing Behaviors.csv", sep=",", dec=".", header=TRUE)
data <- as.matrix(data[,c(2:5,7)]) #data only
head(data)
## age annual_income purchase_amount loyalty_score purchase_frequency
## [1,] 25 45000 200 4.5 12
## [2,] 34 55000 350 7.0 18
## [3,] 45 65000 500 8.0 22
## [4,] 22 30000 150 3.0 10
## [5,] 29 47000 220 4.8 13
## [6,] 41 61000 480 7.8 21
# Checking data
dim(data)
## [1] 238 5
summary(data)
## age annual_income purchase_amount loyalty_score
## Min. :22.00 Min. :30000 Min. :150.0 Min. :3.000
## 1st Qu.:31.00 1st Qu.:50000 1st Qu.:320.0 1st Qu.:5.500
## Median :39.00 Median :59000 Median :440.0 Median :7.000
## Mean :38.68 Mean :57408 Mean :425.6 Mean :6.794
## 3rd Qu.:46.75 3rd Qu.:66750 3rd Qu.:527.5 3rd Qu.:8.275
## Max. :55.00 Max. :75000 Max. :640.0 Max. :9.500
## purchase_frequency
## Min. :10.0
## 1st Qu.:17.0
## Median :20.0
## Mean :19.8
## 3rd Qu.:23.0
## Max. :28.0
# As data in different scale, so I'll nomalize them by using scale()
data <- scale(data,scale = T)
# Check the relationship among variables
corrplot(cor(data), method = "number", number.cex = 0.7, order="hclust")
High correlation (close to 1) between the variables:
Most variables have very high correlation coefficients (≥ 0.97), indicating a strong relationship between them.
Non-existent independent variables:None of the variables are completely isolated from the others. All variables are highly correlated, indicating that they are not independent.
I will first perform a preliminary diagnosis to check whether the data can be clustered and to choose the optimal number of clusters.
To assess the clusterability of the data, I will perform the Hopkins statistic. The null hypothesis states that the data set is uniformly distributed and contains no meaningful clusters.
get_clust_tendency(data, 2, graph=TRUE, gradient=list(low="red", mid="white", high="blue"))
## $hopkins_stat
## [1] 0.9358179
##
## $plot
With a Hopkins statistic of 0.9358179, the results strongly indicate that the data contains meaningful clusters. Next, I will determine the optimal number of clusters using four methods: K-means, PAM, CLARA, and hierarchical clustering, based on the silhouette score.
c1 <- fviz_nbclust(data, kmeans, method = "s") + ggtitle("Kmeans")
c2 <- fviz_nbclust(data, cluster::pam, method = "s") + ggtitle("PAM")
c3 <- fviz_nbclust(data,cluster::clara, method = "s") + ggtitle("Clara")
c4 <- fviz_nbclust(data, hcut, method = "s") + ggtitle("Hierarchical")
grid.arrange(c1, c2, c3, c4, ncol=2, top = "Optimal number of clusters")
Optimal number of clusters: about 9, with the average silhouette value being high. K-means shows good clustering ability, but the silhouette does not change significantly after k = 5.
Optimal number of clusters: 6, with the highest silhouette width. PAM performs well with a smaller number of clusters than K-means, reflecting its flexibility in dealing with heterogeneous data.
Optimal number of clusters: 5, where the silhouette reaches its highest value. This method works well, especially with large amounts of data, and maintains a stable silhouette value from k = 5.
Optimal number of clusters: 5, where the silhouette peaks. Remarks: Hierarchical clustering is suitable for data with complex structures and is stable as the number of clusters increases.
Generally, optimal number of clusters: PAM, CLARA and Hierarchical Clustering all suggest that the optimal number of clusters is 5 or 6. K-means tends to be optimal at 9 clusters, but is not significantly different from lower cluster counts.
The k-means algorithm is a popular clustering technique used to group data points into a fixed number (k) of clusters based on their similarities. It is an unsupervised machine learning algorithm and works iteratively to partition a dataset into clusters, aiming to minimize the variance (or distance) within each cluster.
# clustering of dataset – by individual obs. (in rows)
km1<-eclust(data, "kmeans", hc_metric="euclidean",k=9)
fviz_cluster(km1, main="kmeans / Euclidean")
sil<-silhouette(km1$cluster, dist(data))
fviz_silhouette(sil)
## cluster size ave.sil.width
## 1 1 11 0.66
## 2 2 19 0.44
## 3 3 38 0.27
## 4 4 32 0.53
## 5 5 29 0.63
## 6 6 16 0.94
## 7 7 15 0.82
## 8 8 34 0.60
## 9 9 44 0.55
# Calculate Pearson distance matrix (1 - correlation)
pearson_dist <- as.dist(1 - cor(t(data), method = "pearson"))
# K-means clustering based on Pearson distance
km2 <- kmeans(pearson_dist, centers = 9)
# Visualization
fviz_cluster(list(data = data, cluster = km2$cluster), main = "Kmeans / Pearson")
sil2 <- silhouette(km2$cluster, pearson_dist)
fviz_silhouette(sil2)
## cluster size ave.sil.width
## 1 1 9 0.86
## 2 2 37 0.43
## 3 3 26 0.50
## 4 4 29 0.82
## 5 5 21 0.87
## 6 6 38 0.61
## 7 7 17 0.34
## 8 8 23 0.64
## 9 9 38 0.71
In this method, the focus is on the absolute distance between the data points, so the clusters are analyzed based on the differences between the regimes in space.
The focus is on the linear relationships between the variables, so data with high correlations are grouped in the same cluster
Pearson’s correlation: The average silhouette width is (0.63), but the clusters have a clear overlap. Euclidean distance: The average silhouette width is smaller (0.56), but the clusters are more clearly separated from each other.
PAM (Partitioning Around Medoids) algorithm clusters data around medoids, which are actual data points from the dataset, unlike the k-means algorithm, which clusters data around centroids that may not correspond to actual points in the dataset.
pam1<-eclust(data, "pam", k=6, hc_metric="euclidean")
fviz_silhouette(pam1)
## cluster size ave.sil.width
## 1 1 43 0.61
## 2 2 42 0.56
## 3 3 46 0.53
## 4 4 16 0.95
## 5 5 59 0.65
## 6 6 32 0.71
pam2<-eclust(data, "pam", k=6, hc_metric="Manhatta")
fviz_silhouette(pam2)
## cluster size ave.sil.width
## 1 1 43 0.61
## 2 2 42 0.56
## 3 3 46 0.53
## 4 4 16 0.95
## 5 5 59 0.65
## 6 6 32 0.71
pam3<-eclust(data, "pam", k=6, hc_metric="pearson")
fviz_silhouette(pam3)
## cluster size ave.sil.width
## 1 1 43 0.61
## 2 2 42 0.56
## 3 3 46 0.53
## 4 4 16 0.95
## 5 5 59 0.65
## 6 6 32 0.71
The clustering results are similar for Euclidean distance, Manhattan distance and Pearson’s correlation The clusters are mostly distinct and well-separated. Cluster 4 is noticeably distinct from the other clusters, suggesting strong separation. Conclusion: PAM provides better separation compared to K-Means, especially around overlapping clusters.
CLARA (Clustering Large Applications) is an extension of the PAM (Partitioning Around Medoids) algorithm designed to handle large datasets efficiently. While PAM works well for smaller datasets, it becomes computationally expensive for larger datasets because it calculates all pairwise distances.
cl<-eclust(data, "clara", k=5)
fviz_silhouette(cl)
## cluster size ave.sil.width
## 1 1 35 0.75
## 2 2 50 0.55
## 3 3 78 0.51
## 4 4 16 0.94
## 5 5 59 0.72
####Conclusion: Although CLARA merges two clusters (as observed visually), the silhouette score improvement suggests that this merging leads to a more cohesive representation of the dataset. In contrast, PAM’s additional cluster (Cluster 6) may introduce slight redundancy or noise, reducing the overall silhouette width.
Hierarchical clustering is a method of cluster analysis that seeks to build a hierarchy of clusters. It is an unsupervised machine learning technique used to group similar data points into clusters based on their similarity or distance from one another. The result of hierarchical clustering is often represented as a dendrogram, a tree-like diagram that visualizes the merging or splitting of clusters.
dm<-dist(data)
hc<-hclust(dm, method="complete")
plot(hc, hang=-1)
plot(density(dm))
plot(hc,hang=-1)
x<-rect.hclust(hc, k=5, border=5:7)
# cutting by number of clusters
clust<-cutree(hc, k=5) # division into 5 clusters
install.packages("ClustGeo")
##
## The downloaded binary packages are in
## /var/folders/j9/nx0pl43n2zlgk7f43_3xrp1c0000gn/T//RtmpEqAGvs/downloaded_packages
library(ClustGeo)
dm<-dist(data) # distances between observations
hc<-hclust(dm, method="complete") # simple dendrogram
# cutting by number of clusters
clust.vec.5<-cutree(hc, k=5)
diss.mat<-dm
inertion<-matrix(0, nrow=4, ncol=1)
colnames(inertion)<-"division to 5 clust."
rownames(inertion)<-c("intra-clust", "total", "percentage", "Q")
inertion[1,1]<-withindiss(diss.mat, part=clust.vec.5)# intra-cluster
inertion[2,1]<-inertdiss(diss.mat) # overall
inertion[3,1]<-inertion[1,1]/ inertion[2,1] # ratio
inertion[4,1]<-1-inertion[3,1] # Q, inter-cluster
options("scipen"=100, "digits"=4)
inertion
## division to 5 clust.
## intra-clust 0.26029
## total 4.97899
## percentage 0.05228
## Q 0.94772
The above results show that the division of the data into 5 clusters is of very high quality:
The data points within the same cluster are very homogeneous (low intra-cluster). The clusters are clearly separated from each other (Q is close to 1). The ratio between the intra-cluster tightness and the total distance is very small (low percentage), indicating that this clustering method is suitable and reliable.
fviz_cluster(list(data=data, cluster=clust.vec.5))
plot(silhouette(clust.vec.5,dm))
Silhouette score: CLARA achieved a higher silhouette score (0.64 vs. 0.61 for HC), indicating that CLARA has a clearer clustering analysis
The best algorithms in terms of performance are Clara and PAM with the highest Silhouette Scores (0.64 and 0.63). However, hierarchical clustering (with a lower silhouette score - 0.61) provides detailed hierarchical information that helps to explore the relationships between groups of data.The choice of a particular algorithm depends on the needs and purposes of the organization.