Cluster Analysis: k-Means Clustering, Hierarchical Clustering, and Gaussian Mixture Models (GMM) with real data from R packages and visualizations.

1. Introduction

Cluster analysis is an unsupervised learning technique that groups observations into homogeneous clusters based on similarity. Given a dataset \(X = \{ x_1, x_2, ..., x_n \}\), the objective is to partition \(X\) into \(k\) clusters such that:

\[ C_1, C_2, ..., C_k \quad \text{where} \quad C_i \cap C_j = \emptyset \quad \forall i \neq j \]

The primary clustering methods include:

packages <- c("ggplot2", "dplyr", "cluster", "factoextra", "mclust", "gridExtra")

for (p in packages) {
  if (!requireNamespace(p, quietly = TRUE)) {
    install.packages(p, dependencies = TRUE)
  }
  library(p, character.only = TRUE)
}
## Warning: package 'ggplot2' was built under R version 4.4.1
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## Warning: package 'factoextra' was built under R version 4.4.1
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
## Warning: package 'mclust' was built under R version 4.4.1
## Package 'mclust' version 6.1.1
## Type 'citation("mclust")' for citing this R package in publications.
## Warning: package 'gridExtra' was built under R version 4.4.1
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
## 3. Dataset: Customer Segmentation using `mtcars`

We will cluster car models from the mtcars dataset based on their mileage (mpg), horsepower (hp), and weight (wt).

# Load dataset
data("mtcars")

# Select features for clustering
df <- mtcars[, c("mpg", "hp", "wt")]

# Standardize the data (important for clustering)
df_scaled <- scale(df)

# Display summary
summary(df_scaled)
##       mpg                hp                wt         
##  Min.   :-1.6079   Min.   :-1.3810   Min.   :-1.7418  
##  1st Qu.:-0.7741   1st Qu.:-0.7320   1st Qu.:-0.6500  
##  Median :-0.1478   Median :-0.3455   Median : 0.1101  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4495   3rd Qu.: 0.4859   3rd Qu.: 0.4014  
##  Max.   : 2.2913   Max.   : 2.7466   Max.   : 2.2553

4. k-Means Clustering

k-Means Clustering is a popular partitioning method that minimizes the intra-cluster variance:

\[ J = \sum_{i=1}^{k} \sum_{x \in C_i} || x - \mu_i ||^2 \]

where: - \(\mu_i\) is the centroid of cluster \(C_i\). - \(|| x - \mu_i ||^2\) is the squared Euclidean distance.

Algorithm: 1. Initialize \(k\) centroids randomly. 2. Assign each data point to the nearest centroid. 3. Update centroids as the mean of assigned points. 4. Repeat steps 2 and 3 until convergence.

The Elbow Method is used to determine the optimal number of clusters by plotting the within-cluster sum of squares (WSS):

\[ \text{WSS} = \sum_{i=1}^{k} \sum_{x \in C_i} || x - \mu_i ||^2 \]

Step 1: Finding the Optimal Number of Clusters

Using the Elbow Method to determine the best k.

fviz_nbclust(df_scaled, kmeans, method = "wss") +
  labs(title = "Elbow Method for Optimal k")

### Step 2: Applying k-Means with Optimal k

set.seed(123)
kmeans_result <- kmeans(df_scaled, centers = 3, nstart = 25)

# Add cluster labels to dataset
df$Cluster <- as.factor(kmeans_result$cluster)

# Print cluster centers
print(kmeans_result$centers)
##           mpg         hp          wt
## 1 -0.96410736  1.1844968  0.97873444
## 2 -0.01814766 -0.3509553 -0.09651672
## 3  1.65523937 -1.0382807 -1.37384616
# Visualizing k-Means clusters
fviz_cluster(kmeans_result, data = df_scaled,
             ellipse.type = "convex",
             geom = "point",
             palette = "jco") +
  labs(title = "k-Means Clustering on Car Data")

Step 3: Cluster Summary

df_summary <- df %>%
  group_by(Cluster) %>%
  summarise(Avg_MPG = mean(mpg),
            Avg_HP = mean(hp),
            Avg_Weight = mean(wt))

print(df_summary)
## # A tibble: 3 × 4
##   Cluster Avg_MPG Avg_HP Avg_Weight
##   <fct>     <dbl>  <dbl>      <dbl>
## 1 1          14.3  228.        4.17
## 2 2          20.0  123.        3.12
## 3 3          30.1   75.5       1.87

5. Hierarchical Clustering

Hierarchical clustering creates a dendrogram, a tree-like structure representing nested clusters. The method works by:

  1. Computing Distance Matrix: \[ d(x_i, x_j) = || x_i - x_j || \]

  2. Agglomerative Clustering (Bottom-Up Approach):

    • Start with each data point as a single cluster.
    • Merge the closest clusters iteratively.
    • Stop when all data points are in one cluster.

Linkage Methods: - Single Linkage: \[ d(A, B) = \min \{ d(x_i, x_j) \} \] - Complete Linkage: \[ d(A, B) = \max \{ d(x_i, x_j) \} \] - Average Linkage: \[ d(A, B) = \frac{1}{|A||B|} \sum_{i \in A, j \in B} d(x_i, x_j) \]

The number of clusters can be determined by cutting the dendrogram at a chosen height.

Step 1: Compute Distance Matrix and Apply Clustering

# Compute Euclidean distance
dist_matrix <- dist(df_scaled, method = "euclidean")

# Apply Hierarchical Clustering
hc_complete <- hclust(dist_matrix, method = "complete")

# Plot Dendrogram
plot(hc_complete, main = "Dendrogram - Complete Linkage", sub = "", xlab = "")
rect.hclust(hc_complete, k = 3, border = "red")

Step 2: Cut Tree into Clusters

df$HCluster <- as.factor(cutree(hc_complete, k = 3))

# Display cluster summary
df %>%
  group_by(HCluster) %>%
  summarise(Avg_MPG = mean(mpg),
            Avg_HP = mean(hp),
            Avg_Weight = mean(wt))
## # A tibble: 3 × 4
##   HCluster Avg_MPG Avg_HP Avg_Weight
##   <fct>      <dbl>  <dbl>      <dbl>
## 1 1           19.4  132.        3.24
## 2 2           13.4  248.        4.31
## 3 3           30.1   75.5       1.87

6. Gaussian Mixture Model (GMM) Clustering

Gaussian Mixture Models (GMM) assume that data is generated from a mixture of Gaussian distributions:

\[ p(x) = \sum_{i=1}^{k} \pi_i \mathcal{N}(x | \mu_i, \Sigma_i) \]

where: - \(\pi_i\) are the mixing coefficients (sum to 1). - \(\mathcal{N}(x | \mu_i, \Sigma_i)\) is the Gaussian distribution with mean \(\mu_i\) and covariance \(\Sigma_i\).

Expectation-Maximization (EM) Algorithm is used for parameter estimation: 1. E-Step: Compute the responsibility of each cluster for each data point. 2. M-Step: Update the parameters \((\pi_i, \mu_i, \Sigma_i)\) based on responsibilities. 3. Repeat until convergence.

Step 1: Apply GMM

gmm_model <- Mclust(df_scaled, G = 1:5)  # G = possible number of clusters

# Summary of GMM model
summary(gmm_model)
## ---------------------------------------------------- 
## Gaussian finite mixture model fitted by EM algorithm 
## ---------------------------------------------------- 
## 
## Mclust EII (spherical, equal volume) model with 5 components: 
## 
##  log-likelihood  n df       BIC       ICL
##       -71.27269 32 20 -211.8601 -213.1701
## 
## Clustering table:
##  1  2  3  4  5 
##  8 11  4  3  6

Step 2: Visualizing GMM Clusters

fviz_mclust(gmm_model, what = "classification") +
  labs(title = "GMM Clustering - Customer Segmentation")
## Too few points to calculate an ellipse

7. Cluster Validation

To evaluate clustering quality, we use:

  1. Silhouette Score: \[ S(i) = \frac{b(i) - a(i)}{\max(a(i), b(i))} \] where:
    • \(a(i)\) is the average intra-cluster distance.
    • \(b(i)\) is the average inter-cluster distance.
  2. Davies-Bouldin Index: \[ DB = \frac{1}{k} \sum_{i=1}^{k} \max_{j \neq i} \left( \frac{\sigma_i + \sigma_j}{d(\mu_i, \mu_j)} \right) \] where:
    • \(\sigma_i\) is the dispersion of cluster \(i\).
    • \(d(\mu_i, \mu_j)\) is the distance between centroids.
  3. Dunn Index: \[ DI = \frac{\min_{i \neq j} d(C_i, C_j)}{\max_k d(C_k)} \] where:
    • \(d(C_i, C_j)\) is the minimum distance between two clusters.
    • \(d(C_k)\) is the maximum intra-cluster distance.

Conclusion: - Higher Silhouette Score → Better clustering. - Lower Davies-Bouldin Index → Well-separated clusters. - Higher Dunn Index → Better compactness and separation.

Step 1: Silhouette Score

The Silhouette score evaluates how well observations fit into their assigned clusters.

# k-Means Silhouette Score
sil_kmeans <- silhouette(kmeans_result$cluster, dist(df_scaled))
fviz_silhouette(sil_kmeans) +
  labs(title = "Silhouette Plot for k-Means")
##   cluster size ave.sil.width
## 1       1   10          0.30
## 2       2   16          0.42
## 3       3    6          0.59

# GMM Silhouette Score
sil_gmm <- silhouette(gmm_model$classification, dist(df_scaled))
fviz_silhouette(sil_gmm) +
  labs(title = "Silhouette Plot for GMM")
##   cluster size ave.sil.width
## 1       1    8          0.56
## 2       2   11          0.39
## 3       3    4          0.49
## 4       4    3          0.72
## 5       5    6          0.45

8. Comparison of Clustering Methods

Method Strengths Weaknesses
k-Means Fast and simple Assumes spherical clusters
Hierarchical Clustering No need to pre-define k Computationally expensive for large datasets
GMM (Gaussian Mixture Models) Handles different shapes of clusters Can be sensitive to initialization

9. Conclusion

10. References