ML Techniques with R part 3: Cluster Analysis

Cluster Analysis: k-Means Clustering, Hierarchical Clustering, and Gaussian Mixture Models (GMM) with real data from R packages and visualizations.

1. Introduction

Objective: Cluster analysis aims to group observations into homogeneous clusters based on similarities. It is widely used in:
- Customer segmentation in marketing
- Anomaly detection in fraud analysis
- Pattern recognition in image processing
Algorithms Covered:
- k-Means Clustering: A centroid-based algorithm for partitioning data.
- Hierarchical Clustering: A tree-based method to create a hierarchy of clusters.
- Gaussian Mixture Models (GMM): A probabilistic model using Gaussian distributions.
1. Introduction to Cluster Analysis

Cluster analysis is an unsupervised learning technique that groups observations into homogeneous clusters based on similarity. Given a dataset \(X = \{ x_1, x_2, ..., x_n \}\), the objective is to partition \(X\) into \(k\) clusters such that:

\[ C_1, C_2, ..., C_k \quad \text{where} \quad C_i \cap C_j = \emptyset \quad \forall i \neq j \]

The primary clustering methods include:

Partitioning Methods (e.g., k-Means): Assigns each data point to exactly one cluster.
Hierarchical Methods: Builds a nested hierarchy of clusters.
Model-Based Methods (e.g., Gaussian Mixture Models - GMM): Assumes that data is generated from a mixture of probability distributions.

packages <- c("ggplot2", "dplyr", "cluster", "factoextra", "mclust", "gridExtra")

for (p in packages) {
  if (!requireNamespace(p, quietly = TRUE)) {
    install.packages(p, dependencies = TRUE)
  }
  library(p, character.only = TRUE)
}

## Warning: package 'ggplot2' was built under R version 4.4.1

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

## Warning: package 'factoextra' was built under R version 4.4.1

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

## Warning: package 'mclust' was built under R version 4.4.1

## Package 'mclust' version 6.1.1
## Type 'citation("mclust")' for citing this R package in publications.

## Warning: package 'gridExtra' was built under R version 4.4.1

## 
## Attaching package: 'gridExtra'

## The following object is masked from 'package:dplyr':
## 
##     combine

## 3. Dataset: Customer Segmentation using `mtcars`

We will cluster car models from the mtcars dataset based on their mileage (mpg), horsepower (hp), and weight (wt).

# Load dataset
data("mtcars")

# Select features for clustering
df <- mtcars[, c("mpg", "hp", "wt")]

# Standardize the data (important for clustering)
df_scaled <- scale(df)

# Display summary
summary(df_scaled)

##       mpg                hp                wt         
##  Min.   :-1.6079   Min.   :-1.3810   Min.   :-1.7418  
##  1st Qu.:-0.7741   1st Qu.:-0.7320   1st Qu.:-0.6500  
##  Median :-0.1478   Median :-0.3455   Median : 0.1101  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4495   3rd Qu.: 0.4859   3rd Qu.: 0.4014  
##  Max.   : 2.2913   Max.   : 2.7466   Max.   : 2.2553

4. k-Means Clustering

k-Means Clustering is a popular partitioning method that minimizes the intra-cluster variance:

\[ J = \sum_{i=1}^{k} \sum_{x \in C_i} || x - \mu_i ||^2 \]

where: - \(\mu_i\) is the centroid of cluster \(C_i\). - \(|| x - \mu_i ||^2\) is the squared Euclidean distance.

Algorithm: 1. Initialize \(k\) centroids randomly. 2. Assign each data point to the nearest centroid. 3. Update centroids as the mean of assigned points. 4. Repeat steps 2 and 3 until convergence.

The Elbow Method is used to determine the optimal number of clusters by plotting the within-cluster sum of squares (WSS):

\[ \text{WSS} = \sum_{i=1}^{k} \sum_{x \in C_i} || x - \mu_i ||^2 \]

Step 1: Finding the Optimal Number of Clusters

Using the Elbow Method to determine the best k.

fviz_nbclust(df_scaled, kmeans, method = "wss") +
  labs(title = "Elbow Method for Optimal k")

### Step 2: Applying k-Means with Optimal k

set.seed(123)
kmeans_result <- kmeans(df_scaled, centers = 3, nstart = 25)

# Add cluster labels to dataset
df$Cluster <- as.factor(kmeans_result$cluster)

# Print cluster centers
print(kmeans_result$centers)

##           mpg         hp          wt
## 1 -0.96410736  1.1844968  0.97873444
## 2 -0.01814766 -0.3509553 -0.09651672
## 3  1.65523937 -1.0382807 -1.37384616

# Visualizing k-Means clusters
fviz_cluster(kmeans_result, data = df_scaled,
             ellipse.type = "convex",
             geom = "point",
             palette = "jco") +
  labs(title = "k-Means Clustering on Car Data")

Step 3: Cluster Summary

df_summary <- df %>%
  group_by(Cluster) %>%
  summarise(Avg_MPG = mean(mpg),
            Avg_HP = mean(hp),
            Avg_Weight = mean(wt))

print(df_summary)

## # A tibble: 3 × 4
##   Cluster Avg_MPG Avg_HP Avg_Weight
##   <fct>     <dbl>  <dbl>      <dbl>
## 1 1          14.3  228.        4.17
## 2 2          20.0  123.        3.12
## 3 3          30.1   75.5       1.87

5. Hierarchical Clustering

Hierarchical clustering creates a dendrogram, a tree-like structure representing nested clusters. The method works by:

Computing Distance Matrix: \[ d(x_i, x_j) = || x_i - x_j || \]
Agglomerative Clustering (Bottom-Up Approach):
- Start with each data point as a single cluster.
- Merge the closest clusters iteratively.
- Stop when all data points are in one cluster.

Linkage Methods: - Single Linkage: \[ d(A, B) = \min \{ d(x_i, x_j) \} \] - Complete Linkage: \[ d(A, B) = \max \{ d(x_i, x_j) \} \] - Average Linkage: \[ d(A, B) = \frac{1}{|A||B|} \sum_{i \in A, j \in B} d(x_i, x_j) \]

The number of clusters can be determined by cutting the dendrogram at a chosen height.

Step 1: Compute Distance Matrix and Apply Clustering

# Compute Euclidean distance
dist_matrix <- dist(df_scaled, method = "euclidean")

# Apply Hierarchical Clustering
hc_complete <- hclust(dist_matrix, method = "complete")

# Plot Dendrogram
plot(hc_complete, main = "Dendrogram - Complete Linkage", sub = "", xlab = "")
rect.hclust(hc_complete, k = 3, border = "red")

Step 2: Cut Tree into Clusters

df$HCluster <- as.factor(cutree(hc_complete, k = 3))

# Display cluster summary
df %>%
  group_by(HCluster) %>%
  summarise(Avg_MPG = mean(mpg),
            Avg_HP = mean(hp),
            Avg_Weight = mean(wt))

## # A tibble: 3 × 4
##   HCluster Avg_MPG Avg_HP Avg_Weight
##   <fct>      <dbl>  <dbl>      <dbl>
## 1 1           19.4  132.        3.24
## 2 2           13.4  248.        4.31
## 3 3           30.1   75.5       1.87

6. Gaussian Mixture Model (GMM) Clustering

Gaussian Mixture Models (GMM) assume that data is generated from a mixture of Gaussian distributions:

\[ p(x) = \sum_{i=1}^{k} \pi_i \mathcal{N}(x | \mu_i, \Sigma_i) \]

where: - \(\pi_i\) are the mixing coefficients (sum to 1). - \(\mathcal{N}(x | \mu_i, \Sigma_i)\) is the Gaussian distribution with mean \(\mu_i\) and covariance \(\Sigma_i\).

Expectation-Maximization (EM) Algorithm is used for parameter estimation: 1. E-Step: Compute the responsibility of each cluster for each data point. 2. M-Step: Update the parameters \((\pi_i, \mu_i, \Sigma_i)\) based on responsibilities. 3. Repeat until convergence.

Step 1: Apply GMM

gmm_model <- Mclust(df_scaled, G = 1:5)  # G = possible number of clusters

# Summary of GMM model
summary(gmm_model)

## ---------------------------------------------------- 
## Gaussian finite mixture model fitted by EM algorithm 
## ---------------------------------------------------- 
## 
## Mclust EII (spherical, equal volume) model with 5 components: 
## 
##  log-likelihood  n df       BIC       ICL
##       -71.27269 32 20 -211.8601 -213.1701
## 
## Clustering table:
##  1  2  3  4  5 
##  8 11  4  3  6

Step 2: Visualizing GMM Clusters

fviz_mclust(gmm_model, what = "classification") +
  labs(title = "GMM Clustering - Customer Segmentation")

## Too few points to calculate an ellipse

7. Cluster Validation

To evaluate clustering quality, we use:

Silhouette Score: \[ S(i) = \frac{b(i) - a(i)}{\max(a(i), b(i))} \] where:
- \(a(i)\) is the average intra-cluster distance.
- \(b(i)\) is the average inter-cluster distance.
Davies-Bouldin Index: \[ DB = \frac{1}{k} \sum_{i=1}^{k} \max_{j \neq i} \left( \frac{\sigma_i + \sigma_j}{d(\mu_i, \mu_j)} \right) \] where:
- \(\sigma_i\) is the dispersion of cluster \(i\).
- \(d(\mu_i, \mu_j)\) is the distance between centroids.
Dunn Index: \[ DI = \frac{\min_{i \neq j} d(C_i, C_j)}{\max_k d(C_k)} \] where:
- \(d(C_i, C_j)\) is the minimum distance between two clusters.
- \(d(C_k)\) is the maximum intra-cluster distance.

Conclusion: - Higher Silhouette Score → Better clustering. - Lower Davies-Bouldin Index → Well-separated clusters. - Higher Dunn Index → Better compactness and separation.

Step 1: Silhouette Score

The Silhouette score evaluates how well observations fit into their assigned clusters.

# k-Means Silhouette Score
sil_kmeans <- silhouette(kmeans_result$cluster, dist(df_scaled))
fviz_silhouette(sil_kmeans) +
  labs(title = "Silhouette Plot for k-Means")

##   cluster size ave.sil.width
## 1       1   10          0.30
## 2       2   16          0.42
## 3       3    6          0.59

# GMM Silhouette Score
sil_gmm <- silhouette(gmm_model$classification, dist(df_scaled))
fviz_silhouette(sil_gmm) +
  labs(title = "Silhouette Plot for GMM")

##   cluster size ave.sil.width
## 1       1    8          0.56
## 2       2   11          0.39
## 3       3    4          0.49
## 4       4    3          0.72
## 5       5    6          0.45

8. Comparison of Clustering Methods

Method	Strengths	Weaknesses
k-Means	Fast and simple	Assumes spherical clusters
Hierarchical Clustering	No need to pre-define k	Computationally expensive for large datasets
GMM (Gaussian Mixture Models)	Handles different shapes of clusters	Can be sensitive to initialization

9. Conclusion

k-Means is best for quick clustering but assumes spherical clusters.
Hierarchical Clustering is useful for visualizing clusters but scales poorly with large datasets.
GMM provides more flexibility but requires careful model selection.
Clustering is powerful for segmentation tasks across multiple domains!

10. References

MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations.
Hartigan, J. A., & Wong, M. A. (1979). Algorithm AS 136: A k-means clustering algorithm.
Duda, R. O., & Hart, P. E. (1973). Pattern Classification and Scene Analysis.