Cluster analysis is an unsupervised learning technique that groups observations into homogeneous clusters based on similarity. Given a dataset \(X = \{ x_1, x_2, ..., x_n \}\), the objective is to partition \(X\) into \(k\) clusters such that:
\[ C_1, C_2, ..., C_k \quad \text{where} \quad C_i \cap C_j = \emptyset \quad \forall i \neq j \]
The primary clustering methods include:
packages <- c("ggplot2", "dplyr", "cluster", "factoextra", "mclust", "gridExtra")
for (p in packages) {
if (!requireNamespace(p, quietly = TRUE)) {
install.packages(p, dependencies = TRUE)
}
library(p, character.only = TRUE)
}
## Warning: package 'ggplot2' was built under R version 4.4.1
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## Warning: package 'factoextra' was built under R version 4.4.1
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
## Warning: package 'mclust' was built under R version 4.4.1
## Package 'mclust' version 6.1.1
## Type 'citation("mclust")' for citing this R package in publications.
## Warning: package 'gridExtra' was built under R version 4.4.1
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
## 3. Dataset: Customer Segmentation using `mtcars`
We will cluster car models from the mtcars
dataset based
on their mileage (mpg
), horsepower (hp
), and
weight (wt
).
# Load dataset
data("mtcars")
# Select features for clustering
df <- mtcars[, c("mpg", "hp", "wt")]
# Standardize the data (important for clustering)
df_scaled <- scale(df)
# Display summary
summary(df_scaled)
## mpg hp wt
## Min. :-1.6079 Min. :-1.3810 Min. :-1.7418
## 1st Qu.:-0.7741 1st Qu.:-0.7320 1st Qu.:-0.6500
## Median :-0.1478 Median :-0.3455 Median : 0.1101
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4495 3rd Qu.: 0.4859 3rd Qu.: 0.4014
## Max. : 2.2913 Max. : 2.7466 Max. : 2.2553
k-Means Clustering is a popular partitioning method that minimizes the intra-cluster variance:
\[ J = \sum_{i=1}^{k} \sum_{x \in C_i} || x - \mu_i ||^2 \]
where: - \(\mu_i\) is the centroid of cluster \(C_i\). - \(|| x - \mu_i ||^2\) is the squared Euclidean distance.
Algorithm: 1. Initialize \(k\) centroids randomly. 2. Assign each data point to the nearest centroid. 3. Update centroids as the mean of assigned points. 4. Repeat steps 2 and 3 until convergence.
The Elbow Method is used to determine the optimal number of clusters by plotting the within-cluster sum of squares (WSS):
\[ \text{WSS} = \sum_{i=1}^{k} \sum_{x \in C_i} || x - \mu_i ||^2 \]
Using the Elbow Method to determine the best k.
fviz_nbclust(df_scaled, kmeans, method = "wss") +
labs(title = "Elbow Method for Optimal k")
### Step 2: Applying k-Means with Optimal k
set.seed(123)
kmeans_result <- kmeans(df_scaled, centers = 3, nstart = 25)
# Add cluster labels to dataset
df$Cluster <- as.factor(kmeans_result$cluster)
# Print cluster centers
print(kmeans_result$centers)
## mpg hp wt
## 1 -0.96410736 1.1844968 0.97873444
## 2 -0.01814766 -0.3509553 -0.09651672
## 3 1.65523937 -1.0382807 -1.37384616
# Visualizing k-Means clusters
fviz_cluster(kmeans_result, data = df_scaled,
ellipse.type = "convex",
geom = "point",
palette = "jco") +
labs(title = "k-Means Clustering on Car Data")
df_summary <- df %>%
group_by(Cluster) %>%
summarise(Avg_MPG = mean(mpg),
Avg_HP = mean(hp),
Avg_Weight = mean(wt))
print(df_summary)
## # A tibble: 3 × 4
## Cluster Avg_MPG Avg_HP Avg_Weight
## <fct> <dbl> <dbl> <dbl>
## 1 1 14.3 228. 4.17
## 2 2 20.0 123. 3.12
## 3 3 30.1 75.5 1.87
Hierarchical clustering creates a dendrogram, a tree-like structure representing nested clusters. The method works by:
Computing Distance Matrix: \[ d(x_i, x_j) = || x_i - x_j || \]
Agglomerative Clustering (Bottom-Up Approach):
Linkage Methods: - Single Linkage: \[ d(A, B) = \min \{ d(x_i, x_j) \} \] - Complete Linkage: \[ d(A, B) = \max \{ d(x_i, x_j) \} \] - Average Linkage: \[ d(A, B) = \frac{1}{|A||B|} \sum_{i \in A, j \in B} d(x_i, x_j) \]
The number of clusters can be determined by cutting the dendrogram at a chosen height.
# Compute Euclidean distance
dist_matrix <- dist(df_scaled, method = "euclidean")
# Apply Hierarchical Clustering
hc_complete <- hclust(dist_matrix, method = "complete")
# Plot Dendrogram
plot(hc_complete, main = "Dendrogram - Complete Linkage", sub = "", xlab = "")
rect.hclust(hc_complete, k = 3, border = "red")
df$HCluster <- as.factor(cutree(hc_complete, k = 3))
# Display cluster summary
df %>%
group_by(HCluster) %>%
summarise(Avg_MPG = mean(mpg),
Avg_HP = mean(hp),
Avg_Weight = mean(wt))
## # A tibble: 3 × 4
## HCluster Avg_MPG Avg_HP Avg_Weight
## <fct> <dbl> <dbl> <dbl>
## 1 1 19.4 132. 3.24
## 2 2 13.4 248. 4.31
## 3 3 30.1 75.5 1.87
Gaussian Mixture Models (GMM) assume that data is generated from a mixture of Gaussian distributions:
\[ p(x) = \sum_{i=1}^{k} \pi_i \mathcal{N}(x | \mu_i, \Sigma_i) \]
where: - \(\pi_i\) are the mixing coefficients (sum to 1). - \(\mathcal{N}(x | \mu_i, \Sigma_i)\) is the Gaussian distribution with mean \(\mu_i\) and covariance \(\Sigma_i\).
Expectation-Maximization (EM) Algorithm is used for parameter estimation: 1. E-Step: Compute the responsibility of each cluster for each data point. 2. M-Step: Update the parameters \((\pi_i, \mu_i, \Sigma_i)\) based on responsibilities. 3. Repeat until convergence.
gmm_model <- Mclust(df_scaled, G = 1:5) # G = possible number of clusters
# Summary of GMM model
summary(gmm_model)
## ----------------------------------------------------
## Gaussian finite mixture model fitted by EM algorithm
## ----------------------------------------------------
##
## Mclust EII (spherical, equal volume) model with 5 components:
##
## log-likelihood n df BIC ICL
## -71.27269 32 20 -211.8601 -213.1701
##
## Clustering table:
## 1 2 3 4 5
## 8 11 4 3 6
fviz_mclust(gmm_model, what = "classification") +
labs(title = "GMM Clustering - Customer Segmentation")
## Too few points to calculate an ellipse
To evaluate clustering quality, we use:
Conclusion: - Higher Silhouette Score → Better clustering. - Lower Davies-Bouldin Index → Well-separated clusters. - Higher Dunn Index → Better compactness and separation.
The Silhouette score evaluates how well observations fit into their assigned clusters.
# k-Means Silhouette Score
sil_kmeans <- silhouette(kmeans_result$cluster, dist(df_scaled))
fviz_silhouette(sil_kmeans) +
labs(title = "Silhouette Plot for k-Means")
## cluster size ave.sil.width
## 1 1 10 0.30
## 2 2 16 0.42
## 3 3 6 0.59
# GMM Silhouette Score
sil_gmm <- silhouette(gmm_model$classification, dist(df_scaled))
fviz_silhouette(sil_gmm) +
labs(title = "Silhouette Plot for GMM")
## cluster size ave.sil.width
## 1 1 8 0.56
## 2 2 11 0.39
## 3 3 4 0.49
## 4 4 3 0.72
## 5 5 6 0.45
Method | Strengths | Weaknesses |
---|---|---|
k-Means | Fast and simple | Assumes spherical clusters |
Hierarchical Clustering | No need to pre-define k | Computationally expensive for large datasets |
GMM (Gaussian Mixture Models) | Handles different shapes of clusters | Can be sensitive to initialization |