Clustering is an unsupervised machine learning technique used to identify natural groups in data without using a predefined outcome variable. In healthcare data, clustering can help reveal hidden health profiles, risk patterns, and subgroups of individuals who share similar demographic, physiological, and metabolic characteristics.
This project applies clustering techniques to the National Health and Nutrition Examination Survey (NHANES) 2017–2018 dataset. The selected variables include age, sex, income-to-poverty ratio, body mass index (BMI), blood pressure, and glycohemoglobin (HbA1c). These variables capture important demographic and health-related characteristics. The main goal of the project is to identify meaningful health-related clusters, interpret their characteristics, and compare different clustering methods.
The implementation follows five major steps. First, NHANES
demographic, body measurement, blood pressure, glycohemoglobin, and
questionnaire files are downloaded and merged using the participant
identifier SEQN. Second, the selected variables are cleaned
and prepared by checking missing values, applying median imputation to
numeric variables, examining correlations, removing redundant variables,
and scaling the final continuous clustering variables. Third, three
clustering methods are applied: K-means, DBSCAN, and Gaussian Mixture
Models (GMM). Fourth, model parameters are selected using validation
metrics and diagnostic plots. Finally, the clustering results are
evaluated and visualized using internal validation metrics, PCA plots,
UMAP plots, silhouette plots, cluster profile tables, boxplots, a
heatmap, and a parallel coordinate plot.
# Install packages if needed
packages <- c(
"nhanesA", "tidyverse", "e1071", "knitr", "corrplot", "umap",
"cluster", "clusterCrit", "clusterSim", "dbscan", "fpc", "mclust",
"factoextra", "GGally", "viridis", "RColorBrewer"
)
new_packages <- packages[!(packages %in% installed.packages()[, "Package"])]
if(length(new_packages) > 0) install.packages(new_packages)
library(nhanesA)
library(tidyverse)
library(e1071)
library(knitr)
library(corrplot)
library(umap)
library(cluster)
library(clusterCrit)
library(clusterSim)
library(dbscan)
library(fpc)
library(mclust)
library(factoextra)
library(GGally)
library(gridExtra)
library(viridis)
library(RColorBrewer)
library(tidyr)
library(dplyr)
library(kableExtra)
# Load NHANES 2017-2018 data files
demo <- nhanes("DEMO_J")
bmx <- nhanes("BMX_J")
bpx <- nhanes("BPX_J")
ghb <- nhanes("GHB_J")
bpq <- nhanes("BPQ_J")
# Merge data files using participant ID
dataset <- demo %>%
inner_join(bmx, by = "SEQN") %>%
inner_join(bpx, by = "SEQN") %>%
inner_join(ghb, by = "SEQN") %>%
inner_join(bpq, by = "SEQN")
# Select variables for analysis
datafeat <- dataset %>%
dplyr::select(
age = RIDAGEYR,
sex = RIAGENDR,
income_pir = INDFMPIR,
bmi = BMXBMI,
waist = BMXWAIST,
sys_bp = BPXSY1,
dia_bp = BPXDI1,
hba1c = LBXGH
)
dim(datafeat)
## [1] 5828 8
Exploratory data analysis was conducted to understand the structure and quality of the selected NHANES variables before clustering. Summary statistics were calculated to examine the central tendency, spread, and skewness of the variables. Histograms and boxplots were used to examine distributions and potential outliers. Missing values were identified and treated using median imputation for numeric variables. Median imputation was selected because it is less sensitive to extreme values than mean imputation.
A correlation matrix was also created to examine relationships among variables. BMI and waist circumference were highly correlated, so waist circumference was removed from the final clustering variables to avoid redundancy. Sex was also excluded from the clustering model because it is categorical and not appropriate for direct use in distance-based clustering without additional encoding. However, sex was retained later for cluster profiling and interpretation. The final clustering variables were standardized so that variables measured on larger scales did not dominate the clustering results.
summary_stats <- datafeat %>%
summarise(across(where(is.numeric),
list(
mean = ~mean(.x, na.rm = TRUE),
median = ~median(.x, na.rm = TRUE),
variance = ~var(.x, na.rm = TRUE),
skewness = ~skewness(.x, na.rm = TRUE)
),
.names = "{.col}_{.fn}"))
summary_stats %>%
pivot_longer(
cols = everything(),
names_to = c("Variable", "Statistic"),
names_pattern = "(.+)_(mean|median|variance|skewness)"
) %>%
pivot_wider(names_from = Statistic, values_from = value) %>%
mutate(across(where(is.numeric), ~round(.x, 3))) %>%
kable(caption = "Summary Statistics Before Imputation")
| Variable | mean | median | variance | skewness |
|---|---|---|---|---|
| age | 48.130 | 49.00 | 383.592 | -0.040 |
| income_pir | 2.512 | 2.07 | 2.584 | 0.353 |
| bmi | 29.462 | 28.30 | 55.752 | 1.245 |
| waist | 99.336 | 98.10 | 306.631 | 0.469 |
| sys_bp | 125.134 | 122.00 | 387.482 | 1.034 |
| dia_bp | 71.198 | 72.00 | 185.042 | -0.898 |
| hba1c | 5.811 | 5.50 | 1.148 | 3.239 |
datafeat %>%
pivot_longer(cols = where(is.numeric)) %>%
ggplot(aes(x = value)) +
geom_histogram(bins = 30, fill = "steelblue", color = "white") +
facet_wrap(~name, scales = "free") +
theme_minimal() +
labs(title = "Histograms of Selected Numeric Variables", x = "Value", y = "Count")
datafeat %>%
pivot_longer(cols = where(is.numeric)) %>%
ggplot(aes(x = name, y = value)) +
geom_boxplot(fill = "orange") +
coord_flip() +
theme_minimal() +
labs(title = "Boxplots of Selected Numeric Variables", x = NULL, y = "Value")
missing_summary <- datafeat %>%
summarise(across(everything(), ~sum(is.na(.x)))) %>%
pivot_longer(everything(), names_to = "variable", values_to = "n_missing") %>%
mutate(pct_missing = round(n_missing / nrow(datafeat) * 100, 2))
kable(missing_summary, caption = "Missing Values by Variable")
| variable | n_missing | pct_missing |
|---|---|---|
| age | 0 | 0.00 |
| sex | 0 | 0.00 |
| income_pir | 765 | 13.13 |
| bmi | 101 | 1.73 |
| waist | 357 | 6.13 |
| sys_bp | 659 | 11.31 |
| dia_bp | 659 | 11.31 |
| hba1c | 288 | 4.94 |
ggplot(missing_summary, aes(x = reorder(variable, pct_missing), y = pct_missing)) +
geom_col(fill = "tomato") +
coord_flip() +
labs(title = "Percentage Missing per Feature", x = NULL, y = "% Missing") +
theme_minimal()
data_imputed <- datafeat %>%
mutate(across(where(is.numeric), ~ifelse(is.na(.x), median(.x, na.rm = TRUE), .x)))
summary_stats_imputed <- data_imputed %>%
summarise(across(where(is.numeric),
list(
mean = ~mean(.x, na.rm = TRUE),
median = ~median(.x, na.rm = TRUE),
min = ~min(.x, na.rm = TRUE),
max = ~max(.x, na.rm = TRUE)
),
.names = "{.col}_{.fn}")) %>%
pivot_longer(
cols = everything(),
names_to = c("Variable", "Statistic"),
names_pattern = "(.+)_(mean|median|min|max)",
values_to = "value"
) %>%
pivot_wider(names_from = Statistic, values_from = value) %>%
mutate(across(where(is.numeric), ~round(.x, 2)))
kable(summary_stats_imputed, caption = "Summary Statistics After Median Imputation")
| Variable | mean | median | min | max |
|---|---|---|---|---|
| age | 48.13 | 49.00 | 16.0 | 80.0 |
| income_pir | 2.45 | 2.07 | 0.0 | 5.0 |
| bmi | 29.44 | 28.30 | 13.2 | 86.2 |
| waist | 99.26 | 98.10 | 56.4 | 169.5 |
| sys_bp | 124.78 | 122.00 | 72.0 | 228.0 |
| dia_bp | 71.29 | 72.00 | 0.0 | 136.0 |
| hba1c | 5.80 | 5.50 | 3.8 | 16.2 |
cor_matrix <- data_imputed %>%
dplyr::select(where(is.numeric)) %>%
cor()
corrplot(cor_matrix, method = "color", tl.cex = 0.8, addCoef.col = "black", number.cex = 0.6)
high_corr <- which(abs(cor_matrix) > 0.8 & abs(cor_matrix) < 1, arr.ind = TRUE)
high_corr
## row col
## waist 4 3
## bmi 3 4
The correlation analysis was used to identify redundancy among the selected variables. Because BMI and waist circumference measure closely related body-size characteristics and were highly correlated, waist circumference was removed from the final clustering model. This helps reduce duplicated information in the clustering solution.
# Remove sex because it is categorical.
# Remove waist because it is highly correlated with BMI.
data_cluster <- data_imputed %>%
dplyr::select(-sex, -waist)
# Check variances before scaling
apply(data_cluster, 2, var)
## age income_pir bmi sys_bp dia_bp hba1c
## 383.592104 2.267213 54.808526 344.645463 164.178919 1.095336
# Scale numeric data for clustering
data_scaled <- scale(data_cluster)
# Confirm no missing values after scaling
colSums(is.na(data_scaled))
## age income_pir bmi sys_bp dia_bp hba1c
## 0 0 0 0 0 0
Scaling is necessary because the selected variables are measured on different scales. For example, age, blood pressure, BMI, and HbA1c have different units and ranges. Standardization gives each variable a mean of 0 and a standard deviation of 1, allowing each variable to contribute more equally to distance-based clustering.
Dimensionality reduction was used to visually examine possible cluster structure before applying clustering algorithms. PCA was used to project the standardized data into two dimensions while preserving as much total variation as possible. UMAP was also used to create a nonlinear two-dimensional visualization that can better capture local relationships among observations.
# PCA visualization before clustering
pca_result <- prcomp(data_scaled)
pca_data <- as.data.frame(pca_result$x)
ggplot(pca_data, aes(x = PC1, y = PC2)) +
geom_point(alpha = 0.5) +
theme_minimal() +
labs(title = "PCA Projection Before Clustering", x = "PC1", y = "PC2")
# UMAP visualization before clustering
set.seed(123)
umap_result <- umap(data_scaled)
umap_data <- as.data.frame(umap_result$layout)
colnames(umap_data) <- c("UMAP1", "UMAP2")
ggplot(umap_data, aes(x = UMAP1, y = UMAP2)) +
geom_point(alpha = 0.5) +
theme_minimal() +
labs(title = "UMAP Projection Before Clustering", x = "UMAP1", y = "UMAP2")
Three clustering methods were implemented and compared: K-means, DBSCAN, and Gaussian Mixture Models. K-means was selected because it is simple, widely used, easy to interpret, and suitable for continuous numeric variables. DBSCAN was selected because it can identify dense clusters and label unusual observations as noise. GMM was selected because it allows probability-based cluster assignment and is useful when observations may share characteristics with multiple groups.
The main tuning parameter for K-means is the number of clusters, K.
Several K values were evaluated using the Elbow Method, Silhouette
Score, Davies-Bouldin Index, and Calinski-Harabasz Index. For DBSCAN,
the main parameters are eps and minPts; these
were tuned using a k-distance plot and a grid search. For GMM, the
number of mixture components and covariance structure were selected
using BIC and internal validation metrics.
The clustering models were evaluated using Silhouette Score, Davies-Bouldin Index, Calinski-Harabasz Index, cluster size distribution, and practical interpretability. Visualizations included PCA plots, UMAP plots, validation metric plots, silhouette plots, heatmaps, boxplots, and parallel coordinate plots.
DBSCAN (Density Based Spatial Clustering of Applications with Noise) was chosen as one of the clustering methods because its method does not require a cluster number to be known in advance. It is also able to categorize outliers as noise points. This is very valuable when dealing with healthcare data as there are often cases with very extreme values associated with them and are often wanted for identification for study and recognition.
# K Distance plot is used to guide finding the epsilon value. The plot sorts all observations by their distance to their k-th nearest neighbor. At the elbow, or curve bend, the natural density threshold and starting point for the epsilon value is indicated. To check sensitivity of the elbow to the choice of k, the plot is generated at two values of k.
par(mfrow = c(1, 2))
kNNdistplot(data_scaled, k = 4)
abline(h = 1.0, col = "red", lty = 2)
abline(h = 1.25, col = "red", lty = 2)
abline(h = 1.5, col = "red", lty = 2)
abline(h = 1.75, col = "red", lty = 2)
abline(h = 2.0, col = "red", lty = 2)
title("K-Distance Plot (k = 4)")
kNNdistplot(data_scaled, k = 5)
abline(h = 1.0, col = "red", lty = 2)
abline(h = 1.25, col = "red", lty = 2)
abline(h = 1.5, col = "red", lty = 2)
abline(h = 1.75, col = "red", lty = 2)
abline(h = 2.0, col = "red", lty = 2)
title("K-Distance Plot (k = 5)")
par(mfrow = c(1, 1))
The K-Distance plots were used to guide the selection of epsilon. Both plots show an elbow at an approximate value of 1.5, which is consistent across the two k values tested. This consistency indicates the elbow is not an artifact of the choice of k, and suggests the natural density threshold lies at around 1.5.
# To clarify what combination of epsilon and minPts values would best suit the data, a grid search is conducted. The epsilon values 1.2 to 1.7 and minPts values from 3 to 7 will be calculated.
eps.vals <- seq(1.2, 1.7, by = 0.1)
minpts.vals <- c(3, 4, 5, 6, 7)
grid.results <- expand.grid(eps = eps.vals, minPts = minpts.vals)
grid.results$n_clusters <- NA
grid.results$n_noise <- NA
for (i in seq_len(nrow(grid.results))) {
fit <- dbscan::dbscan(data_scaled,
eps = grid.results$eps[i],
minPts = grid.results$minPts[i])
grid.results$n_clusters[i] <- max(fit$cluster)
grid.results$n_noise[i] <- sum(fit$cluster == 0)
}
print(grid.results)
## eps minPts n_clusters n_noise
## 1 1.2 3 10 172
## 2 1.3 3 7 108
## 3 1.4 3 5 79
## 4 1.5 3 4 51
## 5 1.6 3 4 39
## 6 1.7 3 4 30
## 7 1.2 4 7 206
## 8 1.3 4 7 137
## 9 1.4 4 7 91
## 10 1.5 4 4 57
## 11 1.6 4 4 43
## 12 1.7 4 4 30
## 13 1.2 5 2 245
## 14 1.3 5 4 168
## 15 1.4 5 3 114
## 16 1.5 5 3 69
## 17 1.6 5 3 51
## 18 1.7 5 4 31
## 19 1.2 6 2 258
## 20 1.3 6 3 183
## 21 1.4 6 2 134
## 22 1.5 6 2 87
## 23 1.6 6 2 59
## 24 1.7 6 2 43
## 25 1.2 7 1 285
## 26 1.3 7 2 199
## 27 1.4 7 2 148
## 28 1.5 7 2 100
## 29 1.6 7 2 64
## 30 1.7 7 2 47
ggplot(grid.results, aes(x = eps, y = n_noise,
colour = as.factor(minPts),
group = as.factor(minPts))) +
geom_line() +
geom_point(size = 3) +
theme_minimal() +
labs(title = "DBSCAN Grid Search: Noise Points vs Epsilon",
x = "Epsilon",
y = "Noise Points",
colour = "minPts")
ggplot(grid.results, aes(x = eps, y = n_clusters,
colour = as.factor(minPts),
group = as.factor(minPts))) +
geom_line() +
geom_point(size = 3) +
theme_minimal() +
labs(title = "DBSCAN Grid Search: Number of Clusters vs Epsilon",
x = "Epsilon",
y = "Number of Clusters",
colour = "minPts")
# Heatmap: number of clusters at each parameter combination
ggplot(grid.results, aes(x = eps, y = factor(minPts), fill = n_clusters)) +
geom_tile(color = "white") +
scale_fill_viridis_c(option = "plasma", name = "# Clusters") +
labs(title = "DBSCAN Grid Search: Number of Clusters",
x = "eps (search radius)",
y = "minPts") +
theme_minimal(base_size = 13)
# Heatmap: number of noise points at each parameter combination
ggplot(grid.results, aes(x = eps, y = factor(minPts), fill = n_noise)) +
geom_tile(color = "white") +
scale_fill_viridis_c(option = "magma", name = "# Noise") +
labs(title = "DBSCAN Grid Search: Number of Noise Points",
subtitle = "High noise = eps too small or minPts too high",
x = "eps (search radius)",
y = "minPts") +
theme_minimal(base_size = 13)
A grid search was conducted across a range of epsilon values from 1.2 to
1.7 and minPts values of 3 to 7. Although it subverts the standard rule
of minPts being the number of dimensions plus one, a minPts of 5 was
chosen. This is because the expected minPts = 7 consistently produced
only one or two clusters when tested. At eps 1.5 and minPts 5, DBSCAN
produced a stable three cluster solution with 69 noise points (1.18%).
An eps of 1.6 was rejected because of the apparent instability of its
neighbor 1.7 that produced 4 clusters. In contrast, 1.5 occupies a
stable position of three clusters on both sides.
# Compare candidate DBSCAN fits
# Following an exploratory tight, medium, and strict comparison, three candidate parameter combinations are fit to confirm sensitivity of the chosen solution before committing to a final model.
set.seed(123)
db.tight <- dbscan::dbscan(data_scaled, eps = 1.3, minPts = 5)
db.mid <- dbscan::dbscan(data_scaled, eps = 1.5, minPts = 5)
db.strict <- dbscan::dbscan(data_scaled, eps = 1.5, minPts = 7)
cat("\n--- DBSCAN Tight (eps = 1.3, minPts = 5) ---\n")
##
## --- DBSCAN Tight (eps = 1.3, minPts = 5) ---
print(table(db.tight$cluster))
##
## 0 1 2 3 4
## 168 5640 10 5 5
cat("Noise points:", sum(db.tight$cluster == 0), "\n\n")
## Noise points: 168
cat("--- DBSCAN Medium (eps = 1.5, minPts = 5) ---\n")
## --- DBSCAN Medium (eps = 1.5, minPts = 5) ---
print(table(db.mid$cluster))
##
## 0 1 2 3
## 69 5728 26 5
cat("Noise points:", sum(db.mid$cluster == 0), "\n\n")
## Noise points: 69
cat("--- DBSCAN Strict (eps = 1.5, minPts = 7) ---\n")
## --- DBSCAN Strict (eps = 1.5, minPts = 7) ---
print(table(db.strict$cluster))
##
## 0 1 2
## 100 5707 21
cat("Noise points:", sum(db.strict$cluster == 0), "\n")
## Noise points: 100
# Side-by-side visualization of candidate fits in PCA space
plot_dbscan <- function(clusters, title_str, pca_df = pca_data) {
df <- pca_df
df$clus <- factor(ifelse(clusters == 0,
"Noise",
paste("Cluster", clusters)))
ggplot(df, aes(PC1, PC2, color = clus)) +
geom_point(size = 0.8, alpha = 0.6) +
scale_color_brewer(palette = "Set1") +
labs(title = title_str, color = NULL) +
theme_minimal(base_size = 11) +
theme(legend.position = "bottom")
}
p1 <- plot_dbscan(db.tight$cluster, "Tight (eps = 1.3, minPts = 5)")
p2 <- plot_dbscan(db.mid$cluster, "Medium (eps = 1.5, minPts = 5)")
p3 <- plot_dbscan(db.strict$cluster, "Strict (eps = 1.5, minPts = 7)")
grid.arrange(p1, p2, p3, ncol = 2,
top = "DBSCAN Cluster Comparison Across Parameter Settings")
The tight setting (eps = 1.3) labels noticeably more observations as
noise, the strict setting (minPts = 7) collapses the smaller dense
regions into the dominant cluster, and the medium setting preserves the
three-cluster structure observed in the grid search. The medium
configuration is therefore retained as the final model.
# Fit DBSCAN Model
# eps = 1.5, minPts = 5, clusters = 3
set.seed(123)
dbscan.model <- dbscan::dbscan(data_scaled, eps = 1.5, minPts = 5)
# Define non-noise observations for DBSCAN evaluation
no.noise <- dbscan.model$cluster != 0
print(dbscan.model)
## DBSCAN clustering for 5828 objects.
## Parameters: eps = 1.5, minPts = 5
## Using euclidean distances and borderpoints = TRUE
## The clustering contains 3 cluster(s) and 69 noise points.
##
## 0 1 2 3
## 69 5728 26 5
##
## Available fields: cluster, eps, minPts, metric, borderPoints
cat("Cluster sizes:\n")
## Cluster sizes:
print(table(dbscan.model$cluster))
##
## 0 1 2 3
## 69 5728 26 5
cat("Noise points:", sum(dbscan.model$cluster == 0), "\n")
## Noise points: 69
cat("Noise percentage:",
round(sum(dbscan.model$cluster == 0) / nrow(data_scaled) * 100, 2), "%\n")
## Noise percentage: 1.18 %
# Add cluster labels to data
data_cluster$dbscan_cluster <- as.factor(dbscan.model$cluster)
DBSCAN identified three clusters and one noise group. Cluster 1 contains 5,728 observations, over 95% of the data set. Clusters 2 and 3 are significantly smaller at 26 and 5 observations. This leaves 69 observations as noise in Cluster 0. This is not erroneous data but extreme clinical cases that fall outside the minimum density threshold of the other clusters.
# Visualization of DBSCAN in Reduced Dimensional Space
# DBSCAN PCA Visual
pca_data$dbscan_cluster <- as.factor(dbscan.model$cluster)
ggplot(pca_data, aes(x = PC1, y = PC2, color = dbscan_cluster)) +
geom_point(alpha = 0.6, size = 0.8) +
theme_minimal() +
labs(title = "DBSCAN Clusters Visualized Using PCA",
x = "PC1", y = "PC2",
color = "Cluster") +
scale_color_brewer(palette = "Set1")
# DBSCAN UMAP Visual
umap_data$dbscan_cluster <- as.factor(dbscan.model$cluster)
ggplot(umap_data, aes(x = UMAP1, y = UMAP2, color = dbscan_cluster)) +
geom_point(alpha = 0.6, size = 0.8) +
theme_minimal() +
labs(title = "DBSCAN Clusters Visualized Using UMAP",
x = "UMAP1", y = "UMAP2",
color = "Cluster") +
scale_color_brewer(palette = "Set1")
Both visuals confirm the dominant cluster output. Cluster 1 contains the
central mass of the data while the smaller clusters and noise
observations surround it on the periphery.
# Evaluation of DBSCAN
# DBSCAN Silhouette Analysis
# Compute silhouette, excluding noise (Cluster 0)
sil.dbscan <- silhouette(
dbscan.model$cluster[no.noise],
dist(data_scaled[no.noise, ])
)
# Convert to data frame
sil.df <- as.data.frame(unclass(sil.dbscan))
names(sil.df) <- c("cluster", "neighbor", "sil_width")
# Overall mean silhouette score
cat("DBSCAN Mean Silhouette Score:",
round(mean(sil.df$sil_width), 4), "\n\n")
## DBSCAN Mean Silhouette Score: 0.4811
# Per-cluster silhouette distribution
sil.per.cluster <- sil.df %>%
group_by(cluster) %>%
summarise(
n = n(),
mean_sil = round(mean(sil_width), 4),
median_sil = round(median(sil_width), 4),
min_sil = round(min(sil_width), 4),
max_sil = round(max(sil_width), 4),
pct_negative = round(mean(sil_width < 0) * 100, 2),
.groups = "drop"
)
sil.per.cluster %>%
mutate(cluster = paste("Cluster", cluster)) %>%
rename(
Cluster = cluster,
N = n,
Mean = mean_sil,
Median = median_sil,
Min = min_sil,
Max = max_sil,
`% Negative` = pct_negative
) %>%
kable(
caption = "Per-Cluster Silhouette Distribution",
align = c("l", "r", "r", "r", "r", "r", "r")
)
| Cluster | N | Mean | Median | Min | Max | % Negative |
|---|---|---|---|---|---|---|
| Cluster 1 | 5728 | 0.4815 | 0.5109 | -0.3074 | 0.6514 | 0.51 |
| Cluster 2 | 26 | 0.3624 | 0.4103 | 0.0419 | 0.5259 | 0.00 |
| Cluster 3 | 5 | 0.6315 | 0.6258 | 0.5743 | 0.7097 | 0.00 |
# Make plotting order
sil.df <- sil.df %>%
arrange(cluster, desc(sil_width)) %>%
mutate(id = row_number())
# Reference line for mean
avg.sil <- mean(sil.df$sil_width)
# DBSCAN Silhouette Plot
ggplot(sil.df, aes(x = id, y = sil_width, fill = factor(cluster))) +
geom_col(width = 1) +
geom_hline(yintercept = avg.sil, linetype = "dashed", linewidth = 1) +
labs(
title = "DBSCAN Silhouette Plot",
x = "Observations",
y = "Silhouette Width",
fill = "Cluster"
) +
theme_minimal(base_size = 14) +
theme(
panel.grid.major.x = element_blank(),
panel.grid.minor = element_blank(),
legend.position = "right"
)
A mean score of 0.48111 reflects moderate separation. The observations
are ostensibly matched to their own cluster but boundaries are not
readily defined, which is expected given that the vast majority of
observations fall into a single cluster. Examined per cluster the
distibution is uneven.
# DBSCAN Davies-Bouldin Index
# The Davies-Bouldin index is computed for further evaluation. Lower values indicate better separation.
db.index <- index.DB(
data_scaled[no.noise, ],
dbscan.model$cluster[no.noise]
)
cat("DBSCAN Davies-Bouldin Index:",
round(db.index$DB, 4), "\n")
## DBSCAN Davies-Bouldin Index: 0.7432
# Cluster.stats output
db.stats <- cluster.stats(
dist(data_scaled[no.noise, ]),
dbscan.model$cluster[no.noise]
)
The Davies-Bouldin Index of 0.7432, where lower values denote more compact and better-separated clusters, falls into a moderate range.
# DBSCAN Calinski-Harabasz Index
cat("DBSCAN Calinski-Harabasz Index:",
round(db.stats$ch, 4), "\n")
## DBSCAN Calinski-Harabasz Index: 96.6371
dbscan_metrics <- data.frame(
Method = "DBSCAN",
Silhouette = round(mean(sil.df$sil_width), 4),
Davies_Bouldin = round(db.index$DB, 4),
Calinski_Harabasz = round(db.stats$ch, 4)
)
kable(dbscan_metrics, caption = "Final DBSCAN Evaluation Metrics")
| Method | Silhouette | Davies_Bouldin | Calinski_Harabasz |
|---|---|---|---|
| DBSCAN | 0.4811 | 0.7432 | 96.6371 |
A value of 96.64 is modest. It reflects that DBSCAN found one overwhelmingly dominant cluster and two very small ones. This limits the degree of separation between clusters this metric can detect.
DBSCAN was applied across a comprehensive grid search of 30 parameter combinations and confirmed against three candidate fits (tight, medium, strict). In all cases the algorithm either produced one dominant cluster containing over 95% of observations alongside negligible splinter groups, or collapsed to two clusters with similar imbalance. No parameter combination produced a meaningful multi-cluster solution with adequately sized groups. This indicates that the NHANES health profile data lacks the distinct structure DBSCAN requires. This finding itself is actually valuable. It suggests health risk profiles in this population exist on a continuous spectrum rather than forming isolated density regions, which means the use of GMM and K-Means are better suited to this data structure.
Gaussian Mixture Models were selected because they are useful when clusters may overlap. GMM assumes that the data are generated from a mixture of Gaussian distributions, where each distribution represents a cluster. Unlike K-means, which assigns each observation directly to one group, GMM estimates the probability that each observation belongs to each cluster.
set.seed(123)
# Automatic GMM model selection using BIC
gmm_model <- Mclust(data_scaled)
summary(gmm_model)
## ----------------------------------------------------
## Gaussian finite mixture model fitted by EM algorithm
## ----------------------------------------------------
##
## Mclust VEV (ellipsoidal, equal shape) model with 9 components:
##
## log-likelihood n df BIC ICL
## -39771.76 5828 211 -81372.99 -84650.09
##
## Clustering table:
## 1 2 3 4 5 6 7 8 9
## 491 198 691 506 940 1166 544 775 517
plot(gmm_model, what = "BIC")
# Davies-Bouldin and Calinski-Harabasz for GMM models with different numbers of clusters
g_range <- 2:9
gmm_validation <- data.frame(
G = g_range,
Model = NA,
BIC = NA,
AIC = NA,
Silhouette = NA,
Davies_Bouldin = NA,
Calinski_Harabasz = NA
)
for (i in seq_along(g_range)) {
g <- g_range[i]
fit <- Mclust(data_scaled, G = g)
clusters_gmm <- as.integer(fit$classification)
sil <- silhouette(clusters_gmm, dist(data_scaled))
gmm_validation$Silhouette[i] <- mean(sil[, 3])
res <- intCriteria(as.matrix(data_scaled), clusters_gmm,
c("Davies_Bouldin", "Calinski_Harabasz"))
gmm_validation$Model[i] <- fit$modelName
gmm_validation$BIC[i] <- fit$bic
gmm_validation$AIC[i] <- -2 * fit$loglik + 2 * fit$df
gmm_validation$Davies_Bouldin[i] <- res$davies_bouldin
gmm_validation$Calinski_Harabasz[i] <- res$calinski_harabasz
}
gmm_validation <- gmm_validation %>%
mutate(across(where(is.numeric), ~round(.x, 4)))
kable(gmm_validation, caption = "GMM Validation Metrics by Number of Components")
| G | Model | BIC | AIC | Silhouette | Davies_Bouldin | Calinski_Harabasz |
|---|---|---|---|---|---|---|
| 2 | VVV | -86790.89 | 86424.01 | 0.2461 | 2.1686 | 953.9958 |
| 3 | VEV | -85008.67 | 84521.72 | 0.1667 | 1.8939 | 908.9528 |
| 4 | VEV | -83512.20 | 82871.84 | 0.1813 | 1.8760 | 998.6605 |
| 5 | VEV | -82872.22 | 82078.44 | 0.1234 | 2.0574 | 825.7462 |
| 6 | VEV | -82433.01 | 81485.81 | 0.0997 | 1.9826 | 721.0337 |
| 7 | VEV | -81891.03 | 80790.41 | 0.0856 | 2.5829 | 631.8080 |
| 8 | VEV | -81791.56 | 80537.52 | 0.0900 | 2.2003 | 607.6264 |
| 9 | VEV | -81248.89 | 79841.42 | 0.0476 | 2.3250 | 521.5428 |
ggplot(gmm_validation, aes(x = G, y = BIC)) +
geom_line() + geom_point(size = 3) +
theme_minimal() +
labs(title = "GMM BIC by Number of Components", x = "Number of Components", y = "BIC")
ggplot(gmm_validation, aes(x = G, y = AIC)) +
geom_line() + geom_point(size = 3) +
theme_minimal() +
labs(title = "GMM AIC by Number of Components", x = "Number of Components", y = "AIC")
ggplot(gmm_validation, aes(x = G, y = Silhouette)) +
geom_line() + geom_point(size = 3) +
theme_minimal() +
labs(title = "GMM Silhouette Score by Number of Components", x = "Number of Components", y = "Silhouette Score")
ggplot(gmm_validation, aes(x = G, y = Davies_Bouldin)) +
geom_line() + geom_point(size = 3) +
theme_minimal() +
labs(title = "GMM Davies-Bouldin Index by Number of Components", x = "Number of Components", y = "Davies-Bouldin Index")
ggplot(gmm_validation, aes(x = G, y = Calinski_Harabasz)) +
geom_line() + geom_point(size = 3) +
theme_minimal() +
labs(title = "GMM Calinski-Harabasz Index by Number of Components", x = "Number of Components", y = "Calinski-Harabasz Index")
Based on BIC, internal validation metrics, and interpretability, a four-cluster GMM solution with VEV covariance structure was selected. The VEV structure allows clusters to have variable volume, equal shape, and variable orientation, which gives the model more flexibility than a simple equal-variance model.
set.seed(123)
gmm_final <- Mclust(data_scaled, G = 4, modelNames = "VEV")
clusters_gmm_final <- as.integer(gmm_final$classification)
gmm_indices <- intCriteria(
traj = as.matrix(data_scaled),
part = clusters_gmm_final,
crit = c("Davies_Bouldin", "Calinski_Harabasz")
)
gmm_sil <- silhouette(clusters_gmm_final, dist(data_scaled))
gmm_silhouette <- mean(gmm_sil[, 3])
gmm_metrics <- data.frame(
Method = "GMM",
Silhouette = round(gmm_silhouette, 4),
Davies_Bouldin = round(gmm_indices$davies_bouldin, 4),
Calinski_Harabasz = round(gmm_indices$calinski_harabasz, 4)
)
kable(gmm_metrics, caption = "Final GMM Evaluation Metrics")
| Method | Silhouette | Davies_Bouldin | Calinski_Harabasz |
|---|---|---|---|
| GMM | 0.1746 | 1.9078 | 991.787 |
summary(gmm_sil)
## Silhouette of 5828 units in 4 clusters from silhouette.default(x = clusters_gmm_final, dist = dist(data_scaled)) :
## Cluster sizes and average silhouette widths:
## 1145 823 1970 1890
## 0.2395428 -0.1011910 0.2515514 0.1752064
## Individual silhouette widths:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.49521 0.07651 0.21002 0.17462 0.31027 0.47092
fviz_silhouette(gmm_sil) +
labs(title = "Silhouette Plot for GMM") +
theme_minimal()
## cluster size ave.sil.width
## 1 1 1145 0.24
## 2 2 823 -0.10
## 3 3 1970 0.25
## 4 4 1890 0.18
data_cluster$gmm_cluster <- as.factor(gmm_final$classification)
pca_data$gmm_cluster <- data_cluster$gmm_cluster
ggplot(pca_data, aes(PC1, PC2, color = gmm_cluster)) +
geom_point(alpha = 0.6) +
theme_minimal() +
labs(title = "GMM Clusters Visualized Using PCA", x = "PC1", y = "PC2", color = "Cluster")
umap_data$gmm_cluster <- data_cluster$gmm_cluster
ggplot(umap_data, aes(x = UMAP1, y = UMAP2, color = gmm_cluster)) +
geom_point(alpha = 0.6, size = 0.8) +
theme_minimal() +
labs(title = "GMM Clusters Visualized Using UMAP", x = "UMAP1", y = "UMAP2", color = "Cluster")
# K-means Clustering
K-means clustering was applied to the standardized dataset. Conceptually, K-means assigns each observation to the nearest cluster center, called a centroid, and then updates the centroids until the assignments become stable. Because K-means requires the number of clusters to be selected in advance, several K values were tested before selecting the final model.
set.seed(123)
# Elbow Method
wss <- sapply(1:10, function(k) {
kmeans(data_scaled, centers = k, nstart = 25)$tot.withinss
})
plot(1:10, wss, type = "b",
xlab = "Number of Clusters",
ylab = "Within-Cluster Sum of Squares",
main = "Elbow Method for K-means")
# Validation metrics for K = 2 to 10
validation_results <- data.frame(
K = integer(),
Silhouette_Score = numeric(),
Davies_Bouldin_Index = numeric(),
Calinski_Harabasz_Index = numeric()
)
for (k in 2:10) {
set.seed(123)
km <- kmeans(data_scaled, centers = k, nstart = 25)
clusters <- km$cluster
sil <- silhouette(clusters, dist(data_scaled))
silhouette_score <- mean(sil[, 3])
db <- intCriteria(as.matrix(data_scaled), clusters, "Davies_Bouldin")
ch <- intCriteria(as.matrix(data_scaled), clusters, "Calinski_Harabasz")
validation_results <- rbind(
validation_results,
data.frame(
K = k,
Silhouette_Score = silhouette_score,
Davies_Bouldin_Index = db$davies_bouldin,
Calinski_Harabasz_Index = ch$calinski_harabasz
)
)
}
validation_results <- validation_results %>%
mutate(across(where(is.numeric), ~round(.x, 4)))
kable(validation_results, caption = "Validation Metrics for K-means")
| K | Silhouette_Score | Davies_Bouldin_Index | Calinski_Harabasz_Index |
|---|---|---|---|
| 2 | 0.2019 | 1.7575 | 1534.722 |
| 3 | 0.1941 | 1.6355 | 1259.598 |
| 4 | 0.2114 | 1.4975 | 1223.244 |
| 5 | 0.2153 | 1.4129 | 1205.100 |
| 6 | 0.2135 | 1.3643 | 1195.268 |
| 7 | 0.2149 | 1.2617 | 1131.680 |
| 8 | 0.2025 | 1.2730 | 1101.250 |
| 9 | 0.1842 | 1.3089 | 1067.824 |
| 10 | 0.1798 | 1.3285 | 1015.720 |
ggplot(validation_results, aes(x = K, y = Silhouette_Score)) +
geom_line() + geom_point(size = 3) + theme_minimal() +
labs(title = "Silhouette Score for Different K Values", x = "K", y = "Silhouette Score")
ggplot(validation_results, aes(x = K, y = Davies_Bouldin_Index)) +
geom_line() + geom_point(size = 3) + theme_minimal() +
labs(title = "Davies-Bouldin Index for Different K Values", x = "K", y = "Davies-Bouldin Index")
ggplot(validation_results, aes(x = K, y = Calinski_Harabasz_Index)) +
geom_line() + geom_point(size = 3) + theme_minimal() +
labs(title = "Calinski-Harabasz Index for Different K Values", x = "K", y = "Calinski-Harabasz Index")
Based on the elbow plot, validation metrics, and interpretability of
the cluster sizes, K = 5 was selected for the final K-means model. The
argument nstart = 25 was used so that K-means would try
multiple random starting points and reduce the chance of selecting a
poor local solution.
set.seed(123)
kmeans_model <- kmeans(data_scaled, centers = 5, nstart = 25)
data_cluster$kmeans_cluster <- as.factor(kmeans_model$cluster)
table(data_cluster$kmeans_cluster)
##
## 1 2 3 4 5
## 1317 314 1709 1598 890
clusters <- kmeans_model$cluster
sil_kmeans <- silhouette(clusters, dist(data_scaled))
kmeans_silhouette <- mean(sil_kmeans[, 3])
kmeans_db <- intCriteria(as.matrix(data_scaled), clusters, "Davies_Bouldin")$davies_bouldin
kmeans_ch <- intCriteria(as.matrix(data_scaled), clusters, "Calinski_Harabasz")$calinski_harabasz
kmeans_metrics <- data.frame(
Method = "K-means",
Silhouette = round(kmeans_silhouette, 4),
Davies_Bouldin = round(kmeans_db, 4),
Calinski_Harabasz = round(kmeans_ch, 4)
)
kable(kmeans_metrics, caption = "Final K-means Evaluation Metrics")
| Method | Silhouette | Davies_Bouldin | Calinski_Harabasz |
|---|---|---|---|
| K-means | 0.2153 | 1.4129 | 1205.1 |
pca_data$kmeans_cluster <- data_cluster$kmeans_cluster
ggplot(pca_data, aes(x = PC1, y = PC2, color = kmeans_cluster)) +
geom_point(alpha = 0.6) +
theme_minimal() +
labs(title = "K-means Clusters Visualized Using PCA", x = "PC1", y = "PC2", color = "Cluster")
umap_data$kmeans_cluster <- data_cluster$kmeans_cluster
ggplot(umap_data, aes(x = UMAP1, y = UMAP2, color = kmeans_cluster)) +
geom_point(alpha = 0.6) +
theme_minimal() +
labs(title = "K-means Clusters Visualized Using UMAP", x = "UMAP1", y = "UMAP2", color = "Cluster")
The three clustering methods were compared using internal validation metrics and practical interpretability. The Silhouette Score measures how well observations fit within their assigned cluster compared with other clusters, with higher values indicating better separation. The Davies-Bouldin Index measures average cluster similarity, with lower values indicating better separation. The Calinski-Harabasz Index compares between-cluster dispersion to within-cluster dispersion, with higher values indicating better-defined clusters.
method_comparison <- bind_rows(kmeans_metrics, dbscan_metrics, gmm_metrics)
kable(method_comparison, caption = "Comparison of Clustering Methods")
| Method | Silhouette | Davies_Bouldin | Calinski_Harabasz |
|---|---|---|---|
| K-means | 0.2153 | 1.4129 | 1205.0999 |
| DBSCAN | 0.4811 | 0.7432 | 96.6371 |
| GMM | 0.1746 | 1.9078 | 991.7870 |
Because the purpose of this project was to identify meaningful and interpretable health profiles, K-means was selected as the final clustering method. It provided a strong balance between statistical performance, cluster size distribution, and practical interpretation. DBSCAN was useful for identifying noise and unusual observations, while GMM was useful for examining overlapping probability-based group structure.
The final K-means model was profiled because it was selected as the best method for this project. The profiling analysis uses summary statistics, sex distribution, feature comparison plots, boxplots, a heatmap, and a parallel coordinate plot. Waist was not included in the cluster profiling because it was removed from the clustering model due to high correlation with BMI. Sex was included only as a descriptive profiling variable.
kprof <- data_imputed %>%
dplyr::select(age, sex, income_pir, bmi, sys_bp, dia_bp, hba1c)
# Add K-means cluster labels
kprof$kmeans_cluster <- as.factor(kmeans_model$cluster)
# Label sex variable for interpretation
kprof$sex <- as.character(kprof$sex)
kprof$sex <- case_when(
kprof$sex %in% c("1", "Male", "male", "MALE") ~ "Male",
kprof$sex %in% c("2", "Female", "female", "FEMALE") ~ "Female",
TRUE ~ kprof$sex
)
kprof$sex <- as.factor(kprof$sex)
profile_vars <- c("age", "income_pir", "bmi", "sys_bp", "dia_bp", "hba1c")
cluster_size <- kprof %>%
group_by(kmeans_cluster) %>%
summarise(n = n(), .groups = "drop") %>%
mutate(percent = round(n / sum(n) * 100, 2))
kable(cluster_size, caption = "K-means Cluster Size and Percentage")
| kmeans_cluster | n | percent |
|---|---|---|
| 1 | 1317 | 22.60 |
| 2 | 314 | 5.39 |
| 3 | 1709 | 29.32 |
| 4 | 1598 | 27.42 |
| 5 | 890 | 15.27 |
cluster_profile <- kprof %>%
group_by(kmeans_cluster) %>%
summarise(
n = n(),
mean_age = mean(age, na.rm = TRUE),
mean_income_pir = mean(income_pir, na.rm = TRUE),
mean_bmi = mean(bmi, na.rm = TRUE),
mean_systolic_bp = mean(sys_bp, na.rm = TRUE),
mean_diastolic_bp = mean(dia_bp, na.rm = TRUE),
mean_hba1c = mean(hba1c, na.rm = TRUE),
.groups = "drop"
) %>%
mutate(across(where(is.numeric), ~round(.x, 2)))
kable(cluster_profile, caption = "K-means Cluster Profile Summary")
| kmeans_cluster | n | mean_age | mean_income_pir | mean_bmi | mean_systolic_bp | mean_diastolic_bp | mean_hba1c |
|---|---|---|---|---|---|---|---|
| 1 | 1317 | 49.38 | 4.62 | 27.81 | 120.37 | 72.13 | 5.59 |
| 2 | 314 | 60.29 | 2.17 | 32.51 | 129.50 | 69.73 | 9.23 |
| 3 | 1709 | 30.48 | 1.66 | 25.67 | 111.58 | 65.46 | 5.31 |
| 4 | 1598 | 66.01 | 1.89 | 27.80 | 141.16 | 75.24 | 5.85 |
| 5 | 890 | 43.77 | 1.90 | 40.97 | 125.59 | 74.69 | 5.74 |
sex_profile <- kprof %>%
group_by(kmeans_cluster, sex) %>%
summarise(n = n(), .groups = "drop") %>%
group_by(kmeans_cluster) %>%
mutate(percent = round(n / sum(n) * 100, 2)) %>%
ungroup()
kable(sex_profile, caption = "Sex Distribution by K-means Cluster")
| kmeans_cluster | sex | n | percent |
|---|---|---|---|
| 1 | Female | 645 | 48.97 |
| 1 | Male | 672 | 51.03 |
| 2 | Female | 141 | 44.90 |
| 2 | Male | 173 | 55.10 |
| 3 | Female | 922 | 53.95 |
| 3 | Male | 787 | 46.05 |
| 4 | Female | 779 | 48.75 |
| 4 | Male | 819 | 51.25 |
| 5 | Female | 522 | 58.65 |
| 5 | Male | 368 | 41.35 |
ggplot(sex_profile,
aes(x = kmeans_cluster, y = percent, fill = sex)) +
geom_col(position = "dodge") +
theme_minimal() +
labs(
title = "Sex Distribution Across K-means Clusters",
x = "Cluster",
y = "Percent",
fill = "Sex"
)
Sex was not used to create the clusters, but it was used to describe the clusters after the K-means model was created. This helps determine whether some health profiles are more common among male or female participants.
cluster_profile_long <- cluster_profile %>%
dplyr::select(-n) %>%
pivot_longer(cols = -kmeans_cluster, names_to = "feature", values_to = "mean_value")
ggplot(cluster_profile_long, aes(x = kmeans_cluster, y = mean_value, fill = kmeans_cluster)) +
geom_col() +
facet_wrap(~ feature, scales = "free_y") +
theme_minimal() +
labs(title = "K-means Cluster Profiles Based on Mean Values", x = "Cluster", y = "Mean Value", fill = "Cluster")
kprof %>%
dplyr::select(all_of(profile_vars), kmeans_cluster) %>%
pivot_longer(cols = all_of(profile_vars), names_to = "feature", values_to = "value") %>%
ggplot(aes(x = kmeans_cluster, y = value, fill = kmeans_cluster)) +
geom_boxplot(outlier.size = 0.5) +
facet_wrap(~ feature, scales = "free_y") +
theme_minimal() +
labs(title = "Feature Distributions Across K-means Clusters", x = "Cluster", y = "Value", fill = "Cluster")
heatmap_data <- cluster_profile %>%
dplyr::select(-n) %>%
pivot_longer(cols = -kmeans_cluster, names_to = "feature", values_to = "mean_value") %>%
group_by(feature) %>%
mutate(standardized_mean = as.numeric(scale(mean_value))) %>%
ungroup()
ggplot(heatmap_data, aes(x = feature, y = kmeans_cluster, fill = standardized_mean)) +
geom_tile(color = "white") +
scale_fill_gradient2(low = "steelblue", mid = "white", high = "tomato", midpoint = 0) +
theme_minimal() +
labs(title = "Heatmap of Standardized K-means Cluster Profiles", x = "Feature", y = "Cluster", fill = "Standardized Mean") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
kprof %>%
dplyr::select(age, income_pir, bmi, sys_bp, dia_bp, hba1c, kmeans_cluster) %>%
ggparcoord(columns = 1:6, groupColumn = 7, scale = "uniminmax", alphaLines = 0.25) +
theme_minimal() +
labs(title = "Parallel Coordinate Plot of K-means Clusters", x = "Features", y = "Scaled Values", colour = "Cluster")
This analysis created several visualizations to support clustering interpretation. Histograms and boxplots were used during exploratory data analysis to examine variable distributions and possible outliers. A missing-value bar chart was used to show the amount of missing data by variable. A correlation matrix was used to guide feature selection. PCA and UMAP plots were used to visualize the overall structure of the data and the clusters produced by K-means, DBSCAN, and GMM. Validation metric plots were used to support parameter choices. Silhouette plots were used to assess cluster separation. Finally, K-means cluster profiles were visualized using bar plots, boxplots, a heatmap, sex distribution plots, and a parallel coordinate plot.
Overall, the analysis successfully applied and compared three clustering methods to NHANES health data. The preprocessing steps included merging NHANES files, selecting relevant health variables, checking missing values, applying median imputation, removing redundant variables, and scaling the final clustering variables. K-means, DBSCAN, and GMM were then compared using internal validation metrics and practical interpretability. K-means was selected as the final method because it produced interpretable health-related clusters that were useful for profiling participants based on age, income, BMI, blood pressure, and HbA1c. Sex was not used in creating the clusters, but it was included in the final profiling stage to describe the demographic composition of each cluster.