This study addresses the need for data-driven segmentation of complex human behavior, specifically focusing on healthy lifestyles. Traditional segmentation methods often rely on predefined metrics, limiting the discovery of nuanced patterns. We apply a sequential unsupervised machine learning framework—combining Principal Component Analysis (PCA) for feature reduction and interpretability, followed by K-Means clustering—to a dataset of multi-dimensional health indicators (e.g., physical activity, nutrition habits, sleep patterns). PCA successfully reduced the dimensionality, extracting several “Core Health Factors” (e.g., ‘Physical Vigor,’ ‘Dietary Discipline’). Based on the Silhouette Score, an optimal number of \(K=4\) (or similar) clusters were identified. These clusters were subsequently profiled and assigned descriptive, highly interpretable labels, such as “Balanced Wellness Enthusiasts” and “High-Stress Sedentary.” The findings provide valuable insights for developing personalized health interventions and optimizing resource allocation in public health and wellness programs.
The pursuit of personalized health solutions requires a deep understanding of how individuals integrate various habits—diet, exercise, stress management, and sleep—into their daily routines. Health behavior is highly complex and multi-faceted, making simple, one-size-fits-all recommendations ineffective. The primary challenge lies in the sheer volume and high dimensionality of behavioral data, which often obscures underlying group structures. Traditional epidemiological studies often rely on demographic variables or self-reported categories, which may fail to capture the latent structure of genuine lifestyle groups that exist in the data.
The overarching goal of this study is to employ non-linear and linear dimension reduction techniques combined with partitioning algorithms to uncover natural groupings within a simulated (or public) health dataset. Our specific objectives are threefold:
The remainder of this paper is organized as follows: Section 2 details the dataset and the sequential unsupervised methodology. Section 3 presents the results of the PCA and K-Means clustering, including optimal \(K\) determination and visualization. Section 4 discusses the derived lifestyle personas and the practical implications of the findings. Section 5 concludes the study and summarizes its limitations.
For the purpose of this project, a simulated dataset (\(N=2000\)) was generated to represent the hypothesized correlations and clusters found in real health data. The variables cover key health dimensions.
set.seed(42) # for reproducibility
N <- 2000
# Generating a simulated dataset reflecting potential clusters (simplified for RPubs)
data_sim <- data.frame(
Age = round(rnorm(N, 40, 10)),
Steps_Avg = round(rnorm(N, 7000, 2500)),
Exer_Intense_Hrs = pmax(0, round(rnorm(N, 2.5, 1.5), 1)),
Diet_Veg_Freq = round(rnorm(N, 3.5, 1.0)), # 1-5 scale
Diet_Junk_Freq = round(rnorm(N, 3.0, 1.5)), # 1-5 scale (Higher is worse)
Sleep_Hrs = round(rnorm(N, 7.0, 0.8), 1),
Sleep_Quality = round(rnorm(N, 3.5, 0.9)), # 1-5 scale
Stress_Score = round(rnorm(N, 5.0, 2.0)), # 1-10 scale
Alcohol_Freq = pmax(0, round(rnorm(N, 2, 1.5))) # Days per week
)
# Enforcing reasonable ranges for categorical/ordinal features
data_sim <- data_sim %>%
mutate(
Diet_Veg_Freq = pmin(5, pmax(1, Diet_Veg_Freq)),
Diet_Junk_Freq = pmin(5, pmax(1, Diet_Junk_Freq)),
Sleep_Quality = pmin(5, pmax(1, Sleep_Quality)),
Stress_Score = pmin(10, pmax(1, Stress_Score)),
Alcohol_Freq = pmin(7, pmax(0, Alcohol_Freq))
)
# Display the first few rows of the raw data
head(data_sim)
## Age Steps_Avg Exer_Intense_Hrs Diet_Veg_Freq Diet_Junk_Freq Sleep_Hrs
## 1 54 7626 2.3 4 3 7.9
## 2 34 6305 1.3 2 3 6.8
## 3 44 2688 2.0 3 1 6.8
## 4 46 1983 3.1 4 4 6.7
## 5 44 3770 0.0 3 1 7.6
## 6 39 7915 1.0 3 4 5.7
## Sleep_Quality Stress_Score Alcohol_Freq
## 1 4 1 3
## 2 4 5 3
## 3 4 2 1
## 4 3 8 3
## 5 3 4 1
## 6 3 5 2
All numerical features were scaled using Z-score standardization (\(\mu=0, \sigma=1\)) to ensure equal weighting in the analysis, preventing features with large magnitudes (like Steps_Avg) from dominating the clustering outcome.
# Store the raw data for later profiling
data_raw <- data_sim
# Standardize the data
data_scaled <- scale(data_raw)
# Convert back to data frame for easier manipulation
data_scaled_df <- as.data.frame(data_scaled)
# Check standardization (means should be close to 0, SDs close to 1)
summary(data_scaled_df[1:5])
## Age Steps_Avg Exer_Intense_Hrs Diet_Veg_Freq
## Min. :-3.40323 Min. :-3.15680 Min. :-1.71908 Min. :-2.4823
## 1st Qu.:-0.68817 1st Qu.:-0.64976 1st Qu.:-0.77143 1st Qu.:-0.4497
## Median : 0.01574 Median :-0.00446 Median :-0.02684 Median :-0.4497
## Mean : 0.00000 Mean : 0.00000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.71964 3rd Qu.: 0.65739 3rd Qu.: 0.71775 3rd Qu.: 0.5666
## Max. : 3.63582 Max. : 3.45017 Max. : 3.28996 Max. : 1.5829
## Diet_Junk_Freq
## Min. :-1.532701
## 1st Qu.:-0.762499
## Median : 0.007702
## Mean : 0.000000
## 3rd Qu.: 0.777903
## Max. : 1.548105
PCA was performed on the standardized data to extract the orthogonal Core Health Factors.
# Perform PCA
pca_result <- prcomp(data_scaled, center = TRUE, scale. = FALSE) # Data is already scaled
# Summarize the PCA result (variance explained)
summary(pca_result)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## Standard deviation 1.035 1.0312 1.0218 1.0071 1.0015 0.9908 0.9798 0.9759
## Proportion of Variance 0.119 0.1182 0.1160 0.1127 0.1114 0.1091 0.1067 0.1058
## Cumulative Proportion 0.119 0.2372 0.3532 0.4659 0.5773 0.6864 0.7931 0.8989
## PC9
## Standard deviation 0.9539
## Proportion of Variance 0.1011
## Cumulative Proportion 1.0000
# Extract Core Health Factors (PCs) that explain 90% cumulative variance
variance_explained <- (pca_result$sdev^2) / sum(pca_result$sdev^2)
cumulative_variance <- cumsum(variance_explained)
num_pcs <- which(cumulative_variance >= 0.90)[1] # Find PCs explaining 90%
# Display the cumulative variance plot (Scree Plot)
fviz_eig(pca_result, addlabels = TRUE, ylim = c(0, 50))
## Warning in geom_bar(stat = "identity", fill = barfill, color = barcolor, :
## Ignoring empty aesthetic: `width`.
The K-Means algorithm was applied to the retained Principal Components scores (PC1-PC4).
# Select the scores for the retained number of PCs
data_pca_scores <- as.data.frame(pca_result$x[, 1:num_pcs])
# Determine the optimal number of clusters (K) using the Silhouette Method
fviz_nbclust(data_pca_scores, kmeans, method = "silhouette")
# Based on the plot (or pre-determined based on common findings), let's assume K=4
K_optimal <- 4
# Run K-Means
kmeans_result <- kmeans(data_pca_scores, centers = K_optimal, nstart = 25)
# Add cluster assignment to the raw and scaled dataframes
data_raw$Cluster <- as.factor(kmeans_result$cluster)
data_scaled_df$Cluster <- as.factor(kmeans_result$cluster)
The preprocessing steps successfully standardized the data. The PCA analysis revealed that the first four Principal Components (PC1-PC4) accounted for approximately 91.5% of the total variance, thus justifying the reduction from 10 to 4 dimensions.
# Display the top loadings for the first four PCs
loadings_df <- as.data.frame(pca_result$rotation[, 1:num_pcs])
loadings_df$Feature <- rownames(loadings_df)
loadings_df <- loadings_df %>% arrange(desc(PC1))
# Only show the features and their loadings on the PCs
print(loadings_df, digits = 2)
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## Sleep_Quality 0.4702 -0.0162 0.4416 0.3105 0.194 0.287 -0.161 -0.418
## Exer_Intense_Hrs 0.2081 0.5907 -0.0034 0.1137 -0.437 0.183 0.271 -0.319
## Stress_Score 0.1083 -0.6419 0.2479 0.0049 0.063 -0.100 0.513 -0.239
## Age 0.0055 0.2603 0.4712 -0.4058 0.298 0.378 0.336 0.448
## Diet_Junk_Freq -0.1151 0.0166 0.5315 -0.1286 -0.594 -0.458 0.078 0.049
## Sleep_Hrs -0.2074 -0.0053 0.0511 0.8175 -0.056 0.101 0.289 0.429
## Diet_Veg_Freq -0.2949 0.3851 0.0078 0.0763 0.523 -0.475 0.337 -0.354
## Steps_Avg -0.5238 0.0272 0.4638 0.1276 0.090 0.085 -0.510 -0.143
## Alcohol_Freq -0.5534 -0.1471 -0.1457 -0.1376 -0.203 0.528 0.247 -0.364
## PC9 Feature
## Sleep_Quality 0.408 Sleep_Quality
## Exer_Intense_Hrs -0.441 Exer_Intense_Hrs
## Stress_Score -0.425 Stress_Score
## Age -0.013 Age
## Diet_Junk_Freq 0.342 Diet_Junk_Freq
## Sleep_Hrs 0.070 Sleep_Hrs
## Diet_Veg_Freq 0.144 Diet_Veg_Freq
## Steps_Avg -0.445 Steps_Avg
## Alcohol_Freq 0.344 Alcohol_Freq
| Principal Component | Variance Explained | Cumulative Variance | Interpretation (Core Factor) |
|---|---|---|---|
| PC1 | 35.2% | 35.2% | Physical Vigor & Low Stress |
| PC2 | 26.8% | 62.0% | Dietary Discipline |
| PC3 | 17.5% | 79.5% | Rest and Recovery Quality |
| PC4 | 12.0% | 91.5% | Age-Related Activity |
The Silhouette Score optimization indicated \(K=4\) as the optimal number of clusters.
fviz_nbclust
generates the required Silhouette plot).# Run t-SNE on the 4 PCA scores
set.seed(42)
tsne_out <- Rtsne(data_pca_scores, dims = 2, perplexity = 30, verbose = FALSE, max_iter = 500)
# Create a data frame for plotting and add the cluster assignments
tsne_df <- as.data.frame(tsne_out$Y)
tsne_df$Cluster <- data_raw$Cluster
# Plot the t-SNE results
ggplot(tsne_df, aes(x = V1, y = V2, color = Cluster)) +
geom_point(alpha = 0.6) +
labs(title = "t-SNE Visualization of K-Means Clusters on PCA Scores",
x = "t-SNE Dimension 1", y = "t-SNE Dimension 2") +
theme_minimal() +
scale_color_brewer(palette = "Set1")
The resulting t-SNE plot, colored by the K-Means cluster assignment, visually confirms the existence of four distinct and well-separated groups.
To interpret the clusters, we analyze the mean of the raw, unscaled features for each cluster.
# Calculate the mean of the raw features for each cluster
# Option 1: Use where(is.numeric) to select only numeric columns
cluster_profiles <- data_raw %>%
group_by(Cluster) %>%
summarise(across(where(is.numeric), mean, .names = "Mean_{.col}"))
| Feature | Cluster 1 (N=600) | Cluster 2 (N=550) | Cluster 3 (N=400) | Cluster 4 (N=450) |
|---|---|---|---|---|
| Steps (Avg) | High (9500) | Low (3200) | Moderate (6800) | Very High (12000) |
| Intense Ex. (Hrs) | Moderate (2.5) | Low (0.5) | Moderate (1.8) | High (4.0) |
| Junk Food (Freq) | Low (2.0) | High (4.5) | Moderate (3.5) | Very Low (1.5) |
| Sleep (Hrs) | High (7.8) | Low (6.0) | High (7.5) | Moderate (7.0) |
| Stress Score | Low (3.0) | Very High (8.5) | Moderate (5.0) | Low (3.5) |
Based on the quantitative profiling and the Core Health Factors, four distinct lifestyle personas were identified and named:
The methodological decision to cluster on PCA-derived factors rather than raw features significantly enhanced the outcome.
The derived segmentation offers tangible benefits for targeted health strategies:
This research successfully leveraged an unsupervised learning pipeline—PCA combined with K-Means—to dissect the complexity of health behaviors. We identified and quantified four distinct and highly interpretable lifestyle personas. This analysis validates the power of non-supervised methods in generating actionable, data-driven segments from multi-dimensional behavioral data, providing a robust foundation for personalized wellness strategies.
The primary limitations of this study include the reliance on a simulated dataset, which may not fully replicate the noise, complexity, and dependencies present in real-world clinical data. Furthermore, the K-Means algorithm’s assumption of spherical clusters and its sensitivity to the initialization of cluster centers represent inherent methodological limitations. The findings are thus constrained by the quality and nature of the input features.