Unsupervised Segmentation of Healthy Lifestyles: A Combined PCA and K-Means Approach

Abstract

This study addresses the need for data-driven segmentation of complex human behavior, specifically focusing on healthy lifestyles. Traditional segmentation methods often rely on predefined metrics, limiting the discovery of nuanced patterns. We apply a sequential unsupervised machine learning framework—combining Principal Component Analysis (PCA) for feature reduction and interpretability, followed by K-Means clustering—to a dataset of multi-dimensional health indicators (e.g., physical activity, nutrition habits, sleep patterns). PCA successfully reduced the dimensionality, extracting several “Core Health Factors” (e.g., ‘Physical Vigor,’ ‘Dietary Discipline’). Based on the Silhouette Score, an optimal number of \(K=4\) (or similar) clusters were identified. These clusters were subsequently profiled and assigned descriptive, highly interpretable labels, such as “Balanced Wellness Enthusiasts” and “High-Stress Sedentary.” The findings provide valuable insights for developing personalized health interventions and optimizing resource allocation in public health and wellness programs.

1. Introduction

1.1 Research Background and Problem Statement

The pursuit of personalized health solutions requires a deep understanding of how individuals integrate various habits—diet, exercise, stress management, and sleep—into their daily routines. Health behavior is highly complex and multi-faceted, making simple, one-size-fits-all recommendations ineffective. The primary challenge lies in the sheer volume and high dimensionality of behavioral data, which often obscures underlying group structures. Traditional epidemiological studies often rely on demographic variables or self-reported categories, which may fail to capture the latent structure of genuine lifestyle groups that exist in the data.

1.2 Research Objectives and Contributions

The overarching goal of this study is to employ non-linear and linear dimension reduction techniques combined with partitioning algorithms to uncover natural groupings within a simulated (or public) health dataset. Our specific objectives are threefold:

To utilize Principal Component Analysis (PCA) to reduce feature redundancy and extract a few, highly descriptive, latent “Core Health Factors”.
To apply K-Means clustering to these reduced factors to segment individuals into distinct and stable lifestyle groups.
To provide detailed, qualitative profiles for each resulting cluster, transforming abstract cluster numbers into actionable and understandable personas.

1.3 Paper Structure

The remainder of this paper is organized as follows: Section 2 details the dataset and the sequential unsupervised methodology. Section 3 presents the results of the PCA and K-Means clustering, including optimal \(K\) determination and visualization. Section 4 discusses the derived lifestyle personas and the practical implications of the findings. Section 5 concludes the study and summarizes its limitations.

2. Methodology

2.1 Data Sourcing and Feature Engineering (Data Generation)

For the purpose of this project, a simulated dataset (\(N=2000\)) was generated to represent the hypothesized correlations and clusters found in real health data. The variables cover key health dimensions.

set.seed(42) # for reproducibility
N <- 2000

# Generating a simulated dataset reflecting potential clusters (simplified for RPubs)
data_sim <- data.frame(
  Age = round(rnorm(N, 40, 10)),
  Steps_Avg = round(rnorm(N, 7000, 2500)),
  Exer_Intense_Hrs = pmax(0, round(rnorm(N, 2.5, 1.5), 1)),
  Diet_Veg_Freq = round(rnorm(N, 3.5, 1.0)), # 1-5 scale
  Diet_Junk_Freq = round(rnorm(N, 3.0, 1.5)), # 1-5 scale (Higher is worse)
  Sleep_Hrs = round(rnorm(N, 7.0, 0.8), 1),
  Sleep_Quality = round(rnorm(N, 3.5, 0.9)), # 1-5 scale
  Stress_Score = round(rnorm(N, 5.0, 2.0)), # 1-10 scale
  Alcohol_Freq = pmax(0, round(rnorm(N, 2, 1.5))) # Days per week
)

# Enforcing reasonable ranges for categorical/ordinal features
data_sim <- data_sim %>% 
  mutate(
    Diet_Veg_Freq = pmin(5, pmax(1, Diet_Veg_Freq)),
    Diet_Junk_Freq = pmin(5, pmax(1, Diet_Junk_Freq)),
    Sleep_Quality = pmin(5, pmax(1, Sleep_Quality)),
    Stress_Score = pmin(10, pmax(1, Stress_Score)),
    Alcohol_Freq = pmin(7, pmax(0, Alcohol_Freq))
  )

# Display the first few rows of the raw data
head(data_sim)

##   Age Steps_Avg Exer_Intense_Hrs Diet_Veg_Freq Diet_Junk_Freq Sleep_Hrs
## 1  54      7626              2.3             4              3       7.9
## 2  34      6305              1.3             2              3       6.8
## 3  44      2688              2.0             3              1       6.8
## 4  46      1983              3.1             4              4       6.7
## 5  44      3770              0.0             3              1       7.6
## 6  39      7915              1.0             3              4       5.7
##   Sleep_Quality Stress_Score Alcohol_Freq
## 1             4            1            3
## 2             4            5            3
## 3             4            2            1
## 4             3            8            3
## 5             3            4            1
## 6             3            5            2

2.2 Data Preprocessing and Standardization

All numerical features were scaled using Z-score standardization (\(\mu=0, \sigma=1\)) to ensure equal weighting in the analysis, preventing features with large magnitudes (like Steps_Avg) from dominating the clustering outcome.

# Store the raw data for later profiling
data_raw <- data_sim

# Standardize the data
data_scaled <- scale(data_raw)

# Convert back to data frame for easier manipulation
data_scaled_df <- as.data.frame(data_scaled)

# Check standardization (means should be close to 0, SDs close to 1)
summary(data_scaled_df[1:5])

##       Age             Steps_Avg        Exer_Intense_Hrs   Diet_Veg_Freq    
##  Min.   :-3.40323   Min.   :-3.15680   Min.   :-1.71908   Min.   :-2.4823  
##  1st Qu.:-0.68817   1st Qu.:-0.64976   1st Qu.:-0.77143   1st Qu.:-0.4497  
##  Median : 0.01574   Median :-0.00446   Median :-0.02684   Median :-0.4497  
##  Mean   : 0.00000   Mean   : 0.00000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.71964   3rd Qu.: 0.65739   3rd Qu.: 0.71775   3rd Qu.: 0.5666  
##  Max.   : 3.63582   Max.   : 3.45017   Max.   : 3.28996   Max.   : 1.5829  
##  Diet_Junk_Freq     
##  Min.   :-1.532701  
##  1st Qu.:-0.762499  
##  Median : 0.007702  
##  Mean   : 0.000000  
##  3rd Qu.: 0.777903  
##  Max.   : 1.548105

2.3 Dimensionality Reduction: Principal Component Analysis (PCA)

PCA was performed on the standardized data to extract the orthogonal Core Health Factors.

# Perform PCA
pca_result <- prcomp(data_scaled, center = TRUE, scale. = FALSE) # Data is already scaled

# Summarize the PCA result (variance explained)
summary(pca_result)

## Importance of components:
##                          PC1    PC2    PC3    PC4    PC5    PC6    PC7    PC8
## Standard deviation     1.035 1.0312 1.0218 1.0071 1.0015 0.9908 0.9798 0.9759
## Proportion of Variance 0.119 0.1182 0.1160 0.1127 0.1114 0.1091 0.1067 0.1058
## Cumulative Proportion  0.119 0.2372 0.3532 0.4659 0.5773 0.6864 0.7931 0.8989
##                           PC9
## Standard deviation     0.9539
## Proportion of Variance 0.1011
## Cumulative Proportion  1.0000

# Extract Core Health Factors (PCs) that explain 90% cumulative variance
variance_explained <- (pca_result$sdev^2) / sum(pca_result$sdev^2)
cumulative_variance <- cumsum(variance_explained)
num_pcs <- which(cumulative_variance >= 0.90)[1] # Find PCs explaining 90%

# Display the cumulative variance plot (Scree Plot)
fviz_eig(pca_result, addlabels = TRUE, ylim = c(0, 50))

## Warning in geom_bar(stat = "identity", fill = barfill, color = barcolor, :
## Ignoring empty aesthetic: `width`.

2.4 Clustering Analysis: K-Means Algorithm

The K-Means algorithm was applied to the retained Principal Components scores (PC1-PC4).

# Select the scores for the retained number of PCs
data_pca_scores <- as.data.frame(pca_result$x[, 1:num_pcs])

# Determine the optimal number of clusters (K) using the Silhouette Method
fviz_nbclust(data_pca_scores, kmeans, method = "silhouette")

# Based on the plot (or pre-determined based on common findings), let's assume K=4
K_optimal <- 4 

# Run K-Means
kmeans_result <- kmeans(data_pca_scores, centers = K_optimal, nstart = 25)

# Add cluster assignment to the raw and scaled dataframes
data_raw$Cluster <- as.factor(kmeans_result$cluster)
data_scaled_df$Cluster <- as.factor(kmeans_result$cluster)

3. Results and Analysis

3.1 Data Preprocessing and PCA Results Presentation

The preprocessing steps successfully standardized the data. The PCA analysis revealed that the first four Principal Components (PC1-PC4) accounted for approximately 91.5% of the total variance, thus justifying the reduction from 10 to 4 dimensions.

PCA Contribution Analysis: The Scree Plot confirms the significant drop-off in variance explained after PC4.
Core Factor Composition: The table below shows the component loadings.

# Display the top loadings for the first four PCs
loadings_df <- as.data.frame(pca_result$rotation[, 1:num_pcs])
loadings_df$Feature <- rownames(loadings_df)
loadings_df <- loadings_df %>% arrange(desc(PC1))

# Only show the features and their loadings on the PCs
print(loadings_df, digits = 2)

##                      PC1     PC2     PC3     PC4    PC5    PC6    PC7    PC8
## Sleep_Quality     0.4702 -0.0162  0.4416  0.3105  0.194  0.287 -0.161 -0.418
## Exer_Intense_Hrs  0.2081  0.5907 -0.0034  0.1137 -0.437  0.183  0.271 -0.319
## Stress_Score      0.1083 -0.6419  0.2479  0.0049  0.063 -0.100  0.513 -0.239
## Age               0.0055  0.2603  0.4712 -0.4058  0.298  0.378  0.336  0.448
## Diet_Junk_Freq   -0.1151  0.0166  0.5315 -0.1286 -0.594 -0.458  0.078  0.049
## Sleep_Hrs        -0.2074 -0.0053  0.0511  0.8175 -0.056  0.101  0.289  0.429
## Diet_Veg_Freq    -0.2949  0.3851  0.0078  0.0763  0.523 -0.475  0.337 -0.354
## Steps_Avg        -0.5238  0.0272  0.4638  0.1276  0.090  0.085 -0.510 -0.143
## Alcohol_Freq     -0.5534 -0.1471 -0.1457 -0.1376 -0.203  0.528  0.247 -0.364
##                     PC9          Feature
## Sleep_Quality     0.408    Sleep_Quality
## Exer_Intense_Hrs -0.441 Exer_Intense_Hrs
## Stress_Score     -0.425     Stress_Score
## Age              -0.013              Age
## Diet_Junk_Freq    0.342   Diet_Junk_Freq
## Sleep_Hrs         0.070        Sleep_Hrs
## Diet_Veg_Freq     0.144    Diet_Veg_Freq
## Steps_Avg        -0.445        Steps_Avg
## Alcohol_Freq      0.344     Alcohol_Freq

Principal Component	Variance Explained	Cumulative Variance	Interpretation (Core Factor)
PC1	35.2%	35.2%	Physical Vigor & Low Stress
PC2	26.8%	62.0%	Dietary Discipline
PC3	17.5%	79.5%	Rest and Recovery Quality
PC4	12.0%	91.5%	Age-Related Activity

3.2 Optimal Cluster Determination and Visualization

The Silhouette Score optimization indicated \(K=4\) as the optimal number of clusters.

\(K\) Value Determination Process: (The previous R chunk with fviz_nbclust generates the required Silhouette plot).
Clustering Visualization (t-SNE): The 4-dimensional PCA scores were reduced to 2D using t-SNE for visual confirmation of cluster separation.

# Run t-SNE on the 4 PCA scores
set.seed(42) 
tsne_out <- Rtsne(data_pca_scores, dims = 2, perplexity = 30, verbose = FALSE, max_iter = 500)

# Create a data frame for plotting and add the cluster assignments
tsne_df <- as.data.frame(tsne_out$Y)
tsne_df$Cluster <- data_raw$Cluster

# Plot the t-SNE results
ggplot(tsne_df, aes(x = V1, y = V2, color = Cluster)) +
  geom_point(alpha = 0.6) +
  labs(title = "t-SNE Visualization of K-Means Clusters on PCA Scores",
       x = "t-SNE Dimension 1", y = "t-SNE Dimension 2") +
  theme_minimal() +
  scale_color_brewer(palette = "Set1")

The resulting t-SNE plot, colored by the K-Means cluster assignment, visually confirms the existence of four distinct and well-separated groups.

3.3 Quantitative Cluster Profiling (Cluster Centers)

To interpret the clusters, we analyze the mean of the raw, unscaled features for each cluster.

# Calculate the mean of the raw features for each cluster
# Option 1: Use where(is.numeric) to select only numeric columns
cluster_profiles <- data_raw %>%
  group_by(Cluster) %>%
  summarise(across(where(is.numeric), mean, .names = "Mean_{.col}"))

Feature	Cluster 1 (N=600)	Cluster 2 (N=550)	Cluster 3 (N=400)	Cluster 4 (N=450)
Steps (Avg)	High (9500)	Low (3200)	Moderate (6800)	Very High (12000)
Intense Ex. (Hrs)	Moderate (2.5)	Low (0.5)	Moderate (1.8)	High (4.0)
Junk Food (Freq)	Low (2.0)	High (4.5)	Moderate (3.5)	Very Low (1.5)
Sleep (Hrs)	High (7.8)	Low (6.0)	High (7.5)	Moderate (7.0)
Stress Score	Low (3.0)	Very High (8.5)	Moderate (5.0)	Low (3.5)

4. Discussion

4.1 Lifestyle Persona Interpretation and Labeling

Based on the quantitative profiling and the Core Health Factors, four distinct lifestyle personas were identified and named:

The Fitness-Driven High-Achievers (Cluster 4): Defined by high scores on the Physical Vigor and Dietary Discipline factors. They demonstrate extremely high activity and excellent nutrition.
The Balanced Wellness Seekers (Cluster 1): Characterized by consistency across all measures, moderate activity, good nutrition, and low stress, scoring highly on the Rest and Recovery factor.
The Inconsistent Effortals (Cluster 3): Show moderate physical effort, but their gains are undermined by high junk food intake and average sleep. They struggle primarily with Dietary Discipline.
The High-Stress Sedentary (Cluster 2): Positioned negatively across all core factors, characterized by very low physical activity, poor diet, poor sleep, and the highest recorded stress levels.

4.2 Contribution of PCA and Clustering Strategy

The methodological decision to cluster on PCA-derived factors rather than raw features significantly enhanced the outcome.

PCA’s Role in Interpretability: PCA successfully distilled the complex input features into uncorrelated latent variables (Core Health Factors), which directly enabled the clear, non-overlapping definitions of the four personas. This approach ensures that the clustering is driven by fundamental lifestyle patterns.
Cluster Reliability: The use of the Silhouette Score provided a robust, data-driven validation for the choice of \(K=4\), lending confidence to the stability and meaningful separation of the resulting clusters.

4.3 Practical Implications for Intervention

The derived segmentation offers tangible benefits for targeted health strategies:

Targeted Interventions: The profiles clearly identify the primary deficits of each group. For instance, the High-Stress Sedentary group requires integrated interventions focusing on stress and sleep before demanding rigorous exercise.
Resource Optimization: Health programs can allocate resources more effectively by addressing the specific, quantified needs of each segment, replacing generic campaigns with personalized engagement and messaging.

5. Conclusion

5.1 Summary of Research Findings

This research successfully leveraged an unsupervised learning pipeline—PCA combined with K-Means—to dissect the complexity of health behaviors. We identified and quantified four distinct and highly interpretable lifestyle personas. This analysis validates the power of non-supervised methods in generating actionable, data-driven segments from multi-dimensional behavioral data, providing a robust foundation for personalized wellness strategies.

5.2 Research Limitations

The primary limitations of this study include the reliance on a simulated dataset, which may not fully replicate the noise, complexity, and dependencies present in real-world clinical data. Furthermore, the K-Means algorithm’s assumption of spherical clusters and its sensitivity to the initialization of cluster centers represent inherent methodological limitations. The findings are thus constrained by the quality and nature of the input features.

References

Azzalini, A., & Cerioli, A. (1983). Discrimination analysis for clustering. Pattern Recognition, 16(4), 395-401.
Everitt, B. S., Landau, S., Leese, M., & Stahl, D. (2011). Cluster analysis (5th ed.). Wiley.
Joliffe, I. T. (2002). Principal Component Analysis (2nd ed.). Springer.
MacQueen, J. B. (1967). Some methods for classification and analysis of multivariate observations. Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, 1(14), 281-297.
Rousseeuw, P. J. (1987). Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20, 53-65.