Abstract

This project explores the relationship between sleep quality and lifestyle factors using unsupervised learning methods. With data from 374 individuals, we applied Principal Component Analysis (PCA) and clustering to identify patterns. PCA revealed two main components, explaining about 81% of the variance, with stress levels, physical activity, heart rate, and daily steps being the most significant factors. Clustering uncovered three lifestyle profiles: Good Sleepers, Bad Sleepers, and Moderate Sleepers, each defined by distinct combinations of stress, activity, and sleep metrics. These results highlight the importance of physical activity and reduced stress in improving sleep. Our study demonstrates how unsupervised learning can provide actionable insights for addressing real-world health challenges.

Introduction

Sleep takes up about one-third of our lives and plays an important role in calming the nervous system, improving memory, and supporting overall health. Sleep quality affects our daily lives, and in turn, our lifestyle and health habits influence how well we sleep. In this study, we explore the key factors that impact sleep quality and use unsupervised learning techniques to uncover patterns in different lifestyle profiles. We go beyond simple correlations, like the link between stress levels and sleep duration, by including a full health profile (e.g., daily steps, BMI, blood pressure), as well as gender, profession, and physical activity. Our goal is to provide insights that could benefit the health and wellness industries.

Dataset Columns:

Before moving forward, we will break down the variables present in our data set to give us a holistic view of all the metrics we are working with:

Person ID: An identifier for each individual.
Gender: The gender of the person (Male/Female).
Age: The age of the person in years.
Occupation: The occupation or profession of the person.
Sleep Duration: The number of hours the person sleeps per day.
Quality of Sleep: A subjective rating of the quality of sleep, ranging from 1 to 10.
Physical Activity Level: The number of minutes the person engages in physical activity daily.
Stress Level: A subjective rating of the stress level experienced by the person, ranging from 1 to 10.
BMI Category: The BMI category of the person (e.g., Normal, Overweight, Obese).
Blood Pressure: The blood pressure measurement of the person, indicated as systolic pressure over diastolic pressure.
Heart Rate (bpm): The resting heart rate of the person in beats per minute.
Daily Steps: The number of steps the person takes per day.
Sleep Disorder: The presence or absence of a sleep disorder in the person (None, Insomnia, Sleep Apnea).

Loading & PreProcessing

Let us take a quick glimpse take a quick look at our data:

rm(list=ls()) 
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(GGally)

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

library(factoextra)

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

library(ggplot2)
library(gridExtra)

## 
## Attaching package: 'gridExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     combine

Project_data <- read_csv("Sleep_health_and_lifestyle_dataset_updated.csv")

## Rows: 374 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): Gender, Occupation, BMI Category, Blood Pressure, Sleep Disorder
## dbl (8): Person ID, Age, Sleep Duration, Quality of Sleep, Physical Activity...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

glimpse(Project_data)

## Rows: 374
## Columns: 13
## $ `Person ID`               <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1…
## $ Gender                    <chr> "Male", "Male", "Male", "Male", "Male", "Mal…
## $ Age                       <dbl> 27, 28, 28, 28, 28, 28, 29, 29, 29, 29, 29, …
## $ Occupation                <chr> "Software Engineer", "Doctor", "Doctor", "Sa…
## $ `Sleep Duration`          <dbl> 6.1, 6.2, 6.2, 5.9, 5.9, 5.9, 6.3, 7.8, 7.8,…
## $ `Quality of Sleep`        <dbl> 6, 6, 6, 4, 4, 4, 6, 7, 7, 7, 6, 7, 6, 6, 6,…
## $ `Physical Activity Level` <dbl> 42, 60, 60, 30, 30, 30, 40, 75, 75, 75, 30, …
## $ `Stress Level`            <dbl> 6, 8, 8, 8, 8, 8, 7, 6, 6, 6, 8, 6, 8, 8, 8,…
## $ `BMI Category`            <chr> "Overweight", "Normal", "Normal", "Obese", "…
## $ `Blood Pressure`          <chr> "126/83", "125/80", "125/80", "140/90", "140…
## $ `Heart Rate`              <dbl> 77, 75, 75, 85, 85, 85, 82, 70, 70, 70, 70, …
## $ `Daily Steps`             <dbl> 4200, 10000, 10000, 3000, 3000, 3000, 3500, …
## $ `Sleep Disorder`          <chr> "None", "None", "None", "Sleep Apnea", "Slee…

Missing Values

missing_values <- data.frame(
  Column = names(Project_data),
  ProportionMissing = colMeans(is.na(Project_data))
)


ggplot(missing_values, aes(x = Column, y = ProportionMissing)) +
  geom_bar(stat = "identity", fill = "seagreen3", color = 'black') +
  labs(
    title = "Proportion of Missing Values by Column",
    x = "Columns",
    y = "Proportion Missing",
    subtitle = "Figure 1",
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1)  # Rotate x-axis labels
  )

Feature Engineering

As seen, we do have present some missing values within our data set. Moving forward with our analysis, we will extract Numerical values from our data set and utilize mean imputation to preserve the dataset’s completeness.

After scanning columns for any apparent typos or duplicates, will also take care of some noted mistakes in our data set, namely the usage of “Normal” and “Normal Weight” in our BMI Category.

unique(Project_data$`BMI Category`) #showing mistake in BMI Category

## [1] "Overweight"    "Normal"        "Obese"         "Normal Weight"

Project_data$`BMI Category` <- gsub("Normal Weight", "Normal", Project_data$`BMI Category`)



pca_table <- Project_data[, c(5:8, 11,12)]%>%
  mutate(across(everything(), ~ ifelse(is.na(.), mean(., na.rm = TRUE), .)))%>%
  mutate_all(as.numeric)


colSums(is.na(pca_table))

##          Sleep Duration        Quality of Sleep Physical Activity Level 
##                       0                       0                       0 
##            Stress Level              Heart Rate             Daily Steps 
##                       0                       0                       0

occupation = Project_data$Occupation
disorder = Project_data$`Sleep Disorder`
gender = Project_data$Gender
sleep = Project_data$`Sleep Duration`
stress = Project_data$`Stress Level`
BMIcat <- Project_data$`BMI Category`

Visualization

With missing values and other errors within the data set processed, we can move forward with some descriptive analytics for our data.

Dimension 1: A univariate analysis of our extracted variables. Note, it is integral that we scale the columns of our dataset to be comparable to one another.

pca_table_long <- as.data.frame(scale(pca_table)) %>%
  pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value")

ggplot(pca_table_long, aes(x = Variable, y = Value)) +
  geom_boxplot(fill = "seagreen3", color = "black") +
  labs(
    title = "BoxPlot of PCA Variables",
    subtitle = "Figure 2",
    x = "PCA Variables",
    y = "Scaled Values"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1, size = 10),
    axis.text.y = element_text(size = 10)
  )

Dimension 2: A bi-variate analysis of our extracted variables in the form of a heat map. Column names abbreviated for readability purposes: * SlpDr: Sleep Duration * QltySlp: Sleep Quality * PhysAct: Physical Activity Level * StrsLvl: Stress Level * HrtRt: Heart Rate * DlyStp: Daily Steps

corr_table <- pca_table
colnames(corr_table) <- c("SlpDr", "QltySlp", "PhysAct", "StrsLvl", "HrtRt", "DlyStp")

GGally::ggcorr(corr_table, label=T, hjust=0.75,
               low="mediumslateblue", high="seagreen3")+
  labs(title = "Heatmap of PCA Variables",
       subtitle = "Figure 3")

These results align with expectations, as longer sleep durations are associated with better sleep quality, and increased physical activity correlates with a more active lifestyle. Significant negative correlations were also observed, including the relationship between stress levels and sleep quality, where higher stress levels are associated with poorer sleep quality, and between heart rate and sleep quality, indicating that higher heart rates correlate with lower sleep quality. These efforts ensured that subsequent PCA and clustering analyses were not influenced by skewed distributions or extreme values.

Methods & Analysis

Principle Component Analysis

Using Principal Component Analysis (PCA), we identified the most significant factors influencing sleep quality and lifestyle patterns within our dataset. The analysis revealed that the first two 2 principal components accounted for approximately 81% of the total variance, with Component 1 explaining 51.8% and Component 2 an additional 29.2%, as shown in the PCA variance table. These components provide a comprehensive summary of the relationships among the variables in our dataset.

pca_result <- prcomp(scale(pca_table))
summary(pca_result)

## Importance of components:
##                          PC1    PC2    PC3     PC4     PC5    PC6
## Standard deviation     1.763 1.3230 0.8095 0.48942 0.39053 0.3070
## Proportion of Variance 0.518 0.2917 0.1092 0.03992 0.02542 0.0157
## Cumulative Proportion  0.518 0.8097 0.9190 0.95888 0.98430 1.0000

fviz_eig(pca_result, addlabels = TRUE, main = "Explained Variance by PCA Components")+
  geom_bar(stat = "identity", fill = "seagreen3", color = "black") +
  labs(subtitle = "Figure 4")+
  theme_minimal()

Component 1 was strongly associated with stress level, heart rate, sleep quality, and duration, indicating that these variables are interrelated and critical determinants of sleep health. Component 2, on the other hand, was dominated by physical activity level and daily steps, reflecting their shared influence on overall health and lifestyle. The inverse relationship between sleep quality and stress levels was evident, with higher stress levels corresponding to poorer sleep outcomes. Additionally, physical activity demonstrated a positive association with sleep quality, underscoring its importance in promoting better sleep health.

p1 <- fviz_contrib(pca_result, choice = "var", axes = 1) + 
  ggtitle("Contributions to PC1")+
  geom_bar(stat = "identity", fill = "seagreen3", color = "black") +
  theme_minimal()+theme(
    axis.text.x = element_text(angle = 45, hjust = 1, size = 10)
  ) +
  labs(x = "Column Name")
  
  
p2 <- fviz_contrib(pca_result, choice = "var", axes = 2) + 
  ggtitle("Contributions to PC2")+
  geom_bar(stat = "identity", fill = "seagreen3", color = "black") +
  theme_minimal() + theme(
    axis.text.x = element_text(angle = 45, hjust = 1, size = 10)
    )+
      labs(x = "Column Name")




grid.arrange(p1, p2, ncol = 2)

fviz_pca_var(pca_result, col.var = "contrib")+
  labs(title = 'PCA Variable Bi-Plot',
    subtitle = 'Figure 5')+
  scale_color_gradient(low="mediumslateblue", high="seagreen3") +
  theme_minimal()

To further investigate the patterns identified by PCA, we examined the distribution of respondents by their occupations in the PCA space. This visualization revealed distinct trends, such as occupations like nurses and doctors clustering closer together, likely due to similar stress levels and irregular sleep patterns. Conversely, professions with lower stress, such as engineers and teachers, were more evenly distributed, indicating a more balanced lifestyle’s influence on sleep quality, as shown in Figure 6

data.frame(z1=-pca_result$x[,1],z2=pca_result$x[,2]) %>% 
  ggplot(aes(z1,z2,label=occupation, color = stress)) + geom_point(size=0) +
  labs(title="PC Distribution with Occupations",subtitle = 'Figure 6', x="PC1", y="PC2") +
  theme_bw() + 
  scale_color_gradient(low="mediumslateblue", high="seagreen3")+ 
  theme(legend.position="bottom") + geom_text(size=2, hjust=0.6, vjust=0, check_overlap = TRUE)

Factor Analysis

With our initial assumptions of what our Principle components could be, we can move forward and use Factor analysis to verify our guess.

x.f <- factanal(pca_table, 2, scores="Bartlett", rotation="varimax")
x.f$loadings

## 
## Loadings:
##                         Factor1 Factor2
## Sleep Duration           0.865         
## Quality of Sleep         0.977   0.113 
## Physical Activity Level  0.124   0.745 
## Stress Level            -0.877         
## Heart Rate              -0.615         
## Daily Steps                      0.994 
## 
##                Factor1 Factor2
## SS loadings      2.873   1.572
## Proportion Var   0.479   0.262
## Cumulative Var   0.479   0.741

par(mfrow=c(2,1))
barplot(x.f$loadings[,1], names=F, las=2, col="seagreen3", ylim = c(-1, 1))

barplot(x.f$loadings[,2], las=2, col="seagreen3", ylim = c(-1, 1))

With respect to the loading scores of our 2 factors. Factor 1 shows Strong positive loading for Sleep Duration and Quality of Sleep and a strong negative loading for Stress Level. Factor 2 shows Strong positive loading for Physical Activity Level and Daily Steps. These loading scores hold consistent with our initial assumption of what the principle components could be representing, with factor one being Sleep-Stress Factor, and factor two being Physical Activity Factor.

Clustering

To identify distinct lifestyle profiles in the dataset, we applied k-means clustering. After experimenting with different numbers of clusters, we selected three clusters based on their interpretabilit and alignment with the research goals. These clusters represent groups of individuals with varying combinations of lifestyle factors and their influence on sleep quality.

Good Sleepers: The first cluster (green) included individuals with consistently high sleep quality and duration. These individuals had lower stress levels and higher physical activity, which are known to promote better sleep health. This group also exhibited the lowest prevalence of overweight and obese individuals, with 80% maintaining a normal BMI. Sleep disorders were uncommon, although some cases of mild insomnia and sleep apnea were reported.
Bad Sleepers: The second cluster (red) cluster was characterized by the poorest sleep quality and shortest sleep duration. Members of this group exhibited high-stress levels and low physical activity. They had the highest proportion of overweight and obese individuals, with 50% falling into these categories. Sleep disorders, particularly insomnia and sleep apnea, were prevalent,highlighting the compounded negative effects of stress and poor physical health on sleep.
Moderate Sleepers: The third cluster (blue) consisted of individuals with average sleep quality and duration. They exhibited moderate levels of physical activity and stress, placing them between the extremes of the other two groups. However, a significant portion of this group was overweight (74%), which may have impacted their sleep quality. Rates of severe sleep disorders were lower compared to ”Bad Sleepers,” suggesting that moderate adjustments could improve their sleep health.

To clearly visualize the distribution of our subjects throughout the clusters, we have added a new column to our data set denoting Sleep Duration in categories. GS denoting “Good Sleepers”, MS denoting “Moderate Sleepers” and BS denoting “Bad Sleepers”.

set.seed(112)

Project_data$SleepCat <- ifelse(pca_table$`Sleep Duration` >= 1 & pca_table$`Sleep Duration` <6 , "B",
                           ifelse(pca_table$`Sleep Duration` >= 6 & pca_table$`Sleep Duration` <7, "M",
                                  ifelse(pca_table$`Sleep Duration` >= 7 & pca_table$`Sleep Duration` <= 10, "G", NA)))


sleepqual <- Project_data$SleepCat


k = 3 
fit = kmeans(scale(pca_table), k, nstart=1000) 


fviz_cluster(fit, data=pca_table, geom=c("point"),
             ellipse.type = "norm") + theme_minimal()  + labs(title = "Cluster plot for Sleep Quality", subtitle = "Figure 7") + geom_text(label = sleepqual)

distances = dist(pca_table, method="euclidean")
hc = hclust(distances, method="ward.D2")

hc$labels = occupation
fviz_dend(x = hc,
          k = 3,
          rect = TRUE,
          rect_fill = TRUE,
          palette = "jco",
          rect_border = "jco",
          show_labels = FALSE)+
  labs(subtitle = "Figure 8")

The distribution of these three clusters, as visualized in the PCA cluster plot (Figure 6), highlights clear separations between the groups. Good Sleepers (G) are positioned predominantly in the positive regions of the first principal component (Dim1), reflecting their healthier lifestyle patterns with high physical activity and lower stress levels. In contrast, Bad Sleepers (B) are tightly clustered in the negative regions of both principal components, showcasing the compounding effects of high stress, low physical activity, and poor sleep quality. Moderate Sleepers (M) occupy an intermediate position, bridging the characteristics of the other two groups. This transitional nature of Moderate Sleepers emphasizes the potential for interventions, such as targeted physical activity programs or stress management techniques, to shift them toward the healthier profile of Good Sleepers. The plot also reveals overlaps between Moderate Sleepers and the other two clusters, indicating shared traits with both extremes. This reinforces the importance of considering individual variability within each group when designing sleep and lifestyle interventions. The compactness of Bad Sleepers in the plot further underscores the critical need for addressing their lifestyle challenges, including promoting weight management strategies and alleviating stress to improve their sleep outcomes.

By integrating insights from BMI, sleep duration, and clustering patterns, this analysis provides a comprehensive understanding of the interplay between lifestyle factors and sleep health. It offers a foundation for developing targeted, cluster-specific interventions aimed at enhancing overall well being and promoting healthier sleep behaviors across diverse populations.

Conclusion

This study provides meaningful insights into the multifaceted relationship between lifestyle factors and sleep quality, offering actionable pathways to enhance public health. The clustering results, combined with BMI and sleep disorder data summarized in Table 1, reveal distinct profiles of sleep quality across diverse groups. Good Sleepers, representing the majority of the dataset (227 individuals), sleep for seven or more hours and exhibit a healthier profile, with 80% maintaining a normal BMI. Although a small subset of this group experiences sleep apnea (39 cases) or insomnia (12 cases), their overall lifestyle promotes better sleep health. In contrast, Bad Sleepers, a much smaller group of only six individuals, have severely compromised sleep, with durations under six hours, and half of them being overweight or obese. Notably, this group shows the highest prevalence of sleep disorders relative to its size, indicating a compounded impact of stress and sedentary be- havior on sleep quality. Moderate Sleepers, comprising 141 individuals, present an intermediate profile with sleep durations of 6-7 hours. However, the high percentage of overweight individuals (74%) in this group suggests that BMI plays a significant role in influencing sleep health, even when other factors like stress and physical activity are moderate.

flattened_summary <- Project_data %>%
  group_by(SleepCat) %>%
  summarise(
    total = n(),
    Occupation_Counts = paste(names(table(Occupation)), table(Occupation),sep = ": ", collapse = "|"),
    Sleep_Disorder_Counts = paste(names(table(`Sleep Disorder`)), table(`Sleep Disorder`), sep = ": ", collapse = "|"),
    BMI_Category_Counts = paste(names(table(`BMI Category`)), table(`BMI Category`),sep = ": ", collapse = "|")
  )


flattened_summary

## # A tibble: 3 × 5
##   SleepCat total Occupation_Counts     Sleep_Disorder_Counts BMI_Category_Counts
##   <chr>    <int> <chr>                 <chr>                 <chr>              
## 1 B            6 Nurse: 1|Sales Repre… Insomnia: 1|Sleep Ap… Obese: 3|Overweigh…
## 2 G          227 Accountant: 31|Docto… Insomnia: 12|None: 1… Normal: 181|Obese:…
## 3 M          141 Accountant: 6|Doctor… Insomnia: 64|None: 4… Normal: 35|Obese: …

The findings emphasize the importance of addressing multiple lifestyle dimensions to improve sleep health. Stress management strategies, especially for high-stress professions, are critical for mitigating the impact of prolonged stress on sleep. Similarly, promoting physical activity among sedentary individuals can address the dual challenges of improving sleep duration and mitigating the risk of obesity, as seen in the Bad Sleepers and Moderate Sleepers clusters. These interventions, when tailored to specific demographic and occupational groups, can have a profound impact on public health outcomes.

The inclusion of BMI and sleep disorder data in Table 1 also highlights the need for a holistic approach to improving sleep health. While Good Sleepers maintain a healthier balance, the presence of sleep apnea in this group suggests the importance of addressing physiological factors that persist despite healthy habits. Moderate and Bad Sleepers, on the other hand, require integrated strategies that tackle both behavioral and physiological issues, such as promoting healthier diets and sleep hygiene routines.

This research underscores the power of unsupervised learning techniques, such as PCA and clustering, in revealing actionable insights from complex health data. By simplifying and organizing the data into meaningful components, these methods allow researchers to uncover patterns that are not immediately apparent, driving more effective public health interventions. Future research could extend these findings by exploring longitudinal trends, examining how these clusters evolve over time, and integrating additional variables, such as mental health metrics or socioeconomic factors, to develop more comprehensive strategies for sleep improvement. Such advancements would not only enhance the understanding of sleep health but also contribute to broader public health initiatives.

For the Future

Further plans: Building on our current findings, we plan to expand our research by adding more data and exploring new aspects of sleep and lifestyle. One of our next steps is to use longitudinal data to study how sleep quality and lifestyle factors change over time. This will give us a deeper understanding of these relationships. We also aim to include additional variables, such as mental health, dietary habits, and socioeconomic status, to create a more detailed analysis.

While we have already applied hierarchical clustering, we plan to refine and compare its results with other advanced methods to ensure we capture the most meaningful patterns in the data. Additionally, we aim to use machine learning models to predict sleep quality based on lifestyle factors, which could help create personalized recommendations for improving sleep.

Lastly, we want to apply these findings in practical ways by collaborating with public health programs. This could include developing workplace wellness initiatives or community projects to raise awareness and promote better sleep. These steps will help us build on our current work and contribute to a better understanding of sleep health and its impact on overall well-being.

Unsupervised Learning Analysis on Sleep Health and Lifestyle Dataset

Jeffrey Fernandez

2024-11-15