Predicting
Quality of Sleep
Using Health and Lifestyle
Factors
Visual Exploratory Data Analysis (EDA)
Distribution of Quality of Sleep (Histogram)
Sleep Duration vs. Quality of Sleep (Scatterplot)
Stress Level vs. Quality of Sleep (Scatterplot)
Sleep Disorder vs. Quality of Sleep (Percent Stacked Bar Chart)
Model Comparison and Evaluation
Compare Boosted Trees and Random Forest Models Across Folds
The purpose of this project is to build a machine learning model that
predicts Quality of Sleep using various health and
lifestyle factors. We will be using the
Sleep Health and Lifestyle Dataset, and carrying out
multiple regression techniques to obtain the most accurate model for
this predictive regression problem.
From previous research, most people know the fundamental role sleep plays in their physical health, cognitive performance, and overall well-being. This project is motivated by the fact that many people, including myself, struggle with getting good sleep, and want to understand if there are ways they can improve their quality of sleep.
The relationships between lifestyle factors such as stress levels, physical activity, and many physiological factors are complex, and it is often difficult to determine which variables have the strongest influence on sleep outcomes. Lucky for us, machine learning models provide a way to examine these relationships simultaneously and identify patterns that may help explain variation in sleep quality across individuals. If we can better understand this variation and the relationships between variables, then we can make changes in our lives to improve our quality of sleep, and thus improve our quality of life.
This project is guided by the following questions:
Are some individuals predisposed to obtaining higher quality sleep?
Are there lifestyle changes that can be made to induce better sleep quality?
Is Quality of Sleep strongly associated with
Sleep Duration?
To answer these questions, we will analyze the continuous response
variable, Quality of Sleep, as a supervised regression
problem. Finding the best model requires us to do some initial data
cleaning and manipulation, then perform some exploratory data analysis
(EDA), split our training and testing data, set folds for 5-fold
cross-validation, and then model the training data using Linear
Regression, Elastic Net Regression, Decision Tree, Random Forest, and
Boosted Tree models. Finally, we will see which of these machine
learning models performs the strongest so we can use it on our testing
data and analyze the effectiveness of that model.
The data set used in this project is synthetic and consists of 374 observations and the following 13 variables (further description in the attached codebook)
Response Variable:
Quality of Sleep (numeric)Predictor Variables:
Gender(categorical/binary)
Age (numeric)
Occupation (categorical)
Sleep Duration (numeric)
Physical Activity Level (numeric)
Stress Level (numeric)
BMI Category (categorical)
Blood Pressure (numeric)
Heart Rate (numeric)
Daily Steps (numeric)
Sleep Disorder (categorical)
Observation Variable:
Person ID (identifier, numbers our observations)So, we have 11 predictor variables that capture a wide range of
health and lifestyle factors that are plausible predictors of
Quality of Sleep (our response variable) in the
Sleep Health and Lifestyle dataset.
(See cleaned predictor variables in attached codebook and formal citation of the dataset in References)
First, let’s load in all of our packages and the raw
Sleep Health and Lifestyle dataset.
# Download necessary pacakages
library(tidymodels)
library(tidyverse)
library(knitr) # to assist in knitting process
library(corrplot) # for the correlation plot
library(ggplot2) # for most of the visualizations
library(tinytex) # to knit LaTex Functions
library(glmnet) # for elastic net regression model
library(ranger) # for random forest model
library(xgboost) # for boosted tree model
library(vip) # for Variable Importance plot
library(plotly)
# Read data CSV and assign it to a variable
sleep_raw <- read_csv("Sleep_health_and_lifestyle_dataset.csv", show_col_types = FALSE)
Now that we’ve loaded in our raw data, let’s explore.
glimpse(sleep_raw)
## Rows: 374
## Columns: 13
## $ `Person ID` <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1…
## $ Gender <chr> "Male", "Male", "Male", "Male", "Male", "Mal…
## $ Age <dbl> 27, 28, 28, 28, 28, 28, 29, 29, 29, 29, 29, …
## $ Occupation <chr> "Software Engineer", "Doctor", "Doctor", "Sa…
## $ `Sleep Duration` <dbl> 6.1, 6.2, 6.2, 5.9, 5.9, 5.9, 6.3, 7.8, 7.8,…
## $ `Quality of Sleep` <dbl> 6, 6, 6, 4, 4, 4, 6, 7, 7, 7, 6, 7, 6, 6, 6,…
## $ `Physical Activity Level` <dbl> 42, 60, 60, 30, 30, 30, 40, 75, 75, 75, 30, …
## $ `Stress Level` <dbl> 6, 8, 8, 8, 8, 8, 7, 6, 6, 6, 8, 6, 8, 8, 8,…
## $ `BMI Category` <chr> "Overweight", "Normal", "Normal", "Obese", "…
## $ `Blood Pressure` <chr> "126/83", "125/80", "125/80", "140/90", "140…
## $ `Heart Rate` <dbl> 77, 75, 75, 85, 85, 85, 82, 70, 70, 70, 70, …
## $ `Daily Steps` <dbl> 4200, 10000, 10000, 3000, 3000, 3000, 3500, …
## $ `Sleep Disorder` <chr> "None", "None", "None", "Sleep Apnea", "Slee…
names(sleep_raw)
## [1] "Person ID" "Gender"
## [3] "Age" "Occupation"
## [5] "Sleep Duration" "Quality of Sleep"
## [7] "Physical Activity Level" "Stress Level"
## [9] "BMI Category" "Blood Pressure"
## [11] "Heart Rate" "Daily Steps"
## [13] "Sleep Disorder"
dim(sleep_raw)
## [1] 374 13
str(sleep_raw)
## spc_tbl_ [374 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Person ID : num [1:374] 1 2 3 4 5 6 7 8 9 10 ...
## $ Gender : chr [1:374] "Male" "Male" "Male" "Male" ...
## $ Age : num [1:374] 27 28 28 28 28 28 29 29 29 29 ...
## $ Occupation : chr [1:374] "Software Engineer" "Doctor" "Doctor" "Sales Representative" ...
## $ Sleep Duration : num [1:374] 6.1 6.2 6.2 5.9 5.9 5.9 6.3 7.8 7.8 7.8 ...
## $ Quality of Sleep : num [1:374] 6 6 6 4 4 4 6 7 7 7 ...
## $ Physical Activity Level: num [1:374] 42 60 60 30 30 30 40 75 75 75 ...
## $ Stress Level : num [1:374] 6 8 8 8 8 8 7 6 6 6 ...
## $ BMI Category : chr [1:374] "Overweight" "Normal" "Normal" "Obese" ...
## $ Blood Pressure : chr [1:374] "126/83" "125/80" "125/80" "140/90" ...
## $ Heart Rate : num [1:374] 77 75 75 85 85 85 82 70 70 70 ...
## $ Daily Steps : num [1:374] 4200 10000 10000 3000 3000 3000 3500 8000 8000 8000 ...
## $ Sleep Disorder : chr [1:374] "None" "None" "None" "Sleep Apnea" ...
## - attr(*, "spec")=
## .. cols(
## .. `Person ID` = col_double(),
## .. Gender = col_character(),
## .. Age = col_double(),
## .. Occupation = col_character(),
## .. `Sleep Duration` = col_double(),
## .. `Quality of Sleep` = col_double(),
## .. `Physical Activity Level` = col_double(),
## .. `Stress Level` = col_double(),
## .. `BMI Category` = col_character(),
## .. `Blood Pressure` = col_character(),
## .. `Heart Rate` = col_double(),
## .. `Daily Steps` = col_double(),
## .. `Sleep Disorder` = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
head(sleep_raw$`BMI Category`, 10)
## [1] "Overweight" "Normal" "Normal" "Obese" "Obese"
## [6] "Obese" "Obese" "Normal" "Normal" "Normal"
head(sleep_raw$`Sleep Disorder`, 10)
## [1] "None" "None" "None" "Sleep Apnea" "Sleep Apnea"
## [6] "Insomnia" "Insomnia" "None" "None" "None"
This exploration of the raw dataset allows us to better understand
its structure and variables using functions like glimpse(),
names(), dim(), and str() to
examine the number of observations, variable types, and overall layout
of the data. This helped us confirm that the dataset has 374
observations and 13 variables, including numeric and categorical
features. To get a better understanding of the values of specific
variables before cleaning and modeling, we took a closer at
BMI Category and Sleep Disorder.
From exploring the raw dataset, I have determined that none of our variables should be eliminated, but that some must be cleaned so we can use them in our models later on. So, let’s clean our data and convert any categorical variables to factors!
First, let’s convert the categorical variables, Gender,
Occupation, BMI Category, and
Sleep Disorder into factors because they describe
qualitative groups and should not be treated as numeric predictors.
Converting them to factor variables forces R to recognize them as
categorical predictors so they can be transformed into dummy variables
later in our recipe.
# Convert categorical columns to factors
sleep_data <- sleep_raw %>%
mutate(
Gender = factor(Gender),
Occupation = factor(Occupation),
`BMI Category` = factor(`BMI Category`),
`Sleep Disorder` = factor(`Sleep Disorder`)
)
Now, let’s deal with the Blood Pressure variable. In our
raw data, it is stored as a character string in the following form:
“systolic/diastolic”. But, regression models require numeric predictors,
so let’s split this variable into two separate variables,
Systolic and Diastolic blood pressure. This
preserves this health information and allows each component to be used
as numeric and independent in our models.
# Split blood pressure into numeric columns
sleep_data <- sleep_data %>%
separate(`Blood Pressure`, into = c("systolic", "diastolic"), sep = "/", convert = TRUE) %>%
rename(
Systolic = systolic,
Diastolic = diastolic)
Next, Person ID is an observation variable. that
uniquely identifies each observation in the dataset but does not contain
meaningful information about Quality of Sleep. Because it
is numeric, we need to convert it to an integer and keep it only as an
identifier, not a predictor.
# clean Person ID
sleep_data <- sleep_data %>%
mutate(`Person ID` = as.integer(`Person ID`))
Next, we need to tidy the Occupation variable, so let’s
view and evaluate it.
# View and evaluate occupation variable
count(sleep_raw, Occupation, sort = TRUE)
From this tibble, we can see Occupation contains 11
specific job titles, and some only appear a small number of times in the
dataset. To improve interpretability, we will group these occupations
into broader categories.
# group the occupation variable for the cleaned dataset
sleep_data <- sleep_data %>%
mutate(
`Occupation Group` = case_when(
Occupation %in% c("Doctor", "Nurse") ~ "Healthcare",
Occupation %in% c("Software Engineer", "Engineer") ~ "Technology",
Occupation %in% c("Sales Representative", "Salesperson") ~ "Sales",
Occupation %in% c("Teacher") ~ "Education",
TRUE ~ "Other"
),
`Occupation Group` = factor(`Occupation Group`)
) %>%
select(-Occupation)
# view the grouped occupation variable
count(sleep_data, `Occupation Group`)
Now that we’ve cleaned our data, let’s check for any missing data.
# check for missing data after cleaning
missing_table <- sleep_data %>%
summarise(across(everything(), ~ sum(is.na(.)))) %>%
pivot_longer(everything(),
names_to = "Variable",
values_to = "Missing_Count") %>%
mutate(Percent_Missing = Missing_Count / nrow(sleep_raw) * 100)
missing_table
kable(missing_table, caption = "Missing Values by Variable")
| Variable | Missing_Count | Percent_Missing |
|---|---|---|
| Person ID | 0 | 0 |
| Gender | 0 | 0 |
| Age | 0 | 0 |
| Sleep Duration | 0 | 0 |
| Quality of Sleep | 0 | 0 |
| Physical Activity Level | 0 | 0 |
| Stress Level | 0 | 0 |
| BMI Category | 0 | 0 |
| Systolic | 0 | 0 |
| Diastolic | 0 | 0 |
| Heart Rate | 0 | 0 |
| Daily Steps | 0 | 0 |
| Sleep Disorder | 0 | 0 |
| Occupation Group | 0 | 0 |
colSums(is.na(sleep_raw))
## Person ID Gender Age
## 0 0 0
## Occupation Sleep Duration Quality of Sleep
## 0 0 0
## Physical Activity Level Stress Level BMI Category
## 0 0 0
## Blood Pressure Heart Rate Daily Steps
## 0 0 0
## Sleep Disorder
## 0
We have confirmed there are no missing values in the
Sleep Health and Lifestyle dataset, so we can move forward
with building our models!
Before we move on, let’s check our data!
glimpse(sleep_data)
## Rows: 374
## Columns: 14
## $ `Person ID` <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1…
## $ Gender <fct> Male, Male, Male, Male, Male, Male, Male, Ma…
## $ Age <dbl> 27, 28, 28, 28, 28, 28, 29, 29, 29, 29, 29, …
## $ `Sleep Duration` <dbl> 6.1, 6.2, 6.2, 5.9, 5.9, 5.9, 6.3, 7.8, 7.8,…
## $ `Quality of Sleep` <dbl> 6, 6, 6, 4, 4, 4, 6, 7, 7, 7, 6, 7, 6, 6, 6,…
## $ `Physical Activity Level` <dbl> 42, 60, 60, 30, 30, 30, 40, 75, 75, 75, 30, …
## $ `Stress Level` <dbl> 6, 8, 8, 8, 8, 8, 7, 6, 6, 6, 8, 6, 8, 8, 8,…
## $ `BMI Category` <fct> Overweight, Normal, Normal, Obese, Obese, Ob…
## $ Systolic <int> 126, 125, 125, 140, 140, 140, 140, 120, 120,…
## $ Diastolic <int> 83, 80, 80, 90, 90, 90, 90, 80, 80, 80, 80, …
## $ `Heart Rate` <dbl> 77, 75, 75, 85, 85, 85, 82, 70, 70, 70, 70, …
## $ `Daily Steps` <dbl> 4200, 10000, 10000, 3000, 3000, 3000, 3500, …
## $ `Sleep Disorder` <fct> None, None, None, Sleep Apnea, Sleep Apnea, …
## $ `Occupation Group` <fct> Technology, Healthcare, Healthcare, Sales, S…
dim(sleep_data)
## [1] 374 14
This output confirms that the cleaned dataset contains 374 observations and 14 variables, and that the variables are correctly formatted with appropriate data types for further analysis and modeling.
Now, we will use EDA models to explore the relationships between predictor variables with the outcome as well as with the other predictor variables.
First, let’s explore the distribution of the outcome,
Quality of Sleep.
# Histogram of Distribution of Quality of Sleep (Outcome Exploration)
hist_qual_sleep <- ggplot(sleep_data, aes(x = `Quality of Sleep`)) +
geom_histogram(binwidth = 1, fill = "steelblue", color = "white") +
labs(
title = "Distribution of Quality of Sleep",
x = "Quality of Sleep Score",
y = "Count"
) +
theme_minimal()
ggplotly(hist_qual_sleep)
Our outcome, Quality of Sleep, is measured on a scale of
1 to 10. From this histogram, we can clearly see that our actual data
appears to range from 4 to 9 and is slightly left skewed with most
observations between 6 and 8. While the most common score is 8, 6 is
very close, which tells us this distribution is pretty bimodal. This
histogram indicates that there are very few observations at the lowest
quality sleep levels, meaning extremely poor sleep quality is relatively
uncommon among individuals in the dataset.
Next, let’s make a correlation matrix and then a heat map a of the numeric variables to gain a better understanding of their relationship.
# Correlation Matrix (Predictor Relationships)
# Selecting only numeric values
sleep_numeric <- sleep_data %>%
select(
Age,
`Sleep Duration`,
`Quality of Sleep`,
`Physical Activity Level`,
`Stress Level`,
`Heart Rate`,
`Daily Steps`,
Systolic,
Diastolic
)
cor_matrix <- cor(sleep_numeric)
# Make the correlation plot
corrplot(cor_matrix, method = "color",type = "lower", addCoef.col = "black", number.cex = 0.5, tl.cex = 0.6)
One relationship that stands out is the strong positive correlation
between Sleep Duration and Quality of Sleep
(\(\approx\) 0.88), suggesting that
individuals who sleep longer tend to report higher sleep quality.
Another notable pattern is the strong negative correlation between
Stress Level and both Quality of Sleep (\(\approx\) -0.90) and
Sleep Duration (\(\approx\) -0.81), indicating that higher
stress levels are associated with poorer and shorter sleep. Both of
these relationships align with what I originally expected and the
relationships in the second pattern are strengthened by the strong
relationship between Sleep Duration and
Quality of Sleep. These relationships suggest that
Sleep Duration and Stress Level may be
particularly important predictors when modeling
Quality of Sleep later in this analysis.
Further, Heart Rate and Quality of Sleep
also stand out as they show a moderately strong negative relationship
(\(\approx\) -0.66), suggesting that
physiological factors such as the Heart Rate may also
influence sleep quality.
One result that surprised me was the weak relationship between
Physical Activity Level and Quality of Sleep
(\(\approx\) 0.19), as I initially
suspected that they would have a strong positive relationship.
Since we discussed the relationship between
Sleep Duration and Quality of Sleep in the
correlation plot, it is important to take a closer look at the
relationship between these variables.
# Sleep Duration vs. Quality of Sleep Scatterplot
sleep_dur_vs_quality_scat <- ggplot(sleep_data, aes(x = `Sleep Duration`,
y = `Quality of Sleep`)) +
geom_point(alpha = 0.6) +
geom_smooth(method = "lm", color = "darkblue") +
labs(
title = "Sleep Duration vs. Quality of Sleep",
x = "Sleep Duration (hours)",
y = "Quality of Sleep"
) +
geom_jitter(width = 0.4, height = 0.4, size = 1) +
theme_minimal()
ggplotly(sleep_dur_vs_quality_scat)
This scatterplot confirms the positively correlated relationship
between Sleep Duration and Quality of Sleep
indicating that the amount of hours someone sleeps is related to their
quality of sleep.
Since we discussed the relationship between Stress Level
and Quality of Sleep in the correlation plot, it is
important to take a closer look at the relationship between these
variables.
# Stress Level vs. Quality of Sleep
stress_level_vs_sleep_quality_scat <- ggplot(sleep_data,
aes(x = `Stress Level`,
y = `Quality of Sleep`)) +
geom_point(alpha = 0.6) +
geom_jitter(width = 0.4, height = 0.4, size = 1) +
geom_smooth(method = "lm", color = "darkred") +
labs(
title = "Stress Level vs. Quality of Sleep",
x = "Stress Level",
y = "Quality of Sleep"
) +
theme_minimal()
ggplotly(stress_level_vs_sleep_quality_scat)
This scatterplot illustrates the negatively correlated relationship
we found in the correlation plot between Stress Level and
Quality of Sleep. This tells us that higher stress levels
are correlated to a lower quality of sleep.
To better understand how Sleep Disorder status relates
to Quality of Sleep, let’s use a percent stacked bar chart
to visualize the distribution of Quality of Sleep within
each Sleep Disorder category (Sleep Apnea, Insomnia, or
None). This type of chart allows us to compare the proportions of low,
medium, and high sleep quality across groups while controlling for
differences in the number of observations in each category.
# Percent Stacked Bar Chart
sleep_eda <- sleep_data %>%
mutate(sleep_quality_group = case_when(
`Quality of Sleep` <= 4 ~ "Low",
`Quality of Sleep` <= 7 ~ "Medium",
TRUE ~ "High"
)
)
# Plot the percent stacked bar chart
percent_stacked_barchart <- ggplot(sleep_eda,
aes(x = `Sleep Disorder`, fill = sleep_quality_group)) +
geom_bar(position = "fill") +
labs(title = "Sleep Disorder vs. Quality of Sleep",
x = "Sleep Disorder",
y = "Proportion",
fill = "Sleep Quality") +
theme_minimal()
ggplotly(percent_stacked_barchart)
This chart reveals clear differences in the distribution of
Quality of Sleep across Sleep Disorder
categories. Individuals without a sleep disorder appear to have the
highest proportion of high Quality of Sleep outcomes and no
low quality of sleep outcomes, while individuals with Insomnia have a
much larger proportion of medium Quality of Sleep and very
few high Quality of Sleep outcomes. Individuals with Sleep
Apnea have a relatively balanced distribution between medium and high
Quality of Sleep outcomes, and the most low
Quality of Sleep outcomes. Overall, these differences
suggest that Sleep Disorder status may be associated with
variations in Quality of Sleep.
Let’s check and make sure all of our predictors are here and didn’t get changed at all through our EDA.
names(sleep_data)
## [1] "Person ID" "Gender"
## [3] "Age" "Sleep Duration"
## [5] "Quality of Sleep" "Physical Activity Level"
## [7] "Stress Level" "BMI Category"
## [9] "Systolic" "Diastolic"
## [11] "Heart Rate" "Daily Steps"
## [13] "Sleep Disorder" "Occupation Group"
Now, it’s time to split our data so that later we can evaluate how
well our models generalize to new data. We will perform a 70/30 split on
our data and split it into training and testing sets. We’ll also
stratify on our outcome variable, Quality of Sleep, to
ensure that the distribution of the outcome variable remains similar in
both the training and testing datasets.
# Set Seed for Reproducibility
set.seed(123)
# Split the Data (70/30 split, stratify on Quality of Sleep)
sleep_split <- initial_split(sleep_data, prop = 0.7, strata = `Quality of Sleep`)
sleep_train <- training(sleep_split)
sleep_test <- testing(sleep_split)
Before moving on, we need to make sure that data split correctly!
# Verify data split correctly
nrow(sleep_train)/nrow(sleep_data)
## [1] 0.6925134
nrow(sleep_test)/nrow(sleep_data)
## [1] 0.3074866
The training set has about 70% of the data and the testing set has about 30% of the data, so the data was split correctly.
For the recipe, we will use all of the same predictors as we used for
the EDA, except this time we will exclude the predictor,
Person ID since it is an identifier and is not used for
prediction. We’ll also convert categorical predictors to dummy variables
to be compatible with regression models. We’ll remove predictors with
zero variance and normalize numeric predictors to ensure they are on
comparable scales for modeling.
# Creating recipe
sleep_recipe <- recipe(`Quality of Sleep` ~ ., data = sleep_train) %>%
update_role(`Person ID`, new_role = "id") %>% # keep as identifier, not predictor
step_dummy(all_nominal_predictors()) %>% # change categorical factors to dummy variables
step_zv(all_predictors()) %>% # remove zero variance predictors
step_normalize(all_numeric_predictors()) # scale numeric predictors
Check Recipe
sleep_prep <- prep(sleep_recipe)
sleep_baked <- bake(sleep_prep, new_data = NULL)
ncol(sleep_baked)
## [1] 20
names(sleep_baked)
## [1] "Person ID" "Age"
## [3] "Sleep Duration" "Physical Activity Level"
## [5] "Stress Level" "Systolic"
## [7] "Diastolic" "Heart Rate"
## [9] "Daily Steps" "Quality of Sleep"
## [11] "Gender_Male" "BMI Category_Normal.Weight"
## [13] "BMI Category_Obese" "BMI Category_Overweight"
## [15] "Sleep Disorder_None" "Sleep Disorder_Sleep.Apnea"
## [17] "Occupation Group_Healthcare" "Occupation Group_Other"
## [19] "Occupation Group_Sales" "Occupation Group_Technology"
We will create 5 folds to conduct k-fold (5-fold in this case)
stratified cross-validation. We’ll choose 5 folds in this case because
it allows the model to be tested several times while still keeping the
process manageable. In 5-fold cross-validation, R divides the training
dataset into 5 equal subsets. The model is then trained on k-1 folds and
validated on the remaining fold, and this process repeats until each
fold has been used as the validation set. The performance metrics are
then averaged across all folds to provide a more reliable estimate of
how the model is expected to perform on unseen data. We’ll also stratify
our cross-validation on our response variable,
Quality of Sleep to ensure that each fold contains a
similar distribution of the outcome variable and prevent imbalances
across folds.
set.seed(123) # set seed for reproducibility
sleep_folds <- vfold_cv(sleep_train, v = 5, strata = `Quality of Sleep`)
sleep_folds
We will use RMSE as our metric.
sleep_metrics <- metric_set(rmse)
Let’s start off by using the linear regression model as a baseline model because it provides a simple and interpretable benchmark for evaluating how much more complex machine learning models improve predictive performance.
lm_model <- linear_reg() %>%
set_engine("lm")
lm_wf <- workflow() %>%
add_model(lm_model) %>%
add_recipe(sleep_recipe)
# No tuning grid necessary since we didn't tune any parameters.
# No tuning, so not necessary.
# First, let's fit the linear regression model to the folds since it had no tuning.
lm_fit <- fit_resamples(lm_wf, resamples = sleep_folds)
lm_rmse <- collect_metrics(lm_fit) %>%
filter(.metric == "rmse") %>%
slice(1) %>%
pull(mean)
lm_rmse
## [1] 0.266299
# We didn't tune, so no need for an autoplot.
Next, we’ll fit an elastic net regression model because it combines ridge and lasso regularization, so it can manage correlated predictors while preventing overfitting. We’ll tune the hyperparameters, penalty and mixture, to determine the optimal strength and type of regularization for the model.
enet_model <- linear_reg(
penalty = tune(),
mixture = tune()
) %>%
set_engine("glmnet")
enet_wf <- workflow() %>%
add_model(enet_model) %>%
add_recipe(sleep_recipe)
enet_grid <- grid_regular(
penalty(),
mixture(),
levels = 5
)
enet_grid
set.seed(123) # set seed for reproducibility
enet_tune <- tune_grid(
enet_wf,
resamples = sleep_folds,
grid = enet_grid
)
enet_tune
show_best(enet_tune, metric = "rmse", n = 1) # tells us row # for lowest RMSE for slice()
enet_rmse <- collect_metrics(enet_tune) %>%
filter(.metric == "rmse") %>%
arrange(mean) %>%
slice(1) %>%
pull(mean)
enet_rmse
## [1] 0.2550857
autoplot(enet_tune, metric = "rmse")
From the autoplot, we can see that models with smaller values of the penalty parameter generally produce lower RMSE values, indicating that weaker regularization performs better for this dataset. As the penalty increases, RMSE significantly increases, suggesting that predictive accuracy is harmed by too much regularization. The mixture parameter tells us that smaller values tend to yield lower RMSE compared to values closer to pure lasso. This plot helps identify the tuning parameter combination that minimizes RMSE and was selected as the best elastic net model.
We’ll fit a decision tree model because it can capture nonlinear relationships and interactions between predictors. We’ll tune the hyperparameters, cost_complexity, tree_depth, and min_n to control model complexity and prevent overfitting by determining how large and detailed the tree is allowed to grow.
dtree_model <- decision_tree(
cost_complexity = tune(),
tree_depth = tune(),
min_n = tune()
) %>%
set_engine("rpart") %>%
set_mode("regression")
dtree_wf <- workflow() %>%
add_model(dtree_model) %>%
add_recipe(sleep_recipe)
dtree_grid <- grid_regular(
cost_complexity(),
tree_depth(),
min_n(),
levels = 4
)
dtree_grid
set.seed(123) # set seed for reproducibility
dtree_tune <- tune_grid(
dtree_wf,
resamples = sleep_folds,
grid = dtree_grid
)
dtree_tune
dtree_rmse <- collect_metrics(dtree_tune) %>%
filter(.metric == "rmse") %>%
arrange(mean) %>%
slice(1) %>%
pull(mean)
dtree_rmse
## [1] 0.1806591
autoplot(dtree_tune, metric = "rmse")
The autoplot shows us that RMSE tends to be lowest when the cost-complexity parameter is small, which indicates that stronger pruning increases prediction error. Models with moderate tree depth generally perform better than very shallow trees, which appear to underfit the data. Differences across minimal node sizes are relatively small, although some configurations show slightly lower RMSE than others. Overall, this visualization helps identify the combination of tuning parameters that minimizes RMSE and will be selected as the best decision tree model.
We’ll fit a random forest model because it improves predictive accuracy by averaging many decision trees, reducing variance and overfitting. We’ll tune the hyperparameters, mtry, and min_n, to control how many predictors are considered at each split and the minimum number of observations required in each node, which influence the complexity and performance of the model.
rf_model <- rand_forest(
mtry = tune(),
min_n = tune(),
trees = 500
) %>%
set_engine('ranger', importance = "impurity") %>%
set_mode('regression')
rf_wf <- workflow() %>%
add_model(rf_model) %>%
add_recipe(sleep_recipe)
# first, lets get the true number of predictors so we can set the range for mtry.
num_pred <- sleep_baked %>%
select(-`Quality of Sleep`, - `Person ID`) %>%
ncol()
num_pred
## [1] 18
rf_grid <- grid_regular(
mtry(range = c(1, num_pred)),
min_n(),
levels = 4
)
rf_grid
set.seed(123) # set seed for reproducibility
rf_tune <- tune_grid(
rf_wf,
resamples = sleep_folds,
grid = rf_grid,
metrics = metric_set(rmse)
)
rf_tune
show_best(rf_tune, metric = "rmse", n=1) # tells us row # for lowest RMSE for slice()
rf_rmse <- collect_metrics(rf_tune) %>%
filter(.metric == "rmse") %>%
arrange(mean) %>%
slice(1) %>%
pull(mean)
rf_rmse
## [1] 0.1633274
autoplot(rf_tune, metric = "rmse")
The plot shows us that RMSE tends to be lowest when the cost-complexity parameter is small, which indicates that stronger pruning increases prediction error. Models with moderate tree depth generally perform better than very shallow trees, which appear to underfit the data. Differences across minimal node sizes are relatively small, although some configurations show slightly lower RMSE than others. Overall, this visualization helps identify the combination of tuning parameters that minimizes RMSE and will be selected as the best decision tree model.
We’ll fit the boosted tree model because it sequentially builds trees that learn from the errors of previous trees, allowing the model to capture complex patterns in the data. We’ll tune the hyperparameters trees, tree_depth, learn_rate, and min_n to control the number of trees, the complexity of each tree, the speed at which the model learns, and the minimum observations required in each node to optimize predictive performance and prevent overfitting.
boost_model <- boost_tree(
trees = tune(),
tree_depth = tune(),
learn_rate = tune(),
min_n = tune()
) %>%
set_engine("xgboost") %>%
set_mode("regression")
boost_wf <- workflow() %>%
add_model(boost_model) %>%
add_recipe(sleep_recipe)
boost_grid <- grid_regular(
trees(range = c(100,500)),
tree_depth(),
learn_rate(),
min_n(),
levels = 3
)
boost_grid
set.seed(123) # set seed for reproducibility
boost_tune <- tune_grid(
boost_wf,
resamples = sleep_folds,
grid = boost_grid,
metrics = metric_set(rmse)
)
boost_tune
show_best(boost_tune, metric = "rmse", n = 1) # tells us row # for lowest RMSE for slice()
boost_rmse <- collect_metrics(boost_tune) %>%
filter(.metric == "rmse") %>%
arrange(mean) %>%
slice(1) %>%
pull(mean)
boost_rmse
## [1] 0.1554284
autoplot(boost_tune, metric = "rmse")
The plot shows us that models with extremely small learning rates perform poorly, producing very large RMSE values regardless of the number of trees or tree depth. Models with a higher learning rate show much lower RMSE, indicating better predictive performance. Increasing the number of trees slightly improves performance but does not dramatically change RMSE once a sufficient number of trees is used. Differences across tree depth and minimal node size appear relatively small. Overall, this plot helps identify the combination of tuning parameters that minimizes RMSE of the best boosted tree model.
Let’s see how our models performed!
In order to summarize the best RMSE values from our 5 models, we will create a tibble to display the performance of our models as measure by the RMSE. The lower the RMSE, the better the model has performed.
Create a tibble of all models and their RMSE.. The best model is the ones with the lowest RMSE. So let’s organize them by best to worst so we can use the best model to predict our testing set.
# create comparison tibble
comparison_tibble <- tibble(
Model = c("Linear Regression", "Elastic Net Regression", "Decision Tree", "Random Forest", "Boosted Trees" ),
RMSE = c(lm_rmse, enet_rmse, dtree_rmse, rf_rmse, boost_rmse)
)
comparison_tibble %>%
arrange(RMSE)
The comparison tibble shows us that the model with the lowest RMSE is the Boosted Tree model (\(\approx\) 0.155), indicating this model has the best predictive performance. So, we will use the Boosted Tree model for final evaluation on the the testing set.
The comparison tibble shows us that the model with the second lowest RMSE is the Random Forest model (\(\approx\) 0.163), indicating this model has the second best predictive performance.
Let’s compare our models and make sure Boosted Trees is better than Random Forest since their lowest RMSE values are very close. We’ll use a boxplot to compare across folds.
boost_fold_rmse <- collect_metrics(boost_tune, summarize = FALSE) %>%
filter(.metric == "rmse") %>%
select(id, .estimate) %>%
mutate(Model = "Boosted Tree")
rf_fold_rmse <- collect_metrics(rf_tune, summarize = FALSE) %>%
filter(.metric == "rmse") %>%
select(id, .estimate) %>%
mutate(Model = "Random Forest")
fold_rmse_compare <- bind_rows(boost_fold_rmse, rf_fold_rmse)
boxplot_compare <- ggplot(fold_rmse_compare, aes(x = Model, y = .estimate, fill = Model)) +
geom_boxplot(alpha = 0.8) +
labs(title = "RMSE Fold Comparison Boxplot",
x = "Model",
y = "RMSE") +
theme_minimal()
ggplotly(boxplot_compare)
The boosted tree model shows a wider spread of RMSE values across
folds, indicating that its predictive performance varies more depending
on the specific training split. This occurs because boosted trees build
models sequentially and can be more sensitive to small changes in the
training data. The random forest model, on the other hand, averages
predictions across many independent trees, which reduces variance and
results in more stable performance across folds. This plot ultimately
shows that the boosted tree model is slightly more accurate but more
variable than the random forest model. We’ll train and test both of
these models to see which is a better predictive model for
Quality of Sleep.
Since the Boosted Tree model has the lowest RMSE and the Random Forest model’s RMSE is very close and has less error, we will use both of these models..
Boosted Tree
# collecting metrics on the tuned boosted tree model and slicing on the lowest RMSE
boost_results <- boost_tune %>%
collect_metrics() %>%
filter(.metric == "rmse") %>%
arrange(mean) %>%
slice(1)
boost_results
which.min(boost_results$mean)
## [1] 1
So, Boosted Tree #1 with trees = 300, min_n = 2, tree_depth = 8, and learn_rate = 0.1, performed the best with a cross-validated RMSE of \(\approx\) 0.155.
Random Forest
# collecting metrics on the tuned random forest and slicing on the lowest RMSE
rf_results <- rf_tune %>%
collect_metrics() %>%
filter(.metric == "rmse") %>%
arrange(mean) %>%
slice(1)
rf_results
which.min(rf_results$mean)
## [1] 1
So, Random Forest #1 with mtry = 18 predictors considered at each split and a minimum node size (min_n) of 2, performed the best with a cross-validated RMSE of \(\approx\) 0.1633.
Boosted Tree
Now, we will take the best model from the tuned Boosted Tree (#1) and fit it to the training data to train the Boosted Tree model one more time on the entire training dataset. Once we fit and train the boosted tree on the training data, it will be ready for testing.
set.seed(123)
# Finalize Best Boosted Trees Model
best_boost_train <- select_best(boost_tune, metric = "rmse")
boost_final_wf_train <- finalize_workflow(boost_wf, best_boost_train)
# Fit model to the training data
boost_final_fit_train <- fit(boost_final_wf_train, data = sleep_train)
Random Forest
Now, we will take the best model from the tuned Random Forest (#1) and fit it to the training data to train the Random Forest model one more time on the entire training dataset. Once we fit and train the random forest on the training data, it will be ready for testing.
set.seed(123)
# Finalize best Random Forest Model
best_rf_train <- select_best(rf_tune, metric = "rmse")
rf_final_wf_train <- finalize_workflow(rf_wf, best_rf_train)
# Fit model to the training data
rf_final_fit_train <- fit(rf_final_wf_train, data = sleep_train)
Now, we will take the best boosted tree and random forest model and fit them to the training data. Once we have fit and trained these models, they will be ready for testing!
Boosted Tree
sleep_tibble_boost_train <- predict(
boost_final_fit_train, new_data = sleep_train %>%
select(-`Quality of Sleep`)) %>%
bind_cols(sleep_train %>% select(`Quality of Sleep`))
sleep_tibble_boost_train
boost_train_rmse <- rmse(
sleep_tibble_boost_train,
truth = `Quality of Sleep`,
estimate = .pred
) %>%
pull(.estimate)
boost_train_rmse
## [1] 0.001213556
Random Forest
sleep_tibble_rf_train <- predict(
rf_final_fit_train, new_data = sleep_train %>%
select(-`Quality of Sleep`)) %>%
bind_cols(sleep_train %>% select(`Quality of Sleep`))
sleep_tibble_rf_train
rf_train_rmse <- rmse(
sleep_tibble_rf_train,
truth = `Quality of Sleep`,
estimate = .pred
) %>%
pull(.estimate)
rf_train_rmse
## [1] 0.06745269
Now, it’s time to test the models to see how they perform on the testing data that has not been trained at all.
Let’s take the best boosted tree and random forest model, test them on the testing data, and extract the testing RMSE.
Boosted Tree
sleep_tibble_boost_test <- predict(
boost_final_fit_train, new_data = sleep_test %>%
select(-`Quality of Sleep`)) %>%
bind_cols(sleep_test %>% select(`Quality of Sleep`))
sleep_tibble_boost_test
boost_test_rmse <- rmse(
sleep_tibble_boost_test,
truth = `Quality of Sleep`,
estimate = .pred
) %>%
pull(.estimate)
boost_test_rmse
## [1] 0.3971915
Random Forest
sleep_tibble_rf_test <- predict(
rf_final_fit_train, new_data = sleep_test %>%
select(-`Quality of Sleep`)) %>%
bind_cols(sleep_test %>% select(`Quality of Sleep`))
sleep_tibble_rf_test
rf_test_rmse <- rmse(
sleep_tibble_rf_test,
truth = `Quality of Sleep`,
estimate = .pred
) %>%
pull(.estimate)
rf_test_rmse
## [1] 0.3274892
Let’s compare the testing error of our best models! A lower testing RMSE indicates that a model’s predictions are closer to the actual values and that the model generalizes better to new observations.
testing_error <- tibble(
Model = c("Boosted Tree", "Random Foret"),
test_rmse = c(boost_test_rmse, rf_test_rmse)
)
testing_error
The random forest model achieves a lower testing RMSE (0.327) compared to the boosted tree model (0.397), indicating better predictive performance on unseen data.
Now that we’ve tested both models, let’s compare how they perform compared to the actual data and eachother overall by looking at the generalization gap between the training and testing RMSE for each of our models.
final_model_comparison <- tibble(
Model = c("Boosted Tree", "Random Forest"),
`Cross-Validation RMSE` = c(boost_rmse, rf_rmse),
`Training RMSE` = c(boost_train_rmse, rf_train_rmse),
`Testing RMSE` = c(boost_test_rmse, rf_test_rmse)) %>%
mutate(`Generalization Gap` = `Testing RMSE` - `Training RMSE`)
final_model_comparison
Although the boosted tree model achieved the lowest cross-validation
(0.155) and training RMSE (0.0012), its testing RMSE (0.397) is much
higher indicating overfitting. The random forest model has a slightly
higher cross-validation RMSE (0.163) but a lower testing RMSE (0.327)
and a smaller increase from training to testing error (0.260 versus
0.396), suggesting it generalizes better to unseen data. Overall, this
suggests that the random forest model performs the best at predicting
Quality of Sleep based on the lifestyle and physiological
factors included in the dataset, and thus, we will select the
random forest model as the final model.
Let’s look at a plot of the predicted values vs. actual values:
sleep_tibble_rf_test %>%
ggplot(aes(x = `Quality of Sleep`, y = .pred )) +
geom_point(color = "steelblue", alpha = 0.6, position = position_jitter(width = 0.3, height = 0.1)) +
geom_abline(lty = 2) +
labs(
title = "Predicted vs Actual Quality of Sleep",
x = "Actual Quality of Sleep",
y = "Predicted Quality of Sleep"
) +
theme_minimal()
If the model were perfect, every point would lie on the dashed line. This plot shows us that the majority of data lies either on or very close to the line, telling us that the model’s predictions are very close to the actual values and thus the model performs well overall.
Random forest models can also tell us which variables are the most important in predicting the outcome. We can see this in a variable importance plot (VIP), so let’s make the plot!
rf_final_fit_train %>%
extract_fit_engine() %>%
vip(aesthetics = list(fill = "red", color = "blue"))
From the VIP, we can see that the variable that was most important in
predicting the Quality of Sleep was
Stress Level. The second most important variable was
Sleep Duration, which is what I had initially expected
would be the most important predictor. The rest of the predictors were
significantly less important than Stress Level which makes
sense because I feel like I to get lower quality sleep when I am more
stressed and higher quality sleep when I am less stressed.
Figure: National Fund for Workforce Solutions, “Recognizing and Reducing Stress in the Workplace” (2022)
In this project, I built and compared multiple models to predict
Quality of Sleep using health and lifestyle variables.
After tuning and evaluating Linear Regression, Elastic Net, Decision
Tree, Random Forest, and Boosted Trees, the Random Forest and Boosted
Tree models stood out as the strongest performers based on their
cross-validation RMSE. Although the Boosted Tree model achieved the
lowest cross-validated RMSE, further evaluation revealed that it
significantly overfit the training data, as shown by its extremely low
training RMSE and much higher testing RMSE. In contrast, the Random
Forest achieved a lower testing RMSE, indicating better performance on
unseen data. The predicted vs. actual plot further supported this
result, as the Random Forest predictions closely followed the 45-degree
reference line, showing a relatively strong alignment between predicted
and actual values. While there is some spread in the mid-range values,
the model performs well overall. The variable importance plot also
showed us that Stress Level is the best predictor of
Quality of Sleep in our dataset.
Overall, the Random Forest model was selected as the final model
because it balances accuracy and generalizability, making it the most
reliable for predicting Quality of Sleep. While the current
models performed well, there are several opportunities for future
improvement. We could incorporate additional features such as screen
time and more detailed health metrics which could improve predictive
performance. We also could explore more advanced models or tune
hyperparameters further which may lead to even better results. In the
future, we could also include testing the model on different datasets
(perhaps real data instead of synthetic) and retraining it over time to
evaluate its validity and generalizability across populations.
This project was a great opportunity to gain hands-on experience with
machine learning models and helped me improve my skills. Additionally,
now that I’ve determined that Stress Level is the most
important predictor of Quality of Sleep, I will try to make
changes in my life to lower my stress level and get better sleep. I
learned a lot throughout the process of cleaning the data, conducting
EDA, building and testing models and I’m proud of my final results. I
look forward to continuing to build and improve my modeling skills in
the future!
“Sleep Health and Lifestyle Dataset”, Kaggle, Laksika Tharmalingam (2023), https://www.kaggle.com/datasets/uom190346a/sleep-health-and-lifestyle-dataset