For this assignment I will use the sleep study data set.
This assignment spans two weeks. The first part is to clean and prepare the data set and then build and optimize a logistic regression model. Next week, additional machine learning models will be created and all model results will be compared to select the best one.
Recall the important steps as outlined above for creating a model.
First, I will load all the R packages needed for analysis. Each package provides helpful tools for data cleaning, visualization, model building, or evaluation. For example, tidyverse and dplyr are for data manipulation, ggplot2 is for plotting, and caret and randomForest are for machine learning. Loading them now avoids repeatedly calling library() later.
# Load libraries
#install.packages("caret")
#install.packages("rcompanion")
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.0 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.2.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(cowplot)
##
## Attaching package: 'cowplot'
##
## The following object is masked from 'package:lubridate':
##
## stamp
library(dplyr)
library(tidyr)
library(randomForest)
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
##
## The following object is masked from 'package:dplyr':
##
## combine
##
## The following object is masked from 'package:ggplot2':
##
## margin
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
##
## The following object is masked from 'package:purrr':
##
## lift
library(rcompanion)
library(corrplot)
## corrplot 0.95 loaded
library(pROC)
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
##
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
library(e1071)
##
## Attaching package: 'e1071'
##
## The following object is masked from 'package:ggplot2':
##
## element
This week: 1. Perform any and all steps needed to clean and prepare the sleep data set. Note the glm() model can only predict two outcomes. There are three, normal, insomnia and apnea. Merge the abnormal sleep outcomes into a single outcome.
This week, I will focus on cleaning and preparing the sleep dataset. Since logistic regression predicts only two outcomes, I will combine “Insomnia” and “Sleep Apnea” into a single “Abnormal” category, leaving “Normal” as the other category.
I will import the Sleep Health and Lifestyle dataset and make it analysis-ready. First, I will load the data using read.csv() and clean column names with make.names(). Next, I will split the “Blood.Pressure” column into two numeric columns, “Systolic” and “Diastolic.” Then, I will merge the sleep disorder categories into “Normal” and “Abnormal.” BMI will also be simplified into “Healthy” and “Unhealthy,” and the Person.ID column will be removed because it doesn’t provide predictive information. Finally, I will ensure that categorical and numeric variables have the correct type and inspect the data to confirm that cleaning worked as expected.
# 1 Clean the data, tidy data, remove samples with NA values and extremes or rare values
# Load and inspect data
sleep <- read.csv("Sleep_health_and_lifestyle_dataset.csv")
# Clean column names
colnames(sleep) <- make.names(colnames(sleep))
# Split Blood Pressure into Systolic and Diastolic
sleep_clean <- sleep %>%
mutate(Blood.Pressure = str_trim(Blood.Pressure)) %>%
separate(Blood.Pressure, into = c("Systolic", "Diastolic"), sep = "/") %>%
mutate(across(c(Systolic, Diastolic), as.numeric))
# Merge Sleep Disorder categories (Normal vs Abnormal)
sleep_clean <- sleep_clean %>%
mutate(Sleep.Disorder = case_when(
Sleep.Disorder %in% c("Insomnia", "Sleep Apnea") ~ "Abnormal",
TRUE ~ "Normal"
)) %>%
mutate(Sleep.Disorder = factor(Sleep.Disorder, levels = c("Normal", "Abnormal")))
# Merge BMI categories into Healthy vs Unhealthy
sleep_clean <- sleep_clean %>%
mutate(BMI.Category = case_when(
BMI.Category %in% c("Normal", "Normal Weight") ~ "Healthy",
BMI.Category %in% c("Overweight", "Obese") ~ "Unhealthy"
)) %>%
mutate(BMI.Category = factor(BMI.Category, levels = c("Healthy", "Unhealthy")))
# Remove Person.ID because it is simply a unique identifier and contains no predictive value
sleep_clean <- sleep_clean %>%
select(-Person.ID)
# Convert relevant variables to numeric or factor
sleep_clean <- sleep_clean %>%
mutate(
Gender = as.factor(Gender),
Occupation = as.factor(Occupation),
Age = as.numeric(Age),
Sleep.Duration = as.numeric(Sleep.Duration),
Quality.of.Sleep = as.numeric(Quality.of.Sleep),
Physical.Activity.Level = as.numeric(Physical.Activity.Level),
Stress.Level = as.numeric(Stress.Level),
Heart.Rate = as.numeric(Heart.Rate),
Daily.Steps = as.numeric(Daily.Steps)
)
# Verify structure
str(sleep_clean)
## 'data.frame': 374 obs. of 13 variables:
## $ Gender : Factor w/ 2 levels "Female","Male": 2 2 2 2 2 2 2 2 2 2 ...
## $ Age : num 27 28 28 28 28 28 29 29 29 29 ...
## $ Occupation : Factor w/ 11 levels "Accountant","Doctor",..: 10 2 2 7 7 10 11 2 2 2 ...
## $ Sleep.Duration : num 6.1 6.2 6.2 5.9 5.9 5.9 6.3 7.8 7.8 7.8 ...
## $ Quality.of.Sleep : num 6 6 6 4 4 4 6 7 7 7 ...
## $ Physical.Activity.Level: num 42 60 60 30 30 30 40 75 75 75 ...
## $ Stress.Level : num 6 8 8 8 8 8 7 6 6 6 ...
## $ BMI.Category : Factor w/ 2 levels "Healthy","Unhealthy": 2 1 1 2 2 2 2 1 1 1 ...
## $ Systolic : num 126 125 125 140 140 140 140 120 120 120 ...
## $ Diastolic : num 83 80 80 90 90 90 90 80 80 80 ...
## $ Heart.Rate : num 77 75 75 85 85 85 82 70 70 70 ...
## $ Daily.Steps : num 4200 10000 10000 3000 3000 3000 3500 8000 8000 8000 ...
## $ Sleep.Disorder : Factor w/ 2 levels "Normal","Abnormal": 1 1 1 2 2 2 2 1 1 1 ...
summary(sleep_clean)
## Gender Age Occupation Sleep.Duration Quality.of.Sleep
## Female:185 Min. :27.00 Nurse :73 Min. :5.800 Min. :4.000
## Male :189 1st Qu.:35.25 Doctor :71 1st Qu.:6.400 1st Qu.:6.000
## Median :43.00 Engineer :63 Median :7.200 Median :7.000
## Mean :42.18 Lawyer :47 Mean :7.132 Mean :7.313
## 3rd Qu.:50.00 Teacher :40 3rd Qu.:7.800 3rd Qu.:8.000
## Max. :59.00 Accountant:37 Max. :8.500 Max. :9.000
## (Other) :43
## Physical.Activity.Level Stress.Level BMI.Category Systolic
## Min. :30.00 Min. :3.000 Healthy :216 Min. :115.0
## 1st Qu.:45.00 1st Qu.:4.000 Unhealthy:158 1st Qu.:125.0
## Median :60.00 Median :5.000 Median :130.0
## Mean :59.17 Mean :5.385 Mean :128.6
## 3rd Qu.:75.00 3rd Qu.:7.000 3rd Qu.:135.0
## Max. :90.00 Max. :8.000 Max. :142.0
##
## Diastolic Heart.Rate Daily.Steps Sleep.Disorder
## Min. :75.00 Min. :65.00 Min. : 3000 Normal :219
## 1st Qu.:80.00 1st Qu.:68.00 1st Qu.: 5600 Abnormal:155
## Median :85.00 Median :70.00 Median : 7000
## Mean :84.65 Mean :70.17 Mean : 6817
## 3rd Qu.:90.00 3rd Qu.:72.00 3rd Qu.: 8000
## Max. :95.00 Max. :86.00 Max. :10000
##
After running this code, the dataset will become clean and analysis-ready. The make.names() function will have converted column names like Sleep Duration into Sleep.Duration, ensuring R-compliant variable names. The Blood.Pressure column will now appear as two numeric variables—Systolic and Diastolic—each containing one clear value per participant. The categorical merging will have simplified Sleep.Disorder into “Normal” and “Abnormal” and BMI.Category into “Healthy” and “Unhealthy,” helping with binary modeling. The str() output will show that variables such as Gender, Occupation, and BMI.Category are factors, while continuous measures like Sleep.Duration, Age, and Heart.Rate are numeric. The summary() will confirm that no variables contain unexpected NAs or non-numeric entries. At this point, the data will be properly structured, tidy, and ready for correlation testing and scaling.
In this step, I will examine relationships among both numeric and categorical variables to identify redundancy or multicollinearity that could distort model estimates. For numeric variables, I will calculate a correlation matrix using cor() and then apply findCorrelation() from the caret package to automatically flag variables that are highly correlated (above 0.9). Any such variables will be removed to reduce redundancy. For categorical variables, I will create two custom functions: one that computes chi-square p-values (catcor) and another that calculates Cramér’s V (catcramV) to measure the strength of association between pairs of categorical predictors. Finally, I will visualize these categorical correlations using the corrplot() function. This step will ensure that no predictors are overly correlated and that any strong dependencies among categorical variables are identified before modeling.
# 2 Test for correlations
# NUMERICAL VARIABLES
# Check correlations
num_vars <- sleep_clean %>%
select(where(is.numeric)) %>%
cor(use = "complete.obs")
print(round(num_vars, 2))
## Age Sleep.Duration Quality.of.Sleep
## Age 1.00 0.34 0.47
## Sleep.Duration 0.34 1.00 0.88
## Quality.of.Sleep 0.47 0.88 1.00
## Physical.Activity.Level 0.18 0.21 0.19
## Stress.Level -0.42 -0.81 -0.90
## Systolic 0.61 -0.18 -0.12
## Diastolic 0.59 -0.17 -0.11
## Heart.Rate -0.23 -0.52 -0.66
## Daily.Steps 0.06 -0.04 0.02
## Physical.Activity.Level Stress.Level Systolic Diastolic
## Age 0.18 -0.42 0.61 0.59
## Sleep.Duration 0.21 -0.81 -0.18 -0.17
## Quality.of.Sleep 0.19 -0.90 -0.12 -0.11
## Physical.Activity.Level 1.00 -0.03 0.27 0.38
## Stress.Level -0.03 1.00 0.10 0.09
## Systolic 0.27 0.10 1.00 0.97
## Diastolic 0.38 0.09 0.97 1.00
## Heart.Rate 0.14 0.67 0.29 0.27
## Daily.Steps 0.77 0.19 0.10 0.24
## Heart.Rate Daily.Steps
## Age -0.23 0.06
## Sleep.Duration -0.52 -0.04
## Quality.of.Sleep -0.66 0.02
## Physical.Activity.Level 0.14 0.77
## Stress.Level 0.67 0.19
## Systolic 0.29 0.10
## Diastolic 0.27 0.24
## Heart.Rate 1.00 -0.03
## Daily.Steps -0.03 1.00
# Identify highly correlated variables
high_corr <- findCorrelation(cor(sleep_clean %>% select(where(is.numeric))), cutoff = 0.9, names = TRUE)
high_corr
## [1] "Diastolic"
# Remove those variables
sleep_clean <- sleep_clean %>%
select(-all_of(high_corr))
# Check resulting numeric correlations again
cor(sleep_clean %>% select(where(is.numeric)), use = "complete.obs")
## Age Sleep.Duration Quality.of.Sleep
## Age 1.0000000 0.34470936 0.47373388
## Sleep.Duration 0.3447094 1.00000000 0.88321300
## Quality.of.Sleep 0.4737339 0.88321300 1.00000000
## Physical.Activity.Level 0.1789927 0.21236031 0.19289645
## Stress.Level -0.4223445 -0.81102303 -0.89875203
## Systolic 0.6058784 -0.18040628 -0.12163200
## Heart.Rate -0.2256062 -0.51645489 -0.65986473
## Daily.Steps 0.0579734 -0.03953254 0.01679141
## Physical.Activity.Level Stress.Level Systolic
## Age 0.17899272 -0.42234448 0.6058784
## Sleep.Duration 0.21236031 -0.81102303 -0.1804063
## Quality.of.Sleep 0.19289645 -0.89875203 -0.1216320
## Physical.Activity.Level 1.00000000 -0.03413446 0.2654160
## Stress.Level -0.03413446 1.00000000 0.1028182
## Systolic 0.26541597 0.10281816 1.0000000
## Heart.Rate 0.13697098 0.67002646 0.2941429
## Daily.Steps 0.77272305 0.18682895 0.1033422
## Heart.Rate Daily.Steps
## Age -0.22560619 0.05797340
## Sleep.Duration -0.51645489 -0.03953254
## Quality.of.Sleep -0.65986473 0.01679141
## Physical.Activity.Level 0.13697098 0.77272305
## Stress.Level 0.67002646 0.18682895
## Systolic 0.29414292 0.10334222
## Heart.Rate 1.00000000 -0.03030858
## Daily.Steps -0.03030858 1.00000000
# CATEGORICAL VARIABLES
# Define custom functions
catcor <- function(vars, dat){
corr <- sapply(vars, function(y)
sapply(vars, function(x)
chisq.test(table(dat[,x], dat[,y]))$p.value))
return(as.data.frame(round(corr, 2)))
}
catcramV <- function(vars, dat){
corr <- sapply(vars, function(y)
sapply(vars, function(x)
round(cramerV(table(dat[,x], dat[,y]), bias.correct = TRUE), 2)))
return(as.data.frame(corr))
}
# Select categorical variables
cat_vars <- sleep_clean %>% select(where(is.factor))
cat_names <- colnames(cat_vars)
# Run Chi-squared tests and Cramér's V
pval_cat <- catcor(vars = cat_names, dat = cat_vars)
## Warning in chisq.test(table(dat[, x], dat[, y])): Chi-squared approximation may
## be incorrect
## Warning in chisq.test(table(dat[, x], dat[, y])): Chi-squared approximation may
## be incorrect
## Warning in chisq.test(table(dat[, x], dat[, y])): Chi-squared approximation may
## be incorrect
## Warning in chisq.test(table(dat[, x], dat[, y])): Chi-squared approximation may
## be incorrect
## Warning in chisq.test(table(dat[, x], dat[, y])): Chi-squared approximation may
## be incorrect
## Warning in chisq.test(table(dat[, x], dat[, y])): Chi-squared approximation may
## be incorrect
## Warning in chisq.test(table(dat[, x], dat[, y])): Chi-squared approximation may
## be incorrect
cramV_cat <- catcramV(vars = cat_names, dat = cat_vars)
# Ensure name alignment
pval_mat <- as.matrix(pval_cat)
cramV_mat <- as.matrix(cramV_cat)
# Make sure both matrices share identical names
rownames(pval_mat) <- rownames(cramV_mat)
colnames(pval_mat) <- colnames(cramV_mat)
# Plot Cramér’s V correlations
corrplot::corrplot(
cramV_mat,
is.corr = TRUE,
type = "lower",
diag = FALSE,
hclust.method = "average",
p.mat = pval_mat,
sig.level = 0.05 # optional: only marks significant correlations
)
After running this code, the printed numeric correlation matrix clearly showed the relationships among the continuous variables. The strongest positive correlations appeared between Sleep.Duration and Quality.of.Sleep (r = 0.88) and between Quality.of.Sleep and Stress.Level (r = -0.90), suggesting that longer sleep and better sleep quality are strongly associated, while both are inversely related to stress. Systolic blood pressure also showed a moderate positive correlation with Age (r = 0.61), which aligns with biological expectations that blood pressure tends to rise with age. Physical.Activity.Level and Daily.Steps were highly correlated (r = 0.77), which makes sense since both represent indicators of movement. Because no variable pairs exceeded the 0.9 correlation threshold, none were removed by the findCorrelation() function, meaning multicollinearity among numeric predictors was not severe.
The Cramér’s V correlation plot visually summarized the strength of relationships between categorical variables, where large, dark blue circles indicated strong associations and smaller, lighter blue circles indicated weaker ones. Several notable patterns emerged from the visualization. Occupation and Gender showed a large, dark blue circle, suggesting that certain occupations in this dataset appear to be dominated by one gender. Similarly, BMI.Category and Occupation exhibited a strong relationship, indicating that body weight categories may vary systematically by job type, potentially reflecting lifestyle or activity differences tied to occupation. Sleep.Disorder and Occupation also demonstrated a strong association, implying that occupation could meaningfully influence sleep health outcomes. Likewise, BMI.Category and Sleep.Disorder were highly correlated, which is expected since body weight often affects sleep conditions such as sleep apnea. In contrast, BMI.Category and Gender, as well as Sleep.Disorder and Gender, displayed smaller, lighter blue circles, suggesting only weak positive associations.
Together, these findings indicate that while numeric variables do not exhibit problematic multicollinearity, several categorical predictors are interrelated. However, these relationships reflect meaningful behavioral and physiological patterns rather than redundancy. For instance, the observed connections between occupation, BMI, and sleep health suggest underlying lifestyle or occupational effects that merit further exploration rather than removal from the model. This diagnostic step confirms that the data are well suited for modeling while highlighting potentially important social and biological interactions worth considering in the interpretation of results.
In this step, I will standardize all numeric variables in the dataset so that they are on comparable scales before building machine learning models. Many modeling algorithms, such as logistic regression and Naive Bayes, are sensitive to differences in variable magnitude. For example, Age may range from 20 to 80, while Sleep.Duration might range from 4 to 10; without scaling, variables with larger numerical ranges can dominate model behavior. To address this, I will create two scaled versions of the dataset. The first version will normalize each numeric column by dividing it by its maximum value, resulting in a range between 0 and 1. The second version will center the data by subtracting the mean from each numeric variable, producing values roughly between -0.5 and +0.5. I will ultimately use the 0–1 scaled version (sleep_scaledFrac) for models like logistic regression and Naive Bayes, as normalization simplifies interpretation of coefficients and improves numerical stability.
# 3 Center and Scale the Data
# Scale numeric data as a fraction of the maximum value (0–1 range)
sleep_scaledFrac <- sleep_clean %>%
mutate(across(where(is.numeric), ~ . / max(.)))
# Center numeric data around the mean (range roughly -0.5 to +0.5)
sleep_scaledMean <- sleep_clean %>%
mutate(across(where(is.numeric), ~ . - mean(.)))
# Use the scaled (0–1) version for GLM and Naive Bayes
sleep_scaled <- sleep_scaledFrac
After running this code, all numeric variables were successfully rescaled. In the sleep_scaledFrac dataset, each numeric column now falls between 0 and 1, where 0 represents the minimum value and 1 represents the maximum observed in that column. For example, the first few rows show values such as an Age of approximately 0.46 and Sleep.Duration around 0.72, reflecting mid-range values within their respective distributions. The structure output indicates that the resulting dataset, sleep_scaled, has 374 observations and 12 variables, including both numeric and factor variables such as Gender and Occupation. The centered version (sleep_scaledMean) now has values distributed symmetrically around zero, although this version is not used further. Scaling ensures that all continuous predictors contribute proportionally to the modeling process, preventing variables with larger raw ranges from biasing the model fitting. This step standardizes the dataset, creating a clean and consistent numerical foundation for the upcoming model training phase.
In this step, I will partition the cleaned dataset into training and testing subsets to ensure an unbiased evaluation of model performance and to prevent overfitting. I will begin by setting a fixed random seed using set.seed(123) so that the random selection of observations can be replicated exactly in future runs. Then, I will use the sample() function to randomly select approximately one-third of all observations for the testing set, storing their indices in the variable test_index. Based on these indices, I will create two versions of the split: one scaled dataset (sleep_train_scaled and sleep_test_scaled) to be used with algorithms that require normalized predictors such as Generalized Linear Models (GLM) and Naive Bayes, and one unscaled dataset (sleep_train_raw and sleep_test_raw) to be used with tree-based models like Random Forests that are invariant to scaling.
Next, I will use the dim() function to verify the sizes of the resulting subsets, ensuring that roughly two-thirds of the data are allocated to training and one-third to testing. To make later comparisons easier, I will also add a new categorical indicator variable called test using the mutate() function from the dplyr package. This column will label each observation as belonging to either the “train” or “test” subset. To confirm that the random split produced balanced and representative samples, I will then compare the outcome distribution of Sleep.Disorder across subsets using prop.table() on the contingency tables. I will visually examine the distribution of numeric variables (e.g., Age, Sleep.Duration, Stress.Level, Heart.Rate) using boxplots created with ggplot2, and perform formal statistical checks by running independent two-sample t-tests on each numeric predictor. Finally, I will use bar plots to visualize the proportion of factor levels (such as Gender, Occupation, and BMI.Category) across subsets and apply chi-square tests to verify that categorical distributions do not differ significantly between the training and testing groups. Together, these steps will confirm that the data were split fairly and that both subsets are comparable in structure and content.
# 4 Split training and testing data
# 123 is an arbitrarily set seed for reproducibility
set.seed(123)
# Create a random sample of approximately 1/3 of the data for testing
test_index <- sample(1:nrow(sleep_scaled), size = round(nrow(sleep_scaled)/3))
# Create testing and training sets for GLM and Naive Bayes (scaled)
sleep_train_scaled <- sleep_scaled[-test_index, ]
sleep_test_scaled <- sleep_scaled[test_index, ]
# Create testing and training sets for Random Forest (raw)
sleep_train_raw <- sleep_clean[-test_index, ]
sleep_test_raw <- sleep_clean[test_index, ]
# Add test indicator for checking distribution balance
sleep_scaled$test <- "train"
sleep_scaled$test[test_index] <- "test"
sleep_scaled$test <- factor(sleep_scaled$test)
# Check dimensions
dim(sleep_train_scaled)
## [1] 249 12
dim(sleep_test_scaled)
## [1] 125 12
dim(sleep_test_raw)
## [1] 125 12
dim(sleep_train_raw)
## [1] 249 12
# Add test indicator for checking distribution balance
sleep_scaled <- sleep_scaledFrac %>%
mutate(test = ifelse(row_number() %in% test_index, "test", "train")) %>%
mutate(test = factor(test, levels = c("train", "test")))
# 5 Compare the proportion of Sleep Disorder outcomes
# Confirm similar outcome distribution
prop.table(table(sleep_train_scaled$Sleep.Disorder))
##
## Normal Abnormal
## 0.5783133 0.4216867
prop.table(table(sleep_test_scaled$Sleep.Disorder))
##
## Normal Abnormal
## 0.6 0.4
# Visual comparison of numeric (continuous) variables
numeric_vars <- sleep_scaled %>% select(where(is.numeric), test)
numeric_vars %>%
pivot_longer(cols = -test, names_to = "measure", values_to = "values") %>%
ggplot(aes(x = test, y = values)) +
facet_wrap(~measure, scales = "free") +
geom_boxplot(fill = "lightblue") +
labs(title = "Comparison of Numeric Variable Distributions",
x = "Data Subset", y = "Value") +
theme_minimal()
# Statistical check using t-tests
numeric_pvalues <- numeric_vars %>%
select(-test) %>%
summarise(across(everything(), ~ t.test(.x ~ sleep_scaled$test)$p.value))
numeric_pvalues
## Age Sleep.Duration Quality.of.Sleep Physical.Activity.Level
## 1 0.4263073 0.6507052 0.99251 0.2564439
## Stress.Level Systolic Heart.Rate Daily.Steps
## 1 0.659651 0.7122594 0.7029921 0.08799246
# Visual comparison of categorical variables
categorical_vars <- sleep_scaled %>% select(where(is.factor), test)
categorical_vars %>%
pivot_longer(cols = -test, names_to = "measure", values_to = "values") %>%
ggplot(aes(x = test, fill = values)) +
facet_wrap(~measure, scales = "free") +
geom_bar(position = "fill") +
labs(title = "Proportion of Factor Levels in Train vs Test",
x = "Data Subset", y = "Proportion") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# Chi-square test for categorical balance
chi_square_p <- sleep_scaled %>%
count(Sleep.Disorder, test) %>%
pivot_wider(names_from = test, values_from = n, values_fill = 0) %>%
column_to_rownames("Sleep.Disorder") %>%
chisq.test()
chi_square_p$p.value
## [1] 0.771553
After executing this code, the dataset was successfully split into two balanced subsets that accurately reflect the overall population. The training dataset (sleep_train_scaled) contained 249 observations, while the testing dataset (sleep_test_scaled) contained 125 observations, confirming the intended two-thirds to one-third ratio. Identical dimensions were observed in the unscaled datasets used for Random Forest modeling, ensuring consistency across both scaled and raw data versions. The variable structure and factor assignments were preserved throughout the split, as verified by the str() and summary() outputs.
The proportional comparison of Sleep.Disorder outcomes confirmed that the two subsets were nearly identical in class distribution, with approximately 57.8% Normal and 42.2% Abnormal cases in the training data and 60% Normal and 40% Abnormal cases in the test data. This small difference is well within acceptable variation for a random split. The chi-square test produced a p-value of 0.77, indicating no statistically significant difference between the outcome distributions in the two subsets. Visual inspection of the numeric variable boxplots revealed overlapping medians and interquartile ranges for variables such as Age, Sleep.Duration, and Stress.Level, suggesting that both groups share similar distributions. The t-test results further confirmed this balance, with all p-values exceeding 0.05, implying no significant differences between the subsets for any continuous variable. Likewise, the categorical bar plots displayed comparable proportions of factor levels across “train” and “test” subsets for predictors such as Gender, BMI.Category, and Occupation.
Together, these diagnostics confirm that the random sampling procedure produced statistically equivalent training and testing sets. The balanced structure ensures that any differences in future model performance will be due to model behavior rather than data imbalance, making this a valid and representative split for the subsequent machine learning analyses.
In this step, I will construct and refine a logistic regression model to predict the likelihood of a participant having an abnormal sleep disorder based on various demographic, behavioral, and physiological predictors. Logistic regression is suitable here because the response variable Sleep.Disorder is binary (“Normal” vs. “Abnormal”), and the model will estimate the probability of belonging to the abnormal category. I will begin by fitting a baseline model using the glm() function with the binomial family, specifying all relevant predictors from the scaled training dataset. The initial predictors will include Age, Gender, Sleep.Duration, Quality.of.Sleep, Physical.Activity.Level, Stress.Level, BMI.Category, Heart.Rate, and Daily.Steps.
Before proceeding, I should note that the Occupation variable will not be included in the model because it contains multiple factor levels, which can lead to convergence issues and unstable coefficient estimates in logistic regression. If included, R typically issues a warning such as “fitted probabilities numerically 0 or 1 occurred” or fails to calculate some coefficients due to perfect separation — both of which reduce model reliability. Once the initial model is built, I will examine the coefficient summary output to identify statistically significant predictors and assess their direction and magnitude of effect.
Next, I will use the step() function to perform stepwise model optimization in both forward and backward directions based on the Akaike Information Criterion (AIC). This procedure iteratively removes or adds predictors to find the combination that yields the lowest AIC, which represents the best trade-off between model complexity and goodness of fit. A lower AIC indicates a more efficient model that explains the data well with fewer parameters. After optimization, I will summarize the refined model and compare its deviance, AIC, and predictor significance against the initial model to determine which version performs better. This process ensures that the final GLM is both parsimonious and statistically robust.
# 6. Build logistic regression models and perform AIC optimization
# Initial logistic regression model (excluding Occupation for stability)
sleep.glm <- glm(
Sleep.Disorder ~ Age + Gender + Sleep.Duration + Quality.of.Sleep +
Physical.Activity.Level + Stress.Level + BMI.Category +
Heart.Rate + Daily.Steps,
data = sleep_train_scaled,
family = binomial
)
# View the model summary
summary(sleep.glm)
##
## Call:
## glm(formula = Sleep.Disorder ~ Age + Gender + Sleep.Duration +
## Quality.of.Sleep + Physical.Activity.Level + Stress.Level +
## BMI.Category + Heart.Rate + Daily.Steps, family = binomial,
## data = sleep_train_scaled)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 9.3797 10.8747 0.863 0.38840
## Age 6.2142 3.6403 1.707 0.08781 .
## GenderMale -0.2141 0.7436 -0.288 0.77342
## Sleep.Duration -6.6453 7.3810 -0.900 0.36795
## Quality.of.Sleep -13.3401 6.2853 -2.122 0.03380 *
## Physical.Activity.Level 2.1826 2.6771 0.815 0.41491
## Stress.Level -5.9337 4.0800 -1.454 0.14585
## BMI.CategoryUnhealthy 2.4705 0.8540 2.893 0.00382 **
## Heart.Rate 6.0544 11.9048 0.509 0.61105
## Daily.Steps -1.7712 3.7132 -0.477 0.63336
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 339.05 on 248 degrees of freedom
## Residual deviance: 142.73 on 239 degrees of freedom
## AIC: 162.73
##
## Number of Fisher Scoring iterations: 5
# Stepwise model optimization using AIC (both directions: forward and backward)
sleep.glmAIC <- step(sleep.glm, direction = "both")
## Start: AIC=162.73
## Sleep.Disorder ~ Age + Gender + Sleep.Duration + Quality.of.Sleep +
## Physical.Activity.Level + Stress.Level + BMI.Category + Heart.Rate +
## Daily.Steps
##
## Df Deviance AIC
## - Gender 1 142.82 160.82
## - Daily.Steps 1 142.97 160.97
## - Heart.Rate 1 142.99 160.99
## - Physical.Activity.Level 1 143.44 161.44
## - Sleep.Duration 1 143.60 161.60
## <none> 142.73 162.73
## - Stress.Level 1 144.78 162.78
## - Age 1 145.66 163.66
## - Quality.of.Sleep 1 147.23 165.23
## - BMI.Category 1 151.20 169.20
##
## Step: AIC=160.82
## Sleep.Disorder ~ Age + Sleep.Duration + Quality.of.Sleep + Physical.Activity.Level +
## Stress.Level + BMI.Category + Heart.Rate + Daily.Steps
##
## Df Deviance AIC
## - Daily.Steps 1 143.01 159.01
## - Heart.Rate 1 143.36 159.36
## - Physical.Activity.Level 1 143.45 159.45
## - Sleep.Duration 1 144.11 160.11
## <none> 142.82 160.82
## + Gender 1 142.73 162.73
## - Age 1 147.35 163.35
## - Quality.of.Sleep 1 147.69 163.69
## - Stress.Level 1 147.72 163.72
## - BMI.Category 1 151.90 167.90
##
## Step: AIC=159.01
## Sleep.Disorder ~ Age + Sleep.Duration + Quality.of.Sleep + Physical.Activity.Level +
## Stress.Level + BMI.Category + Heart.Rate
##
## Df Deviance AIC
## - Physical.Activity.Level 1 143.72 157.72
## - Sleep.Duration 1 144.14 158.14
## - Heart.Rate 1 144.41 158.41
## <none> 143.01 159.01
## + Daily.Steps 1 142.82 160.82
## + Gender 1 142.97 160.97
## - Age 1 147.47 161.47
## - Quality.of.Sleep 1 148.09 162.09
## - Stress.Level 1 149.35 163.35
## - BMI.Category 1 153.28 167.28
##
## Step: AIC=157.72
## Sleep.Disorder ~ Age + Sleep.Duration + Quality.of.Sleep + Stress.Level +
## BMI.Category + Heart.Rate
##
## Df Deviance AIC
## - Sleep.Duration 1 144.63 156.63
## <none> 143.72 157.72
## - Heart.Rate 1 145.82 157.82
## + Physical.Activity.Level 1 143.01 159.01
## + Daily.Steps 1 143.45 159.45
## + Gender 1 143.70 159.70
## - Age 1 147.78 159.78
## - Quality.of.Sleep 1 148.10 160.10
## - Stress.Level 1 149.41 161.41
## - BMI.Category 1 155.60 167.60
##
## Step: AIC=156.63
## Sleep.Disorder ~ Age + Quality.of.Sleep + Stress.Level + BMI.Category +
## Heart.Rate
##
## Df Deviance AIC
## - Heart.Rate 1 146.05 156.05
## <none> 144.63 156.63
## + Sleep.Duration 1 143.72 157.72
## + Physical.Activity.Level 1 144.14 158.14
## + Gender 1 144.32 158.32
## + Daily.Steps 1 144.34 158.34
## - Age 1 148.44 158.44
## - Stress.Level 1 149.43 159.43
## - Quality.of.Sleep 1 153.07 163.07
## - BMI.Category 1 162.61 172.61
##
## Step: AIC=156.05
## Sleep.Disorder ~ Age + Quality.of.Sleep + Stress.Level + BMI.Category
##
## Df Deviance AIC
## <none> 146.05 156.05
## + Heart.Rate 1 144.63 156.63
## + Physical.Activity.Level 1 144.99 156.99
## + Gender 1 145.40 157.40
## - Stress.Level 1 149.69 157.69
## + Sleep.Duration 1 145.82 157.82
## + Daily.Steps 1 145.89 157.89
## - Age 1 150.24 158.24
## - Quality.of.Sleep 1 155.18 163.18
## - BMI.Category 1 168.04 176.04
# View the optimized model summary
summary(sleep.glmAIC)
##
## Call:
## glm(formula = Sleep.Disorder ~ Age + Quality.of.Sleep + Stress.Level +
## BMI.Category, family = binomial, data = sleep_train_scaled)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 9.858 5.115 1.927 0.05392 .
## Age 5.815 2.763 2.105 0.03533 *
## Quality.of.Sleep -15.441 5.151 -2.998 0.00272 **
## Stress.Level -4.741 2.556 -1.855 0.06364 .
## BMI.CategoryUnhealthy 2.970 0.692 4.292 1.77e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 339.05 on 248 degrees of freedom
## Residual deviance: 146.05 on 244 degrees of freedom
## AIC: 156.05
##
## Number of Fisher Scoring iterations: 5
The initial logistic regression model successfully converged and identified several meaningful predictors of sleep disorders. In this baseline model, the predictors Quality.of.Sleep (p = 0.034) and BMI.Category (p = 0.004) emerged as statistically significant, while Age (p = 0.088) was borderline significant. The positive coefficient for BMI.CategoryUnhealthy (2.47) indicates that individuals with an unhealthy BMI had substantially higher odds of experiencing a sleep disorder compared to those in the healthy range. Conversely, Quality.of.Sleep (-13.34) showed a strong negative association, meaning that better self-reported sleep quality was associated with a much lower probability of abnormal sleep outcomes. The overall model achieved an AIC of 162.73 and a residual deviance of 142.73, suggesting a relatively good model fit, although several predictors (e.g., Heart.Rate, Daily.Steps, Physical.Activity.Level, and Gender) contributed little explanatory power and were candidates for removal during model optimization.
The stepwise AIC optimization process systematically evaluated different combinations of predictors, adding and removing variables to minimize AIC and improve model parsimony. Through iterative testing, non-contributory predictors were eliminated, ultimately yielding a refined model containing only Age, Quality.of.Sleep, Stress.Level, and BMI.Category. The final optimized model achieved a lower AIC of 156.05, indicating improved efficiency and a better balance between simplicity and explanatory strength. Within this optimized model, Quality.of.Sleep (p = 0.0027) and BMI.CategoryUnhealthy (p < 0.001) remained highly significant predictors, reinforcing that better sleep quality substantially decreases, and unhealthy BMI markedly increases, the likelihood of having a sleep disorder. Age (p = 0.035) also became a significant predictor, suggesting that older individuals face greater risk. Interestingly, Stress.Level (-4.74, p = 0.064) had a marginally significant negative coefficient—an unexpected finding implying that, after accounting for other variables such as sleep quality, higher stress appeared slightly associated with lower odds of being classified as having a sleep disorder. This counterintuitive relationship may reflect overlapping variance between stress and quality of sleep, where much of stress’s direct effect on sleep is already captured by the quality metric.
Overall, the stepwise optimization improved the model’s statistical parsimony without sacrificing explanatory accuracy. The final AIC-optimized GLM indicates that age, sleep quality, stress level, and BMI category are the most important predictors of sleep disorder risk in this dataset, aligning well with known physiological and behavioral relationships observed in sleep research.
In this step, I will quantitatively evaluate the performance of both logistic regression models, the original GLM and the stepwise AIC-optimized GLM, and summarize their results in a single comparison table. To begin, I will generate predicted probabilities for the testing dataset using the predict() function with the argument type = “response”, which returns the estimated probability that each observation belongs to the “Abnormal” sleep disorder category. These predicted probabilities will then be converted into binary class predictions using a threshold of 0.5 with the ifelse() function, such that probabilities greater than 0.5 are labeled as “Abnormal,” and those equal to or below 0.5 are labeled as “Normal.”
To evaluate model performance, I will define a custom function called class.summary(), which calculates several important classification metrics. This function first constructs a confusion matrix (using table(y, pred)) to compare actual versus predicted labels. From this, it computes accuracy (acc), which represents the proportion of correctly classified cases, and the misclassification rate (mis), which reflects the proportion of incorrect predictions. It also calculates deviance (dev), a measure of model fit derived from the log-likelihood where lower values indicate a better-fitting model. In addition, the function computes sensitivity (sens), or the true positive rate, representing the proportion of “Abnormal” cases correctly identified, and specificity (speci), or the true negative rate, representing the proportion of “Normal” cases correctly identified. For completeness, it also reports the false positive and false negative rates. To prevent computational issues when taking logarithms, the function slightly adjusts any extreme predicted probabilities of 0 or 1.
Using this custom function, I will compute all these performance metrics for both the original and AIC-optimized GLM models. I will then calculate the Area Under the ROC Curve (AUC) using the auc() function to evaluate each model’s ability to distinguish between normal and abnormal sleep outcomes across all possible probability thresholds. Finally, all results, including accuracy, misclassification rate, deviance, sensitivity, specificity, and AUC, will be organized into a clear comparison table (glm_table) where rows represent metrics and columns represent each model. This summary table will make it easy to assess whether the stepwise AIC optimization improved model performance or simply reduced model complexity without compromising predictive accuracy.
# 7 Table to summarize model
# Original GLM
pred_glm <- predict(sleep.glm, newdata = sleep_test_scaled, type = "response")
pred_glm_class <- factor(ifelse(pred_glm > 0.5, "Abnormal", "Normal"),
levels = c("Normal", "Abnormal"))
# class.summary() function
class.summary <- function(y, pred, pred.prob){
tab <- table(y=y, pred=pred) # Confusion matrix
acc <- (sum(diag(tab))) / sum(tab) # Accuracy rate
mis <- (sum(tab) - sum(diag(tab)))/sum(tab) # Misclassification Rate
# Deviance
if(any(pred.prob == 0))
pred.prob[pred.prob == 0] <- 0.001
if(any(pred.prob == 1))
pred.prob[pred.prob == 1] <- 1 - 0.001
dev <- -2*sum((y=="Abnormal")*log(pred.prob) + (1-(y=="Abnormal"))*log(1-pred.prob))
#sensitivity and specificity
conf <- tab # extract confusion matrix
sens <- conf[2,2]/sum(conf[2,]) # sensitivity
speci <- conf[1,1]/sum(conf[1,]) # specificity
fp <- 1 - speci # false positive rate
fn <- 1 - sens # false negative rate
return(list(conf.tab=tab, acc=acc, mis=mis, dev=dev, sens=sens, speci=speci, fp=fp, fn=fn))
}
# Use class.summary() function
sumry_glm <- class.summary(y = sleep_test_scaled$Sleep.Disorder,
pred = pred_glm_class,
pred.prob = pred_glm)
# Calculate AUC
auc_glm <- auc(sleep_test_scaled$Sleep.Disorder, pred_glm)
## Setting levels: control = Normal, case = Abnormal
## Setting direction: controls < cases
# Stepwise AIC GLM
pred_glmAIC <- predict(sleep.glmAIC, newdata = sleep_test_scaled, type = "response")
pred_glmAIC_class <- factor(ifelse(pred_glmAIC > 0.5, "Abnormal", "Normal"),
levels = c("Normal", "Abnormal"))
sumry_glmAIC <- class.summary(y = sleep_test_scaled$Sleep.Disorder,
pred = pred_glmAIC_class,
pred.prob = pred_glmAIC)
auc_glmAIC <- auc(sleep_test_scaled$Sleep.Disorder, pred_glmAIC)
## Setting levels: control = Normal, case = Abnormal
## Setting direction: controls < cases
# Combine into a table (metrics as rows, models as columns)
glm_table <- data.frame(
GLM = round(c(sumry_glm$acc,
sumry_glm$mis,
sumry_glm$dev,
sumry_glm$sens,
sumry_glm$speci,
auc_glm), 3),
GLM_AIC = round(c(sumry_glmAIC$acc,
sumry_glmAIC$mis,
sumry_glmAIC$dev,
sumry_glmAIC$sens,
sumry_glmAIC$speci,
auc_glmAIC), 3)
)
# Add row names
row.names(glm_table) <- c("acc", "mis", "dev", "sens", "speci", "AUC")
# View the table
glm_table
## GLM GLM_AIC
## acc 0.920 0.920
## mis 0.080 0.080
## dev 70.221 70.996
## sens 0.960 0.960
## speci 0.893 0.893
## AUC 0.927 0.931
After running this code, both logistic regression models displayed excellent predictive performance on the test dataset, with nearly identical results across all metrics. The accuracy (acc) of both the base and AIC-optimized GLMs was 0.920, meaning each correctly classified 92% of cases, while the misclassification rate (mis) was correspondingly low at 0.080. Both models produced similar deviance values—approximately 70.22 for the base GLM and 70.99 for the AIC-optimized version—indicating comparable model fit and confirming that simplifying the model did not substantially degrade predictive quality.
The sensitivity (sens) of 0.960 demonstrates that 96% of individuals with an abnormal sleep disorder were correctly identified, while the specificity (speci) of 0.893 indicates that roughly 89% of normal cases were correctly classified as normal. These balanced sensitivity and specificity values suggest that the model performs well for both positive and negative cases without bias toward either class. The AUC values further reinforce this conclusion: 0.927 for the original GLM and 0.931 for the AIC-optimized model, both of which fall well within the “excellent” range (AUC > 0.9).
Together, these results confirm that both versions of the logistic regression model are highly accurate and stable predictors of sleep disorder status. The stepwise AIC optimization marginally improved the model’s efficiency—achieving the same predictive power with fewer predictors—while maintaining identical accuracy and sensitivity. This balance between parsimony and performance makes the optimized GLM a strong and interpretable choice for subsequent model comparisons.
Week 2 of the assignment.
In this step, I will create and compare several Random Forest models to find the best settings for predicting whether a person has a sleep disorder. Random Forest is a method that builds many decision trees using random samples of the data and combines their predictions to improve accuracy and avoid overfitting. I will adjust two main settings: mtry, which is the number of predictor variables considered at each split in the tree, and nodesize, which is the minimum number of cases required in a final node or leaf. To find the best combination, I will fit four models using mtry values of 2 or 4 and nodesize values of 5 or 10. The randomForest() function will calculate the importance of each variable and measure how similar cases are. After training, I will look at the confusion matrix for each model to check accuracy and use varImpPlot() to see which variables are most important.
The varImpPlot() function shows a ranked chart of each variable’s contribution to the model using two measures. The first measure, Mean Decrease in Accuracy, shows how much accuracy drops if the variable is removed, with higher numbers meaning the variable is more important. The second measure, Mean Decrease in Gini, shows how much the variable helps the tree make clear splits, with higher numbers meaning it is more useful. Variables that score well on both measures are considered important predictors. I expect most variables to add useful information to the model.
Once the models are trained, I will evaluate them using accuracy, sensitivity, specificity, and AUC with the class.summary() function. From previous tests, the model with mtry of 2 and nodesize of 5 is expected to perform best. I will also create a version of this model without the Gender variable, which showed little importance before, to see if removing it makes the model simpler without reducing performance. Finally, I will combine the results for all models, including GLM, stepwise GLM, the best Random Forest, and the Random Forest without Gender, into a single table for comparison.
# 8 Random Forest Optimization
# We will manually try several Random Forest models by changing the mtry (number of variables randomly sampled per split) and the nodesize (minimum samples per leaf).
set.seed(682)
dim(sleep_train_raw)
## [1] 249 12
# Model 1: mtry = 2, nodesize = 5
sleep.rf1 <- randomForest(Sleep.Disorder ~ .,
data = sleep_train_raw,
mtry = 2,
nodesize = 5,
importance = TRUE,
proximity = TRUE,
na.action = na.omit)
print(sleep.rf1$confusion)
## Normal Abnormal class.error
## Normal 136 8 0.05555556
## Abnormal 10 95 0.09523810
varImpPlot(sleep.rf1, main = "Random Forest: mtry=2, nodesize=5")
# Model 2: mtry = 2, nodesize = 10
set.seed(682)
sleep.rf2 <- randomForest(Sleep.Disorder ~ .,
data = sleep_train_raw,
mtry = 2,
nodesize = 10,
importance = TRUE,
proximity = TRUE,
na.action = na.omit)
print(sleep.rf2$confusion)
## Normal Abnormal class.error
## Normal 135 9 0.0625000
## Abnormal 10 95 0.0952381
varImpPlot(sleep.rf2, main = "Random Forest: mtry=2, nodesize=10")
# Model 3: mtry = 4, nodesize = 5
set.seed(682)
sleep.rf3 <- randomForest(Sleep.Disorder ~ .,
data = sleep_train_raw,
mtry = 4,
nodesize = 5,
importance = TRUE,
proximity = TRUE,
na.action = na.omit)
print(sleep.rf3$confusion)
## Normal Abnormal class.error
## Normal 135 9 0.0625000
## Abnormal 10 95 0.0952381
varImpPlot(sleep.rf3, main = "Random Forest: mtry=4, nodesize=5")
# Model 4: mtry = 4, nodesize = 10
set.seed(682)
sleep.rf4 <- randomForest(Sleep.Disorder ~ .,
data = sleep_train_raw,
mtry = 4,
nodesize = 10,
importance = TRUE,
proximity = TRUE,
na.action = na.omit)
print(sleep.rf4$confusion)
## Normal Abnormal class.error
## Normal 135 9 0.0625000
## Abnormal 10 95 0.0952381
varImpPlot(sleep.rf4, main = "Random Forest: mtry=4, nodesize=10")
# Best Random Forest Model (mtry=2, nodesize=5)
sleep.rf.best <- sleep.rf1
print(sleep.rf.best$confusion)
## Normal Abnormal class.error
## Normal 136 8 0.05555556
## Abnormal 10 95 0.09523810
varImpPlot(sleep.rf.best, main = "Variable Importance - Best RF (mtry=2, nodesize=5)")
# Evaluate best Random Forest
rf_pred_best <- predict(sleep.rf.best, newdata = sleep_test_raw, type = "response")
rf_prob_best <- predict(sleep.rf.best, newdata = sleep_test_raw, type = "prob")[,2]
sumry_rf_best <- class.summary(y = sleep_test_raw$Sleep.Disorder,
pred = rf_pred_best,
pred.prob = rf_prob_best)
auc_rf_best <- auc(sleep_test_raw$Sleep.Disorder, rf_prob_best)
## Setting levels: control = Normal, case = Abnormal
## Setting direction: controls < cases
# Store metrics for table
Rnd_forest_best <- c(unlist(sumry_rf_best[2:6]), AUC = auc_rf_best)
Rnd_forest_best
## acc mis dev sens speci AUC
## 0.9680000 0.0320000 67.1571706 0.9600000 0.9733333 0.9448000
# Refine the model manually by removing Gender
sleep.rf.opt <- randomForest(Sleep.Disorder ~ Age + Sleep.Duration + Quality.of.Sleep +
Physical.Activity.Level + Stress.Level + BMI.Category + Heart.Rate + Daily.Steps,
data = sleep_train_raw,
mtry = 2,
nodesize = 5,
importance = TRUE,
proximity = TRUE,
na.action = na.omit)
print(sleep.rf.opt$confusion)
## Normal Abnormal class.error
## Normal 134 10 0.06944444
## Abnormal 10 95 0.09523810
varImpPlot(sleep.rf.opt, main = "Variable Importance - RF Without Gender")
# Evaluate optimized RF (no Gender)
rf_pred_opt <- predict(sleep.rf.opt, newdata = sleep_test_raw, type = "response")
rf_prob_opt <- predict(sleep.rf.opt, newdata = sleep_test_raw, type = "prob")[,2]
sumry_rf_opt <- class.summary(y = sleep_test_raw$Sleep.Disorder,
pred = rf_pred_opt,
pred.prob = rf_prob_opt)
auc_rf_opt <- auc(sleep_test_raw$Sleep.Disorder, rf_prob_opt)
## Setting levels: control = Normal, case = Abnormal
## Setting direction: controls < cases
# Store optimized Random Forest metrics for table
Rnd_forest_opt <- c(unlist(sumry_rf_opt[2:6]), AUC = auc_rf_opt)
Rnd_forest_opt
## acc mis dev sens speci AUC
## 0.9440000 0.0560000 72.9045911 0.9600000 0.9333333 0.9421333
# Make sure all model metrics are numeric vectors
glm_metrics <- round(c(sumry_glm$acc, sumry_glm$mis, sumry_glm$dev, sumry_glm$sens, sumry_glm$speci, auc_glm), 3)
glmAIC_metrics <- round(c(sumry_glmAIC$acc, sumry_glmAIC$mis, sumry_glmAIC$dev, sumry_glmAIC$sens, sumry_glmAIC$speci, auc_glmAIC), 3)
rf_best_metrics <- round(c(unlist(sumry_rf_best[2:6]), auc_rf_best), 3)
rf_opt_metrics <- round(c(unlist(sumry_rf_opt[2:6]), auc_rf_opt), 3)
# Combine into a table
glm_table <- data.frame(
GLM = glm_metrics,
GLM_AIC = glmAIC_metrics,
RF_Best = rf_best_metrics,
RF_NoGender = rf_opt_metrics
)
# Add row names
row.names(glm_table) <- c("acc", "mis", "dev", "sens", "speci", "AUC")
# View final comparison table
glm_table
## GLM GLM_AIC RF_Best RF_NoGender
## acc 0.920 0.920 0.968 0.944
## mis 0.080 0.080 0.032 0.056
## dev 70.221 70.996 67.157 72.905
## sens 0.960 0.960 0.960 0.960
## speci 0.893 0.893 0.973 0.933
## AUC 0.927 0.931 0.945 0.942
The Random Forest models performed well across all configurations, but the model with mtry = 2 and nodesize = 5 showed the best overall performance. It achieved an accuracy of 0.968, a misclassification rate of 0.032, and a deviance of 67.16 on the test set. Sensitivity and specificity were balanced and high, both around 0.96 to 0.97, showing that the model could effectively distinguish between normal and abnormal sleep patterns. The AUC value of 0.945 indicated excellent ability to discriminate between the two classes.
The variable importance plot produced by varImpPlot clearly showed how much each predictor contributed to the model’s performance. It reported two measures. Mean Decrease in Accuracy shows how much accuracy drops when a variable is randomly changed, and Mean Decrease in Gini shows how strongly a variable helps improve splits in the trees. All predictors had positive importance scores on both measures, indicating that each added useful information. Among them, Quality of Sleep, Stress Level, and BMI Category had the highest scores, suggesting they were the most important predictors of sleep disorder risk.
After refining the model by removing Gender, which was consistently among the least important predictors, the optimized Random Forest (RF_NoGender) still performed well. It achieved an accuracy of 0.944, specificity of 0.933, sensitivity of 0.96, and AUC of 0.942. The slight drop in performance was minimal, showing that excluding Gender simplified the model without reducing predictive reliability.
When comparing all models, both Random Forest versions outperformed the standard GLM and stepwise AIC GLM, which each achieved about 0.92 accuracy and an AUC around 0.93. This indicates that Random Forest can capture nonlinear relationships and interactions more effectively. The variable importance measures also provided insight into which predictors most influenced the outcomes. Overall, the Random Forest with mtry = 2 and nodesize = 5 provided the best balance of accuracy, sensitivity, and simplicity, making it the preferred model.
Next, I will build and evaluate a Naive Bayes classifier to predict whether an individual has a sleep disorder, and then compare its performance to the GLM, AIC-optimized GLM, and Random Forest models. Naive Bayes is based on Bayes’ theorem and assumes that all predictors are independent once we know the target class. In practice, this assumption is rarely perfect, but the model often works well because it captures the main patterns in the data and handles mild correlations without much issue.
Using the naiveBayes() function from the e1071 package, I will predict Sleep.Disorder using variables such as Age, Gender, Sleep Duration, Quality of Sleep, Physical Activity Level, Stress Level, BMI Category, Heart Rate, and Daily Steps. The argument laplace = 1 applies Laplace smoothing, which helps prevent problems when a categorical predictor includes levels not seen in the training data and also makes the probability estimates more stable.
After training the model on the scaled training dataset, I will generate predictions for both the class labels (“Normal” vs “Abnormal”) and the probabilities of being in the “Abnormal” group. The model’s performance will be summarized using the class.summary() function, which reports metrics such as accuracy, misclassification rate, deviance, sensitivity, and specificity. I will also calculate the AUC using the auc() function to measure how well the model can distinguish between individuals with and without a sleep disorder.
Finally, I will combine the Naive Bayes results with those of the GLM and Random Forest models in a single comparison table. I will also visualize the predictive performance of each model using ROC curves. Each ROC curve plots the false-positive rate (1 – specificity) on the x-axis and the true-positive rate (sensitivity) on the y-axis, providing a clear visual comparison of how well each model balances correct and incorrect classifications.
# 9 Build Naive Bayes model
nb_model <- naiveBayes(Sleep.Disorder ~ Age + Gender + Sleep.Duration + Quality.of.Sleep +
Physical.Activity.Level + Stress.Level + BMI.Category +
Heart.Rate + Daily.Steps,
data = sleep_train_scaled,
laplace = 1)
# Predict class labels
nb_pred_class <- predict(nb_model, newdata = sleep_test_scaled)
# Predict probabilities
nb_pred_prob <- predict(nb_model, newdata = sleep_test_scaled, type = "raw")[,2]
# Compute summary metrics
sumry_nb <- class.summary(y = sleep_test_scaled$Sleep.Disorder,
pred = nb_pred_class,
pred.prob = nb_pred_prob)
# Compute AUC
auc_nb <- auc(sleep_test_scaled$Sleep.Disorder, nb_pred_prob)
## Setting levels: control = Normal, case = Abnormal
## Setting direction: controls < cases
# Store in vector for final table
nb_metrics <- round(c(unlist(sumry_nb[2:6]), auc_nb), 3)
# 10 Compare and contrast models
# Ensure GLM vectors are numeric
glm_vec <- as.numeric(glm_table$GLM)
glmAIC_vec <- as.numeric(glm_table$GLM_AIC)
# Combine all models into a single table
# After all model evaluations
final_table <- data.frame(
GLM = round(c(sumry_glm$acc, sumry_glm$mis, sumry_glm$dev, sumry_glm$sens, sumry_glm$speci, auc_glm), 3),
GLM_AIC = round(c(sumry_glmAIC$acc, sumry_glmAIC$mis, sumry_glmAIC$dev, sumry_glmAIC$sens, sumry_glmAIC$speci, auc_glmAIC), 3),
Random_Forest_Best = round(c(unlist(sumry_rf_best[2:6]), auc_rf_best), 3),
Random_Forest_NoGender = round(c(unlist(sumry_rf_opt[2:6]), auc_rf_opt), 3),
Naive_Bayes = round(c(unlist(sumry_nb[2:6]), auc_nb), 3)
)
# Assign row names to match metrics
row.names(final_table) <- c("acc", "mis", "dev", "sens", "speci", "AUC")
# Round for display
final_table
## GLM GLM_AIC Random_Forest_Best Random_Forest_NoGender Naive_Bayes
## acc 0.920 0.920 0.968 0.944 0.928
## mis 0.080 0.080 0.032 0.056 0.072
## dev 70.221 70.996 67.157 72.905 95.279
## sens 0.960 0.960 0.960 0.960 0.960
## speci 0.893 0.893 0.973 0.933 0.907
## AUC 0.927 0.931 0.945 0.942 0.938
# 11 ROC Curve Comparison
# Generate ROC objects for each model
roc_glm <- roc(sleep_test_scaled$Sleep.Disorder, pred_glm)
## Setting levels: control = Normal, case = Abnormal
## Setting direction: controls < cases
roc_glmAIC <- roc(sleep_test_scaled$Sleep.Disorder, pred_glmAIC)
## Setting levels: control = Normal, case = Abnormal
## Setting direction: controls < cases
roc_rf <- roc(sleep_test_raw$Sleep.Disorder, rf_prob_best)
## Setting levels: control = Normal, case = Abnormal
## Setting direction: controls < cases
roc_rf_noGender <- roc(sleep_test_raw$Sleep.Disorder, rf_prob_opt)
## Setting levels: control = Normal, case = Abnormal
## Setting direction: controls < cases
roc_nb <- roc(sleep_test_scaled$Sleep.Disorder, nb_pred_prob)
## Setting levels: control = Normal, case = Abnormal
## Setting direction: controls < cases
# Plot all ROC curves together
plot(roc_glm,
col = "black",
lwd = 2, # line width (thicker lines)
main = "ROC Curve Comparison: GLM vs GLM_AIC vs RF vs Naive Bayes",
xlab = "False Positive Rate (1 - Specificity)",
ylab = "True Positive Rate (Sensitivity)")
lines(roc_glmAIC, col = "blue", lwd = 2)
lines(roc_rf, col = "darkgreen", lwd = 2)
lines(roc_rf_noGender, col = "purple", lwd = 2)
lines(roc_nb, col = "red", lwd = 2)
# Add diagonal reference line (chance level)
abline(a = 0, b = 1, lty = 2, col = "gray")
#Add legend with AUC values
legend("bottomright",
legend = c(
paste0("GLM (AUC = ", round(auc_glm, 3), ")"),
paste0("GLM AIC (AUC = ", round(auc_glmAIC, 3), ")"),
paste0("Random Forest (AUC = ", round(auc_rf_best, 3), ")"),
paste0("RF No Gender (AUC = ", round(auc_rf_opt, 3), ")"),
paste0("Naive Bayes (AUC = ", round(auc_nb, 3), ")")
),
col = c("black", "blue", "darkgreen", "purple", "red"),
lwd = 2,
cex = 0.8)
The Naive Bayes classifier produced strong and consistent results, achieving an accuracy of 0.928, a misclassification rate of 0.072, and an AUC of 0.938. This indicates that the model was able to reliably distinguish between normal and abnormal sleep classes. Both sensitivity (0.96) and specificity (0.907) were high and well balanced, showing that the model effectively identified both true sleep-disorder cases and true normal sleepers. Because Naive Bayes calculates probabilities based on the contribution of each predictor to the overall likelihood of the class, these results suggest that variables such as Quality of Sleep, Stress Level, and BMI Category were the most influential, which aligns with what the GLM and Random Forest analyses also indicated.
The final comparison table provides a clear summary of all models’ performance metrics. The GLM and stepwise AIC GLM both reached an accuracy of 0.920, with AUCs around 0.927–0.931. The Random Forest Best model achieved the highest accuracy (0.968) and largest AUC (0.945), followed by the Random Forest model without Gender (accuracy 0.944, AUC 0.942). The Naive Bayes model ranked second, showing strong overall performance (accuracy 0.928, AUC 0.938). This demonstrates that even a relatively simple probabilistic approach can perform very well when features are informative and appropriately scaled. Across all models, the Random Forest with mtry = 2 and nodesize = 5 offered the best combination of accuracy and discrimination, while the GLM models performed solidly but were less able to capture nonlinear relationships compared with the Random Forest.
The ROC-curve comparison plot reinforces these findings. All models produced curves that rose well above the diagonal line representing random guessing, with the x-axis showing the false-positive rate (1 – specificity) and the y-axis showing the true-positive rate (sensitivity). The Random Forest Best model’s curve stayed closest to the top-left corner across most thresholds, reflecting the best balance between sensitivity and specificity and the strongest overall performance. The Naive Bayes and GLM_AIC curves followed closely, showing good discrimination, though slightly below the Random Forest, indicating a modest reduction in predictive power.
In summary, all models performed far above random chance, but the Random Forest Best model provided the most accurate and reliable predictions. The Naive Bayes classifier offered a simpler and interpretable alternative with only a small decrease in performance. These results highlight that ensemble methods can effectively capture complex relationships, while probabilistic models like Naive Bayes remain valuable for their transparency and computational efficiency.