R Markdown

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ readr     2.1.5
## ✔ ggplot2   3.5.1     ✔ stringr   1.5.1
## ✔ lubridate 1.9.3     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(naniar)
library(readr)
library(ggplot2)
library(broom)
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(tidyr)
library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:purrr':
## 
##     some
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
library(Hmisc)
## 
## Attaching package: 'Hmisc'
## 
## The following objects are masked from 'package:dplyr':
## 
##     src, summarize
## 
## The following objects are masked from 'package:base':
## 
##     format.pval, units
library(mctest)
library(gmodels)
library(skimr)
## 
## Attaching package: 'skimr'
## 
## The following object is masked from 'package:naniar':
## 
##     n_complete
library(corrplot)
## corrplot 0.95 loaded

Please place the Sleep_health_and_lifestyle_dataset.csv file in the current working directory in order to load the dataset.

sleep_data <- read.csv("./Sleep_health_and_lifestyle_dataset.csv") #load from the cwd
head(sleep_data)
##   Person.ID Gender Age           Occupation Sleep.Duration Quality.of.Sleep
## 1         1   Male  27    Software Engineer            6.1                6
## 2         2   Male  28               Doctor            6.2                6
## 3         3   Male  28               Doctor            6.2                6
## 4         4   Male  28 Sales Representative            5.9                4
## 5         5   Male  28 Sales Representative            5.9                4
## 6         6   Male  28    Software Engineer            5.9                4
##   Physical.Activity.Level Stress.Level BMI.Category Blood.Pressure Heart.Rate
## 1                      42            6   Overweight         126/83         77
## 2                      60            8       Normal         125/80         75
## 3                      60            8       Normal         125/80         75
## 4                      30            8        Obese         140/90         85
## 5                      30            8        Obese         140/90         85
## 6                      30            8        Obese         140/90         85
##   Daily.Steps Sleep.Disorder
## 1        4200           None
## 2       10000           None
## 3       10000           None
## 4        3000    Sleep Apnea
## 5        3000    Sleep Apnea
## 6        3000       Insomnia
View(sleep_data)

This dataset is taken from Kaggle.com. It is stated by the author Laksika Tharmalingam that this dataset is synthetic and is generated by her. It is not sourced from any original individual or entity.

Dataset description:

  1. Person ID: An identifier for each individual.
  2. Gender: The gender of the person (Male/Female).
  3. Age: The age of the person in years.
  4. Occupation: The occupation or profession of the person.
  5. Sleep Duration (hours): The number of hours the person sleeps per day.
  6. Quality of Sleep (scale: 1-10): A subjective rating of the quality of sleep, ranging from 1 to 10.
  7. Physical Activity Level (minutes/day): The number of minutes the person engages in physical activity daily.
  8. Stress Level (scale: 1-10): A subjective rating of the stress level experienced by the person, ranging from 1 to 10.
  9. BMI Category: The BMI category of the person (e.g., Underweight, Normal, Overweight).
  10. Blood Pressure (systolic/diastolic): The blood pressure measurement of the person, indicated as systolic pressure over diastolic pressure.
  11. Heart Rate (bpm): The resting heart rate of the person in beats per minute.
  12. Daily Steps: The number of steps the person takes per day.
  13. Sleep Disorder: The presence or absence of a sleep disorder in the person (None, Insomnia, Sleep Apnea).
pct_complete(sleep_data) 
## [1] 100

This dataset had no missing values.

dim(sleep_data)
## [1] 374  13
summary(sleep_data)
##    Person.ID         Gender               Age         Occupation       
##  Min.   :  1.00   Length:374         Min.   :27.00   Length:374        
##  1st Qu.: 94.25   Class :character   1st Qu.:35.25   Class :character  
##  Median :187.50   Mode  :character   Median :43.00   Mode  :character  
##  Mean   :187.50                      Mean   :42.18                     
##  3rd Qu.:280.75                      3rd Qu.:50.00                     
##  Max.   :374.00                      Max.   :59.00                     
##  Sleep.Duration  Quality.of.Sleep Physical.Activity.Level  Stress.Level  
##  Min.   :5.800   Min.   :4.000    Min.   :30.00           Min.   :3.000  
##  1st Qu.:6.400   1st Qu.:6.000    1st Qu.:45.00           1st Qu.:4.000  
##  Median :7.200   Median :7.000    Median :60.00           Median :5.000  
##  Mean   :7.132   Mean   :7.313    Mean   :59.17           Mean   :5.385  
##  3rd Qu.:7.800   3rd Qu.:8.000    3rd Qu.:75.00           3rd Qu.:7.000  
##  Max.   :8.500   Max.   :9.000    Max.   :90.00           Max.   :8.000  
##  BMI.Category       Blood.Pressure       Heart.Rate     Daily.Steps   
##  Length:374         Length:374         Min.   :65.00   Min.   : 3000  
##  Class :character   Class :character   1st Qu.:68.00   1st Qu.: 5600  
##  Mode  :character   Mode  :character   Median :70.00   Median : 7000  
##                                        Mean   :70.17   Mean   : 6817  
##                                        3rd Qu.:72.00   3rd Qu.: 8000  
##                                        Max.   :86.00   Max.   :10000  
##  Sleep.Disorder    
##  Length:374        
##  Class :character  
##  Mode  :character  
##                    
##                    
## 

There are 374 observations and 13 variables in this dataset.

First, we will analyze the dataset by exploring continuous variables using histograms and descriptive statistics. Next, we will examine categorical variables through frequency tables and pie charts.

ggplot use

ggplot(sleep_data, aes(x = Age)) +
  geom_histogram(binwidth = 3, fill = "lightgreen", color = "black", alpha = 0.9) +  
   scale_x_continuous(  
    limits = c(20, 60)  
  ) +
  labs(title = "Distribution of Age", x = "Age", y = "Frequency") + 
  theme_minimal() +  
  theme(plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),  
        axis.title = element_text(size = 12)) 
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_bar()`).

summary(sleep_data$Age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   27.00   35.25   43.00   42.18   50.00   59.00

The histogram indicates that the age distribution in this dataset appears to be approximately normal, with a mean age of 42 and a median age of 43. The youngest patient is 27 years old, while the oldest is 59. The close alignment of the mean and median suggests an absence of significant outliers in the data.

hist(sleep_data$Sleep.Duration, breaks = 10, main= "Sleep Duration")

summary(sleep_data$Sleep.Duration)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   5.800   6.400   7.200   7.132   7.800   8.500

Sleep duration results seem to be normally distributed with a slight skew to the left. The average sleep duration was 7, with the maximum of 8.5 and minimum of 5.8.

hist(sleep_data$Stress.Level, breaks=5, main= "Stress level")

summary(sleep_data$Stress.Level)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   4.000   5.000   5.385   7.000   8.000

The histogram reveals that the stress level distribution is not normally distributed. It displays a skewed pattern with a higher concentration of individuals reporting low stress levels. The mean stress level was 5, while maximum and minimum were 8 and 3 correspondingly.

hist(sleep_data$Quality.of.Sleep,breaks=5, main= "Quality of Sleep")

summary(sleep_data$Quality.of.Sleep)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.000   6.000   7.000   7.313   8.000   9.000

Quality of sleep is not normally distributed, with higher proportion of people having higher values. The mean was 7, with a maximum and minimum values of 9 and 3 correspondingly.

hist(sleep_data$Physical.Activity.Level, breaks=4, main= "Physical activity level")

summary(sleep_data$Physical.Activity.Level)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   30.00   45.00   60.00   59.17   75.00   90.00

The data is right-skewed, and most values lie in the 40–60 range, with a gradual decrease toward higher physical activity levels.

hist(sleep_data$Heart.Rate, breaks= 10, main= "Heart Rate")

summary(sleep_data$Heart.Rate)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   65.00   68.00   70.00   70.17   72.00   86.00

The heart rate distribution is negatively skewed, with a higher concentration of patients having lower heart rate values. The average heart rate is 70, with a minimum of 65 and a maximum of 86.

ggplot use

hist(sleep_data$Daily.Steps, breaks=5, main= "Daily Steps")

summary(sleep_data$Daily.Steps)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3000    5600    7000    6817    8000   10000
ggplot(sleep_data, aes(x = Daily.Steps)) +
  geom_density(fill = "purple", alpha = 0.3) +
  labs(title = "Density Plot of Daily Steps", x = "Daily Steps", y = "Density") +
  theme_minimal()

The distribution of the number of daily steps appears to be approximately normal, with a mean value of 6817 daily steps. The minimum number of steps was 3000, while the maximum was 10000.

table(sleep_data$Gender)
## 
## Female   Male 
##    185    189
# Pie chart
pie(table(sleep_data$Gender),
    main = "Gender Distribution",
    col = c("pink", "lightblue"))

The distribution of males and females appears to be fairly similar across the dataset.

ggplot use

occ <- table(sleep_data$Occupation)
occ
## 
##           Accountant               Doctor             Engineer 
##                   37                   71                   63 
##               Lawyer              Manager                Nurse 
##                   47                    1                   73 
## Sales Representative          Salesperson            Scientist 
##                    2                   32                    4 
##    Software Engineer              Teacher 
##                    4                   40
occ <- data.frame(occ)  #creating a data frame for Occupations

occ <- occ %>% rename(Occupation=Var1) #renaming Var1 column

occ$Occupation <- as.character(occ$Occupation)
occ$Occupation[occ$Occupation == "Software Engineer"] <- "SE" #renaming occupations for better representativeness 
occ$Occupation[occ$Occupation == "Sales Representative"] <- "SR"
occ$Occupation[occ$Occupation == "Accountant"] <- "Acc"

ggplot(occ, aes(x = Occupation, y = Freq, fill = Occupation)) +  # Adjust 'Count' with your actual frequency column name if different
  geom_bar(stat = "identity", fill = "lightpink", width=0.9) +
  labs(title = "Frequency of Occupations", x = "Occupation", y = "Frequency") +
  theme_minimal()

The majority of individuals in the dataset are doctors and nurses, followed by engineers, lawyers, and teachers.

table(sleep_data$BMI.Category)
## 
##        Normal Normal Weight         Obese    Overweight 
##           195            21            10           148

I suspect that some entries for normal weight were mistakenly saved under a different name. To address this, I will create a new column named BMI where I will combine these two categories. data manipulation

sleep_data$BMI[sleep_data$BMI.Category == "Normal Weight"] <- "Normal"
 
table(sleep_data$BMI)
## 
##     Normal      Obese Overweight 
##        216         10        148
# Pie chart
pie(table(sleep_data$BMI),
    main = "Weight Distribution",
    col = c("green", "red", "orange"))

The majority of patients had a normal weight, while less than half were overweight, and only a small percentage were classified as obese.

table(sleep_data$Sleep.Disorder)
## 
##    Insomnia        None Sleep Apnea 
##          77         219          78
sleep_data$Sleep.Disorder <- as.factor(sleep_data$Sleep.Disorder)
# Pie chart
pie(table(sleep_data$Sleep.Disorder),
    main = "Sleep Disorder Distribution",
    col = c("blue", "green", "orange"))

The majority of individuals did not have a sleep disorder, while the distributions of insomnia and sleep apnea were comparable.

Since the blood pressure data is saved in a format that cannot be directly analyzed, I will create a new column that includes only the systolic pressure for further analysis. data manipulation

#Creating a new column for Systolic Pressure
sleep_data <- sleep_data %>%
  mutate(Systolic.Pressure = str_split_fixed(Blood.Pressure, "/", 2)[, 1])

sleep_data$Systolic.Pressure <- as.integer(sleep_data$Systolic.Pressure)
head(sleep_data)
##   Person.ID Gender Age           Occupation Sleep.Duration Quality.of.Sleep
## 1         1   Male  27    Software Engineer            6.1                6
## 2         2   Male  28               Doctor            6.2                6
## 3         3   Male  28               Doctor            6.2                6
## 4         4   Male  28 Sales Representative            5.9                4
## 5         5   Male  28 Sales Representative            5.9                4
## 6         6   Male  28    Software Engineer            5.9                4
##   Physical.Activity.Level Stress.Level BMI.Category Blood.Pressure Heart.Rate
## 1                      42            6   Overweight         126/83         77
## 2                      60            8       Normal         125/80         75
## 3                      60            8       Normal         125/80         75
## 4                      30            8        Obese         140/90         85
## 5                      30            8        Obese         140/90         85
## 6                      30            8        Obese         140/90         85
##   Daily.Steps Sleep.Disorder        BMI Systolic.Pressure
## 1        4200           None Overweight               126
## 2       10000           None     Normal               125
## 3       10000           None     Normal               125
## 4        3000    Sleep Apnea      Obese               140
## 5        3000    Sleep Apnea      Obese               140
## 6        3000       Insomnia      Obese               140
hist(sleep_data$Systolic.Pressure, breaks=6, main= "Systolic Pressure")

summary(sleep_data$Systolic.Pressure)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   115.0   125.0   130.0   128.6   135.0   142.0

The systolic pressure appears to be approximately normally distributed, showing a slight right skew with a mean value of 128.

Based on the WHO’s recommendation for adults to engage in more than 1 hour of physical activity, I will divide participants into two groups. The Activity column will be used to represent those who engage in 60 or more minutes of physical activity as 1, and those who engage in less than 60 minutes as 0. data manipulation

sleep_data$Activity <- ifelse(sleep_data$Physical.Activity.Level >= 60, 1, 0)
sleep_data$Activity <- as.factor(sleep_data$Activity)

table(sleep_data$Activity)
## 
##   0   1 
## 161 213

The table shows that 161 participants engaged in less than 60 minutes of physical activity, while 213 participants met the WHO recommendation of 60 or more minutes daily.

Now, let’s compare Systolic Pressure among two genders and see if they differ. We will need to perform a t-test. And the assumptions for a valid t-test are: - The sample data have been randomly sampled from a population. - There is homogeneity of variance (i.e., the variability of the data in each group is similar). - The distribution is approximately normal.

I will perform Shapiro test to check the normality of the data.

## 
##  Shapiro-Wilk normality test
## 
## data:  male_systolic
## W = 0.83703, p-value = 2.837e-13
## 
##  Shapiro-Wilk normality test
## 
## data:  female_systolic
## W = 0.83308, p-value = 2.744e-13

As both low p-values from the Shapiro-Wilk test and unstructured Q-Q plots indicate that the assumption of normality in the samples is violated, the t-test may not be appropriate for comparing the means of these groups.

That is why we will perform the Wilcoxon rank-sum test, which is the non-parametric alternative to the t-test for comparing the medians of two independent groups, particularly when the data do not meet the assumptions of normality.

systolic_result <- wilcox.test(Systolic.Pressure ~ Gender, data = sleep_data)

tidy(systolic_result)
## # A tibble: 1 × 4
##   statistic   p.value method                                         alternative
##       <dbl>     <dbl> <chr>                                          <chr>      
## 1     21728 0.0000356 Wilcoxon rank sum test with continuity correc… two.sided
ggplot(sleep_data, aes(x = factor(Gender), y = Systolic.Pressure)) +
  geom_boxplot(fill = "lightpink") +
  labs(title = "Distribution of Physical Activity Level by BMI Category",
       x = "BMI Category",
       y = "Physical Activity Level") +
  theme_minimal()

The p-value indicates that we must reject the null hypothesis, suggesting that there is a statistically significant difference in the means of Systolic Pressure across different genders.

Next we will analyze if Physical Activity level affects BMI. We would need to perform ANNOVA test, so let’s first check for the normality of physical activity level across different BMI categories with Shapiro test.

shapiro_test_bmi <- sleep_data %>%
  group_by(BMI) %>%
  summarise(
    ShapiroTest = list(shapiro.test(Physical.Activity.Level)),
      p_value = sapply(ShapiroTest, `[[`, "p.value")
  )

shapiro_test_bmi
## # A tibble: 3 × 3
##   BMI        ShapiroTest  p_value
##   <chr>      <list>         <dbl>
## 1 Normal     <htest>     4.64e-13
## 2 Obese      <htest>     1.11e- 1
## 3 Overweight <htest>     4.46e-13

Based on the low p-values from the Shapiro-Wilk tests, we would reject the null hypothesis that the data is normally distributed for each BMI category. Instead of performing an ANOVA test, we will use its non-parametric alternative, the Kruskal-Wallis test, to compare Phyisical Activity Level among different BMI categories.

kruskal_BMI <- kruskal.test(Physical.Activity.Level ~ BMI, data = sleep_data)
kruskal_BMI
## 
##  Kruskal-Wallis rank sum test
## 
## data:  Physical.Activity.Level by BMI
## Kruskal-Wallis chi-squared = 2.6252, df = 2, p-value = 0.2691
ggplot(sleep_data, aes(x = factor(BMI), y = Physical.Activity.Level)) +
  geom_boxplot(fill = "lightblue") +
  labs(title = "Distribution of Physical Activity Level by BMI Category",
       x = "BMI Category",
       y = "Physical Activity Level") +
  theme_minimal()

The p-value from the Kruskal-Wallis test is greater than 0.05, it means that there is no statistically significant difference in Physical Activity Levels across the different BMI categories. This suggests that the variability in Physical Activity Levels is similar across all BMI groups, and therefore, we fail to reject the null hypothesis of no difference among BMI categories in Physical Activity Levels.

Let’s now create a linear regression model to predict Sleep Duration using Stress Level, Quality of Sleep, and Physical Activity Level. First, we will test if there is a significant linear relationship between each predictor and the outcome variable. We will use the Spearman’s correlation test because it does not require the data to be normally distributed, making it suitable for our analysis.

stress.sleep <- cor.test(sleep_data$Sleep.Duration, sleep_data$Stress.Level, method = "spearman")
## Warning in cor.test.default(sleep_data$Sleep.Duration, sleep_data$Stress.Level,
## : Cannot compute exact p-value with ties
stress.sleep
## 
##  Spearman's rank correlation rho
## 
## data:  sleep_data$Sleep.Duration and sleep_data$Stress.Level
## S = 15786679, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##        rho 
## -0.8106326
ggplot(sleep_data, aes(x = Sleep.Duration, y = Stress.Level)) +
  geom_point(color = "blue") +
  geom_smooth(method = "lm", se = FALSE, color = "red") +  # Adding a linear regression line
  annotate("text", x = max(sleep_data$Sleep.Duration) * 0.8, y = max(sleep_data$Stress.Level) * 0.9,
           label = paste("r =", round(cor(sleep_data$Sleep.Duration, sleep_data$Stress.Level), 2)), size = 5) +
  labs(title = "Scatter Plot of Sleep Duration vs Stress Level",
       x = "Sleep Duration",
       y = "Stress Level")+
  theme_light()
## `geom_smooth()` using formula = 'y ~ x'

The Spearman’s correlation coefficient is -0.81 with a p-value lower than 0.05, indicating a strong negative linear relationship between Sleep Duration and Stress Level.

quality.sleep <- cor.test(sleep_data$Sleep.Duration, sleep_data$Quality.of.Sleep, method= "spearman")
## Warning in cor.test.default(sleep_data$Sleep.Duration,
## sleep_data$Quality.of.Sleep, : Cannot compute exact p-value with ties
quality.sleep
## 
##  Spearman's rank correlation rho
## 
## data:  sleep_data$Sleep.Duration and sleep_data$Quality.of.Sleep
## S = 981760, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.8873984
ggplot(sleep_data, aes(x = Sleep.Duration, y = Quality.of.Sleep)) +
  geom_point(color = "blue") +
  geom_smooth(method = "lm", se = FALSE, color = "green") +  
  annotate("text", x = max(sleep_data$Sleep.Duration) * 0.8, y = max(sleep_data$Quality.of.Sleep) * 0.9,
           label = paste("r =", round(cor(sleep_data$Sleep.Duration, sleep_data$Quality.of.Sleep), 2)), size = 5) +
  labs(title = "Scatter Plot of Sleep Duration vs Quality of Sleep",
       x = "Sleep Duration",
       y = "Quality of Sleep")+
  theme_light()
## `geom_smooth()` using formula = 'y ~ x'

The Spearman’s correlation coefficient is 0.89 with a p-value lower than 0.05, indicating a strong positive linear relationship between Sleep Duration and Sleep Quality.

physical.sleep <- cor.test(sleep_data$Sleep.Duration, sleep_data$Physical.Activity.Level, method = "spearman")
## Warning in cor.test.default(sleep_data$Sleep.Duration,
## sleep_data$Physical.Activity.Level, : Cannot compute exact p-value with ties
physical.sleep
## 
##  Spearman's rank correlation rho
## 
## data:  sleep_data$Sleep.Duration and sleep_data$Physical.Activity.Level
## S = 6894581, p-value = 4.541e-05
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##      rho 
## 0.209235
ggplot(sleep_data, aes(x = Sleep.Duration, y = Physical.Activity.Level)) +
  geom_point(color = "blue") +
  geom_smooth(method = "lm", se = FALSE, color = "purple") +  
  annotate("text", x = max(sleep_data$Sleep.Duration) * 0.8, y = max(sleep_data$Physical.Activity.Level) * 0.9,
           label = paste("r =", round(cor(sleep_data$Sleep.Duration, sleep_data$Physical.Activity.Level), 2)), size = 5) +
  labs(title = "Scatter Plot of Sleep Duration vs Physical Activity Level",
       x = "Sleep Duration",
       y = "Physical Activity Level")+
  theme_light()
## `geom_smooth()` using formula = 'y ~ x'

A Spearman’s correlation coefficient of 0.2, with a p-value lower than 0.05 suggests a weak positive relationship between the Sleep Duration and Physical Activity Level.

sleep_data$Sleep.Disorder <- relevel(sleep_data$Sleep.Disorder, ref=3) #making the None category of sleep disorder the reference group

sleep_lm <- lm(Sleep.Duration ~ Gender + Sleep.Disorder + Quality.of.Sleep +
                 Physical.Activity.Level + Stress.Level , data=sleep_data)
summary(sleep_lm)
## 
## Call:
## lm(formula = Sleep.Duration ~ Gender + Sleep.Disorder + Quality.of.Sleep + 
##     Physical.Activity.Level + Stress.Level, data = sleep_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.59937 -0.28393 -0.04949  0.16388  0.87315 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              4.843915   0.421089  11.503  < 2e-16 ***
## GenderMale               0.315443   0.044644   7.066 8.08e-12 ***
## Sleep.DisorderInsomnia  -0.213760   0.066422  -3.218   0.0014 ** 
## Sleep.DisorderNone      -0.045619   0.056079  -0.813   0.4165    
## Quality.of.Sleep         0.397611   0.040690   9.772  < 2e-16 ***
## Physical.Activity.Level  0.001918   0.001009   1.900   0.0582 .  
## Stress.Level            -0.152583   0.026809  -5.691 2.58e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3412 on 367 degrees of freedom
## Multiple R-squared:  0.8191, Adjusted R-squared:  0.8161 
## F-statistic: 276.9 on 6 and 367 DF,  p-value: < 2.2e-16
vif(sleep_lm) #to check multicollinearity
##                             GVIF Df GVIF^(1/(2*Df))
## Gender                  1.600432  1        1.265082
## Sleep.Disorder          2.104334  2        1.204422
## Quality.of.Sleep        7.599758  1        2.756766
## Physical.Activity.Level 1.416055  1        1.189981
## Stress.Level            7.251058  1        2.692779

This model uses gender, physical activity level, quality of sleep, stress level, and sleep disorders to predict an individual’s sleep duration. However, we observed moderate collinearity between Quality of Sleep and Stress levels (VIF = 7). To address this, I will remove the predictor Quality of Sleep from the model.

sleep_data$Sleep.Disorder <- relevel(sleep_data$Sleep.Disorder, ref=3)
sleep_lm1 <- lm(Sleep.Duration ~ Gender + Sleep.Disorder + 
                 Physical.Activity.Level + Stress.Level , data=sleep_data)
summary(sleep_lm1)
## 
## Call:
## lm(formula = Sleep.Duration ~ Gender + Sleep.Disorder + Physical.Activity.Level + 
##     Stress.Level, data = sleep_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.96933 -0.24517 -0.01905  0.19433  0.86545 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                8.842461   0.090904  97.272  < 2e-16 ***
## GenderMale                 0.353683   0.049855   7.094 6.72e-12 ***
## Sleep.DisorderSleep Apnea -0.022179   0.062384  -0.356    0.722    
## Sleep.DisorderInsomnia    -0.388202   0.053151  -7.304 1.75e-12 ***
## Physical.Activity.Level    0.004855   0.001080   4.495 9.33e-06 ***
## Stress.Level              -0.388454   0.013077 -29.704  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3825 on 368 degrees of freedom
## Multiple R-squared:  0.772,  Adjusted R-squared:  0.7689 
## F-statistic: 249.2 on 5 and 368 DF,  p-value: < 2.2e-16
confint(sleep_lm1)
##                                  2.5 %       97.5 %
## (Intercept)                8.663703969  9.021218802
## GenderMale                 0.255646474  0.451720189
## Sleep.DisorderSleep Apnea -0.144852962  0.100495506
## Sleep.DisorderInsomnia    -0.492718976 -0.283685303
## Physical.Activity.Level    0.002731114  0.006978954
## Stress.Level              -0.414170349 -0.362738396
vif(sleep_lm1)
##                             GVIF Df GVIF^(1/(2*Df))
## Gender                  1.588134  1        1.260212
## Sleep.Disorder          1.716423  2        1.144606
## Physical.Activity.Level 1.290457  1        1.135983
## Stress.Level            1.372860  1        1.171691

Now that we have confirmed there is no issue with multicollinearity, as all predictors have low VIF values. Also , we observed that all predictor coefficients were statistically significant, except Sleep disorder Apnea. Adjusted R-squared equals 0.768 (p value<0.05) meaning that approximately 77% of the variability in the dataset is explained by our model.

Based on our model, we can conclude the following:

Finally, let’s check if the assumptions of linear model are met. For that we will plot the residuals.

plot(sleep_lm1, 1)

plot(sleep_lm1, 2)

The first plot is a scatterplot of the residuals from our model plotted against the fitted values. The plot displays a random scatter pattern with a slight clustering around the smaller values. The small clustering around smaller values may suggest minor deviations, but overall, the model appears to be a good fit with no significant issues with homoscedasticity.

The Q-Q plot of residuals shows deviations from the diagonal reference line, particularly in the upper tail, indicating non-normality of residuals. The presence of outliers (points 262, 263) and a rightward skew suggest heavy tails and potential model issues such as outliers or model misspecification. This non-normality could affect the reliability of the model’s assumptions and statistical inferences.

The second model will aim to predict stress level using sleep duration, heart rate, gender, and BMI as predictors. Before proceeding, I will first assess the linear relationship between the outcome variable and the continuous predictors.

hr.stress <- cor.test(sleep_data$Stress.Level, sleep_data$Heart.Rate, method = "spearman")
## Warning in cor.test.default(sleep_data$Stress.Level, sleep_data$Heart.Rate, :
## Cannot compute exact p-value with ties
hr.stress
## 
##  Spearman's rank correlation rho
## 
## data:  sleep_data$Stress.Level and sleep_data$Heart.Rate
## S = 1566811, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.8202966
ggplot(sleep_data, aes(x = Stress.Level, y = Heart.Rate)) +
  geom_point(color = "blue") +
  geom_smooth(method = "lm", se = FALSE, color = "purple") +  
  annotate("text", x = max(sleep_data$Stress.Level) * 0.8, y = max(sleep_data$Heart.Rate) * 0.9,
           label = paste("r =", round(cor(sleep_data$Stress.Level, sleep_data$Heart.Rate, method="spearman"), 2)), size = 5) +
  labs(title = "Scatter Plot of Stress Level vs Heart Rate",
       x = "Stress Level",
       y = "Heart Rate")+
  theme_light()
## `geom_smooth()` using formula = 'y ~ x'

The Spearman’s correlation coefficient is 0.82 with a p-value lower than 0.05, indicating a strong positive linear relationship between Stress Level and Heart Rate.

sd.stress <- cor.test(sleep_data$Stress.Level, sleep_data$Sleep.Duration, method = "spearman")
## Warning in cor.test.default(sleep_data$Stress.Level, sleep_data$Sleep.Duration,
## : Cannot compute exact p-value with ties
sd.stress
## 
##  Spearman's rank correlation rho
## 
## data:  sleep_data$Stress.Level and sleep_data$Sleep.Duration
## S = 15786679, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##        rho 
## -0.8106326
ggplot(sleep_data, aes(x = Stress.Level, y = Sleep.Duration)) +
  geom_point(color = "blue") +
  geom_smooth(method = "lm", se = FALSE, color = "purple") +  
  annotate("text", x = max(sleep_data$Stress.Level) * 0.8, y = max(sleep_data$Sleep.Duration) * 0.9,
           label = paste("r =", round(cor(sleep_data$Stress.Level, sleep_data$Sleep.Duration, method="spearman"), 2)), size = 5) +
  labs(title = "Scatter Plot of Stress Level vs Sleep Duration",
       x = "Stress Level",
       y = "Sleep Duration")+
  theme_light()
## `geom_smooth()` using formula = 'y ~ x'

The Spearman’s correlation coefficient is -0.81 with a p-value lower than 0.05, indicating a strong negative linear relationship between Stress Level and Sleep Duration.

As we saw the linear realtionship between Stress level and continous predictors we can proceed to fit the glm.

stress.lm <- lm(Stress.Level~ Sleep.Duration + Gender+ Heart.Rate + BMI, data=sleep_data)
summary(stress.lm)
## 
## Call:
## lm(formula = Stress.Level ~ Sleep.Duration + Gender + Heart.Rate + 
##     BMI, data = sleep_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.5333 -0.3193 -0.1143  0.5979  1.9255 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -2.52328    1.10800  -2.277   0.0233 *  
## Sleep.Duration -1.22395    0.05430 -22.541  < 2e-16 ***
## GenderMale      0.78391    0.07474  10.488  < 2e-16 ***
## Heart.Rate      0.23485    0.01230  19.099  < 2e-16 ***
## BMIObese       -3.76140    0.26908 -13.979  < 2e-16 ***
## BMIOverweight  -0.34463    0.08147  -4.230 2.95e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6333 on 368 degrees of freedom
## Multiple R-squared:  0.8743, Adjusted R-squared:  0.8726 
## F-statistic: 512.1 on 5 and 368 DF,  p-value: < 2.2e-16
vif(stress.lm)
##                    GVIF Df GVIF^(1/(2*Df))
## Sleep.Duration 1.735970  1        1.317562
## Gender         1.302141  1        1.141114
## Heart.Rate     2.405211  1        1.550874
## BMI            2.471573  2        1.253844

All predictor coefficients in this model were statistically significant. Additionally, all predictors had low VIF values, indicating the absence of multicollinearity. R-squared equals 0.87, meaning that 87% of the variability in the dataset is explained by our linear model.

Based on our model, we observed the following relationships:

Finally, let’s check if the assumptions of linear model are met. For that we will plot the residuals.

plot(stress.lm, 1)

plot(stress.lm, 2)

If we are not very strict, the Residuals vs Fitted plot can be considered close to random scatter, as most residuals are centered around zero without a strong systematic pattern. The slight curvature in the line and minor heteroscedasticity are subtle and unlikely to significantly impact the model’s conclusions.

The Q-Q plot is used to assess the normality of residuals. The points generally follow the diagonal reference line, with some deviation at the tails, particularly at the tails. This suggests that the residuals are approximately normally distributed, though there might be some slight departure from normality at the extremes.

To sum up, these findings reveal several notable relationships in this dataset:

Stress Levels: - Sleep Duration and Stress: Longer sleep duration is associated with lower stress levels, suggesting that adequate sleep might play a protective role against stress. - Gender and Stress: Females tend to report higher stress levels than males, implying potential gender differences in stress experiences or coping mechanisms. - Heart Rate and Stress: Higher heart rates are linked to increased stress, reflecting a physiological response where stress impacts cardiovascular function. - BMI and Stress: Obese individuals experience significantly lower stress levels than those with a normal BMI. Overweight individuals also report slightly lower stress levels, though the effect is smaller than in the obese group. Lower stress levels in obese and overweight individuals may stem from stress-coping behaviors, hormonal differences, or varied social and psychological factors.

Sleep Duration: - Gender and Sleep: Females tend to sleep slightly longer than males when other factors are held constant. - Sleep Disorders and Sleep: Insomnia is associated with shorter sleep duration compared to those without sleeping disorders, highlighting the adverse effects of insomnia on sleep patterns. - Physical Activity and Sleep: Higher levels of physical activity correspond to marginally longer sleep durations, suggesting a positive association between physical activity and sleep. - Stress and Sleep: Increased stress levels are linked to shorter sleep durations, indicating a potential cyclical relationship where stress and sleep disturbances exacerbate each other.

Implications: - Stress and Sleep Connection: There is a clear interplay between stress and sleep, where poor sleep contributes to higher stress levels, and stress negatively impacts sleep duration. Addressing both factors may be crucial for improving overall well-being. - Role of Gender: Gender differences influence both stress levels and sleep patterns, warranting further investigation into societal, psychological, or biological factors driving these disparities. - Physical Health Factors: The relationships between BMI, heart rate, and stress suggest complex interactions between mental and physical health that could benefit from a holistic approach in interventions.

These findings collectively underscore the importance of considering various demographic, lifestyle, and physiological factors when studying stress and sleep dynamics.

sessionInfo() 
## R version 4.4.1 (2024-06-14)
## Platform: aarch64-apple-darwin20
## Running under: macOS 15.0.1
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: Asia/Nicosia
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] corrplot_0.95   skimr_2.1.5     gmodels_2.19.1  mctest_1.3.1   
##  [5] Hmisc_5.2-1     car_3.1-3       carData_3.0-5   GGally_2.2.1   
##  [9] broom_1.0.7     naniar_1.1.0    lubridate_1.9.3 forcats_1.0.0  
## [13] stringr_1.5.1   purrr_1.0.2     readr_2.1.5     tidyr_1.3.1    
## [17] tibble_3.2.1    ggplot2_3.5.1   tidyverse_2.0.0 dplyr_1.1.4    
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.2.1   farver_2.1.2       fastmap_1.2.0      digest_0.6.37     
##  [5] rpart_4.1.23       timechange_0.3.0   lifecycle_1.0.4    cluster_2.1.6     
##  [9] gdata_3.0.1        magrittr_2.0.3     compiler_4.4.1     rlang_1.1.4       
## [13] sass_0.4.9         tools_4.4.1        utf8_1.2.4         yaml_2.3.10       
## [17] data.table_1.16.2  knitr_1.48         labeling_0.4.3     htmlwidgets_1.6.4 
## [21] plyr_1.8.9         repr_1.1.7         RColorBrewer_1.1-3 abind_1.4-8       
## [25] withr_3.0.1        foreign_0.8-86     nnet_7.3-19        grid_4.4.1        
## [29] fansi_1.0.6        colorspace_2.1-1   scales_1.3.0       gtools_3.9.5      
## [33] MASS_7.3-61        cli_3.6.3          rmarkdown_2.28     generics_0.1.3    
## [37] rstudioapi_0.16.0  tzdb_0.4.0         cachem_1.1.0       splines_4.4.1     
## [41] base64enc_0.1-3    vctrs_0.6.5        Matrix_1.7-0       jsonlite_1.8.8    
## [45] hms_1.1.3          visdat_0.6.0       Formula_1.2-5      htmlTable_2.4.3   
## [49] jquerylib_0.1.4    glue_1.7.0         ggstats_0.7.0      stringi_1.8.4     
## [53] gtable_0.3.5       munsell_0.5.1      pillar_1.9.0       htmltools_0.5.8.1 
## [57] R6_2.5.1           evaluate_1.0.0     lattice_0.22-6     highr_0.11        
## [61] backports_1.5.0    bslib_0.8.0        Rcpp_1.0.13        gridExtra_2.3     
## [65] nlme_3.1-164       checkmate_2.3.2    mgcv_1.9-1         xfun_0.47         
## [69] pkgconfig_2.0.3