library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ readr 2.1.5
## ✔ ggplot2 3.5.1 ✔ stringr 1.5.1
## ✔ lubridate 1.9.3 ✔ tibble 3.2.1
## ✔ purrr 1.0.2 ✔ tidyr 1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(naniar)
library(readr)
library(ggplot2)
library(broom)
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(tidyr)
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
##
## The following object is masked from 'package:purrr':
##
## some
##
## The following object is masked from 'package:dplyr':
##
## recode
library(Hmisc)
##
## Attaching package: 'Hmisc'
##
## The following objects are masked from 'package:dplyr':
##
## src, summarize
##
## The following objects are masked from 'package:base':
##
## format.pval, units
library(mctest)
library(gmodels)
library(skimr)
##
## Attaching package: 'skimr'
##
## The following object is masked from 'package:naniar':
##
## n_complete
library(corrplot)
## corrplot 0.95 loaded
Please place the Sleep_health_and_lifestyle_dataset.csv file in the current working directory in order to load the dataset.
sleep_data <- read.csv("./Sleep_health_and_lifestyle_dataset.csv") #load from the cwd
head(sleep_data)
## Person.ID Gender Age Occupation Sleep.Duration Quality.of.Sleep
## 1 1 Male 27 Software Engineer 6.1 6
## 2 2 Male 28 Doctor 6.2 6
## 3 3 Male 28 Doctor 6.2 6
## 4 4 Male 28 Sales Representative 5.9 4
## 5 5 Male 28 Sales Representative 5.9 4
## 6 6 Male 28 Software Engineer 5.9 4
## Physical.Activity.Level Stress.Level BMI.Category Blood.Pressure Heart.Rate
## 1 42 6 Overweight 126/83 77
## 2 60 8 Normal 125/80 75
## 3 60 8 Normal 125/80 75
## 4 30 8 Obese 140/90 85
## 5 30 8 Obese 140/90 85
## 6 30 8 Obese 140/90 85
## Daily.Steps Sleep.Disorder
## 1 4200 None
## 2 10000 None
## 3 10000 None
## 4 3000 Sleep Apnea
## 5 3000 Sleep Apnea
## 6 3000 Insomnia
View(sleep_data)
This dataset is taken from Kaggle.com. It is stated by the author Laksika Tharmalingam that this dataset is synthetic and is generated by her. It is not sourced from any original individual or entity.
pct_complete(sleep_data)
## [1] 100
This dataset had no missing values.
dim(sleep_data)
## [1] 374 13
summary(sleep_data)
## Person.ID Gender Age Occupation
## Min. : 1.00 Length:374 Min. :27.00 Length:374
## 1st Qu.: 94.25 Class :character 1st Qu.:35.25 Class :character
## Median :187.50 Mode :character Median :43.00 Mode :character
## Mean :187.50 Mean :42.18
## 3rd Qu.:280.75 3rd Qu.:50.00
## Max. :374.00 Max. :59.00
## Sleep.Duration Quality.of.Sleep Physical.Activity.Level Stress.Level
## Min. :5.800 Min. :4.000 Min. :30.00 Min. :3.000
## 1st Qu.:6.400 1st Qu.:6.000 1st Qu.:45.00 1st Qu.:4.000
## Median :7.200 Median :7.000 Median :60.00 Median :5.000
## Mean :7.132 Mean :7.313 Mean :59.17 Mean :5.385
## 3rd Qu.:7.800 3rd Qu.:8.000 3rd Qu.:75.00 3rd Qu.:7.000
## Max. :8.500 Max. :9.000 Max. :90.00 Max. :8.000
## BMI.Category Blood.Pressure Heart.Rate Daily.Steps
## Length:374 Length:374 Min. :65.00 Min. : 3000
## Class :character Class :character 1st Qu.:68.00 1st Qu.: 5600
## Mode :character Mode :character Median :70.00 Median : 7000
## Mean :70.17 Mean : 6817
## 3rd Qu.:72.00 3rd Qu.: 8000
## Max. :86.00 Max. :10000
## Sleep.Disorder
## Length:374
## Class :character
## Mode :character
##
##
##
There are 374 observations and 13 variables in this dataset.
First, we will analyze the dataset by exploring continuous variables using histograms and descriptive statistics. Next, we will examine categorical variables through frequency tables and pie charts.
ggplot use
ggplot(sleep_data, aes(x = Age)) +
geom_histogram(binwidth = 3, fill = "lightgreen", color = "black", alpha = 0.9) +
scale_x_continuous(
limits = c(20, 60)
) +
labs(title = "Distribution of Age", x = "Age", y = "Frequency") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
axis.title = element_text(size = 12))
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_bar()`).
summary(sleep_data$Age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 27.00 35.25 43.00 42.18 50.00 59.00
The histogram indicates that the age distribution in this dataset appears to be approximately normal, with a mean age of 42 and a median age of 43. The youngest patient is 27 years old, while the oldest is 59. The close alignment of the mean and median suggests an absence of significant outliers in the data.
hist(sleep_data$Sleep.Duration, breaks = 10, main= "Sleep Duration")
summary(sleep_data$Sleep.Duration)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.800 6.400 7.200 7.132 7.800 8.500
Sleep duration results seem to be normally distributed with a slight skew to the left. The average sleep duration was 7, with the maximum of 8.5 and minimum of 5.8.
hist(sleep_data$Stress.Level, breaks=5, main= "Stress level")
summary(sleep_data$Stress.Level)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 4.000 5.000 5.385 7.000 8.000
The histogram reveals that the stress level distribution is not normally distributed. It displays a skewed pattern with a higher concentration of individuals reporting low stress levels. The mean stress level was 5, while maximum and minimum were 8 and 3 correspondingly.
hist(sleep_data$Quality.of.Sleep,breaks=5, main= "Quality of Sleep")
summary(sleep_data$Quality.of.Sleep)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.000 6.000 7.000 7.313 8.000 9.000
Quality of sleep is not normally distributed, with higher proportion of people having higher values. The mean was 7, with a maximum and minimum values of 9 and 3 correspondingly.
hist(sleep_data$Physical.Activity.Level, breaks=4, main= "Physical activity level")
summary(sleep_data$Physical.Activity.Level)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 30.00 45.00 60.00 59.17 75.00 90.00
The data is right-skewed, and most values lie in the 40–60 range, with a gradual decrease toward higher physical activity levels.
hist(sleep_data$Heart.Rate, breaks= 10, main= "Heart Rate")
summary(sleep_data$Heart.Rate)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 65.00 68.00 70.00 70.17 72.00 86.00
The heart rate distribution is negatively skewed, with a higher concentration of patients having lower heart rate values. The average heart rate is 70, with a minimum of 65 and a maximum of 86.
ggplot use
hist(sleep_data$Daily.Steps, breaks=5, main= "Daily Steps")
summary(sleep_data$Daily.Steps)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3000 5600 7000 6817 8000 10000
ggplot(sleep_data, aes(x = Daily.Steps)) +
geom_density(fill = "purple", alpha = 0.3) +
labs(title = "Density Plot of Daily Steps", x = "Daily Steps", y = "Density") +
theme_minimal()
The distribution of the number of daily steps appears to be approximately normal, with a mean value of 6817 daily steps. The minimum number of steps was 3000, while the maximum was 10000.
table(sleep_data$Gender)
##
## Female Male
## 185 189
# Pie chart
pie(table(sleep_data$Gender),
main = "Gender Distribution",
col = c("pink", "lightblue"))
The distribution of males and females appears to be fairly similar across the dataset.
ggplot use
occ <- table(sleep_data$Occupation)
occ
##
## Accountant Doctor Engineer
## 37 71 63
## Lawyer Manager Nurse
## 47 1 73
## Sales Representative Salesperson Scientist
## 2 32 4
## Software Engineer Teacher
## 4 40
occ <- data.frame(occ) #creating a data frame for Occupations
occ <- occ %>% rename(Occupation=Var1) #renaming Var1 column
occ$Occupation <- as.character(occ$Occupation)
occ$Occupation[occ$Occupation == "Software Engineer"] <- "SE" #renaming occupations for better representativeness
occ$Occupation[occ$Occupation == "Sales Representative"] <- "SR"
occ$Occupation[occ$Occupation == "Accountant"] <- "Acc"
ggplot(occ, aes(x = Occupation, y = Freq, fill = Occupation)) + # Adjust 'Count' with your actual frequency column name if different
geom_bar(stat = "identity", fill = "lightpink", width=0.9) +
labs(title = "Frequency of Occupations", x = "Occupation", y = "Frequency") +
theme_minimal()
The majority of individuals in the dataset are doctors and nurses, followed by engineers, lawyers, and teachers.
table(sleep_data$BMI.Category)
##
## Normal Normal Weight Obese Overweight
## 195 21 10 148
I suspect that some entries for normal weight were mistakenly saved under a different name. To address this, I will create a new column named BMI where I will combine these two categories. data manipulation
sleep_data$BMI[sleep_data$BMI.Category == "Normal Weight"] <- "Normal"
table(sleep_data$BMI)
##
## Normal Obese Overweight
## 216 10 148
# Pie chart
pie(table(sleep_data$BMI),
main = "Weight Distribution",
col = c("green", "red", "orange"))
The majority of patients had a normal weight, while less than half were
overweight, and only a small percentage were classified as obese.
table(sleep_data$Sleep.Disorder)
##
## Insomnia None Sleep Apnea
## 77 219 78
sleep_data$Sleep.Disorder <- as.factor(sleep_data$Sleep.Disorder)
# Pie chart
pie(table(sleep_data$Sleep.Disorder),
main = "Sleep Disorder Distribution",
col = c("blue", "green", "orange"))
The majority of individuals did not have a sleep disorder, while the distributions of insomnia and sleep apnea were comparable.
Since the blood pressure data is saved in a format that cannot be directly analyzed, I will create a new column that includes only the systolic pressure for further analysis. data manipulation
#Creating a new column for Systolic Pressure
sleep_data <- sleep_data %>%
mutate(Systolic.Pressure = str_split_fixed(Blood.Pressure, "/", 2)[, 1])
sleep_data$Systolic.Pressure <- as.integer(sleep_data$Systolic.Pressure)
head(sleep_data)
## Person.ID Gender Age Occupation Sleep.Duration Quality.of.Sleep
## 1 1 Male 27 Software Engineer 6.1 6
## 2 2 Male 28 Doctor 6.2 6
## 3 3 Male 28 Doctor 6.2 6
## 4 4 Male 28 Sales Representative 5.9 4
## 5 5 Male 28 Sales Representative 5.9 4
## 6 6 Male 28 Software Engineer 5.9 4
## Physical.Activity.Level Stress.Level BMI.Category Blood.Pressure Heart.Rate
## 1 42 6 Overweight 126/83 77
## 2 60 8 Normal 125/80 75
## 3 60 8 Normal 125/80 75
## 4 30 8 Obese 140/90 85
## 5 30 8 Obese 140/90 85
## 6 30 8 Obese 140/90 85
## Daily.Steps Sleep.Disorder BMI Systolic.Pressure
## 1 4200 None Overweight 126
## 2 10000 None Normal 125
## 3 10000 None Normal 125
## 4 3000 Sleep Apnea Obese 140
## 5 3000 Sleep Apnea Obese 140
## 6 3000 Insomnia Obese 140
hist(sleep_data$Systolic.Pressure, breaks=6, main= "Systolic Pressure")
summary(sleep_data$Systolic.Pressure)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 115.0 125.0 130.0 128.6 135.0 142.0
The systolic pressure appears to be approximately normally distributed, showing a slight right skew with a mean value of 128.
Based on the WHO’s recommendation for adults to engage in more than 1 hour of physical activity, I will divide participants into two groups. The Activity column will be used to represent those who engage in 60 or more minutes of physical activity as 1, and those who engage in less than 60 minutes as 0. data manipulation
sleep_data$Activity <- ifelse(sleep_data$Physical.Activity.Level >= 60, 1, 0)
sleep_data$Activity <- as.factor(sleep_data$Activity)
table(sleep_data$Activity)
##
## 0 1
## 161 213
The table shows that 161 participants engaged in less than 60 minutes of physical activity, while 213 participants met the WHO recommendation of 60 or more minutes daily.
Now, let’s compare Systolic Pressure among two genders and see if they differ. We will need to perform a t-test. And the assumptions for a valid t-test are: - The sample data have been randomly sampled from a population. - There is homogeneity of variance (i.e., the variability of the data in each group is similar). - The distribution is approximately normal.
I will perform Shapiro test to check the normality of the data.
##
## Shapiro-Wilk normality test
##
## data: male_systolic
## W = 0.83703, p-value = 2.837e-13
##
## Shapiro-Wilk normality test
##
## data: female_systolic
## W = 0.83308, p-value = 2.744e-13
As both low p-values from the Shapiro-Wilk test and unstructured Q-Q plots indicate that the assumption of normality in the samples is violated, the t-test may not be appropriate for comparing the means of these groups.
That is why we will perform the Wilcoxon rank-sum test, which is the non-parametric alternative to the t-test for comparing the medians of two independent groups, particularly when the data do not meet the assumptions of normality.
systolic_result <- wilcox.test(Systolic.Pressure ~ Gender, data = sleep_data)
tidy(systolic_result)
## # A tibble: 1 × 4
## statistic p.value method alternative
## <dbl> <dbl> <chr> <chr>
## 1 21728 0.0000356 Wilcoxon rank sum test with continuity correc… two.sided
ggplot(sleep_data, aes(x = factor(Gender), y = Systolic.Pressure)) +
geom_boxplot(fill = "lightpink") +
labs(title = "Distribution of Physical Activity Level by BMI Category",
x = "BMI Category",
y = "Physical Activity Level") +
theme_minimal()
The p-value indicates that we must reject the null hypothesis,
suggesting that there is a statistically significant difference in the
means of Systolic Pressure across different genders.
Next we will analyze if Physical Activity level affects BMI. We would need to perform ANNOVA test, so let’s first check for the normality of physical activity level across different BMI categories with Shapiro test.
shapiro_test_bmi <- sleep_data %>%
group_by(BMI) %>%
summarise(
ShapiroTest = list(shapiro.test(Physical.Activity.Level)),
p_value = sapply(ShapiroTest, `[[`, "p.value")
)
shapiro_test_bmi
## # A tibble: 3 × 3
## BMI ShapiroTest p_value
## <chr> <list> <dbl>
## 1 Normal <htest> 4.64e-13
## 2 Obese <htest> 1.11e- 1
## 3 Overweight <htest> 4.46e-13
Based on the low p-values from the Shapiro-Wilk tests, we would reject the null hypothesis that the data is normally distributed for each BMI category. Instead of performing an ANOVA test, we will use its non-parametric alternative, the Kruskal-Wallis test, to compare Phyisical Activity Level among different BMI categories.
kruskal_BMI <- kruskal.test(Physical.Activity.Level ~ BMI, data = sleep_data)
kruskal_BMI
##
## Kruskal-Wallis rank sum test
##
## data: Physical.Activity.Level by BMI
## Kruskal-Wallis chi-squared = 2.6252, df = 2, p-value = 0.2691
ggplot(sleep_data, aes(x = factor(BMI), y = Physical.Activity.Level)) +
geom_boxplot(fill = "lightblue") +
labs(title = "Distribution of Physical Activity Level by BMI Category",
x = "BMI Category",
y = "Physical Activity Level") +
theme_minimal()
The p-value from the Kruskal-Wallis test is greater than 0.05, it means
that there is no statistically significant difference in Physical
Activity Levels across the different BMI categories. This suggests that
the variability in Physical Activity Levels is similar across all BMI
groups, and therefore, we fail to reject the null hypothesis of no
difference among BMI categories in Physical Activity Levels.
Let’s now create a linear regression model to predict Sleep Duration using Stress Level, Quality of Sleep, and Physical Activity Level. First, we will test if there is a significant linear relationship between each predictor and the outcome variable. We will use the Spearman’s correlation test because it does not require the data to be normally distributed, making it suitable for our analysis.
stress.sleep <- cor.test(sleep_data$Sleep.Duration, sleep_data$Stress.Level, method = "spearman")
## Warning in cor.test.default(sleep_data$Sleep.Duration, sleep_data$Stress.Level,
## : Cannot compute exact p-value with ties
stress.sleep
##
## Spearman's rank correlation rho
##
## data: sleep_data$Sleep.Duration and sleep_data$Stress.Level
## S = 15786679, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.8106326
ggplot(sleep_data, aes(x = Sleep.Duration, y = Stress.Level)) +
geom_point(color = "blue") +
geom_smooth(method = "lm", se = FALSE, color = "red") + # Adding a linear regression line
annotate("text", x = max(sleep_data$Sleep.Duration) * 0.8, y = max(sleep_data$Stress.Level) * 0.9,
label = paste("r =", round(cor(sleep_data$Sleep.Duration, sleep_data$Stress.Level), 2)), size = 5) +
labs(title = "Scatter Plot of Sleep Duration vs Stress Level",
x = "Sleep Duration",
y = "Stress Level")+
theme_light()
## `geom_smooth()` using formula = 'y ~ x'
The Spearman’s correlation coefficient is -0.81 with a p-value lower
than 0.05, indicating a strong negative linear relationship between
Sleep Duration and Stress Level.
quality.sleep <- cor.test(sleep_data$Sleep.Duration, sleep_data$Quality.of.Sleep, method= "spearman")
## Warning in cor.test.default(sleep_data$Sleep.Duration,
## sleep_data$Quality.of.Sleep, : Cannot compute exact p-value with ties
quality.sleep
##
## Spearman's rank correlation rho
##
## data: sleep_data$Sleep.Duration and sleep_data$Quality.of.Sleep
## S = 981760, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.8873984
ggplot(sleep_data, aes(x = Sleep.Duration, y = Quality.of.Sleep)) +
geom_point(color = "blue") +
geom_smooth(method = "lm", se = FALSE, color = "green") +
annotate("text", x = max(sleep_data$Sleep.Duration) * 0.8, y = max(sleep_data$Quality.of.Sleep) * 0.9,
label = paste("r =", round(cor(sleep_data$Sleep.Duration, sleep_data$Quality.of.Sleep), 2)), size = 5) +
labs(title = "Scatter Plot of Sleep Duration vs Quality of Sleep",
x = "Sleep Duration",
y = "Quality of Sleep")+
theme_light()
## `geom_smooth()` using formula = 'y ~ x'
The Spearman’s correlation coefficient is 0.89 with a p-value lower than
0.05, indicating a strong positive linear relationship between Sleep
Duration and Sleep Quality.
physical.sleep <- cor.test(sleep_data$Sleep.Duration, sleep_data$Physical.Activity.Level, method = "spearman")
## Warning in cor.test.default(sleep_data$Sleep.Duration,
## sleep_data$Physical.Activity.Level, : Cannot compute exact p-value with ties
physical.sleep
##
## Spearman's rank correlation rho
##
## data: sleep_data$Sleep.Duration and sleep_data$Physical.Activity.Level
## S = 6894581, p-value = 4.541e-05
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.209235
ggplot(sleep_data, aes(x = Sleep.Duration, y = Physical.Activity.Level)) +
geom_point(color = "blue") +
geom_smooth(method = "lm", se = FALSE, color = "purple") +
annotate("text", x = max(sleep_data$Sleep.Duration) * 0.8, y = max(sleep_data$Physical.Activity.Level) * 0.9,
label = paste("r =", round(cor(sleep_data$Sleep.Duration, sleep_data$Physical.Activity.Level), 2)), size = 5) +
labs(title = "Scatter Plot of Sleep Duration vs Physical Activity Level",
x = "Sleep Duration",
y = "Physical Activity Level")+
theme_light()
## `geom_smooth()` using formula = 'y ~ x'
A Spearman’s correlation coefficient of 0.2, with a p-value lower than
0.05 suggests a weak positive relationship between the Sleep Duration
and Physical Activity Level.
sleep_data$Sleep.Disorder <- relevel(sleep_data$Sleep.Disorder, ref=3) #making the None category of sleep disorder the reference group
sleep_lm <- lm(Sleep.Duration ~ Gender + Sleep.Disorder + Quality.of.Sleep +
Physical.Activity.Level + Stress.Level , data=sleep_data)
summary(sleep_lm)
##
## Call:
## lm(formula = Sleep.Duration ~ Gender + Sleep.Disorder + Quality.of.Sleep +
## Physical.Activity.Level + Stress.Level, data = sleep_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.59937 -0.28393 -0.04949 0.16388 0.87315
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.843915 0.421089 11.503 < 2e-16 ***
## GenderMale 0.315443 0.044644 7.066 8.08e-12 ***
## Sleep.DisorderInsomnia -0.213760 0.066422 -3.218 0.0014 **
## Sleep.DisorderNone -0.045619 0.056079 -0.813 0.4165
## Quality.of.Sleep 0.397611 0.040690 9.772 < 2e-16 ***
## Physical.Activity.Level 0.001918 0.001009 1.900 0.0582 .
## Stress.Level -0.152583 0.026809 -5.691 2.58e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3412 on 367 degrees of freedom
## Multiple R-squared: 0.8191, Adjusted R-squared: 0.8161
## F-statistic: 276.9 on 6 and 367 DF, p-value: < 2.2e-16
vif(sleep_lm) #to check multicollinearity
## GVIF Df GVIF^(1/(2*Df))
## Gender 1.600432 1 1.265082
## Sleep.Disorder 2.104334 2 1.204422
## Quality.of.Sleep 7.599758 1 2.756766
## Physical.Activity.Level 1.416055 1 1.189981
## Stress.Level 7.251058 1 2.692779
This model uses gender, physical activity level, quality of sleep, stress level, and sleep disorders to predict an individual’s sleep duration. However, we observed moderate collinearity between Quality of Sleep and Stress levels (VIF = 7). To address this, I will remove the predictor Quality of Sleep from the model.
sleep_data$Sleep.Disorder <- relevel(sleep_data$Sleep.Disorder, ref=3)
sleep_lm1 <- lm(Sleep.Duration ~ Gender + Sleep.Disorder +
Physical.Activity.Level + Stress.Level , data=sleep_data)
summary(sleep_lm1)
##
## Call:
## lm(formula = Sleep.Duration ~ Gender + Sleep.Disorder + Physical.Activity.Level +
## Stress.Level, data = sleep_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.96933 -0.24517 -0.01905 0.19433 0.86545
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.842461 0.090904 97.272 < 2e-16 ***
## GenderMale 0.353683 0.049855 7.094 6.72e-12 ***
## Sleep.DisorderSleep Apnea -0.022179 0.062384 -0.356 0.722
## Sleep.DisorderInsomnia -0.388202 0.053151 -7.304 1.75e-12 ***
## Physical.Activity.Level 0.004855 0.001080 4.495 9.33e-06 ***
## Stress.Level -0.388454 0.013077 -29.704 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3825 on 368 degrees of freedom
## Multiple R-squared: 0.772, Adjusted R-squared: 0.7689
## F-statistic: 249.2 on 5 and 368 DF, p-value: < 2.2e-16
confint(sleep_lm1)
## 2.5 % 97.5 %
## (Intercept) 8.663703969 9.021218802
## GenderMale 0.255646474 0.451720189
## Sleep.DisorderSleep Apnea -0.144852962 0.100495506
## Sleep.DisorderInsomnia -0.492718976 -0.283685303
## Physical.Activity.Level 0.002731114 0.006978954
## Stress.Level -0.414170349 -0.362738396
vif(sleep_lm1)
## GVIF Df GVIF^(1/(2*Df))
## Gender 1.588134 1 1.260212
## Sleep.Disorder 1.716423 2 1.144606
## Physical.Activity.Level 1.290457 1 1.135983
## Stress.Level 1.372860 1 1.171691
Now that we have confirmed there is no issue with multicollinearity, as all predictors have low VIF values. Also , we observed that all predictor coefficients were statistically significant, except Sleep disorder Apnea. Adjusted R-squared equals 0.768 (p value<0.05) meaning that approximately 77% of the variability in the dataset is explained by our model.
Based on our model, we can conclude the following:
Finally, let’s check if the assumptions of linear model are met. For that we will plot the residuals.
plot(sleep_lm1, 1)
plot(sleep_lm1, 2)
The first plot is a scatterplot of the residuals from our model plotted against the fitted values. The plot displays a random scatter pattern with a slight clustering around the smaller values. The small clustering around smaller values may suggest minor deviations, but overall, the model appears to be a good fit with no significant issues with homoscedasticity.
The Q-Q plot of residuals shows deviations from the diagonal reference line, particularly in the upper tail, indicating non-normality of residuals. The presence of outliers (points 262, 263) and a rightward skew suggest heavy tails and potential model issues such as outliers or model misspecification. This non-normality could affect the reliability of the model’s assumptions and statistical inferences.
The second model will aim to predict stress level using sleep duration, heart rate, gender, and BMI as predictors. Before proceeding, I will first assess the linear relationship between the outcome variable and the continuous predictors.
hr.stress <- cor.test(sleep_data$Stress.Level, sleep_data$Heart.Rate, method = "spearman")
## Warning in cor.test.default(sleep_data$Stress.Level, sleep_data$Heart.Rate, :
## Cannot compute exact p-value with ties
hr.stress
##
## Spearman's rank correlation rho
##
## data: sleep_data$Stress.Level and sleep_data$Heart.Rate
## S = 1566811, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.8202966
ggplot(sleep_data, aes(x = Stress.Level, y = Heart.Rate)) +
geom_point(color = "blue") +
geom_smooth(method = "lm", se = FALSE, color = "purple") +
annotate("text", x = max(sleep_data$Stress.Level) * 0.8, y = max(sleep_data$Heart.Rate) * 0.9,
label = paste("r =", round(cor(sleep_data$Stress.Level, sleep_data$Heart.Rate, method="spearman"), 2)), size = 5) +
labs(title = "Scatter Plot of Stress Level vs Heart Rate",
x = "Stress Level",
y = "Heart Rate")+
theme_light()
## `geom_smooth()` using formula = 'y ~ x'
The Spearman’s correlation coefficient is 0.82 with a p-value lower than
0.05, indicating a strong positive linear relationship between Stress
Level and Heart Rate.
sd.stress <- cor.test(sleep_data$Stress.Level, sleep_data$Sleep.Duration, method = "spearman")
## Warning in cor.test.default(sleep_data$Stress.Level, sleep_data$Sleep.Duration,
## : Cannot compute exact p-value with ties
sd.stress
##
## Spearman's rank correlation rho
##
## data: sleep_data$Stress.Level and sleep_data$Sleep.Duration
## S = 15786679, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.8106326
ggplot(sleep_data, aes(x = Stress.Level, y = Sleep.Duration)) +
geom_point(color = "blue") +
geom_smooth(method = "lm", se = FALSE, color = "purple") +
annotate("text", x = max(sleep_data$Stress.Level) * 0.8, y = max(sleep_data$Sleep.Duration) * 0.9,
label = paste("r =", round(cor(sleep_data$Stress.Level, sleep_data$Sleep.Duration, method="spearman"), 2)), size = 5) +
labs(title = "Scatter Plot of Stress Level vs Sleep Duration",
x = "Stress Level",
y = "Sleep Duration")+
theme_light()
## `geom_smooth()` using formula = 'y ~ x'
The Spearman’s correlation coefficient is -0.81 with a p-value lower than 0.05, indicating a strong negative linear relationship between Stress Level and Sleep Duration.
As we saw the linear realtionship between Stress level and continous predictors we can proceed to fit the glm.
stress.lm <- lm(Stress.Level~ Sleep.Duration + Gender+ Heart.Rate + BMI, data=sleep_data)
summary(stress.lm)
##
## Call:
## lm(formula = Stress.Level ~ Sleep.Duration + Gender + Heart.Rate +
## BMI, data = sleep_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.5333 -0.3193 -0.1143 0.5979 1.9255
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.52328 1.10800 -2.277 0.0233 *
## Sleep.Duration -1.22395 0.05430 -22.541 < 2e-16 ***
## GenderMale 0.78391 0.07474 10.488 < 2e-16 ***
## Heart.Rate 0.23485 0.01230 19.099 < 2e-16 ***
## BMIObese -3.76140 0.26908 -13.979 < 2e-16 ***
## BMIOverweight -0.34463 0.08147 -4.230 2.95e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6333 on 368 degrees of freedom
## Multiple R-squared: 0.8743, Adjusted R-squared: 0.8726
## F-statistic: 512.1 on 5 and 368 DF, p-value: < 2.2e-16
vif(stress.lm)
## GVIF Df GVIF^(1/(2*Df))
## Sleep.Duration 1.735970 1 1.317562
## Gender 1.302141 1 1.141114
## Heart.Rate 2.405211 1 1.550874
## BMI 2.471573 2 1.253844
All predictor coefficients in this model were statistically significant. Additionally, all predictors had low VIF values, indicating the absence of multicollinearity. R-squared equals 0.87, meaning that 87% of the variability in the dataset is explained by our linear model.
Based on our model, we observed the following relationships:
Finally, let’s check if the assumptions of linear model are met. For that we will plot the residuals.
plot(stress.lm, 1)
plot(stress.lm, 2)
If we are not very strict, the Residuals vs Fitted plot can be
considered close to random scatter, as most residuals are centered
around zero without a strong systematic pattern. The slight curvature in
the line and minor heteroscedasticity are subtle and unlikely to
significantly impact the model’s conclusions.
The Q-Q plot is used to assess the normality of residuals. The points generally follow the diagonal reference line, with some deviation at the tails, particularly at the tails. This suggests that the residuals are approximately normally distributed, though there might be some slight departure from normality at the extremes.
To sum up, these findings reveal several notable relationships in this dataset:
Stress Levels: - Sleep Duration and Stress: Longer sleep duration is associated with lower stress levels, suggesting that adequate sleep might play a protective role against stress. - Gender and Stress: Females tend to report higher stress levels than males, implying potential gender differences in stress experiences or coping mechanisms. - Heart Rate and Stress: Higher heart rates are linked to increased stress, reflecting a physiological response where stress impacts cardiovascular function. - BMI and Stress: Obese individuals experience significantly lower stress levels than those with a normal BMI. Overweight individuals also report slightly lower stress levels, though the effect is smaller than in the obese group. Lower stress levels in obese and overweight individuals may stem from stress-coping behaviors, hormonal differences, or varied social and psychological factors.
Sleep Duration: - Gender and Sleep: Females tend to sleep slightly longer than males when other factors are held constant. - Sleep Disorders and Sleep: Insomnia is associated with shorter sleep duration compared to those without sleeping disorders, highlighting the adverse effects of insomnia on sleep patterns. - Physical Activity and Sleep: Higher levels of physical activity correspond to marginally longer sleep durations, suggesting a positive association between physical activity and sleep. - Stress and Sleep: Increased stress levels are linked to shorter sleep durations, indicating a potential cyclical relationship where stress and sleep disturbances exacerbate each other.
Implications: - Stress and Sleep Connection: There is a clear interplay between stress and sleep, where poor sleep contributes to higher stress levels, and stress negatively impacts sleep duration. Addressing both factors may be crucial for improving overall well-being. - Role of Gender: Gender differences influence both stress levels and sleep patterns, warranting further investigation into societal, psychological, or biological factors driving these disparities. - Physical Health Factors: The relationships between BMI, heart rate, and stress suggest complex interactions between mental and physical health that could benefit from a holistic approach in interventions.
These findings collectively underscore the importance of considering various demographic, lifestyle, and physiological factors when studying stress and sleep dynamics.
sessionInfo()
## R version 4.4.1 (2024-06-14)
## Platform: aarch64-apple-darwin20
## Running under: macOS 15.0.1
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: Asia/Nicosia
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] corrplot_0.95 skimr_2.1.5 gmodels_2.19.1 mctest_1.3.1
## [5] Hmisc_5.2-1 car_3.1-3 carData_3.0-5 GGally_2.2.1
## [9] broom_1.0.7 naniar_1.1.0 lubridate_1.9.3 forcats_1.0.0
## [13] stringr_1.5.1 purrr_1.0.2 readr_2.1.5 tidyr_1.3.1
## [17] tibble_3.2.1 ggplot2_3.5.1 tidyverse_2.0.0 dplyr_1.1.4
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.2.1 farver_2.1.2 fastmap_1.2.0 digest_0.6.37
## [5] rpart_4.1.23 timechange_0.3.0 lifecycle_1.0.4 cluster_2.1.6
## [9] gdata_3.0.1 magrittr_2.0.3 compiler_4.4.1 rlang_1.1.4
## [13] sass_0.4.9 tools_4.4.1 utf8_1.2.4 yaml_2.3.10
## [17] data.table_1.16.2 knitr_1.48 labeling_0.4.3 htmlwidgets_1.6.4
## [21] plyr_1.8.9 repr_1.1.7 RColorBrewer_1.1-3 abind_1.4-8
## [25] withr_3.0.1 foreign_0.8-86 nnet_7.3-19 grid_4.4.1
## [29] fansi_1.0.6 colorspace_2.1-1 scales_1.3.0 gtools_3.9.5
## [33] MASS_7.3-61 cli_3.6.3 rmarkdown_2.28 generics_0.1.3
## [37] rstudioapi_0.16.0 tzdb_0.4.0 cachem_1.1.0 splines_4.4.1
## [41] base64enc_0.1-3 vctrs_0.6.5 Matrix_1.7-0 jsonlite_1.8.8
## [45] hms_1.1.3 visdat_0.6.0 Formula_1.2-5 htmlTable_2.4.3
## [49] jquerylib_0.1.4 glue_1.7.0 ggstats_0.7.0 stringi_1.8.4
## [53] gtable_0.3.5 munsell_0.5.1 pillar_1.9.0 htmltools_0.5.8.1
## [57] R6_2.5.1 evaluate_1.0.0 lattice_0.22-6 highr_0.11
## [61] backports_1.5.0 bslib_0.8.0 Rcpp_1.0.13 gridExtra_2.3
## [65] nlme_3.1-164 checkmate_2.3.2 mgcv_1.9-1 xfun_0.47
## [69] pkgconfig_2.0.3