https://studentaffairs.stanford.edu/sleep-corner-mental-health-and-sleep
For my DATA 110 final project, I am investigating if your occupation and sleep duration has an effect on your stress level using the Sleep Health and Lifestyle Dataset, collected by Laksika Tharmalingam from sleepdata.org. This dataset contains information about individuals’ occupations, lifestyle habits, stress levels, and sleep patterns.The goal of this project is to determine if certain occupations are associated with higher stress because of how long you sleep. To do this data analysis I used the 3 important variables: Occupatoon- A categorical variable telling the profession of each person; Sleep.Duration- A numerical variable that measures the average hours of sleep at night; Stress.Level- Also a numerical variable that represents stress level from 1-10.
The dataset was cleaned by first checking for missing values using colSums(is.na(sleep_data)), which told me that there were no NA values. Next, I made the dataset more simple, including only the variables I will use (Occupation, Sleep.Duration, and Stress.Level) using the select() function. This step made sure the analysis focused on the key variables without unnecessary data.
I chose this topic because, as a student who also has a part time job, I am deeply interested in how lifestyle factors like sleep and work environments impact stress. Stress is a critical factor in physical and mental health,and can effect your academic performance.library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggthemes)
library(plotly)
##
## Attaching package: 'plotly'
##
## The following object is masked from 'package:ggplot2':
##
## last_plot
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following object is masked from 'package:graphics':
##
## layout
# Load dataset
sleep_data <- read.csv("Sleep_health_and_lifestyle_dataset.csv")
str(sleep_data)
## 'data.frame': 374 obs. of 13 variables:
## $ Person.ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Gender : chr "Male" "Male" "Male" "Male" ...
## $ Age : int 27 28 28 28 28 28 29 29 29 29 ...
## $ Occupation : chr "Software Engineer" "Doctor" "Doctor" "Sales Representative" ...
## $ Sleep.Duration : num 6.1 6.2 6.2 5.9 5.9 5.9 6.3 7.8 7.8 7.8 ...
## $ Quality.of.Sleep : int 6 6 6 4 4 4 6 7 7 7 ...
## $ Physical.Activity.Level: int 42 60 60 30 30 30 40 75 75 75 ...
## $ Stress.Level : int 6 8 8 8 8 8 7 6 6 6 ...
## $ BMI.Category : chr "Overweight" "Normal" "Normal" "Obese" ...
## $ Blood.Pressure : chr "126/83" "125/80" "125/80" "140/90" ...
## $ Heart.Rate : int 77 75 75 85 85 85 82 70 70 70 ...
## $ Daily.Steps : int 4200 10000 10000 3000 3000 3000 3500 8000 8000 8000 ...
## $ Sleep.Disorder : chr "None" "None" "None" "Sleep Apnea" ...
#Cleaning the dataset
colSums(is.na(sleep_data))
## Person.ID Gender Age
## 0 0 0
## Occupation Sleep.Duration Quality.of.Sleep
## 0 0 0
## Physical.Activity.Level Stress.Level BMI.Category
## 0 0 0
## Blood.Pressure Heart.Rate Daily.Steps
## 0 0 0
## Sleep.Disorder
## 0
unique(sleep_data$Occupation)
## [1] "Software Engineer" "Doctor" "Sales Representative"
## [4] "Teacher" "Nurse" "Engineer"
## [7] "Accountant" "Scientist" "Lawyer"
## [10] "Salesperson" "Manager"
#Keep only the variables we need
sleep_data<-sleep_data|>
select(Occupation,Sleep.Duration,Stress.Level)
avg_sleep_data <- sleep_data |>
group_by(Occupation) |>
summarize(mean_stress = mean(Stress.Level)) |>
arrange(desc(mean_stress))
# Custom color palette
custom_colors <- c("#1b9e77", "#d95f02", "#7570b3", "#e7298a",
"#66a61e", "#e6ab02", "#a6761d", "#666666","#00FFCE","#FF0086","#000000")
# bargraph plot of sleep vs. stress
ggplot(avg_sleep_data, aes(x = reorder(Occupation, -mean_stress), y = mean_stress, fill = Occupation)) +
geom_col() +
scale_fill_manual(values = custom_colors) +
theme_minimal() +
labs(title = "Average Stress Level by Occupation",
x = "Occupation",
y = "Mean Stress Level") +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = "none")
# Reuse the same custom colors
custom_colors <- c("#1b9e77", "#d95f02", "#7570b3", "#e7298a",
"#66a61e", "#e6ab02", "#a6761d", "#666666","#00FFCE","#FF0086","#000000")
# Boxplot of sleep duration by occupation
ggplot(sleep_data, aes(x = Occupation, y = Sleep.Duration, fill = Occupation)) +
geom_boxplot() +
scale_fill_manual(values = custom_colors) +
theme_bw() +
labs(title = "Sleep Duration by Occupation",
x = "Occupation",
y = "Sleep Duration (hours)") +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = "none")
#lm model
sleep_model<-lm(Stress.Level~Sleep.Duration+Occupation, data=sleep_data)
summary(sleep_model)
##
## Call:
## lm(formula = Stress.Level ~ Sleep.Duration + Occupation, data = sleep_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.82578 -0.55422 -0.09543 0.49250 2.68649
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18.271125 0.424130 43.079 < 2e-16 ***
## Sleep.Duration -1.922613 0.057361 -33.518 < 2e-16 ***
## OccupationDoctor 1.862691 0.142951 13.030 < 2e-16 ***
## OccupationEngineer 0.974250 0.154162 6.320 7.71e-10 ***
## OccupationLawyer 1.040491 0.155632 6.686 8.73e-11 ***
## OccupationManager -0.005098 0.713418 -0.007 0.9943
## OccupationNurse 0.856259 0.142074 6.027 4.12e-09 ***
## OccupationSales Representative 1.072289 0.515701 2.079 0.0383 *
## OccupationSalesperson 1.039603 0.174735 5.950 6.34e-09 ***
## OccupationScientist 0.264550 0.375933 0.704 0.4821
## OccupationSoftware Engineer 0.706510 0.371054 1.904 0.0577 .
## OccupationTeacher -0.883847 0.162375 -5.443 9.67e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7039 on 362 degrees of freedom
## Multiple R-squared: 0.8473, Adjusted R-squared: 0.8427
## F-statistic: 182.6 on 11 and 362 DF, p-value: < 2.2e-16
Stress.Level= -1.92(Sleep.Duration) + 1.86(Doctor)+0.97(Engineer)+1.04(Lawyer)-0.01(Manager)+0.86(Nurse)+1.07(Sales Rep)+1.04(Salesperson)+0.26(Scientist)+0.70(Software Engineer)-0.88(Teacher)
P-value is 2.2e-16 and it is less than the significant 0.05, which means that we reject the null hypothesis. This means that there is significant evidence of sleep duration and occupation impacting stress level.
The adjusted R² is 0.8427, meaning the model explains 84.27% of the variation in stress levels. This is a moderate fit, it’s not perfect, but meaningful given that stress is influenced by many factors outside this dataset.
#plot diagnostic model
par(mfrow=c(2,2))
plot(sleep_model)
## Warning: not plotting observations with leverage one:
## 264
Diagnostic plots show that the assumptions of linear regression are reasonably met, some points may slightly deviate from normality or have moderate leverage.
#Summarize average stress per sleep & occupation pair
heatmap_data<-sleep_data|>
group_by(Occupation,Sleep.Duration)|>
summarize(avg_stress = mean(Stress.Level), .groups = "drop")
#interactive heatmap
heatmap<-ggplot(heatmap_data,aes(x=Sleep.Duration,y=Occupation,fill=avg_stress))+
geom_tile(color="white")+
scale_fill_viridis_c(option = "C") +
theme_bw()+
labs(title = "Heatmap of Average Stress by Sleep Duration and Occupation",
x = "Sleep Duration (hours)",
y = "Occupation",
fill = "Avg Stress")
ggplotly(heatmap)
#Histogram
ggplot(sleep_data,aes(x=Stress.Level,fill=Occupation))+
geom_histogram(binwidth = 1, position = "dodge", color = "black") +
scale_fill_manual(values=custom_colors) +
theme_minimal() +
labs(title = "Histogram of Stress Levels by Occupation",
x = "Stress Level",
y = "Count",
fill = "Occupation")
Bar Graph of Average Stress by Occupation: This plot shows that Sales Representatives have the highest average stress levels, followed by Salesperson and Scientist. Surprisingly, Engineers shows the lowest stress levels among the professions analyzed. This goes against common assumptions about the stressfulness of being an engineer
Boxplot of Sleep Duration by Occupation: Doctors and Nurses have the longest sleep durations, because of their low stress levels. Teachers and Scientists have pretty low sleep durations, which may go with their high stress.
Linear Regression Model: The model (Stress.Level ~ Sleep.Duration + Occupation) confirmed that both sleep duration and occupation significantly affect stress (p-value < 0.05). The negative coefficient for Sleep.Duration (-1.92) indicates that more sleep correlates with lower stress, while occupation specific coefficients show professions link to higher stress.
Heatmap of Stress by Occupation and Sleep: This visualization highlights how stress levels vary across sleep durations for each occupation. For example, Sales Representative show high stress even with moderate sleep, while Engineers exhibit lower stress with longer sleep.
Histogram of Stress Levels by Occupation: The histogram reveals clustering, with Doctors and Nurses and Lawyers skewed toward higher stress levels, while Teachers and Managers are more evenly distributed.