Data101 Final

A- Introduction (1-2 paragraphs):

 My research question:

Do mean hours of sleep deprivation differ for transportation workers?

 My Data-set:

The data-set I will be using is called “sleep_deprivation”. This data-set contains 1087 observations and 2 variables. With this data-set, I plan to use both variables “sleep” and “profession” to analyze what each professions of drivers and compare their mean sleep deprivation. The variable “profession contains bus, taxi, limo drivers, control, pilots, train operators, truck drivers. I want to look at this because I think it is important for several social, health, and economic reasons—especially for jobs that affect public safety, like transportation. Sleep-deprived workers have slower reaction times, reduced attention, and impaired decision-making. In professions such as truck driving, piloting, train operation, or bus driving, lack of sleep can increase the risk of accidents that endanger both workers and the public(I know that many crashes occur within public transportation). Understanding sleep deprivation helps identify which occupations are at higher risk and need stronger safety regulations.

 Data-set link: https://www.openintro.org/data/index.php?data=sleep_deprivation

B- Data Analysis (1 paragraph and 3-5 chunks of code): In your paragraph, describe the type of data analysis you will perform and the types of plots you will generate to address your research question.

For my data analysis part, I will first select the types of professions I will be using, which include bus drivers, pilots, train workers and truck driver. First I will select the variables even though there are only two. Then I will filter the professions by the hones I will use, which are bus drivers, pilots, train workers, and truck drivers. Lastly I will mutate the <6,>8, which is looking for the people who have less than 6 hours, greater than 8 hours, and anywhere between 6-8 hours of sleep. Then I will summarize the mean and max of each category, and lastly group_by the max, and for each sleep profession.

library(tidyverse)
library(ggplot2)
library(dplyr)

#Setting Working directory
setwd("C:/Users/Joanne G/OneDrive/Data101(Fall 2025)/Datasets")

#read the babies.csv in here
sleep_deprivation_df <- read.csv("sleep_deprivation.csv")

 Clean the data-set and conduct exploratory data analysis (EDA) to better understand the data (2 functions minimum)

# EDA Data-set Chunk

#dimensions
dim(sleep_deprivation_df )

## [1] 1087    2

#head
head(sleep_deprivation_df )

##   sleep profession
## 1    <6    control
## 2    <6    control
## 3    <6    control
## 4    <6    control
## 5    <6    control
## 6    <6    control

summary(sleep_deprivation_df )

##     sleep            profession       
##  Length:1087        Length:1087       
##  Class :character   Class :character  
##  Mode  :character   Mode  :character

 Use a minimum of three dplyr functions (filter, select, mutate, summary, mean, max, etc.,) to manipulate the data-set and prepare it for analysis.

filtered_sleep_deprivation_df <- sleep_deprivation_df |>
   filter(profession %in% c("bus / taxi / limo drivers","pilots","train operators","truck drivers")) |>                                      
   select(profession, sleep) |>
   # mutate to change "<6 hours of sleep, into 
   mutate(
      sleep_hours = sapply(sleep, function(x) {
        if (x == "<6") {
          5
        } 
        else if (x == "6-8") {
          7
        } 
        else {
          9
        }
      })
    )

sleep_deprivation_summary <- filtered_sleep_deprivation_df |>
  group_by(profession) |>
  summarise(
    avg_sleep = mean(sleep_hours),
    max_sleep = max(sleep_hours)
  )

sleep_deprivation_summary

## # A tibble: 4 × 3
##   profession                avg_sleep max_sleep
##   <chr>                         <dbl>     <dbl>
## 1 bus / taxi / limo drivers      7.35         9
## 2 pilots                         7.32         9
## 3 train operators                7.03         9
## 4 truck drivers                  7.16         9

C- Statistical Analysis (1-2 paragraph and 2-5 chunks of code):

As a part of my Statistical Analysis, I will be doing an ANOVA Test, hypothesizing if bus, pilot, train, and truck drivers have different mean averages of sleep, or if they all average the same. I will be making a box plot as well to visually represent the hypothesis as well.

 State your hypothesis clearly, use the correct notation, and type them properly. Perform the appropriate test (ANOVA Test or Chi-Squared.)

Hypothesis:

\(\mu_1\) = bus worker sleep mean

\(\mu_2\) = pilot worker sleep mean

\(\mu_3\) = train worker sleep mean

\(\mu_4\) = truck worker sleep mean

\(H_0\): \(\mu_1\) = \(\mu_2\) =\(\mu_3\) = \(\mu_4\) : (Mean hours of sleep are the same across all transportation worker professions./ NO difference)

\(H_a\): Not all \(\mu_i\) are equal: (At least one transportation profession has a different mean number of sleep hours.)
I will be doing an ANOVA test between the professions I filtered:

anova_result <- aov(sleep_hours ~ profession, data = filtered_sleep_deprivation_df)

summary(anova_result)

##              Df Sum Sq Mean Sq F value Pr(>F)  
## profession    3   12.6   4.211   2.942 0.0323 *
## Residuals   791 1132.4   1.432                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

p-value: 0.0323

TukeyHSD(anova_result)

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = sleep_hours ~ profession, data = filtered_sleep_deprivation_df)
## 
## $profession
##                                                  diff        lwr          upr
## pilots-bus / taxi / limo drivers          -0.03554927 -0.3391259  0.268027395
## train operators-bus / taxi / limo drivers -0.31904762 -0.6319375 -0.006157743
## truck drivers-bus / taxi / limo drivers   -0.19474548 -0.4979408  0.108449817
## train operators-pilots                    -0.28349835 -0.5992349  0.032238195
## truck drivers-pilots                      -0.15919622 -0.4653283  0.146935917
## truck drivers-train operators              0.12430213 -0.1910678  0.439672022
##                                               p adj
## pilots-bus / taxi / limo drivers          0.9904791
## train operators-bus / taxi / limo drivers 0.0436854
## truck drivers-bus / taxi / limo drivers   0.3492204
## train operators-pilots                    0.0962212
## truck drivers-pilots                      0.5384175
## truck drivers-train operators             0.7408622

 Create visualizations (e.g., histograms, Box plots, etc.) to visualize the data’s distribution and relationships. Use codes we covered in this class or code you learned in previous courses.

library(ggplot2)

ggplot(filtered_sleep_deprivation_df, aes(x = profession, y = sleep_hours)) +
  geom_boxplot() +
  labs(
    title = "Sleep Hours by Transportation Profession",
    x = "Profession",
    y = "Hours of Sleep"
  ) +
  theme_minimal()

 Interpret the results, including the p-value, alpha, and other relevant statistics, and discuss their significance. You need to include statements for the null and the alternative hypothesis.

Based on this visualization, as well as the results of all my findings/testing, I have found that there is significant evidence that the alternate hypothesis is correct. All the professions have a different mean in sleep hours. Although they had almost similar averages, the graph shows a more in-depth visualization that bus/taxi/limo drivers indicates high variability, suggesting irregular schedules and inconsistent sleep. While the graph suggests that train drivers have more consistent sleep patterns overall, with a few exceptions. These visual differences suggest that sleep deprivation varies by profession, supporting the results of the ANOVA.

D- Conclusion and Future Directions(1-2 paragraphs): Summarize the key findings of your analysis, discuss the implications of your results and their relevance to the research question, and suggest potential avenues for future research or further analysis

To conclude, my hypothesis was correct as I suspected showing that the mean hours of sleep deprivation differ among transportation workers using data from the sleep_deprivation data-set. The results from the ANOVA test, supported by the box-plot visualization, indicated that sleep hours vary across transportation professions, specifically differences in median sleep and variability suggest that certain occupation. As I mentioned in the intro, these results are important for both worker well-being and public safety. Transportation workers who experience greater sleep deprivation may face increased risks of fatigue-related errors and accidents, highlighting the need for workplace policies that prioritize adequate rest and regulated work schedules. As far as looking into future research/ what I wished would’ve been included was more variables. I think the analysis could’ve gone deeper by incorporating additional variables such as shift length, time of day worked, or years of experience to better understand the structural factors contributing to sleep deprivation. Other than that, this was a great data-set to work with, and I’m glad all my work was able to be analyzed in the way I thought it would.

E- References:

Documentation for my research on sleep deprivation (within professions of drivers)

https://www.sciencedaily.com/releases/2012/03/120304141858.htm

https://www.ncbi.nlm.nih.gov/books/NBK384961/?utm

Final Project

JG

2025-12-14