Sleep

Author

Zihao Yu

Approach

Sleep efficiency dataset

I’ve chosen this dataset because a lot of people stay up late. By looking at the data, it records the time spent in each sleep stage, so then I want to see how different factors(caffeine, alcohol consumption, smoking status, and exercise) affect sleep quality.

Compare the effects of various factors on sleep states

Sleep dataset: https://raw.githubusercontent.com/XxY-coder/data607-week1/refs/heads/main/Sleep_Efficiency.csv

#view the columns and delete some columns.

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.1     ✔ stringr   1.5.2
✔ ggplot2   4.0.1     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(gt)

url <- "https://raw.githubusercontent.com/XxY-coder/data607-week1/refs/heads/main/Sleep_Efficiency.csv"
df <- read_csv(
  file = url
)
Rows: 452 Columns: 15
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr   (2): Gender, Smoking status
dbl  (11): ID, Age, Sleep duration, Sleep efficiency, REM sleep percentage, ...
dttm  (2): Bedtime, Wakeup time

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(df)
Rows: 452
Columns: 15
$ ID                       <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14…
$ Age                      <dbl> 65, 69, 40, 40, 57, 36, 27, 53, 41, 11, 50, 5…
$ Gender                   <chr> "Female", "Male", "Female", "Female", "Male",…
$ Bedtime                  <dttm> 2021-03-06 01:00:00, 2021-12-05 02:00:00, 20…
$ `Wakeup time`            <dttm> 2021-03-06 07:00:00, 2021-12-05 09:00:00, 20…
$ `Sleep duration`         <dbl> 6.0, 7.0, 8.0, 6.0, 8.0, 7.5, 6.0, 10.0, 6.0,…
$ `Sleep efficiency`       <dbl> 0.88, 0.66, 0.89, 0.51, 0.76, 0.90, 0.54, 0.9…
$ `REM sleep percentage`   <dbl> 18, 19, 20, 23, 27, 23, 28, 28, 28, 18, 23, 1…
$ `Deep sleep percentage`  <dbl> 70, 28, 70, 25, 55, 60, 25, 52, 55, 37, 57, 6…
$ `Light sleep percentage` <dbl> 12, 53, 10, 52, 18, 17, 47, 20, 17, 45, 20, 2…
$ Awakenings               <dbl> 0, 3, 1, 3, 3, 0, 2, 0, 3, 4, 1, 0, 0, 4, 2, …
$ `Caffeine consumption`   <dbl> 0, 0, 0, 50, 0, NA, 50, 50, 50, 0, 50, 0, 50,…
$ `Alcohol consumption`    <dbl> 0, 3, 0, 5, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, …
$ `Smoking status`         <chr> "Yes", "Yes", "No", "Yes", "No", "No", "Yes",…
$ `Exercise frequency`     <dbl> 3, 3, 3, 1, 3, 1, 1, 3, 1, 0, 3, 3, 1, 3, 0, …

##Cleaning the data Caffeine, alcohol, smoking, and exercise are more straight forward, then rename them for convenient. check and remove the NA values.

df2 <-
  df |>
  select(
    Age,
    Gender,
    sleep_time = 'Sleep duration',
    caffeine_intake = 'Caffeine consumption',
    alcohol_intake  = 'Alcohol consumption',
    smoking_status  = 'Smoking status' ,
    exercise_freq   = 'Exercise frequency'
) |>
  filter(
    !is.na(caffeine_intake),
    !is.na(alcohol_intake),
    !is.na(smoking_status),
    !is.na(exercise_freq)
  )

glimpse(df)
Rows: 452
Columns: 15
$ ID                       <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14…
$ Age                      <dbl> 65, 69, 40, 40, 57, 36, 27, 53, 41, 11, 50, 5…
$ Gender                   <chr> "Female", "Male", "Female", "Female", "Male",…
$ Bedtime                  <dttm> 2021-03-06 01:00:00, 2021-12-05 02:00:00, 20…
$ `Wakeup time`            <dttm> 2021-03-06 07:00:00, 2021-12-05 09:00:00, 20…
$ `Sleep duration`         <dbl> 6.0, 7.0, 8.0, 6.0, 8.0, 7.5, 6.0, 10.0, 6.0,…
$ `Sleep efficiency`       <dbl> 0.88, 0.66, 0.89, 0.51, 0.76, 0.90, 0.54, 0.9…
$ `REM sleep percentage`   <dbl> 18, 19, 20, 23, 27, 23, 28, 28, 28, 18, 23, 1…
$ `Deep sleep percentage`  <dbl> 70, 28, 70, 25, 55, 60, 25, 52, 55, 37, 57, 6…
$ `Light sleep percentage` <dbl> 12, 53, 10, 52, 18, 17, 47, 20, 17, 45, 20, 2…
$ Awakenings               <dbl> 0, 3, 1, 3, 3, 0, 2, 0, 3, 4, 1, 0, 0, 4, 2, …
$ `Caffeine consumption`   <dbl> 0, 0, 0, 50, 0, NA, 50, 50, 50, 0, 50, 0, 50,…
$ `Alcohol consumption`    <dbl> 0, 3, 0, 5, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, …
$ `Smoking status`         <chr> "Yes", "Yes", "No", "Yes", "No", "No", "Yes",…
$ `Exercise frequency`     <dbl> 3, 3, 3, 1, 3, 1, 1, 3, 1, 0, 3, 3, 1, 3, 0, …

####Check for relevance Then compare how each factors effact to the sleep time.

ggplot(df2, aes(Age, sleep_time)) +
  geom_point(alpha = 0.3) +
  geom_smooth(se = FALSE) +
  labs(title = "sleep_time vs Age", x = "Age", y = "sleep_time")
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

ggplot(df2, aes(caffeine_intake, sleep_time)) +
  geom_point(alpha = 0.3) +
  geom_smooth(method = "lm" ) +
  labs(title = "sleep_time vs Caffeine", x = "caffeine_intake", y = "sleep_time")
`geom_smooth()` using formula = 'y ~ x'

ggplot(df2, aes(alcohol_intake, sleep_time)) +
  geom_point(alpha = 0.3) +
  geom_smooth(method = "lm" ) +
  labs(title = "sleep_time vs alcohol_intake", x = "alcohol_intake", y = "sleep_time")
`geom_smooth()` using formula = 'y ~ x'

ggplot(df2, aes(factor(exercise_freq), sleep_time, fill = exercise_freq)) +
  geom_boxplot() +
  labs(title = "sleep_time vs exercise_freq", x = "exercise_freq", y = "sleep_time")

df2 |>
  count(exercise_freq)
# A tibble: 6 × 2
  exercise_freq     n
          <dbl> <int>
1             0   113
2             1    82
3             2    46
4             3   121
5             4    38
6             5     7

#####Conclusions Four sets of comparisons reveal that overall correlations between age, caffeine intake, smoking status, and sleep duration are relatively weak. The scatter plot shows no discernible trend, with only slight variations in data distribution. Box plots for exercise frequency indicate similar ranges across groups, with most sleep durations concentrated within the 7–8 hour range.