Course: PSYC 7750G – Reproducible Psychological Research
Instructor: Christian A. Martinez

1. Setup & Data Import

Let’s import the data and load our necessary library packages. I might’ve loaded some extras, just in case.

library(tidyverse)
library(readxl)
library(knitr)
library(dplyr)
library(ggplot2)
library(httr)
library(jsonlite)
library(supernova)
library(AICcmodavg)
library(mosaic)
library(stringr)
library(janitor)

participant_info <- read_excel("midterm_sleep_exercise.xlsx",
                               sheet = "participant_info_midterm")

sleep_data <- read_excel("midterm_sleep_exercise.xlsx",
                         sheet = "sleep_data_midterm")
head(participant_info)
## # A tibble: 6 × 4
##   ID    Exercise_Group Sex      Age
##   <chr> <chr>          <chr>  <dbl>
## 1 P001  NONE           Male      35
## 2 P002  Nonee          Malee     57
## 3 P003  None           Female    26
## 4 P004  None           Female    29
## 5 P005  None           Male      33
## 6 P006  None           Female    33
head(sleep_data)
## # A tibble: 6 × 4
##   ID    Pre_Sleep Post_Sleep Sleep_Efficiency
##   <chr> <chr>          <dbl>            <dbl>
## 1 P001  zzz-5.8          4.7             81.6
## 2 P002  Sleep-6.6        7.4             75.7
## 3 P003  <NA>             6.2             82.9
## 4 P004  SLEEP-7.2        7.3             83.6
## 5 P005  score-7.4        7.4             83.5
## 6 P006  Sleep-6.6        7.1             88.5

2. MERGE & BASE CLEANING

Let’s lowercase and standardize the format of the column names so they are easier to work with. We also need to clean the data inputs, since there are many typos in the sex and exercise_group columns. The participants’ group corresponds nicely with their ID numbers, with P001-P025 in one group, P026-P050 in the next, and so on. For sex, if there is an F included anywhere in the cell, they are considered female. If not, they are considered male.

participant_info_clean <- participant_info %>%
  clean_names()

sleep_data_clean <- sleep_data %>%
  clean_names()

colnames(participant_info_clean)
## [1] "id"             "exercise_group" "sex"            "age"
colnames(sleep_data_clean)
## [1] "id"               "pre_sleep"        "post_sleep"       "sleep_efficiency"
participant_info_clean <- participant_info_clean %>%
  mutate(
    id_num = parse_number(id),
    exercise_group = case_when(
      id_num <= 25 ~ "None",
      id_num >= 26 & id_num <= 50 ~ "Cardio",
      id_num >= 51 & id_num <= 75 ~ "Weights",
      TRUE ~ "Cardio_and_Weights"
    ),
    sex = ifelse(
      str_detect(str_to_upper(sex), "F"),
      "Female",
      "Male"
    )
  )

unique(participant_info_clean$exercise_group)
## [1] "None"               "Cardio"             "Weights"           
## [4] "Cardio_and_Weights"
unique(participant_info_clean$sex)
## [1] "Male"   "Female"
midterm_data <- participant_info_clean %>%
  left_join(sleep_data_clean, by = "id")

head(midterm_data)
## # A tibble: 6 × 8
##   id    exercise_group sex      age id_num pre_sleep post_sleep sleep_efficiency
##   <chr> <chr>          <chr>  <dbl>  <dbl> <chr>          <dbl>            <dbl>
## 1 P001  None           Male      35      1 zzz-5.8          4.7             81.6
## 2 P002  None           Male      57      2 Sleep-6.6        7.4             75.7
## 3 P003  None           Female    26      3 <NA>             6.2             82.9
## 4 P004  None           Female    29      4 SLEEP-7.2        7.3             83.6
## 5 P005  None           Male      33      5 score-7.4        7.4             83.5
## 6 P006  None           Female    33      6 Sleep-6.6        7.1             88.5

3. CREATE DERIVED VARIABLES

The two datasets are now merged, but there are still issues with the data. The pre_sleep column is filled with unnecessary characters. Let’s eliminate those extra characters, and make sure the numbers make sense. After we prepare this, we can create the sleep_difference variable. Some participants’ pre or post intervention sleep data were missing, so we removed these rows for the final cleaned datasets. We were also tasked to divide participants into two age groups (under 40y/o and over 40y/o).

midterm_data_clean <- midterm_data %>% 
  mutate(pre_sleep = parse_number(pre_sleep))

midterm_data_clean <- midterm_data_clean %>%
  mutate(pre_sleep = abs(pre_sleep))

midterm_data_clean <- midterm_data_clean %>%
  mutate(
    sleep_difference = post_sleep - pre_sleep
  )

midterm_data_clean <- midterm_data_clean %>% 
  mutate(agegroup2 = case_when(
    age < 40 ~ "<40",
    age >= 40 ~ ">40"
  ))

sum(is.na(midterm_data_clean$sleep_difference))
## [1] 14
midterm_data_clean <- midterm_data_clean %>%
  filter(!is.na(sleep_difference))

midterm_data_clean <- midterm_data_clean %>%
  mutate(
    sleep_difference = post_sleep - pre_sleep
  )

glimpse(midterm_data_clean)
## Rows: 86
## Columns: 10
## $ id               <chr> "P001", "P002", "P004", "P005", "P006", "P007", "P008…
## $ exercise_group   <chr> "None", "None", "None", "None", "None", "None", "None…
## $ sex              <chr> "Male", "Male", "Female", "Male", "Female", "Male", "…
## $ age              <dbl> 35, 57, 29, 33, 33, 32, 30, 37, 28, 30, 20, 42, 33, 2…
## $ id_num           <dbl> 1, 2, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 15, 16, 17, 1…
## $ pre_sleep        <dbl> 5.8, 6.6, 7.2, 7.4, 6.6, 6.0, 8.1, 5.5, 5.7, 7.0, 5.5…
## $ post_sleep       <dbl> 4.7, 7.4, 7.3, 7.4, 7.1, 6.7, 9.0, 5.1, 6.3, 6.2, 4.6…
## $ sleep_efficiency <dbl> 81.6, 75.7, 83.6, 83.5, 88.5, 83.6, 73.4, 88.2, 80.4,…
## $ sleep_difference <dbl> -1.1, 0.8, 0.1, 0.0, 0.5, 0.7, 0.9, -0.4, 0.6, -0.8, …
## $ agegroup2        <chr> "<40", ">40", "<40", "<40", "<40", "<40", "<40", "<40…

4. DESCRIPTIVE STATISTICS

Let’s look at the descriptive statistics.

favstats(midterm_data_clean$sleep_difference)
##   min  Q1 median  Q3 max      mean        sd  n missing
##  -1.1 0.3   0.75 1.1 2.1 0.6825581 0.6610494 86       0
favstats(midterm_data_clean$sleep_efficiency)
##   min     Q1 median     Q3   max     mean       sd  n missing
##  71.7 79.975   83.3 88.425 101.5 83.77558 5.973804 86       0
favstats(sleep_difference ~ exercise_group, data = midterm_data_clean) %>%
  kable(digits = 2, caption = "Descriptive Statistics for Sleep Difference by Exercise Group")
Descriptive Statistics for Sleep Difference by Exercise Group
exercise_group min Q1 median Q3 max mean sd n missing
Cardio 0.3 0.70 1.2 1.4 2.1 1.14 0.49 21 0
Cardio_and_Weights -0.1 0.65 0.9 1.1 1.5 0.86 0.38 23 0
None -1.1 -0.40 0.1 0.6 0.9 0.05 0.64 21 0
Weights -0.7 0.30 0.5 1.1 1.8 0.67 0.61 21 0
favstats(sleep_efficiency ~ exercise_group, data = midterm_data_clean) %>%
  kable(digits = 2, caption = "Descriptive Statistics for Sleep Efficiency by Exercise Group")
Descriptive Statistics for Sleep Efficiency by Exercise Group
exercise_group min Q1 median Q3 max mean sd n missing
Cardio 75.9 81.3 85.5 88.0 101.5 85.45 5.99 21 0
Cardio_and_Weights 74.5 83.5 88.7 90.5 96.3 86.83 5.98 23 0
None 71.7 76.6 81.5 83.6 90.4 81.07 5.55 21 0
Weights 74.8 77.9 80.8 83.6 89.5 81.46 4.31 21 0

5. VISUALIZATIONS

Hey Dr. Walker. Matty boy. Check this out. This should give us a clear idea of which groups are showing the best sleep outcomes.

ggplot(midterm_data_clean, aes(x = exercise_group, y = sleep_difference, fill = exercise_group)) +
  geom_boxplot() +
  labs(
    title = "Sleep Difference by Exercise Group", 
    x = "Exercise Group", 
    y = "Change in Sleep (hours)"
  )  + theme_minimal()

ggplot(midterm_data_clean, aes(x = exercise_group, y = sleep_efficiency, fill = exercise_group)) +
  geom_boxplot() +
  labs(
    title = "Sleep Efficiency by Exercise Group", 
    x = "Exercise Group", 
    y = "Sleep Efficiency (%)"
  )  + theme_minimal()

ggplot(midterm_data_clean, aes (x = sleep_difference, y = sleep_efficiency)) +
  geom_point() +
  geom_smooth(method = "lm") +
  labs(
    x = "Sleep Difference (hours)",
    y = "Sleep Efficiency (%)"
  )  + theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

6. T-TESTS

Let’s see if there any significant differences based on sex or age group.

t_test_sex <- t.test(sleep_difference ~ sex, data = midterm_data_clean)
t_test_sex
## 
##  Welch Two Sample t-test
## 
## data:  sleep_difference by sex
## t = 1.5801, df = 77.647, p-value = 0.1182
## alternative hypothesis: true difference in means between group Female and group Male is not equal to 0
## 95 percent confidence interval:
##  -0.05865017  0.50972574
## sample estimates:
## mean in group Female   mean in group Male 
##            0.7795918            0.5540541

Female mean sleep difference = .78 hours. Male mean sleep difference = .55 hours. t(77.65) = 1.58, p = .12; difference between means is NOT significant.

t_test_agegroup2 <- t.test(sleep_difference ~ agegroup2, data = midterm_data_clean)
t_test_agegroup2
## 
##  Welch Two Sample t-test
## 
## data:  sleep_difference by agegroup2
## t = -1.3746, df = 36.662, p-value = 0.1776
## alternative hypothesis: true difference in means between group <40 and group >40 is not equal to 0
## 95 percent confidence interval:
##  -0.50676303  0.09717936
## sample estimates:
## mean in group <40 mean in group >40 
##         0.6373134         0.8421053

Under 40 year olds’ mean sleep difference = .64 hours. Over 40 year olds’ mean sleep difference = .84 hours. t(36.66) = -1.37, p = .18; difference between means is NOT significant.

7. ANOVAS + POST-HOCS # TASK: # - ANOVA A: Sleep_Difference ~ Exercise_Group # - ANOVA B: Sleep_Efficiency ~ Exercise_Group # - For each, report F, df, p, and an effect-size comment, and PRE via supernova # - HINT: Run supernova() after each ANOVA to obtain PRE (effect size). # - Run TukeyHSD() on each ANOVA and interpret pairwise differences # (Which groups are significantly different? Who “wins”?)

anova_a <- aov(sleep_difference ~ exercise_group, data = midterm_data_clean)
summary(anova_a)
##                Df Sum Sq Mean Sq F value   Pr(>F)    
## exercise_group  3  13.56   4.520   15.72 3.67e-08 ***
## Residuals      82  23.58   0.288                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
supernova(anova_a)
##  Analysis of Variance Table (Type III SS)
##  Model: sleep_difference ~ exercise_group
## 
##                              SS df    MS      F   PRE     p
##  ----- --------------- | ------ -- ----- ------ ----- -----
##  Model (error reduced) | 13.560  3 4.520 15.717 .3651 .0000
##  Error (from model)    | 23.583 82 0.288                   
##  ----- --------------- | ------ -- ----- ------ ----- -----
##  Total (empty model)   | 37.144 85 0.437
TukeyHSD(anova_a)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = sleep_difference ~ exercise_group, data = midterm_data_clean)
## 
## $exercise_group
##                                  diff        lwr         upr     p adj
## Cardio_and_Weights-Cardio  -0.2772257 -0.7017134  0.14726203 0.3237562
## None-Cardio                -1.0904762 -1.5245041 -0.65644825 0.0000000
## Weights-Cardio             -0.4714286 -0.9054565 -0.03740063 0.0278779
## None-Cardio_and_Weights    -0.8132505 -1.2377382 -0.38876282 0.0000171
## Weights-Cardio_and_Weights -0.1942029 -0.6186906  0.23028480 0.6287294
## Weights-None                0.6190476  0.1850197  1.05307556 0.0018927
  1. All exercise groups showed significant improvements in sleep duration.
  2. Cardio, Cardio + Weights, and Weights groups (respectively) showed the largest improvements in sleep duration compared to no exercise.
  3. Amongst exercise groups, Cardio improved sleep duration more than Weights.
  4. Weights did not differ significantly from Cardio + Weights. Cardio did not differ significantly from Cardio + Weights.
anova_b <- aov(sleep_efficiency ~ exercise_group, data = midterm_data_clean)
summary(anova_b)
##                Df Sum Sq Mean Sq F value  Pr(>F)   
## exercise_group  3  540.4   180.1   5.925 0.00104 **
## Residuals      82 2492.9    30.4                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
supernova(anova_b)
##  Analysis of Variance Table (Type III SS)
##  Model: sleep_efficiency ~ exercise_group
## 
##                                SS df      MS     F   PRE     p
##  ----- --------------- | -------- -- ------- ----- ----- -----
##  Model (error reduced) |  540.400  3 180.133 5.925 .1782 .0010
##  Error (from model)    | 2492.939 82  30.402                  
##  ----- --------------- | -------- -- ------- ----- ----- -----
##  Total (empty model)   | 3033.339 85  35.686
TukeyHSD(anova_b)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = sleep_efficiency ~ exercise_group, data = midterm_data_clean)
## 
## $exercise_group
##                                  diff        lwr         upr     p adj
## Cardio_and_Weights-Cardio   1.3871636  -2.977172  5.75149915 0.8383629
## None-Cardio                -4.3761905  -8.838613  0.08623232 0.0566544
## Weights-Cardio             -3.9904762  -8.452899  0.47194661 0.0962888
## None-Cardio_and_Weights    -5.7633540 -10.127690 -1.39901844 0.0046379
## Weights-Cardio_and_Weights -5.3776398  -9.741975 -1.01330416 0.0094267
## Weights-None                0.3857143  -4.076709  4.84813708 0.9958617
  1. Overall, Cardio + Weights yielded the best sleep efficiency outcomes.
  2. Cardio + Weights group had significantly better sleep efficiency than the None group and Weights group. No difference between Cardio + Weights and Cardio groups.
  3. Surprisingly, neither the Weights or Cardio groups showed significance for improvements in sleep efficiency compared to the None group. However, the Cardio group’s improvements in sleep efficiency compared to the None group were nearly significant (p=.057).

8. SYNTHESIS & RECOMMENDATION

The ANOVA for exercise group on sleep duration was significant (F(3,82)=15.72, p=.000) and accounted for about 37% of the variance in sleep duration (PRE = .37). Similarly, the ANOVA for exercise group on sleep efficiency was significant (F(3,82)=5.93, p=.001) and accounted for about 18% of the variance in sleep efficiency. These ANOVA results show that the exercise intervention did, in fact, produce significant changes in both sleep outcomes. Considering both outcomes, the cardio + weights regimen was the most beneficial overall. However, people who did any kind of exercise generally showed better sleep outcomes than those who did not exercise. If you are someone who can only take on one form of exercise (cardio OR weights), cardio may be slightly preferable to weights, particularly for increasing sleep duration (Tukey p=.028).

9 REFLECTION

The most challenging part was preparing the data for analysis. This included cleaning the unnecessary characters in the pre_sleep, sex, and exercise_group columns. It felt like there were many methods I could have gone with, but I felt that my solutions ended up being efficient and relatively simple. I felt most confident about the actual statistical tests. These are easy to run and interpret, and they require little-to-no creative problem-solving. In the future, I’d like to improve my familiarity with reporting my processes in R Markdown, since there is plenty of room to make this report cleaner and more aesthetic.