Course: PSYC 7750G – Reproducible Psychological
Research
Instructor: Christian A. Martinez
1. Setup & Data Import
Let’s import the data and load our necessary library packages. I might’ve loaded some extras, just in case.
library(tidyverse)
library(readxl)
library(knitr)
library(dplyr)
library(ggplot2)
library(httr)
library(jsonlite)
library(supernova)
library(AICcmodavg)
library(mosaic)
library(stringr)
library(janitor)
participant_info <- read_excel("midterm_sleep_exercise.xlsx",
sheet = "participant_info_midterm")
sleep_data <- read_excel("midterm_sleep_exercise.xlsx",
sheet = "sleep_data_midterm")
head(participant_info)
## # A tibble: 6 × 4
## ID Exercise_Group Sex Age
## <chr> <chr> <chr> <dbl>
## 1 P001 NONE Male 35
## 2 P002 Nonee Malee 57
## 3 P003 None Female 26
## 4 P004 None Female 29
## 5 P005 None Male 33
## 6 P006 None Female 33
head(sleep_data)
## # A tibble: 6 × 4
## ID Pre_Sleep Post_Sleep Sleep_Efficiency
## <chr> <chr> <dbl> <dbl>
## 1 P001 zzz-5.8 4.7 81.6
## 2 P002 Sleep-6.6 7.4 75.7
## 3 P003 <NA> 6.2 82.9
## 4 P004 SLEEP-7.2 7.3 83.6
## 5 P005 score-7.4 7.4 83.5
## 6 P006 Sleep-6.6 7.1 88.5
2. MERGE & BASE CLEANING
Let’s lowercase and standardize the format of the column names so they are easier to work with. We also need to clean the data inputs, since there are many typos in the sex and exercise_group columns. The participants’ group corresponds nicely with their ID numbers, with P001-P025 in one group, P026-P050 in the next, and so on. For sex, if there is an F included anywhere in the cell, they are considered female. If not, they are considered male.
participant_info_clean <- participant_info %>%
clean_names()
sleep_data_clean <- sleep_data %>%
clean_names()
colnames(participant_info_clean)
## [1] "id" "exercise_group" "sex" "age"
colnames(sleep_data_clean)
## [1] "id" "pre_sleep" "post_sleep" "sleep_efficiency"
participant_info_clean <- participant_info_clean %>%
mutate(
id_num = parse_number(id),
exercise_group = case_when(
id_num <= 25 ~ "None",
id_num >= 26 & id_num <= 50 ~ "Cardio",
id_num >= 51 & id_num <= 75 ~ "Weights",
TRUE ~ "Cardio_and_Weights"
),
sex = ifelse(
str_detect(str_to_upper(sex), "F"),
"Female",
"Male"
)
)
unique(participant_info_clean$exercise_group)
## [1] "None" "Cardio" "Weights"
## [4] "Cardio_and_Weights"
unique(participant_info_clean$sex)
## [1] "Male" "Female"
midterm_data <- participant_info_clean %>%
left_join(sleep_data_clean, by = "id")
head(midterm_data)
## # A tibble: 6 × 8
## id exercise_group sex age id_num pre_sleep post_sleep sleep_efficiency
## <chr> <chr> <chr> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 P001 None Male 35 1 zzz-5.8 4.7 81.6
## 2 P002 None Male 57 2 Sleep-6.6 7.4 75.7
## 3 P003 None Female 26 3 <NA> 6.2 82.9
## 4 P004 None Female 29 4 SLEEP-7.2 7.3 83.6
## 5 P005 None Male 33 5 score-7.4 7.4 83.5
## 6 P006 None Female 33 6 Sleep-6.6 7.1 88.5
3. CREATE DERIVED VARIABLES
The two datasets are now merged, but there are still issues with the data. The pre_sleep column is filled with unnecessary characters. Let’s eliminate those extra characters, and make sure the numbers make sense. After we prepare this, we can create the sleep_difference variable. Some participants’ pre or post intervention sleep data were missing, so we removed these rows for the final cleaned datasets. We were also tasked to divide participants into two age groups (under 40y/o and over 40y/o).
midterm_data_clean <- midterm_data %>%
mutate(pre_sleep = parse_number(pre_sleep))
midterm_data_clean <- midterm_data_clean %>%
mutate(pre_sleep = abs(pre_sleep))
midterm_data_clean <- midterm_data_clean %>%
mutate(
sleep_difference = post_sleep - pre_sleep
)
midterm_data_clean <- midterm_data_clean %>%
mutate(agegroup2 = case_when(
age < 40 ~ "<40",
age >= 40 ~ ">40"
))
sum(is.na(midterm_data_clean$sleep_difference))
## [1] 14
midterm_data_clean <- midterm_data_clean %>%
filter(!is.na(sleep_difference))
midterm_data_clean <- midterm_data_clean %>%
mutate(
sleep_difference = post_sleep - pre_sleep
)
glimpse(midterm_data_clean)
## Rows: 86
## Columns: 10
## $ id <chr> "P001", "P002", "P004", "P005", "P006", "P007", "P008…
## $ exercise_group <chr> "None", "None", "None", "None", "None", "None", "None…
## $ sex <chr> "Male", "Male", "Female", "Male", "Female", "Male", "…
## $ age <dbl> 35, 57, 29, 33, 33, 32, 30, 37, 28, 30, 20, 42, 33, 2…
## $ id_num <dbl> 1, 2, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 15, 16, 17, 1…
## $ pre_sleep <dbl> 5.8, 6.6, 7.2, 7.4, 6.6, 6.0, 8.1, 5.5, 5.7, 7.0, 5.5…
## $ post_sleep <dbl> 4.7, 7.4, 7.3, 7.4, 7.1, 6.7, 9.0, 5.1, 6.3, 6.2, 4.6…
## $ sleep_efficiency <dbl> 81.6, 75.7, 83.6, 83.5, 88.5, 83.6, 73.4, 88.2, 80.4,…
## $ sleep_difference <dbl> -1.1, 0.8, 0.1, 0.0, 0.5, 0.7, 0.9, -0.4, 0.6, -0.8, …
## $ agegroup2 <chr> "<40", ">40", "<40", "<40", "<40", "<40", "<40", "<40…
4. DESCRIPTIVE STATISTICS
Let’s look at the descriptive statistics.
favstats(midterm_data_clean$sleep_difference)
## min Q1 median Q3 max mean sd n missing
## -1.1 0.3 0.75 1.1 2.1 0.6825581 0.6610494 86 0
favstats(midterm_data_clean$sleep_efficiency)
## min Q1 median Q3 max mean sd n missing
## 71.7 79.975 83.3 88.425 101.5 83.77558 5.973804 86 0
favstats(sleep_difference ~ exercise_group, data = midterm_data_clean) %>%
kable(digits = 2, caption = "Descriptive Statistics for Sleep Difference by Exercise Group")
| exercise_group | min | Q1 | median | Q3 | max | mean | sd | n | missing |
|---|---|---|---|---|---|---|---|---|---|
| Cardio | 0.3 | 0.70 | 1.2 | 1.4 | 2.1 | 1.14 | 0.49 | 21 | 0 |
| Cardio_and_Weights | -0.1 | 0.65 | 0.9 | 1.1 | 1.5 | 0.86 | 0.38 | 23 | 0 |
| None | -1.1 | -0.40 | 0.1 | 0.6 | 0.9 | 0.05 | 0.64 | 21 | 0 |
| Weights | -0.7 | 0.30 | 0.5 | 1.1 | 1.8 | 0.67 | 0.61 | 21 | 0 |
favstats(sleep_efficiency ~ exercise_group, data = midterm_data_clean) %>%
kable(digits = 2, caption = "Descriptive Statistics for Sleep Efficiency by Exercise Group")
| exercise_group | min | Q1 | median | Q3 | max | mean | sd | n | missing |
|---|---|---|---|---|---|---|---|---|---|
| Cardio | 75.9 | 81.3 | 85.5 | 88.0 | 101.5 | 85.45 | 5.99 | 21 | 0 |
| Cardio_and_Weights | 74.5 | 83.5 | 88.7 | 90.5 | 96.3 | 86.83 | 5.98 | 23 | 0 |
| None | 71.7 | 76.6 | 81.5 | 83.6 | 90.4 | 81.07 | 5.55 | 21 | 0 |
| Weights | 74.8 | 77.9 | 80.8 | 83.6 | 89.5 | 81.46 | 4.31 | 21 | 0 |
5. VISUALIZATIONS
Hey Dr. Walker. Matty boy. Check this out. This should give us a clear idea of which groups are showing the best sleep outcomes.
ggplot(midterm_data_clean, aes(x = exercise_group, y = sleep_difference, fill = exercise_group)) +
geom_boxplot() +
labs(
title = "Sleep Difference by Exercise Group",
x = "Exercise Group",
y = "Change in Sleep (hours)"
) + theme_minimal()
ggplot(midterm_data_clean, aes(x = exercise_group, y = sleep_efficiency, fill = exercise_group)) +
geom_boxplot() +
labs(
title = "Sleep Efficiency by Exercise Group",
x = "Exercise Group",
y = "Sleep Efficiency (%)"
) + theme_minimal()
ggplot(midterm_data_clean, aes (x = sleep_difference, y = sleep_efficiency)) +
geom_point() +
geom_smooth(method = "lm") +
labs(
x = "Sleep Difference (hours)",
y = "Sleep Efficiency (%)"
) + theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
6. T-TESTS
Let’s see if there any significant differences based on sex or age group.
t_test_sex <- t.test(sleep_difference ~ sex, data = midterm_data_clean)
t_test_sex
##
## Welch Two Sample t-test
##
## data: sleep_difference by sex
## t = 1.5801, df = 77.647, p-value = 0.1182
## alternative hypothesis: true difference in means between group Female and group Male is not equal to 0
## 95 percent confidence interval:
## -0.05865017 0.50972574
## sample estimates:
## mean in group Female mean in group Male
## 0.7795918 0.5540541
Female mean sleep difference = .78 hours. Male mean sleep difference = .55 hours. t(77.65) = 1.58, p = .12; difference between means is NOT significant.
t_test_agegroup2 <- t.test(sleep_difference ~ agegroup2, data = midterm_data_clean)
t_test_agegroup2
##
## Welch Two Sample t-test
##
## data: sleep_difference by agegroup2
## t = -1.3746, df = 36.662, p-value = 0.1776
## alternative hypothesis: true difference in means between group <40 and group >40 is not equal to 0
## 95 percent confidence interval:
## -0.50676303 0.09717936
## sample estimates:
## mean in group <40 mean in group >40
## 0.6373134 0.8421053
Under 40 year olds’ mean sleep difference = .64 hours. Over 40 year olds’ mean sleep difference = .84 hours. t(36.66) = -1.37, p = .18; difference between means is NOT significant.
7. ANOVAS + POST-HOCS # TASK: # - ANOVA A: Sleep_Difference ~ Exercise_Group # - ANOVA B: Sleep_Efficiency ~ Exercise_Group # - For each, report F, df, p, and an effect-size comment, and PRE via supernova # - HINT: Run supernova() after each ANOVA to obtain PRE (effect size). # - Run TukeyHSD() on each ANOVA and interpret pairwise differences # (Which groups are significantly different? Who “wins”?)
anova_a <- aov(sleep_difference ~ exercise_group, data = midterm_data_clean)
summary(anova_a)
## Df Sum Sq Mean Sq F value Pr(>F)
## exercise_group 3 13.56 4.520 15.72 3.67e-08 ***
## Residuals 82 23.58 0.288
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
supernova(anova_a)
## Analysis of Variance Table (Type III SS)
## Model: sleep_difference ~ exercise_group
##
## SS df MS F PRE p
## ----- --------------- | ------ -- ----- ------ ----- -----
## Model (error reduced) | 13.560 3 4.520 15.717 .3651 .0000
## Error (from model) | 23.583 82 0.288
## ----- --------------- | ------ -- ----- ------ ----- -----
## Total (empty model) | 37.144 85 0.437
TukeyHSD(anova_a)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = sleep_difference ~ exercise_group, data = midterm_data_clean)
##
## $exercise_group
## diff lwr upr p adj
## Cardio_and_Weights-Cardio -0.2772257 -0.7017134 0.14726203 0.3237562
## None-Cardio -1.0904762 -1.5245041 -0.65644825 0.0000000
## Weights-Cardio -0.4714286 -0.9054565 -0.03740063 0.0278779
## None-Cardio_and_Weights -0.8132505 -1.2377382 -0.38876282 0.0000171
## Weights-Cardio_and_Weights -0.1942029 -0.6186906 0.23028480 0.6287294
## Weights-None 0.6190476 0.1850197 1.05307556 0.0018927
anova_b <- aov(sleep_efficiency ~ exercise_group, data = midterm_data_clean)
summary(anova_b)
## Df Sum Sq Mean Sq F value Pr(>F)
## exercise_group 3 540.4 180.1 5.925 0.00104 **
## Residuals 82 2492.9 30.4
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
supernova(anova_b)
## Analysis of Variance Table (Type III SS)
## Model: sleep_efficiency ~ exercise_group
##
## SS df MS F PRE p
## ----- --------------- | -------- -- ------- ----- ----- -----
## Model (error reduced) | 540.400 3 180.133 5.925 .1782 .0010
## Error (from model) | 2492.939 82 30.402
## ----- --------------- | -------- -- ------- ----- ----- -----
## Total (empty model) | 3033.339 85 35.686
TukeyHSD(anova_b)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = sleep_efficiency ~ exercise_group, data = midterm_data_clean)
##
## $exercise_group
## diff lwr upr p adj
## Cardio_and_Weights-Cardio 1.3871636 -2.977172 5.75149915 0.8383629
## None-Cardio -4.3761905 -8.838613 0.08623232 0.0566544
## Weights-Cardio -3.9904762 -8.452899 0.47194661 0.0962888
## None-Cardio_and_Weights -5.7633540 -10.127690 -1.39901844 0.0046379
## Weights-Cardio_and_Weights -5.3776398 -9.741975 -1.01330416 0.0094267
## Weights-None 0.3857143 -4.076709 4.84813708 0.9958617
8. SYNTHESIS & RECOMMENDATION
The ANOVA for exercise group on sleep duration was significant (F(3,82)=15.72, p=.000) and accounted for about 37% of the variance in sleep duration (PRE = .37). Similarly, the ANOVA for exercise group on sleep efficiency was significant (F(3,82)=5.93, p=.001) and accounted for about 18% of the variance in sleep efficiency. These ANOVA results show that the exercise intervention did, in fact, produce significant changes in both sleep outcomes. Considering both outcomes, the cardio + weights regimen was the most beneficial overall. However, people who did any kind of exercise generally showed better sleep outcomes than those who did not exercise. If you are someone who can only take on one form of exercise (cardio OR weights), cardio may be slightly preferable to weights, particularly for increasing sleep duration (Tukey p=.028).
9 REFLECTION
The most challenging part was preparing the data for analysis. This included cleaning the unnecessary characters in the pre_sleep, sex, and exercise_group columns. It felt like there were many methods I could have gone with, but I felt that my solutions ended up being efficient and relatively simple. I felt most confident about the actual statistical tests. These are easy to run and interpret, and they require little-to-no creative problem-solving. In the future, I’d like to improve my familiarity with reporting my processes in R Markdown, since there is plenty of room to make this report cleaner and more aesthetic.