Setting up…
library(tidyverse)
d <- read_csv("loewen_et_al_2020_babbel.csv")
This is (most) of the data from Loewen et al. (2020). There are two time points: pre and post.
Let’s set a 0 point first, as our Time variable is text right now.
d <- d %>% mutate(time01 = if_else(Time == "Pre", 0, 1))
We will consider a couple of different ways of analyzing this data. We’ll focus on vocabulary learning.
Let’s start with some visualization.
d %>% ggplot(aes(x = Vocab_score))+
geom_histogram(binwidth = 5)+
facet_wrap(~time01, nrow = 1)
It definitely looks like there is an increase in scores over time.
We could also look at something like a boxplot:
d %>% ggplot(aes(y = Vocab_score, x = time01, group = time01))+
geom_boxplot()+
theme_bw()
But the best way might be to show individual changes in a line plot:
d %>% ggplot(aes(y = Vocab_score, x = time01, group = StudyID))+
geom_line()+
geom_point()+
stat_summary(aes(group = NULL), fun = mean, geom = "point", color = "blue", size = 6)+
stat_summary(aes(group = NULL), fun = mean, geom = "line", color = "blue", size = 3)+
scale_x_continuous(breaks = c(0, 1), limits = c(-.5, 1.5))+
theme_bw()
Here you can see that there was an overall increase in the average score of the group, but you can also see that there is some pretty substantial variation in change over time.
For ‘simpler’ approaches of statistically analyzing the difference in vocabulary scores, we’ll actually need to make the data wide first.
w <- d %>% select(-time01) %>%
pivot_wider(names_from = Time, names_sep = "_", values_from = Rating_num:Grammar_score)
The paired t-test is quite simple to run:
vocab_t <- t.test(w$Vocab_score_Post, w$Vocab_score_Pre, paired = TRUE, alternative = "two.sided")
vocab_t
Paired t-test
data: w$Vocab_score_Post and w$Vocab_score_Pre
t = 7.4383, df = 53, p-value = 8.892e-10
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
4.923095 8.558387
sample estimates:
mean difference
6.740741
We can very easily calculate Cohen’s d for this within-group comparison:
mean(w$Vocab_score_Post-w$Vocab_score_Pre)/sd(w$Vocab_score_Post-w$Vocab_score_Pre)
[1] 1.012226
The speaking scores (Rating_num) came from ACTFL OPIc
tests. Technically, these scores are ordinal (1 = Novice Low, etc.).
Ordinal data do not meet the assumptions of t-tests. We can analyze
these pre-post data with a Wilcoxon test, which can be thought of as a
test of medians rather than means.
wilcox.test(x = w$Rating_num_Post,
y = w$Rating_num_Pre,
alternative = "two.sided",
mu = 0,
paired = TRUE,
exact = FALSE,
correct = FALSE)
Wilcoxon signed rank test
data: w$Rating_num_Post and w$Rating_num_Pre
V = 547, p-value = 3.227e-07
alternative hypothesis: true location shift is not equal to 0
It probably makes more sense to do a one-sided test in situations like this, as we certainly don’t expect a decline in proficiency after instruction.
wilcox.test(x = w$Rating_num_Post,
y = w$Rating_num_Pre,
alternative = "greater",
paired = TRUE,
exact = FALSE,
correct = FALSE)
Wilcoxon signed rank test
data: w$Rating_num_Post and w$Rating_num_Pre
V = 547, p-value = 1.614e-07
alternative hypothesis: true location shift is greater than 0
We can also calculate standardized effect sizes for Wilcoxon tests.
qnorm(1-.0000001614) # z value; 1-p for a one sided test.
[1] 5.109624
#if test is two-sided, divide p by 2
Now z to r:
5.109624/sqrt(54)
[1] 0.6953318
We’ll now consider a between-subjects comparison (Sex) alongside the within-subjects time effect.
time_sex <- aov(Vocab_score ~ time01*Sex + Error(StudyID/time01), data=d)
summary(time_sex)
Error: StudyID
Df Sum Sq Mean Sq F value Pr(>F)
Sex 1 389 388.9 1.81 0.184
Residuals 52 11172 214.8
Error: StudyID:time01
Df Sum Sq Mean Sq F value Pr(>F)
time01 1 1226.8 1226.8 54.394 1.25e-09 ***
time01:Sex 1 2.4 2.4 0.104 0.748
Residuals 52 1172.8 22.6
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
And the post-hocs:
library(emmeans)
time_sex_emm <- emmeans(time_sex, ~ time01*Sex)
time_sex_emm
time01 Sex emmean SE df lower.CL upper.CL
0 F 14.2 2.07 52.0 10.1 18.4
1 F 20.8 2.35 79.3 16.1 25.4
0 M 18.3 2.07 52.0 14.1 22.5
1 M 25.5 2.64 98.5 20.2 30.7
Warning: EMMs are biased unless design is perfectly balanced
Confidence level used: 0.95
pairs(time_sex_emm)
contrast estimate SE df t.ratio p.value
time010 F - time011 F -6.54 1.10 52.0 -5.924 <.0001
time010 F - time010 M -4.09 3.04 52.0 -1.345 0.5388
time010 F - time011 M -11.26 3.45 79.6 -3.268 0.0085
time011 F - time010 M 2.45 3.23 65.5 0.760 0.8722
time011 F - time011 M -4.72 3.62 89.1 -1.305 0.5623
time010 M - time011 M -7.18 1.63 52.0 -4.406 0.0003
P value adjustment: tukey method for comparing a family of 4 estimates
Now we’ll include a very interesting and important predictor: how much time people actually spent using Babbel. This would, in theory, account for why some people demonstrated larger or smaller gains over the course of the study.
Packages:
library(lme4)
library(lmerTest)
library(MuMIn)
Model:
vocab <- lmer(Vocab_score ~ time01 + time01:hrs_tot + (time01||StudyID),
data = d, control=lmerControl(optimizer="bobyqa"))
summary(vocab)
Linear mixed model fit by REML. t-tests use Satterthwaite's method [
lmerModLmerTest]
Formula: Vocab_score ~ time01 + time01:hrs_tot + (time01 || StudyID)
Data: d
Control: lmerControl(optimizer = "bobyqa")
REML criterion at convergence: 740.3
Scaled residuals:
Min 1Q Median 3Q Max
-1.40039 -0.34333 -0.05482 0.37773 1.68996
Random effects:
Groups Name Variance Std.Dev.
StudyID (Intercept) 88.045 9.383
StudyID.1 time01 9.026 3.004
Residual 12.810 3.579
Number of obs: 108, groups: StudyID, 54
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 12.8889 1.3666 53.0000 9.431 6.26e-13 ***
time01 1.0204 1.4928 52.9963 0.684 0.497
time01:hrs_tot 0.4928 0.1085 52.0000 4.541 3.35e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Correlation of Fixed Effects:
(Intr) time01
time01 -0.116
tm01:hrs_tt 0.000 -0.844
You may often find that you will need to uncorrelate the random effects (intercept and slope) when dealing with simple pre-post data. This is due to having a limited number of observations when aggregating scores like we did here (i.e., using a total vocabulary score instead of item-level scores).
Let’s look at r-squared:
MuMIn::r.squaredGLMM(vocab)
R2m R2c
[1,] 0.1447169 0.8960168
And compare to a null, intercept and random-effects only model:
vocab.null <- lmer(Vocab_score ~ 1 +
(time01||StudyID)
,data = d, control=lmerControl(optimizer="bobyqa"))
anova(vocab.null, vocab, refit = F)
Data: d
Models:
vocab.null: Vocab_score ~ 1 + (time01 || StudyID)
vocab: Vocab_score ~ time01 + time01:hrs_tot + (time01 || StudyID)
npar AIC BIC logLik -2*log(L) Chisq Df Pr(>Chisq)
vocab.null 4 802.43 813.16 -397.22 794.43
vocab 6 752.27 768.36 -370.13 740.27 54.165 2 1.731e-12 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Now for some fancy plotting. One neat way of showing the interaction is by breaking up the sample in some way according to ‘chunks’ of the continuous predictor (study time - hrs_tot).
#add preds to df
d$Vocab.pred <- predict(vocab)
#effects & visuals
vocab.outfit <- ggplot(d, aes(time01, Vocab_score))+
geom_line(aes(group = StudyID), alpha = .4)+
stat_summary(aes(y = Vocab.pred),
fun.data = mean_se, geom = "ribbon", color = NA, fill = "blue", alpha = .2)+
stat_summary(aes(y = Vocab.pred), fun = mean,
geom = "line", size = 1.5, color = "blue") +
scale_y_continuous(limits = c(-1.5, 60), breaks = c(0, 10, 20, 30, 40, 50, 60))+
scale_x_continuous(breaks = c(0, 1), labels = c("Pretest","Posttest"))+
labs(x = "Time", y = "Vocabulary Score")+
theme_bw(base_size = 18) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))+
facet_grid(~ cut(hrs_tot, breaks = c(0, 5, 10, 15, 40),
labels = c("< 5 hours", "5-10 hours", "10-15 hours", "15+ hours")))
vocab.outfit