This is some analyses playing about with a data set that was re-analysed in a recent preprint by Don van den Bergh et al. The purpose of the preprint was to acknowledge that the JASP software had been set to default to a model that was wrong (at least for this data). What caught my eye was that it was being framed as “you have to redo your Bayesian repeated measures analyses”, and the paper itself was setting it up as a contrast between Bayesian and frequentist analyses. This was even though, as they acknowledge, the difference was in the models.
The data comes from an unpublished experiment on the Stroop effect. In essence, There are 3 treatments: congruent, neutral and incongruent, and the hypothesis is that the congruent should have the smallest respionse. There is also a hypothesis that the previous trial (whether it is neutral or a “proper” trial) interacts with this. Thus, we have a 2x3 design. This is repeated for 19 individuals, so they provide replication. Thus, this is a repeated designs study. The data are recorded as the means for each of the 2x3 levels for each individual (each level is also tested repeatedly, and only the means reported). I should express my thanks to the experimenters (Hershman, Dadon, Kisel & Henik) for making it available, and also the authors of the preprint for putting it online as part of the JASP package.
We can read in the data from GitHub, and then reformat it. I also scale the data, to make things easier later. In particular, default priors in the Bayesian analysis don’t like the unscaled data.
Data <- read.csv("https://raw.githubusercontent.com/jasp-stats/jasp-desktop/stable/Resources/Data%20Sets/Data%20Library/3.%20ANOVA/Stroop.csv")
library(tidyr)
library(knitr) # for nice tables
library(INLA) # for the Bayesian analysis later
library(lme4) # for the REML analysis later
Stroop <- gather(Data, key= "Treatment",
value="ResponseTime", names(Data))
Stroop$Indiv <- factor(1:nrow(Data))
Stroop$PT <- factor(gsub(".*\\.", "", Stroop$Treatment))
Stroop$Congruency <- factor(gsub("\\..*", "", Stroop$Treatment))
Stroop$ResponseTimeUnscaled <- Stroop$ResponseTime
Stroop$ResponseTime <- scale(Stroop$ResponseTime)
First, we can plot the data:
par(mar=c(4,8,1,1))
plot(1,1, ylim=c(1,6), xlim=range(Data), ann=FALSE,
yaxt="n")
apply(Data, 1, function(x) lines(x,1:6))
axis(2, at=1:6, labels=gsub("\\.", " ", names(Data)), las=1)
Eyeballing this suggests that individuals are either slower after a Stroop test or a break, but rarely both. However, there also seems to be some positive skew in the data, which might accentuate the issue. We will be naughty and ignore the skew, but it should probably be looked at (there is an excellent Exegesis by Bill Venables that contains some good advice on this: the best transformation might even be the same one). In the end it may not affect the main conclusions, but one would have to check.
We can start with the classical approach, that I learned 30 years ago, and appears to be the approach used in the preprint. For this data set it works, because the data are balanced and there are no problems with negative variance components.
The main interest is in the Congruency*PT terms. The
individuals are there because variation between individuals might be
important (replication and all that). We can start by looking at the
simple (and wrong!) model:
mod.simple <- lm(ResponseTime ~ Congruency*PT, data=Stroop)
summary(mod.simple)
##
## Call:
## lm(formula = ResponseTime ~ Congruency * PT, data = Stroop)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.4736 -0.7762 -0.1163 0.4450 2.7693
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.024598 0.223595 0.110 0.9126
## CongruencyIncongruent 0.538382 0.316212 1.703 0.0915 .
## CongruencyNeutral 0.095626 0.316212 0.302 0.7629
## PTStroop -0.395219 0.316212 -1.250 0.2141
## CongruencyIncongruent:PTStroop -0.232157 0.447191 -0.519 0.6047
## CongruencyNeutral:PTStroop 0.002211 0.447191 0.005 0.9961
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9746 on 108 degrees of freedom
## Multiple R-squared: 0.09213, Adjusted R-squared: 0.0501
## F-statistic: 2.192 on 5 and 108 DF, p-value: 0.06034
We can see that the contract between Congruent and Incongruent is the largest effect, with Incongruent images taking longer on average. The contrast between Neutral and Stroop prior images is the next largest, with neutral images (on average) taking longer. But these averages are modulated by the interaction, and there is also a lot of uncertainty.
Notice that the \(R^2\) is only 9%. The hope is that a lot of the unexplained variation is due to individuals. Although this won’t change the fact that only a small amount of the data is explained by the treatment.
The design is balanced which is nice for several reasons:
So we can look at the ANOVA table for the saturated model:
mod.saturated <- lm(ResponseTime ~ Congruency*PT*Indiv, data=Stroop)
kable(an.sat <- anova(mod.saturated), digits=3)
| Df | Sum Sq | Mean Sq | F value | Pr(>F) | |
|---|---|---|---|---|---|
| Congruency | 2 | 3.720 | 1.860 | NaN | NaN |
| PT | 1 | 6.346 | 6.346 | NaN | NaN |
| Indiv | 18 | 45.930 | 2.552 | NaN | NaN |
| Congruency:PT | 2 | 0.345 | 0.172 | NaN | NaN |
| Congruency:Indiv | 36 | 3.022 | 0.084 | NaN | NaN |
| PT:Indiv | 18 | 50.936 | 2.830 | NaN | NaN |
| Congruency:PT:Indiv | 36 | 2.701 | 0.075 | NaN | NaN |
| Residuals | 0 | 0.000 | NaN | NA | NA |
The residual variance is 0, because the
Congruency*PT*Indiv effect is the residual. This makes the
tests nonsense, but that’s OK because (a) we can construct our own
tests, and (b) our main interest here is the mean squares. The three
main effects have a lot of variance, as well as the
PT:Indiv interaction. This reflects the pattern we can see
in the data of individuals being slow in one or another PT level.
The tests used in the original preprint are all also based on the
mean squares, and calculating the correct F ratio from these. The only
random factor is Individual, and this affects the variation in fixed
effects. For example, the Congruency effect can vary between
individuals, so the sum of squares is made up of contributions from
Congruency, Congruency:Individual, and
Congruency:PT:Individual (the last one is the residual, of
course). The Congruency:Individual sums of squares has
Congruency:Individual and
Congruency:PT:Individual contributions. So if there is no
Congruency effect, the mean squares should be the same: we can use an F
test to test this. Similarly, the test for the PT effect is the
PT mean square divided by the PT:Individual
mean square.
The F ratio for the Congruency effect is 1.86/0.084 = 22.158, which is exactly what the preprint gives. The other tests are also the same as in the preprint: working that out is an exercise for the user.
In contract, the Bayesian method used in the preprint only has Individual as a random effect, so the model is
mod.B <- lm(ResponseTime ~ Congruency*PT+Indiv, data=Stroop)
kable(an.B <- anova(mod.B), digits=3)
| Df | Sum Sq | Mean Sq | F value | Pr(>F) | |
|---|---|---|---|---|---|
| Congruency | 2 | 3.720 | 1.860 | 2.955 | 0.057 |
| PT | 1 | 6.346 | 6.346 | 10.080 | 0.002 |
| Indiv | 18 | 45.930 | 2.552 | 4.053 | 0.000 |
| Congruency:PT | 2 | 0.345 | 0.172 | 0.274 | 0.761 |
| Residuals | 90 | 56.659 | 0.630 | NA | NA |
Comparing to the saturated model, we see that the Residuals row is the only one that is different, and the degrees of freedom and sums of squares are the sums of the omitted terms. The F tests are now valid, in the sense that they should compare the mean squares to the residual mean square. They suggest there is clear evidence for an effect of PT, as well as Individual. And weak evidence of a Congruency main effect.
But notice that the mean squares are the same (as are the effect sizes, if you want to check). So the change in our conclusions are not due to changes in the size of any effect, it’s due to changes in the variation we would expect to see: it is the denominator of the F ratio that is changing.
So far we have shown that the behaviour seen in the preprint can be replicated using a traditional analysis. But statistics has moved on, and the more modern approach is to explicitly specify and fit a random effects model. There are all sorts of advantage to doing this (e.g. it is easier to work with unbalanced data). For this data the models will actually be the same, but it is easier to see what is going on.
Here I will use a Bayesian model, partly to show that we can replicate the same behaviour. The code for the non-Bayesian modelling is below: the results are almost the same. I will be naughty and use the default priors, but there is code in the appendix to use more informative priors.
The first model is the “Bayesian” model, which only has an Individual random effect. First we create a couple of new variables.
Stroop$ConInd <- factor(paste0(Stroop$Congruency, Stroop$Indiv))
Stroop$PTInd <- factor(paste0(Stroop$PT, Stroop$Indiv))
inla.r1 <-inla(ResponseTime ~ Congruency*PT + f(Indiv, model="iid"),
data=Stroop)
kable(summary(inla.r1)$fixed, digits=3)
| mean | sd | 0.025quant | 0.5quant | 0.975quant | mode | kld | |
|---|---|---|---|---|---|---|---|
| (Intercept) | 0.025 | 0.220 | -0.409 | 0.025 | 0.458 | 0.025 | 0 |
| CongruencyIncongruent | 0.538 | 0.260 | 0.026 | 0.538 | 1.050 | 0.538 | 0 |
| CongruencyNeutral | 0.096 | 0.260 | -0.416 | 0.096 | 0.608 | 0.096 | 0 |
| PTStroop | -0.395 | 0.260 | -0.907 | -0.395 | 0.117 | -0.395 | 0 |
| CongruencyIncongruent:PTStroop | -0.232 | 0.368 | -0.956 | -0.232 | 0.492 | -0.232 | 0 |
| CongruencyNeutral:PTStroop | 0.002 | 0.368 | -0.722 | 0.002 | 0.726 | 0.002 | 0 |
We can see that these are the same as in the simple linear model, which is good news. But because we also have a model, it is easy to look at the random effects:
kable(CalcHyperParStdDev(inla.r1), digits=3)
| mean | sd | 0.025quant | 0.5quant | 0.975quant | mode | |
|---|---|---|---|---|---|---|
| StdDev.for.the.Gaussian.observations | 0.795 | 0.059 | 0.687 | 0.792 | 0.920 | 0.785 |
| StdDev.for.Indiv | 0.518 | 0.121 | 0.316 | 0.506 | 0.789 | 0.483 |
For the moment the point is that we can see where the variation is, and even look at the uncertainty in the variance estimates. These are the posterior standard deviations for the random effects, so they are on the same scale as the fixed effects. The residual standard deviation is a bit bigger than the Individual effect, and is about the same as the largest fixed effect. We can calculate these in the classical approach above, but here they are part of the model, and we can avoid weirdness with negative estimates. One advantage of the Bayesian approach is that we get measures of uncertainty from the analysis without more messing around.
Now we can look at the ‘correct’ model:
inla.r1CT <-inla(ResponseTime ~ Congruency*PT + f(ConInd) + f(Indiv) + f(PTInd),
data=Stroop, control.compute = list(dic = TRUE, waic = TRUE))
kable(summary(inla.r1CT)$fixed, digits=3)
| mean | sd | 0.025quant | 0.5quant | 0.975quant | mode | kld | |
|---|---|---|---|---|---|---|---|
| (Intercept) | 0.025 | 0.222 | -0.413 | 0.025 | 0.462 | 0.025 | 0 |
| CongruencyIncongruent | 0.538 | 0.091 | 0.359 | 0.538 | 0.718 | 0.538 | 0 |
| CongruencyNeutral | 0.096 | 0.091 | -0.084 | 0.096 | 0.275 | 0.096 | 0 |
| PTStroop | -0.395 | 0.314 | -1.013 | -0.395 | 0.223 | -0.395 | 0 |
| CongruencyIncongruent:PTStroop | -0.232 | 0.129 | -0.486 | -0.232 | 0.021 | -0.232 | 0 |
| CongruencyNeutral:PTStroop | 0.002 | 0.129 | -0.251 | 0.002 | 0.256 | 0.002 | 0 |
The point estimates are unchanged, but the posterior standard deviations are appreciably smaller for the effects involving Congruency, but are a bit larger for the PT main effect. We can see the reason for this in the standard deviations for the random effects:
kable(CalcHyperParStdDev(inla.r1CT), digits=4, type='html')
| mean | sd | 0.025quant | 0.5quant | 0.975quant | mode | |
|---|---|---|---|---|---|---|
| StdDev.for.the.Gaussian.observations | 0.2798 | 0.0223 | 0.2394 | 0.2785 | 0.3267 | 0.2759 |
| StdDev.for.ConInd | 0.0059 | 0.0015 | 0.0031 | 0.0058 | 0.0089 | 0.0059 |
| StdDev.for.Indiv | 0.0099 | 0.0040 | 0.0052 | 0.0089 | 0.0205 | 0.0073 |
| StdDev.for.PTInd | 0.9095 | 0.0965 | 0.7332 | 0.9048 | 1.1119 | 0.8965 |
The standard deviation for the PT effect (which was missing in the previous model) is the largest, and is larger that the largest contrast in the fixed effects. The Individual variance is now close to zero, as is the variation in the Congruency effect between individuals. The residual variation is now also much smaller.
The sizes of these variances are what affect the posterior standard
deviations of the fixed effects: the Congruency effects have smaller
standard errors because the Congruency:Individual random
effect is well estimated.
So what does this tell us about the fixed effects? First we can ask whether we need the interactions.
Main effects only:
inla.main <-inla(ResponseTime ~ Congruency+PT + f(ConInd) + f(Indiv) + f(PTInd),
data=Stroop, control.compute = list(dic = TRUE, waic = TRUE))
We can compare the models in several ways. The Bayes factor can be calculated from the marginal log-likelihoods, which are -127.8 for the full model, and -118.1 for the model without the interaction. So the \(\log_e\) Bayes Factor is -9.67, giving the Bayes Factor of 15896 against the interaction.
We can also look at wAIC, which is 80.2 for the full model, and 81.6 for the model without the interaction. This suggests that the more complex model is better, although the difference is not great. So, partly to make life simple and also to make a point, we can look at the model with only main effects:
MainFixed <- summary(inla.main)$fixed
kable(MainFixed, digits=3)
| mean | sd | 0.025quant | 0.5quant | 0.975quant | mode | kld | |
|---|---|---|---|---|---|---|---|
| (Intercept) | 0.063 | 0.217 | -0.364 | 0.063 | 0.490 | 0.063 | 0 |
| CongruencyIncongruent | 0.422 | 0.066 | 0.293 | 0.422 | 0.551 | 0.422 | 0 |
| CongruencyNeutral | 0.097 | 0.066 | -0.032 | 0.097 | 0.226 | 0.097 | 0 |
| PTStroop | -0.472 | 0.302 | -1.067 | -0.472 | 0.123 | -0.472 | 0 |
And we can plot these, if it is easier to see what is going on:
par(mar=c(4.1,9,1,1))
plot(MainFixed[-1,"mean"], 2:4, xlim=range(MainFixed), yaxt="n", ann=FALSE)
segments(MainFixed[-1,"0.025quant"], 2:4,
MainFixed[-1,"0.975quant"], 2:4)
abline(v=0)
axis(2, at=2:4, labels=gsub("y", "y\n", rownames(MainFixed)[2:4]), las=1)
We can see that the contrast between Congruent and Incongruent is clearly positive, but the directions of the other effects could go either way. The typical response would then be to declare that there is an effect of congruency, but not of the previous trial. In some ways, though, this would be a bizarre conclusion, because the largest estimate is the previous trial effect. The problem, of course, is that the large variation between individuals in the PT effect means that the PT effect is more difficult to estimate.
The main effect in the ANOVA is to drastically reduce the PT mean square, but not the others. So the Congruency effect gets a large F value. This agrees with what the authors found.
Looking at the variance components we see, unsurprisingly, that the
PT:Indiveffect is large. This is what sucks up the PT sums
of squares. The other random effects are tiny. Quite what these mean is
for someone else to explain.
summary(mod.r1ct)$varcor
## Groups Name Std.Dev.
## Congruency:Indiv (Intercept) 6.6289e-02
## PT:Indiv (Intercept) 9.3305e-01
## Indiv (Intercept) 8.1794e-05
## Residual 2.7402e-01
Lots of conclusions could be made, and thoughs raised. Including:
PT:Individual effect. Does it make sense that people who
are quick after a Stroop trial are slow after a break trial, and
vice versa? So is this likely to be a real effect, or just
random noise being, well, random?For completeness, these are the frequentist mixed model analyses,
using REML. Without going into detail, they tell the same story: large
PT:Individual random effect, good estimates of the
congruency effects, and in the model without the interaction, the
largest point estimate has a large standard error.
library(lme4)
mod.r1 <-lmer(ResponseTime ~ Congruency*PT + (1|Indiv), data=Stroop)
mod.r1ct <-lmer(ResponseTime ~ Congruency*PT + (1|Indiv) +
(1|Congruency:Indiv) + (1|PT:Indiv), data=Stroop)
mod.main <-lmer(ResponseTime ~ Congruency+PT + (1|Indiv) +
(1|Congruency:Indiv) + (1|PT:Indiv), data=Stroop)
kable(summary(mod.r1ct)$coefficients, digits=3)
kable(summary(mod.r1ct)$varcor, digits=3)
This is code to change the priors to be more informative. The effect on the Bayes factor is to reuce it to “only” 400, so the results are not that different. The difference is mainly due to the more informative priors on the fixed effects.
# PC priors
PC <- list(prec = list(prior = "pc.prec", param = c(0.5, 0.01)))
inla.r1CT.pc <-inla(ResponseTime ~ Congruency*PT +
f(ConInd, hyper=PC) + f(Indiv, hyper=PC) + f(PTInd, hyper=PC),
data=StroopPredict, control.compute = list(dic = TRUE, waic = TRUE),
control.fixed=list(mean=0, mean.intercept=0, prec=1, prec.intercept=1))
inla.main.pc <-inla(ResponseTime ~ Congruency+PT +
f(ConInd, hyper=PC) + f(Indiv, hyper=PC) + f(PTInd, hyper=PC),
data=StroopPredict, control.compute = list(dic = TRUE, waic = TRUE),
control.fixed=list(mean=0, mean.intercept=0, prec=1, prec.intercept=1))
This is just a helper function, hidden away here.
CalcHyperParStdDev <- function(inlaObj) {
require(INLA)
SDs <- lapply(inlaObj$marginals.hyperpar,
function(mar) {
CalcSD <- function(m) {
m1 = inla.emarginal(function(x) x, m)
m2 = inla.emarginal(function(x) x^2L, m)
stdev = sqrt(m2 - m1^2L)
stdev
}
marg <- inla.tmarginal(function(x) 1/sqrt(x),
marginal = mar)
res <- c(mean = inla.emarginal(function(x) x, marg),
sd = CalcSD(marg),
quant = inla.qmarginal(p=c(0.025, 0.5, 0.975), marg),
mode = inla.mmarginal(marg))
names(res) <- c("mean", "sd", "0.025quant", "0.5quant",
"0.975quant", "mode")
res
})
PostSD <- t(as.data.frame(SDs))
rownames(PostSD) <- gsub("Precision", "StdDev", rownames(PostSD))
PostSD
}
PredData <- data.frame(
Treatment = rep(NA, 6),
ResponseTime = rep(NA, 6),
Indiv = rep(NA, 6),
PT = rep(c("Break", "Stroop"), each=3),
Congruency = rep(levels(Stroop$Congruency), times=2),
ResponseTimeUnscaled = rep(NA, 6),
ConInd = rep(NA, 6),
PTInd = rep(NA, 6)
)