Introduction

This is some analyses playing about with a data set that was re-analysed in a recent preprint by Don van den Bergh et al. The purpose of the preprint was to acknowledge that the JASP software had been set to default to a model that was wrong (at least for this data). What caught my eye was that it was being framed as “you have to redo your Bayesian repeated measures analyses”, and the paper itself was setting it up as a contrast between Bayesian and frequentist analyses. This was even though, as they acknowledge, the difference was in the models.

The Data

The data comes from an unpublished experiment on the Stroop effect. In essence, There are 3 treatments: congruent, neutral and incongruent, and the hypothesis is that the congruent should have the smallest respionse. There is also a hypothesis that the previous trial (whether it is neutral or a “proper” trial) interacts with this. Thus, we have a 2x3 design. This is repeated for 19 individuals, so they provide replication. Thus, this is a repeated designs study. The data are recorded as the means for each of the 2x3 levels for each individual (each level is also tested repeatedly, and only the means reported). I should express my thanks to the experimenters (Hershman, Dadon, Kisel & Henik) for making it available, and also the authors of the preprint for putting it online as part of the JASP package.

We can read in the data from GitHub, and then reformat it. I also scale the data, to make things easier later. In particular, default priors in the Bayesian analysis don’t like the unscaled data.

Data <- read.csv("https://raw.githubusercontent.com/jasp-stats/jasp-desktop/stable/Resources/Data%20Sets/Data%20Library/3.%20ANOVA/Stroop.csv")

library(tidyr)
library(knitr) # for nice tables
library(INLA) # for the Bayesian analysis later
library(lme4) # for the REML analysis later
Stroop <- gather(Data, key= "Treatment", 
                 value="ResponseTime", names(Data))
Stroop$Indiv <- factor(1:nrow(Data))
Stroop$PT <- factor(gsub(".*\\.", "", Stroop$Treatment))
Stroop$Congruency <- factor(gsub("\\..*", "", Stroop$Treatment))
Stroop$ResponseTimeUnscaled <- Stroop$ResponseTime
Stroop$ResponseTime <- scale(Stroop$ResponseTime)

First, we can plot the data:

par(mar=c(4,8,1,1))
plot(1,1, ylim=c(1,6), xlim=range(Data), ann=FALSE,
     yaxt="n")
apply(Data, 1, function(x) lines(x,1:6))
axis(2, at=1:6, labels=gsub("\\.", " ", names(Data)), las=1)

Eyeballing this suggests that individuals are either slower after a Stroop test or a break, but rarely both. However, there also seems to be some positive skew in the data, which might accentuate the issue. We will be naughty and ignore the skew, but it should probably be looked at (there is an excellent Exegesis by Bill Venables that contains some good advice on this: the best transformation might even be the same one). In the end it may not affect the main conclusions, but one would have to check.

The Modelling

Classical random effects ANOVA

We can start with the classical approach, that I learned 30 years ago, and appears to be the approach used in the preprint. For this data set it works, because the data are balanced and there are no problems with negative variance components.

The main interest is in the Congruency*PT terms. The individuals are there because variation between individuals might be important (replication and all that). We can start by looking at the simple (and wrong!) model:

mod.simple <- lm(ResponseTime ~ Congruency*PT, data=Stroop)
summary(mod.simple)
## 
## Call:
## lm(formula = ResponseTime ~ Congruency * PT, data = Stroop)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.4736 -0.7762 -0.1163  0.4450  2.7693 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)  
## (Intercept)                     0.024598   0.223595   0.110   0.9126  
## CongruencyIncongruent           0.538382   0.316212   1.703   0.0915 .
## CongruencyNeutral               0.095626   0.316212   0.302   0.7629  
## PTStroop                       -0.395219   0.316212  -1.250   0.2141  
## CongruencyIncongruent:PTStroop -0.232157   0.447191  -0.519   0.6047  
## CongruencyNeutral:PTStroop      0.002211   0.447191   0.005   0.9961  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9746 on 108 degrees of freedom
## Multiple R-squared:  0.09213,    Adjusted R-squared:  0.0501 
## F-statistic: 2.192 on 5 and 108 DF,  p-value: 0.06034

We can see that the contract between Congruent and Incongruent is the largest effect, with Incongruent images taking longer on average. The contrast between Neutral and Stroop prior images is the next largest, with neutral images (on average) taking longer. But these averages are modulated by the interaction, and there is also a lot of uncertainty.

Notice that the \(R^2\) is only 9%. The hope is that a lot of the unexplained variation is due to individuals. Although this won’t change the fact that only a small amount of the data is explained by the treatment.

The design is balanced which is nice for several reasons:

  1. the point estimates will be unchanged if we add random effects, although the standard errors will change.
  2. All of the information about the variation in tje different terms is in the ANOVA table (which is why it’s called Analysis of Variance).
  3. The order we pout the terms into the model doesn’t affect the ANOVA table, and the different sums of squares will be the same (in other words, we can just use Type I sums of squares).

So we can look at the ANOVA table for the saturated model:

mod.saturated <- lm(ResponseTime ~ Congruency*PT*Indiv, data=Stroop)
kable(an.sat <- anova(mod.saturated), digits=3)
Df Sum Sq Mean Sq F value Pr(>F)
Congruency 2 3.720 1.860 NaN NaN
PT 1 6.346 6.346 NaN NaN
Indiv 18 45.930 2.552 NaN NaN
Congruency:PT 2 0.345 0.172 NaN NaN
Congruency:Indiv 36 3.022 0.084 NaN NaN
PT:Indiv 18 50.936 2.830 NaN NaN
Congruency:PT:Indiv 36 2.701 0.075 NaN NaN
Residuals 0 0.000 NaN NA NA

The residual variance is 0, because the Congruency*PT*Indiv effect is the residual. This makes the tests nonsense, but that’s OK because (a) we can construct our own tests, and (b) our main interest here is the mean squares. The three main effects have a lot of variance, as well as the PT:Indiv interaction. This reflects the pattern we can see in the data of individuals being slow in one or another PT level.

The tests used in the original preprint are all also based on the mean squares, and calculating the correct F ratio from these. The only random factor is Individual, and this affects the variation in fixed effects. For example, the Congruency effect can vary between individuals, so the sum of squares is made up of contributions from Congruency, Congruency:Individual, and Congruency:PT:Individual (the last one is the residual, of course). The Congruency:Individual sums of squares has Congruency:Individual and Congruency:PT:Individual contributions. So if there is no Congruency effect, the mean squares should be the same: we can use an F test to test this. Similarly, the test for the PT effect is the PT mean square divided by the PT:Individual mean square.

The F ratio for the Congruency effect is 1.86/0.084 = 22.158, which is exactly what the preprint gives. The other tests are also the same as in the preprint: working that out is an exercise for the user.

In contract, the Bayesian method used in the preprint only has Individual as a random effect, so the model is

mod.B <- lm(ResponseTime ~ Congruency*PT+Indiv, data=Stroop)
kable(an.B <- anova(mod.B), digits=3)
Df Sum Sq Mean Sq F value Pr(>F)
Congruency 2 3.720 1.860 2.955 0.057
PT 1 6.346 6.346 10.080 0.002
Indiv 18 45.930 2.552 4.053 0.000
Congruency:PT 2 0.345 0.172 0.274 0.761
Residuals 90 56.659 0.630 NA NA

Comparing to the saturated model, we see that the Residuals row is the only one that is different, and the degrees of freedom and sums of squares are the sums of the omitted terms. The F tests are now valid, in the sense that they should compare the mean squares to the residual mean square. They suggest there is clear evidence for an effect of PT, as well as Individual. And weak evidence of a Congruency main effect.

But notice that the mean squares are the same (as are the effect sizes, if you want to check). So the change in our conclusions are not due to changes in the size of any effect, it’s due to changes in the variation we would expect to see: it is the denominator of the F ratio that is changing.

Modern random effects ANOVA, Bayesian Version

So far we have shown that the behaviour seen in the preprint can be replicated using a traditional analysis. But statistics has moved on, and the more modern approach is to explicitly specify and fit a random effects model. There are all sorts of advantage to doing this (e.g. it is easier to work with unbalanced data). For this data the models will actually be the same, but it is easier to see what is going on.

Here I will use a Bayesian model, partly to show that we can replicate the same behaviour. The code for the non-Bayesian modelling is below: the results are almost the same. I will be naughty and use the default priors, but there is code in the appendix to use more informative priors.

The first model is the “Bayesian” model, which only has an Individual random effect. First we create a couple of new variables.

Stroop$ConInd <- factor(paste0(Stroop$Congruency, Stroop$Indiv))
Stroop$PTInd <- factor(paste0(Stroop$PT, Stroop$Indiv))

inla.r1 <-inla(ResponseTime ~ Congruency*PT + f(Indiv, model="iid"),
                data=Stroop)
kable(summary(inla.r1)$fixed, digits=3)
mean sd 0.025quant 0.5quant 0.975quant mode kld
(Intercept) 0.025 0.220 -0.409 0.025 0.458 0.025 0
CongruencyIncongruent 0.538 0.260 0.026 0.538 1.050 0.538 0
CongruencyNeutral 0.096 0.260 -0.416 0.096 0.608 0.096 0
PTStroop -0.395 0.260 -0.907 -0.395 0.117 -0.395 0
CongruencyIncongruent:PTStroop -0.232 0.368 -0.956 -0.232 0.492 -0.232 0
CongruencyNeutral:PTStroop 0.002 0.368 -0.722 0.002 0.726 0.002 0

We can see that these are the same as in the simple linear model, which is good news. But because we also have a model, it is easy to look at the random effects:

kable(CalcHyperParStdDev(inla.r1), digits=3)
mean sd 0.025quant 0.5quant 0.975quant mode
StdDev.for.the.Gaussian.observations 0.795 0.059 0.687 0.792 0.920 0.785
StdDev.for.Indiv 0.518 0.121 0.316 0.506 0.789 0.483

For the moment the point is that we can see where the variation is, and even look at the uncertainty in the variance estimates. These are the posterior standard deviations for the random effects, so they are on the same scale as the fixed effects. The residual standard deviation is a bit bigger than the Individual effect, and is about the same as the largest fixed effect. We can calculate these in the classical approach above, but here they are part of the model, and we can avoid weirdness with negative estimates. One advantage of the Bayesian approach is that we get measures of uncertainty from the analysis without more messing around.

Now we can look at the ‘correct’ model:

inla.r1CT <-inla(ResponseTime ~ Congruency*PT + f(ConInd) + f(Indiv) + f(PTInd),
                 data=Stroop, control.compute = list(dic = TRUE, waic = TRUE))
kable(summary(inla.r1CT)$fixed, digits=3)
mean sd 0.025quant 0.5quant 0.975quant mode kld
(Intercept) 0.025 0.222 -0.413 0.025 0.462 0.025 0
CongruencyIncongruent 0.538 0.091 0.359 0.538 0.718 0.538 0
CongruencyNeutral 0.096 0.091 -0.084 0.096 0.275 0.096 0
PTStroop -0.395 0.314 -1.013 -0.395 0.223 -0.395 0
CongruencyIncongruent:PTStroop -0.232 0.129 -0.486 -0.232 0.021 -0.232 0
CongruencyNeutral:PTStroop 0.002 0.129 -0.251 0.002 0.256 0.002 0

The point estimates are unchanged, but the posterior standard deviations are appreciably smaller for the effects involving Congruency, but are a bit larger for the PT main effect. We can see the reason for this in the standard deviations for the random effects:

kable(CalcHyperParStdDev(inla.r1CT), digits=4, type='html')
mean sd 0.025quant 0.5quant 0.975quant mode
StdDev.for.the.Gaussian.observations 0.2798 0.0223 0.2394 0.2785 0.3267 0.2759
StdDev.for.ConInd 0.0059 0.0015 0.0031 0.0058 0.0089 0.0059
StdDev.for.Indiv 0.0099 0.0040 0.0052 0.0089 0.0205 0.0073
StdDev.for.PTInd 0.9095 0.0965 0.7332 0.9048 1.1119 0.8965

The standard deviation for the PT effect (which was missing in the previous model) is the largest, and is larger that the largest contrast in the fixed effects. The Individual variance is now close to zero, as is the variation in the Congruency effect between individuals. The residual variation is now also much smaller.

The sizes of these variances are what affect the posterior standard deviations of the fixed effects: the Congruency effects have smaller standard errors because the Congruency:Individual random effect is well estimated.

So what does this tell us about the fixed effects? First we can ask whether we need the interactions.

Main effects only:

inla.main <-inla(ResponseTime ~ Congruency+PT + f(ConInd) + f(Indiv) + f(PTInd),
                 data=Stroop, control.compute = list(dic = TRUE, waic = TRUE))

We can compare the models in several ways. The Bayes factor can be calculated from the marginal log-likelihoods, which are -127.8 for the full model, and -118.1 for the model without the interaction. So the \(\log_e\) Bayes Factor is -9.67, giving the Bayes Factor of 15896 against the interaction.

We can also look at wAIC, which is 80.2 for the full model, and 81.6 for the model without the interaction. This suggests that the more complex model is better, although the difference is not great. So, partly to make life simple and also to make a point, we can look at the model with only main effects:

MainFixed <- summary(inla.main)$fixed
kable(MainFixed, digits=3)
mean sd 0.025quant 0.5quant 0.975quant mode kld
(Intercept) 0.063 0.217 -0.364 0.063 0.490 0.063 0
CongruencyIncongruent 0.422 0.066 0.293 0.422 0.551 0.422 0
CongruencyNeutral 0.097 0.066 -0.032 0.097 0.226 0.097 0
PTStroop -0.472 0.302 -1.067 -0.472 0.123 -0.472 0

And we can plot these, if it is easier to see what is going on:

par(mar=c(4.1,9,1,1))
plot(MainFixed[-1,"mean"], 2:4, xlim=range(MainFixed), yaxt="n", ann=FALSE)
segments(MainFixed[-1,"0.025quant"], 2:4,
         MainFixed[-1,"0.975quant"], 2:4)
abline(v=0)
axis(2, at=2:4, labels=gsub("y", "y\n", rownames(MainFixed)[2:4]), las=1)

We can see that the contrast between Congruent and Incongruent is clearly positive, but the directions of the other effects could go either way. The typical response would then be to declare that there is an effect of congruency, but not of the previous trial. In some ways, though, this would be a bizarre conclusion, because the largest estimate is the previous trial effect. The problem, of course, is that the large variation between individuals in the PT effect means that the PT effect is more difficult to estimate.

The main effect in the ANOVA is to drastically reduce the PT mean square, but not the others. So the Congruency effect gets a large F value. This agrees with what the authors found.

Looking at the variance components we see, unsurprisingly, that the PT:Indiveffect is large. This is what sucks up the PT sums of squares. The other random effects are tiny. Quite what these mean is for someone else to explain.

summary(mod.r1ct)$varcor
##  Groups           Name        Std.Dev.  
##  Congruency:Indiv (Intercept) 6.6289e-02
##  PT:Indiv         (Intercept) 9.3305e-01
##  Indiv            (Intercept) 8.1794e-05
##  Residual                     2.7402e-01

Conclusions

Lots of conclusions could be made, and thoughs raised. Including:

  1. The issues are nothing to do with a Bayesian-frequentist difference: the results can be replicated in a purely frequentist analysis. To set this up as “the Bayesian analysis is different from the frequentist analysis” is misleading. It is all about how the specific implementations.
  2. This is a case where the choice of random effects makes a big difference, which I guess was the point of the original preprint.
  3. We have some nice modern approaches to this sort of data. They help you focus on the model as a model, which (I think at least) is clearer.
  4. Significance testing is not the only think to use. Here the test results depend on how the standard errors are estimated: the only time the point estimates changed was when the interactions were removed. If you are fiddling with your error terms, it really helps to know if this changes the point estimates or their standard errors.
  5. Bayes Factors are generally a pain. Read the preprint to see how they just make things more complicated. Also note that they are sensitive to the priors: here when I went from the default priors to not-too-informative priors the Bayes Factor changed by about 20 times. Luckily it didn’t make a difference because it was so large anyway, but be warned. And always state your priors (like, um, I haven’t really done).
  6. One aspect I have not discussed is the psychological interpretation. That is because I am not a psychologist. But it seems strange to me (who doesn’t know the subject area) that the biggest effects seems to be the PT:Individual effect. Does it make sense that people who are quick after a Stroop trial are slow after a break trial, and vice versa? So is this likely to be a real effect, or just random noise being, well, random?

Appendices

Modern random effects ANOVA, REML Version

For completeness, these are the frequentist mixed model analyses, using REML. Without going into detail, they tell the same story: large PT:Individual random effect, good estimates of the congruency effects, and in the model without the interaction, the largest point estimate has a large standard error.

library(lme4)
mod.r1 <-lmer(ResponseTime ~ Congruency*PT + (1|Indiv), data=Stroop)
mod.r1ct <-lmer(ResponseTime ~ Congruency*PT + (1|Indiv) +
                  (1|Congruency:Indiv) + (1|PT:Indiv), data=Stroop)
mod.main <-lmer(ResponseTime ~ Congruency+PT + (1|Indiv) +
                  (1|Congruency:Indiv) + (1|PT:Indiv), data=Stroop)

kable(summary(mod.r1ct)$coefficients, digits=3)
kable(summary(mod.r1ct)$varcor, digits=3)

PC priors

This is code to change the priors to be more informative. The effect on the Bayes factor is to reuce it to “only” 400, so the results are not that different. The difference is mainly due to the more informative priors on the fixed effects.

# PC priors
PC <- list(prec = list(prior = "pc.prec", param = c(0.5, 0.01)))
inla.r1CT.pc <-inla(ResponseTime ~ Congruency*PT + 
                      f(ConInd, hyper=PC) + f(Indiv, hyper=PC) + f(PTInd, hyper=PC),
                 data=StroopPredict, control.compute = list(dic = TRUE, waic = TRUE), 
                 control.fixed=list(mean=0, mean.intercept=0, prec=1, prec.intercept=1))
inla.main.pc <-inla(ResponseTime ~ Congruency+PT + 
                      f(ConInd, hyper=PC) + f(Indiv, hyper=PC) + f(PTInd, hyper=PC),
                 data=StroopPredict, control.compute = list(dic = TRUE, waic = TRUE), 
                 control.fixed=list(mean=0, mean.intercept=0, prec=1, prec.intercept=1))

Code to transform posterior precisions to posterior standard deviations

This is just a helper function, hidden away here.

CalcHyperParStdDev <- function(inlaObj) {
  require(INLA)
  SDs <- lapply(inlaObj$marginals.hyperpar, 
                function(mar) {
                  CalcSD <- function(m) {
                    m1 = inla.emarginal(function(x) x, m)
                    m2 = inla.emarginal(function(x) x^2L, m)
                    stdev = sqrt(m2 - m1^2L)
                    stdev
                  }
                  marg <- inla.tmarginal(function(x) 1/sqrt(x), 
                                         marginal = mar)
                  res <- c(mean = inla.emarginal(function(x) x, marg),
                           sd = CalcSD(marg),
                           quant = inla.qmarginal(p=c(0.025, 0.5, 0.975), marg),
                           mode = inla.mmarginal(marg))
                  names(res) <- c("mean", "sd", "0.025quant", "0.5quant",
                                  "0.975quant", "mode")
                  res
                })
  
  PostSD <- t(as.data.frame(SDs))
  rownames(PostSD) <- gsub("Precision", "StdDev", rownames(PostSD))
  PostSD
}

PredData <- data.frame(
  Treatment = rep(NA, 6),
  ResponseTime = rep(NA, 6),
  Indiv = rep(NA, 6),
  PT = rep(c("Break", "Stroop"), each=3),
  Congruency = rep(levels(Stroop$Congruency), times=2), 
  ResponseTimeUnscaled = rep(NA, 6),
  ConInd = rep(NA, 6),
  PTInd = rep(NA, 6)
)