Linear mixed-effects modeling considerations

v.1.0 : October 2015 : Athanassios Protopapas

This document discusses and demonstrates some issues arising when using general linear mixed-effects models to analyze experimental data. It is assumed that data have been collected, visualized, cleaned up, and stored in an appropriately structured data frame.

For the demonstrations below we use a real experimental dataset including response times from 74 participants responding to 288 items (half words, half pseudowords) in three priming conditions: Neutral, Match, and Mismatch.

if (Sys.info()["sysname"]=="Windows") {
  Sys.setlocale("LC_CTYPE","Greek")
  read.table("LMMdemoA.txt") -> data2   # win, read ANSI text
} else {
  read.table("LMMdemo.txt") -> data2    # mac, read UTF text
}
data2 <- data2[!is.na(data2$logRT),]
str(data2)
## 'data.frame':    19705 obs. of  8 variables:
##  $ subject: Factor w/ 74 levels "aggi","agka",..: 6 69 70 48 29 3 21 9 23 5 ...
##  $ item   : Factor w/ 288 levels "αβίρα","άβιρα",..: 100 100 100 100 100 100 100 100 100 100 ...
##  $ Corder : num  -1.033 -1.587 0.73 -1.712 -0.615 ...
##  $ lex    : Factor w/ 2 levels "Pseudo","Word": 2 2 2 2 2 2 2 2 2 2 ...
##  $ cond   : Factor w/ 3 levels "Match","Mismatch",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ err    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ logRT  : num  6.97 6.38 6.53 6.47 6.99 ...
##  $ CprevRT: num  0.674 -0.774 -0.287 -0.126 1.449 ...

LMEM vs. ANOVA : no magic bullet

In the first section, we will compare LMEM analysis to traditional analyses of variance, including only word data (i.e., excluding pseudowords) and focusing on condition comparisons by participants only. We will first consider a simple one-factor comparison involving only two conditions, namely Match and Neutral, ignoring for the moment the Mismatch condition.

rld <- droplevels(subset(data2, lex=="Word" & cond %in% c("Neutral","Match")))
rld$cond <- relevel(rld$cond, ref="Neutral")

The traditional approaches involve aggregating data to get a single value per participant.

agd <- aggregate(logRT ~ cond+subject,rld,mean,na.rm=T)

The simplest comparison is via paired t-test:

t.test(logRT ~ cond, agd, paired=T)
## 
##  Paired t-test
## 
## data:  logRT by cond
## t = 6.7691, df = 73, p-value = 2.784e-09
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.04051920 0.07433559
## sample estimates:
## mean of the differences 
##               0.0574274

This is equivalent to a within-participants ANOVA:

summary(aov(logRT ~ cond + Error(subject), agd))
## 
## Error: subject
##           Df Sum Sq Mean Sq F value Pr(>F)
## Residuals 73  4.674 0.06403               
## 
## Error: Within
##           Df Sum Sq Mean Sq F value   Pr(>F)    
## cond       1 0.1220 0.12202   45.82 2.78e-09 ***
## Residuals 73 0.1944 0.00266                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The exact same p value in the two analyses is no coincidence. The analyses are equivalent, and the two statistics are related by \(F=t^2\):

t.test(logRT ~ cond, agd, paired=T)$statistic^2
##       t 
## 45.8202

The same results can be achieved with unaggregated data using packages (ez and afex) designed to facilitate this analysis. (Note that these packages will not tolerate missing values in the dependent variable; those must be explicitly excluded.)

library(ez)
library(afex)
ezANOVA(data=subset(rld,!is.na(logRT)),dv=.(logRT),wid=.(subject),within=.(cond))
## Warning: Collapsing data to cell means. *IF* the requested effects are a
## subset of the full design, you must use the "within_full" argument, else
## results may be inaccurate.
## $ANOVA
##   Effect DFn DFd       F            p p<.05        ges
## 2   cond   1  73 45.8202 2.784232e-09     * 0.02444917
ez.glm("subject","logRT",subset(rld,!is.na(logRT)),within="cond")
## Warning: ez.glm was renamed to aov_ez and is now deprecated.
## Warning in aov_car(formula = as.formula(formula), data = data,
## fun.aggregate = fun.aggregate, : More than one observation per cell,
## aggregating the data using mean (i.e, fun.aggregate = mean)!
##   Effect    df  MSE         F ges p.value
## 1   cond 1, 73 0.00 45.82 *** .02  <.0001

This result can also be produced using the standard aov function, provided we take account of the grouping structure of the data, that is, the fact that there are multiple data points for each level of condition per participant (nested grouping).

a1 <- aov(logRT ~ cond + Error(subject/cond), rld)
sa1 <- summary(a1)
sa1
## 
## Error: subject
##           Df Sum Sq Mean Sq F value Pr(>F)
## cond       1   0.04  0.0407   0.015  0.903
## Residuals 72 195.90  2.7209               
## 
## Error: subject:cond
##           Df Sum Sq Mean Sq F value   Pr(>F)    
## cond       1  5.227   5.227    45.9 2.72e-09 ***
## Residuals 73  8.314   0.114                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Error: Within
##             Df Sum Sq Mean Sq F value Pr(>F)
## Residuals 6216  368.6  0.0593

(The slight numerical difference is due to the missing data affecting averaging.)

The same result can be produced via general linear mixed-effects modeling, retaining the appropriate grouping structure.

library(lmerTest)
l1a <- lmer(logRT ~ cond + (1|subject/cond), rld)
sl1a <- summary(l1a)
sl1a
## Linear mixed model fit by REML t-tests use Satterthwaite approximations
##   to degrees of freedom [lmerMod]
## Formula: logRT ~ cond + (1 | subject/cond)
##    Data: rld
## 
## REML criterion at convergence: 424.3
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.9449 -0.6819 -0.1634  0.5291  4.4787 
## 
## Random effects:
##  Groups       Name        Variance Std.Dev.
##  cond:subject (Intercept) 0.001268 0.03561 
##  subject      (Intercept) 0.030617 0.17498 
##  Residual                 0.059298 0.24351 
## Number of obs: 6364, groups:  cond:subject, 148; subject, 74
## 
## Fixed effects:
##              Estimate Std. Error        df t value Pr(>|t|)    
## (Intercept)  6.579853   0.021209 79.050000 310.235  < 2e-16 ***
## condMatch   -0.057400   0.008464 72.700000  -6.781 2.68e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##           (Intr)
## condMatch -0.201

In this model, the variance among all possible combinations of condition and participant is estimated as a model parameter.

Alternatively, we can fit a general linear mixed-effects model (LMEM) in which grouping will be limited to participants, and condition differences per participant will be modeled as random slopes.

l1b <- lmer(logRT ~ cond + (cond|subject), rld)
sl1b <- summary(l1b)
sl1b
## Linear mixed model fit by REML t-tests use Satterthwaite approximations
##   to degrees of freedom [lmerMod]
## Formula: logRT ~ cond + (cond | subject)
##    Data: rld
## 
## REML criterion at convergence: 424.2
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.9500 -0.6830 -0.1631  0.5305  4.4736 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  subject  (Intercept) 0.032459 0.18016       
##           condMatch   0.002538 0.05037  -0.20
##  Residual             0.059298 0.24351       
## Number of obs: 6364, groups:  subject, 74
## 
## Fixed effects:
##              Estimate Std. Error        df t value Pr(>|t|)    
## (Intercept)  6.579857   0.021391 72.790000 307.599  < 2e-16 ***
## condMatch   -0.057408   0.008465 72.700000  -6.781 2.68e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##           (Intr)
## condMatch -0.241

In this model, grouping of condition within participants is associated with “within-participants” variance attributable to factor “condition”. It is against this background variance that the significance of the fixed effect of condition must be evaluated. The “random structure” of our model accounts for the participant grouping (the “random effect”, since participants are sampled from a population to which we want to generalize) and for the within-participant effects of condition, which may be too variable to support a significant fixed effect of condition.

Because the design is the same, the data are the same, and the variance is structured in the same way, the two LMEMs and ANOVA produce the same results. Note that the \(t\) value from the LMEM (squared) is very close to the \(F\) value from the ANOVA, as they refer to the same effect (and one degree of freedom).

coef(sl1a)["condMatch","t value"]^2 
## [1] 45.98702
sa1[["Error: subject:cond"]][[1]]["F value"]
##           F value
## cond       45.896
## Residuals

Note, also, that when trial-level data are submitted, it is necessary to specify the full grouping structure, otherwise the results will be invalid (equally so in ANOVA and LMEM):

l2 <- lmer(logRT ~ cond + (1|subject), rld)
sl2 <- summary(l2)
sl2
## Linear mixed model fit by REML t-tests use Satterthwaite approximations
##   to degrees of freedom [lmerMod]
## Formula: logRT ~ cond + (1 | subject)
##    Data: rld
## 
## REML criterion at convergence: 443.4
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.8898 -0.6841 -0.1657  0.5299  4.4945 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  subject  (Intercept) 0.03121  0.1767  
##  Residual             0.05993  0.2448  
## Number of obs: 6364, groups:  subject, 74
## 
## Fixed effects:
##               Estimate Std. Error         df t value Pr(>|t|)    
## (Intercept)  6.580e+00  2.100e-02  7.600e+01  313.37   <2e-16 ***
## condMatch   -5.735e-02  6.141e-03  6.289e+03   -9.34   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##           (Intr)
## condMatch -0.148
a2 <- aov(logRT ~ cond + Error(subject), rld)
sa2 <- summary(a2)
sa2
## 
## Error: subject
##           Df Sum Sq Mean Sq F value Pr(>F)
## cond       1   0.04  0.0407   0.015  0.903
## Residuals 72 195.90  2.7209               
## 
## Error: Within
##             Df Sum Sq Mean Sq F value Pr(>F)    
## cond         1    5.2   5.227   87.22 <2e-16 ***
## Residuals 6289  376.9   0.060                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
coef(sl2)["condMatch","t value"]^2 
## [1] 87.23031
sa2[["Error: Within"]][[1]]["F value"]
##           F value
## cond        87.22
## Residuals

A LMEM with intercept-only random structure is like an ANOVA with unaggregated datapoints per participants, causing variance from an unrelated source to affect the comparison of interest.

Factor coding

In the examples above, condition is a “factor”, that is, a nominal variable. Actually, in the calculations a numerical variable is used instead, applying “treatment coding” with a reference level receiving a value of zero and a comparison level receiving a value of one. We can duplicate this coding directly using a numeric variable

rld$condN <- ifelse(rld$cond=="Neutral", 0, ifelse(rld$cond=="Match", 1, NA))
l1c <- lmer(logRT ~ condN + (condN|subject), rld)
sl1c <- summary(l1c)
sl1c
## Linear mixed model fit by REML t-tests use Satterthwaite approximations
##   to degrees of freedom [lmerMod]
## Formula: logRT ~ condN + (condN | subject)
##    Data: rld
## 
## REML criterion at convergence: 424.2
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.9500 -0.6830 -0.1631  0.5305  4.4736 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  subject  (Intercept) 0.032459 0.18016       
##           condN       0.002538 0.05037  -0.20
##  Residual             0.059298 0.24351       
## Number of obs: 6364, groups:  subject, 74
## 
## Fixed effects:
##              Estimate Std. Error        df t value Pr(>|t|)    
## (Intercept)  6.579857   0.021391 72.790000 307.599  < 2e-16 ***
## condN       -0.057408   0.008465 72.700000  -6.781 2.68e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##       (Intr)
## condN -0.241

Alternatively, we can code one level as \(-0.5\) and another as \(+0.5\), retaining the difference of one unit between the two conditions but setting the “Intercept” (i.e., the reference value against which effects are computed) to the grand mean instead. This is known as “deviation” coding.

rld$condN <- ifelse(rld$cond=="Neutral", -0.5, ifelse(rld$cond=="Match", +0.5, NA))
l1d <- lmer(logRT ~ condN + (condN|subject), rld)
sl1d <- summary(l1d)
sl1d
## Linear mixed model fit by REML t-tests use Satterthwaite approximations
##   to degrees of freedom [lmerMod]
## Formula: logRT ~ condN + (condN | subject)
##    Data: rld
## 
## REML criterion at convergence: 424.2
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.9500 -0.6830 -0.1631  0.5305  4.4736 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  subject  (Intercept) 0.031258 0.17680       
##           condN       0.002538 0.05037  -0.06
##  Residual             0.059298 0.24351       
## Number of obs: 6364, groups:  subject, 74
## 
## Fixed effects:
##              Estimate Std. Error        df t value Pr(>|t|)    
## (Intercept)  6.551153   0.020779 72.920000 315.280  < 2e-16 ***
## condN       -0.057408   0.008465 72.700000  -6.781 2.68e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##       (Intr)
## condN -0.045

Naturally, coding of the condition levels does not affect whether their difference is significant or not.

Note also that the \(t\) value from the LMEMs is about the same (to the second decimal digit) as the lowly paired t test we tried first. This is comforting: When the same data and the same structure are submitted to different statistical procedures, you want to get the same answer from all of them, otherwise you wouldn’t know what to believe. So, the power of LMEMs is not that they give you statistically significant results where more traditional approaches don’t. So, why the fuss?

Crossed random effects and trial-level covariates

The most important feature of LMEMs for psycholinguistics is the ability to account for participant and item variance simultaneously. Before LMEMs, we would have to conduct two ANOVAs on these data, one for participants (averaging over items) and one for items (averaging over participants), and then decide whether our results are statistically significant on the basis of both analyses. This is suboptimal because the two grouping factors are not considered together and the criteria for significance are not very well founded (and perhaps a bit too conservative). In contrast, today we can easily fit an LMEM accounting for both random factors:

l3a <- lmer(logRT ~ cond + (cond|subject) + (cond|item), rld)
sl3a <- summary(l3a)
sl3a
## Linear mixed model fit by REML t-tests use Satterthwaite approximations
##   to degrees of freedom [lmerMod]
## Formula: logRT ~ cond + (cond | subject) + (cond | item)
##    Data: rld
## 
## REML criterion at convergence: -967.8
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.3894 -0.6462 -0.1194  0.5120  5.6627 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  item     (Intercept) 0.015467 0.12437       
##           condMatch   0.005101 0.07142  -0.09
##  subject  (Intercept) 0.034404 0.18548       
##           condMatch   0.001258 0.03547  -0.28
##  Residual             0.043859 0.20943       
## Number of obs: 6364, groups:  item, 144; subject, 74
## 
## Fixed effects:
##               Estimate Std. Error         df t value Pr(>|t|)    
## (Intercept)   6.594618   0.024223 104.990000 272.248  < 2e-16 ***
## condMatch    -0.059114   0.008978 109.410000  -6.584 1.64e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##           (Intr)
## condMatch -0.209

A second feature that can be very useful in modeling psycholinguistic data is to account for trends that are defined at the trial level, including temporal dependencies. Baayen and Milin (2010) have pointed out that response times tend to follow long-term trends that may differ among participants (reflecting individuals’ balance between learning and fatigue) as well as strong correlations between successive trials. Explicitly modeling these trends improves the fit of the model, and its validity, to the extent that (a) additional variance is accounted for and (b) the error term more cloesly approximates the theoretical requirement of independence.

In the simple example below, trial order and previous logRT have been normalized (centered and scaled to \(SD=1.0\)).

l3b <- lmer(logRT ~ cond + Corder + CprevRT + (cond|subject) + (0+Corder|subject) + (cond|item), rld)
sl3b <- summary(l3b)
sl3b
## Linear mixed model fit by REML t-tests use Satterthwaite approximations
##   to degrees of freedom [lmerMod]
## Formula: 
## logRT ~ cond + Corder + CprevRT + (cond | subject) + (0 + Corder |  
##     subject) + (cond | item)
##    Data: rld
## 
## REML criterion at convergence: -1179.7
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.5446 -0.6357 -0.1083  0.5009  6.0050 
## 
## Random effects:
##  Groups    Name        Variance  Std.Dev. Corr 
##  item      (Intercept) 0.0159384 0.12625       
##            condMatch   0.0050348 0.07096  -0.11
##  subject   Corder      0.0016905 0.04112       
##  subject.1 (Intercept) 0.0290583 0.17046       
##            condMatch   0.0009273 0.03045  -0.30
##  Residual              0.0416189 0.20401       
## Number of obs: 6364, groups:  item, 144; subject, 74
## 
## Fixed effects:
##               Estimate Std. Error         df t value Pr(>|t|)    
## (Intercept)  6.592e+00  2.274e-02  1.100e+02 289.816  < 2e-16 ***
## condMatch   -5.915e-02  8.634e-03  1.090e+02  -6.851  4.5e-10 ***
## Corder      -7.499e-03  5.470e-03  7.200e+01  -1.371    0.175    
## CprevRT      2.657e-02  3.419e-03  6.170e+03   7.771  9.1e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##           (Intr) cndMtc Corder
## condMatch -0.211              
## Corder    -0.001 -0.001       
## CprevRT   -0.015  0.002  0.032

Each trial has a unique trial order and a unique previous RT, so the only way to account for these effects is to process data at the individual trial level. However, trial-level covariates need not be individually trial-specific, as in this example. They can also be made up of unique combinations of the two random (grouping) factors. For example, you may want to account for participant age and word frequency; although the participant’s age applies to all items responded by each participant, and the word’s frequency applies to all participants responding to this word, the age-frequency combination is trial specific in that it applies to one participant’s response to one word. Being able to model crossed random effects also allows you to simultaneously account for covariates related to both random factors.

These two things (crossed random effects and trial-level covariates) are things that you just cannot do in the traditional analyses.

Model “criticism”

From the perspective of model fit, there are several steps that can be taken to examine and revise the model. Before discussing the random structure (in the next session) let us consider the model residuals. As in any regression model, the residuals must be normally distributed and there should be no relationship between their dispersion and the predicted dependent variable. Large residuals indicate poor fit for the specific samples they arise. That is, large residuals constitute post-hoc outliers; from the point of view of model fit, large residuals can be more problematic than outliers in the dependent variable (at least for low-leverage outliers). So, one strategy to improve model fit is to remove points with the largest residuals.

plot(l3b)

Cutoff points for \(p<.001\) (two-tailed; or \(p<.0001\), two-tailed) can be derived from the normal distribution:

qnorm(.9995)
## [1] 3.290527
qnorm(.99995)
## [1] 3.890592

Data points with absolute normalized residuals exceeding the cutoff are removed from the dataset, and the model is re-fit. We choose to define a new dependent variable, rather than simply omitting the points from the analysis, to retain the comparability of data frames, so that we can align the two fits and the other variables, if needed.

crit <- abs(scale(resid(l3b)))
cutf <- qnorm(.9995)
sum(crit>cutf) # number of points to be removed
## [1] 40
rld$logRTr <- ifelse(crit>cutf, NA, rld$logRT)[,1]
l3c <- lmer(logRTr ~ cond + Corder + CprevRT + (cond|subject) + (0+Corder|subject) + (cond|item), rld, na.action=na.exclude)
sl3c <- summary(l3c)
sl3c
## Linear mixed model fit by REML t-tests use Satterthwaite approximations
##   to degrees of freedom [lmerMod]
## Formula: 
## logRTr ~ cond + Corder + CprevRT + (cond | subject) + (0 + Corder |  
##     subject) + (cond | item)
##    Data: rld
## 
## REML criterion at convergence: -1719
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.0702 -0.6435 -0.1013  0.5273  3.6178 
## 
## Random effects:
##  Groups    Name        Variance Std.Dev. Corr 
##  item      (Intercept) 0.015778 0.12561       
##            condMatch   0.005110 0.07149  -0.11
##  subject   Corder      0.001757 0.04191       
##  subject.1 (Intercept) 0.027951 0.16719       
##            condMatch   0.001093 0.03306  -0.32
##  Residual              0.037929 0.19475       
## Number of obs: 6324, groups:  item, 144; subject, 74
## 
## Fixed effects:
##               Estimate Std. Error         df t value Pr(>|t|)    
## (Intercept)  6.586e+00  2.236e-02  1.110e+02 294.501  < 2e-16 ***
## condMatch   -5.891e-02  8.667e-03  1.140e+02  -6.797 5.12e-10 ***
## Corder      -7.727e-03  5.498e-03  7.200e+01  -1.405    0.164    
## CprevRT      3.247e-02  3.413e-03  6.122e+03   9.514  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##           (Intr) cndMtc Corder
## condMatch -0.227              
## Corder    -0.001 -0.001       
## CprevRT   -0.017  0.004  0.030

(Note that, when there are missing data in the dependent variable, the model is effectively fit to a subset of the data frame. This results in residual and predicted vectors that do not align properly with the data frame and are of different length. To ensure that all rows are included in model-based outputs, such as residuals, we add na.action=na.exclude in the model specification.)

The outcome of this minor adjustment (40 points out of 6364 is 0.63%) is a huge decrease in BIC and noticeable improvement in the quantile plot, that is, an improved model fit that increases our confidence in its estimates.

BIC(l3b,l3c)
## Warning in BIC.default(l3b, l3c): models are not all fitted to the same
## number of observations
##     df       BIC
## l3b 12 -1074.565
## l3c 12 -1614.001
par(mfrow=c(1,2))
ylim=max(crit)
qqnorm(scale(resid(l3b)),ylim=c(-ylim,ylim))
qqline(scale(resid(l3b)))
qqnorm(scale(resid(l3c)),ylim=c(-ylim,ylim))
qqline(scale(resid(l3c)))

How many points should/can we remove? It stands to reason that, if you are modeling 1,000 points, you should be able to tolerate residuals with deviations corresponding to \(p >= .001\); conversely, you are justified in removing data points with residual deviations corresponding to \(p < .001\). Similarly, if you are modeling 10,000 points, you can discard residuals by a \(p < .0001\) criterion. More relaxed criteria are also possible, but it is not clear that the improvement in model fit justifies discarding data, since model fit is not our sole (or even primary) objective.

Back to the full model

Now it is time to go back to our desired test involving three levels of condition (Match and Mismatch against Neutral) over two lexicality conditions (Word and Pseudo), including possible interactions. We want to avoid degenerate solutions and nonsignificant terms. We also want to include temporal effects (of trial order and previous RT) and to apply minimal, post-hoc data trimming on the basis of model fit. It is not entirely clear in which order all of these enhancements should be applied.

Moreover, it is not clear to which extent all of these requirements are achievable, or even desirable. For example, from a model fit point of view, nonsignificant terms should be dropped. This includes both fixed and random effects terms. However, from a hypothesis testing perspective, terms are included in the model for theoretical reasons, in order to examine their significance, and are retained even when nonsignificant if there are theoretical hypotheses involving their interaction with other terms. From this point of view, fixed effects are not removed because they are what we want to test; and random effects are not removed because they form the background against which fixed effects are evaluated (recall the “maximal random structure” recommendation). It seems that we are simultaneously trying to serve two masters, who are partly at cross-purposes, and it is no wonder that we sometimes end up with conflicting requirements and indefensible models. For the time being this is the cost we pay for giving up good old ANOVA and being able to model crossed random effects and trial-level covariates. Just as long as we have a sense of what we are doing, and why.

Factor coding

Before we specify our full model we have to decide on the desired contrast coding. Recall that factor levels do not work very well in the random structures, especially when double-bar syntax is used to exclude correlations. So we need to define and use contrast-coding variables, like we did above with deviation-coded lexN and condN. Briefly, our main contenders are:

  • Treatment coding (contr.treatment), in which one level is the reference (and model Intercept) and other levels are compared to it. In this coding, effects of other factors are evaluated at the reference level and not at the mean (therefore, they are simple effects and not main effects).
  • Deviation coding (contr.sum), in which levels are compared to the mean, which is also the model Intercept. In this coding, effects of other factors are evaluated at the mean and, therefore, correspond to main effects.
  • Successive difference coding (contr.sdif in package MASS), in which levels are ordered and each level is compared to the one next to it. As in deviation coding, effects of other factors are evaluated at the mean, which is also the Intercept, and correspond to main effects.

For two-level factors, one variable suffices to encode the contrast. For three-level factors, two variables are needed; and so on. These variables must be explicitly defined and used in the model as we did for lexN and condN above. Here are the values these variables must take for each level in order to encode the contrast according to each scheme, first for two levels and then for three levels:

library(MASS)
contr.treatment(2)
##   2
## 1 0
## 2 1
contr.sum(2)
##   [,1]
## 1    1
## 2   -1
contr.sdif(2)
##    2-1
## 1 -0.5
## 2  0.5
contr.treatment(3)
##   2 3
## 1 0 0
## 2 1 0
## 3 0 1
contr.sum(3)
##   [,1] [,2]
## 1    1    0
## 2    0    1
## 3   -1   -1
contr.sdif(3)
##          2-1        3-2
## 1 -0.6666667 -0.3333333
## 2  0.3333333 -0.3333333
## 3  0.3333333  0.6666667

For our analysis, we have to decide whether we want condition effects to be evaluated at the mean (of words and pseudowords) or at a base level (e.g., Words); and whether we want lexicality to be evaluated at the mean (of conditions) or at one condition (e.g., Neutral). For condition, we should also decide whether we want to examine differences from a base level or from the mean.

Given that our priming effects are defined with respect to the Neutral, condition should be encoded either as treatment, with Neutral as the reference level (in which case the effect of lexicality will be evaluated at the Neutral condition), or as successive differences, in order Mismatch-Neutral-Match, producing two differences from the Neutral and evaluating lexicality at the mean of all conditions. Naturally, the interpretation of the interaction coefficients will vary accordingly.

Let us encode lexicality by treatment, with Word being the reference level, and condition by successive differences, with Neutral between Mismatch and Match. Iin this way lexicality will be evaluated on the mean of all conditions, condition (i.e., the two priming effects) will be evaluated on words, and interactions will correspond to differences in priming between words and pseudowords. We will need one variable for lexicality (with values of 0 for the Word and 1 for Pseudo, as defined by contr.treatment) and two variables for condition (one for the difference Neutral\(-\)Match and one for the difference Mismatch\(-\)Neutral, with values defined by contr.sdif).

data2$d1.Mi.Ne <- ifelse(data2$cond=="Mismatch",-0.6666667,0.3333333) # encoding inhibition by mismatch
data2$d2.Ne.Ma <- ifelse(data2$cond=="Match",0.6666667,-0.3333333)    # encoding facilitation by match
data2$t.lexP <- ifelse(data2$lex=="Pseudo", 1.0, ifelse(data2$lex=="Word", 0.0, NA)) # words as reference

Fitting, criticising, and trimming the model

The full model can then be fitted, including full random structures for participants and items, and the temporal dependence effects as discussed above. Naturally, there is no random slope of lexicality for items, since items can only be either words or pseudowords.

l5a <- lmer(logRT ~ (d1.Mi.Ne+d2.Ne.Ma)*t.lexP + CprevRT + Corder + 
            (0+Corder|subject) + ((d1.Mi.Ne+d2.Ne.Ma)*t.lexP|subject) + ((d1.Mi.Ne+d2.Ne.Ma)|item), 
            data2, REML=F, na.action=na.exclude)
sl5a <- summary(l5a)
print(sl5a,corr=F)
## Linear mixed model fit by maximum likelihood t-tests use Satterthwaite
##   approximations to degrees of freedom [lmerMod]
## Formula: logRT ~ (d1.Mi.Ne + d2.Ne.Ma) * t.lexP + CprevRT + Corder + (0 +  
##     Corder | subject) + ((d1.Mi.Ne + d2.Ne.Ma) * t.lexP | subject) +  
##     ((d1.Mi.Ne + d2.Ne.Ma) | item)
##    Data: data2
## 
##      AIC      BIC   logLik deviance df.resid 
##  -6787.2  -6495.3   3430.6  -6861.2    19668 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.6424 -0.6335 -0.1050  0.5034  6.5565 
## 
## Random effects:
##  Groups    Name            Variance  Std.Dev. Corr                   
##  item      (Intercept)     0.0103985 0.10197                         
##            d1.Mi.Ne        0.0021908 0.04681   0.19                  
##            d2.Ne.Ma        0.0036530 0.06044  -0.07 -0.56            
##  subject   (Intercept)     0.0288948 0.16998                         
##            d1.Mi.Ne        0.0012957 0.03600  -0.40                  
##            d2.Ne.Ma        0.0010379 0.03222  -0.16 -0.52            
##            t.lexP          0.0074521 0.08633  -0.41  0.42 -0.11      
##            d1.Mi.Ne:t.lexP 0.0011401 0.03376   0.41 -0.71  0.17 -0.43
##            d2.Ne.Ma:t.lexP 0.0008381 0.02895  -0.19  0.71 -0.84  0.23
##  subject.1 Corder          0.0016553 0.04069                         
##  Residual                  0.0372458 0.19299                         
##       
##       
##       
##       
##       
##       
##       
##       
##       
##  -0.64
##       
##       
## Number of obs: 19705, groups:  item, 288; subject, 74
## 
## Fixed effects:
##                   Estimate Std. Error         df t value Pr(>|t|)    
## (Intercept)      6.589e+00  2.161e-02  1.020e+02 304.928  < 2e-16 ***
## d1.Mi.Ne        -5.087e-02  7.574e-03  1.080e+02  -6.716 8.94e-10 ***
## d2.Ne.Ma        -5.927e-02  7.960e-03  1.420e+02  -7.445 8.60e-12 ***
## t.lexP           7.319e-03  1.590e-02  2.630e+02   0.460   0.6458    
## CprevRT          2.991e-02  1.864e-03  1.919e+04  16.047  < 2e-16 ***
## Corder          -1.241e-02  4.934e-03  7.400e+01  -2.516   0.0141 *  
## d1.Mi.Ne:t.lexP  7.903e-02  9.611e-03  1.150e+02   8.222 3.46e-13 ***
## d2.Ne.Ma:t.lexP  1.069e-02  1.039e-02  1.600e+02   1.029   0.3050    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The model converged, giving no cause for concern in the random structure (no zeros or ones). This does not mean that it is a “parsimonious” model.

summary(rePCA(l5a))
## $item
## Importance of components:
##                          [,1]   [,2]    [,3]
## Standard deviation     0.5323 0.3496 0.17478
## Proportion of Variance 0.6497 0.2802 0.07005
## Cumulative Proportion  0.6497 0.9300 1.00000
## 
## $subject
## Importance of components:
##                          [,1]   [,2]    [,3]    [,4]    [,5]    [,6] [,7]
## Standard deviation     0.9124 0.4083 0.25511 0.21082 0.13925 0.08932    0
## Proportion of Variance 0.7327 0.1468 0.05728 0.03912 0.01707 0.00702    0
## Cumulative Proportion  0.7327 0.8795 0.93679 0.97591 0.99298 1.00000    1

It seems that all three variance components in the item structure are required; and that the participant structure is overparameterized with seven parameters and six dimensions accounting for 100% of the variance (maybe less, since five dimensions account for 99.3%). We can try to eliminate the d2 \(\times\) lex interaction, which is both high-order and low-variance, and see if that significantly deteriorates model fit. (Noting, again, that according to the “maximal structure” guideline, we should not remove this variance component if we want to test the significance of the corresponding fixed effect with appropriate Type-I error control.)

l5b <- lmer(logRT ~ (d1.Mi.Ne+d2.Ne.Ma)*t.lexP + CprevRT + Corder + 
            (0+Corder|subject) + (d1.Mi.Ne*t.lexP+d2.Ne.Ma|subject) + ((d1.Mi.Ne+d2.Ne.Ma)|item), 
            data2, REML=F, na.action=na.exclude)
anova(l5a,l5b)
## Data: data2
## Models:
## ..1: logRT ~ (d1.Mi.Ne + d2.Ne.Ma) * t.lexP + CprevRT + Corder + (0 + 
## ..1:     Corder | subject) + (d1.Mi.Ne * t.lexP + d2.Ne.Ma | subject) + 
## ..1:     ((d1.Mi.Ne + d2.Ne.Ma) | item)
## object: logRT ~ (d1.Mi.Ne + d2.Ne.Ma) * t.lexP + CprevRT + Corder + (0 + 
## object:     Corder | subject) + ((d1.Mi.Ne + d2.Ne.Ma) * t.lexP | subject) + 
## object:     ((d1.Mi.Ne + d2.Ne.Ma) | item)
##        Df     AIC     BIC logLik deviance Chisq Chi Df Pr(>Chisq)
## ..1    31 -6793.7 -6549.1 3427.8  -6855.7                        
## object 37 -6787.2 -6495.3 3430.6  -6861.2 5.512      6       0.48

The difference is not statistically significant, so we might want to keep the reduced model. This time the number of dimensions warrants the number of parameters:

summary(rePCA(l5b))
## $item
## Importance of components:
##                          [,1]   [,2]    [,3]
## Standard deviation     0.5323 0.3493 0.17421
## Proportion of Variance 0.6504 0.2800 0.06965
## Cumulative Proportion  0.6504 0.9304 1.00000
## 
## $subject
## Importance of components:
##                          [,1]   [,2]   [,3]    [,4]    [,5]    [,6]
## Standard deviation     0.9096 0.4017 0.2110 0.14153 0.13669 0.02177
## Proportion of Variance 0.7715 0.1505 0.0415 0.01868 0.01742 0.00044
## Cumulative Proportion  0.7715 0.9220 0.9635 0.98214 0.99956 1.00000

You may be hesitant to remove a variance component from the random structure because that is supposed to render the test for the corresponding fixed effect anti-conservative. Is that really the case when the random effect is not statistically significant?

sl5b <- lme4:::summary.merMod(l5b)
coef(sl5a)["d2.Ne.Ma:t.lexP","t value"]
## [1] 1.028975
coef(sl5b)["d2.Ne.Ma:t.lexP","t value"]
## [1] 1.083587

The \(t\) value increased somewhat (by 5.3%); you judge whether or not you are prepared to accept that. One argument for removing the nonsignificant variance component is, if it so small as to be nonsignificant, how can it be too large to allow the fixed effect to be significant? That is, the purpose of these random variance components is to serve as a backdrop against which fixed effects are evaluated. If the random variance is very small then evidently they cannot “wash out” the fixed effect, so removing them does not render our criterion anticonservative (at least not unreasonably so).

Finally, let us also consider model criticism. We will apply the criticism to l5a (the full model)

crit <- abs(scale(resid(l5a)))
cutf <- qnorm(.99995) # one in ten thousand
sum(crit>cutf) # number of points to be removed
## [1] 30
data2$logRTr <- ifelse(crit>cutf, NA, data2$logRT)[,1]
l5c <- lmer(logRTr ~ (d1.Mi.Ne+d2.Ne.Ma)*t.lexP + CprevRT + Corder + 
            (0+Corder|subject) + ((d1.Mi.Ne+d2.Ne.Ma)*t.lexP|subject) + ((d1.Mi.Ne+d2.Ne.Ma)|item), 
            data2, REML=F, na.action=na.exclude)
#Save time by using the original lme4 summary instead of p-value approximation in lmerTest:
#sl5c <- lme4:::summary.merMod(l5c) 
sl5c <- summary(l5c)
print(sl5c,corr=F)
## Linear mixed model fit by maximum likelihood t-tests use Satterthwaite
##   approximations to degrees of freedom [lmerMod]
## Formula: logRTr ~ (d1.Mi.Ne + d2.Ne.Ma) * t.lexP + CprevRT + Corder +  
##     (0 + Corder | subject) + ((d1.Mi.Ne + d2.Ne.Ma) * t.lexP |  
##     subject) + ((d1.Mi.Ne + d2.Ne.Ma) | item)
##    Data: data2
## 
##      AIC      BIC   logLik deviance df.resid 
##  -7339.3  -7047.5   3706.6  -7413.3    19638 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.6639 -0.6359 -0.0989  0.5113  3.8397 
## 
## Random effects:
##  Groups    Name            Variance Std.Dev. Corr                         
##  item      (Intercept)     0.010406 0.10201                               
##            d1.Mi.Ne        0.002291 0.04786   0.18                        
##            d2.Ne.Ma        0.003713 0.06094  -0.07 -0.56                  
##  subject   (Intercept)     0.028609 0.16914                               
##            d1.Mi.Ne        0.001575 0.03969  -0.38                        
##            d2.Ne.Ma        0.001141 0.03379  -0.13 -0.58                  
##            t.lexP          0.007396 0.08600  -0.41  0.40 -0.08            
##            d1.Mi.Ne:t.lexP 0.001538 0.03922   0.37 -0.73  0.36 -0.38      
##            d2.Ne.Ma:t.lexP 0.001233 0.03511  -0.17  0.67 -0.86  0.16 -0.73
##  subject.1 Corder          0.001663 0.04078                               
##  Residual                  0.036114 0.19004                               
## Number of obs: 19675, groups:  item, 288; subject, 74
## 
## Fixed effects:
##                   Estimate Std. Error         df t value Pr(>|t|)    
## (Intercept)      6.587e+00  2.152e-02  1.020e+02 306.133  < 2e-16 ***
## d1.Mi.Ne        -5.221e-02  7.821e-03  1.080e+02  -6.676 1.09e-09 ***
## d2.Ne.Ma        -5.879e-02  8.033e-03  1.430e+02  -7.318 1.66e-11 ***
## t.lexP           8.135e-03  1.588e-02  2.650e+02   0.512   0.6088    
## CprevRT          3.120e-02  1.863e-03  1.915e+04  16.751  < 2e-16 ***
## Corder          -1.281e-02  4.939e-03  7.400e+01  -2.594   0.0114 *  
## d1.Mi.Ne:t.lexP  8.071e-02  9.892e-03  1.160e+02   8.159 4.48e-13 ***
## d2.Ne.Ma:t.lexP  9.800e-03  1.062e-02  1.550e+02   0.923   0.3576    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
par(mfrow=c(1,2))
ylim=max(crit)
qqnorm(scale(resid(l5a)),ylim=c(-ylim,ylim))
qqline(scale(resid(l5a)))
qqnorm(scale(resid(l5c)),ylim=c(-ylim,ylim))
qqline(scale(resid(l5c)))

cor(na.omit(predict(l5a)),na.omit(data2$logRT))^2
## [1] 0.5671786
cor(na.omit(predict(l5c)),na.omit(data2$logRTr))^2
## [1] 0.5754413
BIC(l5a,l5c)
## Warning in BIC.default(l5a, l5c): models are not all fitted to the same
## number of observations
##     df       BIC
## l5a 37 -6495.306
## l5c 37 -7047.456

Removal of only 30 points (\(p < .0001\), two-tailed) has increased \(R^2\) from 0.567 to 0.575, has greatly reduced BIC, and has cleared up the tails of the residuals distribution. We might remove more points to further improve fit, balanced by the consideration that it is not a good idea to throw out data just because we don’t like them.

Interpretation

So, what are the results of our study?

For words, we have significant facilitation by matching items relative to neutral (d2.Ne.Ma), and significant inhibition by mismatching items relative to neutral (d1.Mi.Ne). To interpret these properly you need to refer to the definition of the two variables, noting the the negative sign of the coefficients indicates larger values for Mismatch (compared to Neutral) and larger values for Neutral (compared to Match).

There are significant temporal dependencies in the data, as there always are, but we are not interested in interpreting these effects; they were only included to improve model fit.

There is no significant difference between words and pseudords in facilitation (i.e., in the difference between Match and Neutral), indicated by the interaction term d2.Ne.Ma:t.lexP. This does not imply that there is significant priming in pseudowords; if we care about that we’ll have to test it specifically (in another, appropriately defined model).

There is a significant difference between words and pseudowords in inhibition (i.e., in the difference between Mismatch and Neutral), indicated by the interaction term d2.Mi.Ne:t.lexP. This does not imply anything in particular about priming in pseudowords. Looking at the coefficients, the mismatching priming effect in pseudowords must be d1.Mi.Ne:t.lexP \(+\) d1.Mi.Ne \(=\) 0.0285, that is, positive rather than negative, so there is certainly no inhibition. There may or may not be significant facilitation; if we care about that we’ll have to test it specifically (in another, appropriately defined model).

There is no significant difference between words and pseudowords, averaging over (matching, mismatching, and neutral) conditions. This is not very informative insofar as responses to targets are demonstrably affected by the primes, at least in some conditions, differentially for the two types of stimuli. (In actual data analysis, we should probably have encoded condition by treatment, to evaluate the lex contrast on the neutral condition rather than on the mean, but I chose successive differences coding above for demonstration purposes.)

An additional note of caution regarding interpretation of our effects. The resulting structure (model l5c) lists 8 fixed-effects coefficients, with associated standard errors and significance tests. If we are on a fishing trip, looking to see what came up significant, then we are inflating our Type I error probability by as much as this number (\(0.05 \times 8 = 0.4\)); if our effects are entirely independent, which they are certainly not, so this is a worst-case scenario).

Hopefully, there was a reason for running the experiment and collecting the data in the first place, and this has led to specific research questions and hypotheses to test. Test results must be adjusted according to the number of tests considered. Conversely, only tests posed by our research hypotheses may be considered and interpreted. Otherwise, you may be interepreting random noise. In any case, it is best to replicate, but that’s a whole different issue.

Suggestion: A piecemeal approach to random structures

Because most of our LMEM trouble originates in the random structure, it is worth spending a bit more time on trying to bring it under control. What we would like to have is some indication of trouble ahead that can be revealed without having to wait for hours while our model fails to converge.

One informal observation that seems to hold generally is that the addition of parameters to the model will tend to reduce, rather than increase, variance in random effects components. That seems reasonable insofar as modeling some of the available variance with another component will reduce the available variance. Fixed effects, especially, will eat random effects away to the extent they are substantial (and significant). This suggests that it may be fruitful (and certainly much faster) to examine random effects in the absence of fixed effects, to get a sense of the maximal variance that can be squeezed out of them and, therefore, the extent to which they may be expected to cause instabilities further on (if they are too small, or too strongly correlated with others, or uncorrelated).

A second observation is that grouping factors can be somewhat separable. Although they are not entirely independent, since they are based on the same data and, possibly, certain common dependencies, we can largely assume that effects related to participants will be largely distinct from those related to items. This suggests that it may be instructive to begin with an examination of participant random effects, separately from item random effects.

At any rate, variable coding must precede these explorations, as random effects depend critically on it. So, first of all, we must decide on the appropriate coding for our variables given our experimental design and research hypotheses, so that the encoded contrasts reflect exactly what we need to test. Otherwise we may be wasting our time optimizing a useless random structure.

For example, in our last model, we could have started like this:

l6a <- lmer(logRTr ~ 1 + (0+Corder|subject) + ((d1.Mi.Ne+d2.Ne.Ma)*t.lexP|subject), 
            data2, REML=F, na.action=na.exclude)
VarCorr(l6a)
##  Groups    Name            Std.Dev. Corr                              
##  subject   Corder          0.047490                                   
##  subject.1 (Intercept)     0.199275                                   
##            d1.Mi.Ne        0.077572 -0.546                            
##            d2.Ne.Ma        0.076064 -0.308  0.203                     
##            t.lexP          0.083603 -0.223  0.031 -0.136              
##            d1.Mi.Ne:t.lexP 0.092323  0.583 -0.938 -0.514  0.026       
##            d2.Ne.Ma:t.lexP 0.056353 -0.161  0.493 -0.730  0.046 -0.198
##  Residual                  0.216084
summary(rePCA(l6a))
## $subject
## Importance of components:
##                          [,1]   [,2]   [,3]    [,4]    [,5]    [,6]
## Standard deviation     1.0010 0.4612 0.4114 0.34349 0.21977 0.03875
## Proportion of Variance 0.6457 0.1371 0.1091 0.07603 0.03113 0.00097
## Cumulative Proportion  0.6457 0.7828 0.8919 0.96791 0.99903 1.00000
##                             [,7]
## Standard deviation     3.219e-07
## Proportion of Variance 0.000e+00
## Cumulative Proportion  1.000e+00
l6b <- lmer(logRTr ~ 1 + ((d1.Mi.Ne+d2.Ne.Ma)|item), 
            data2, REML=F, na.action=na.exclude)
VarCorr(l6b)
##  Groups   Name        Std.Dev. Corr         
##  item     (Intercept) 0.097350              
##           d1.Mi.Ne    0.080112 -0.005       
##           d2.Ne.Ma    0.089719 -0.038 -0.263
##  Residual             0.264434
summary(rePCA(l6b))
## $item
## Importance of components:
##                          [,1]   [,2]   [,3]
## Standard deviation     0.3717 0.3611 0.2718
## Proportion of Variance 0.4036 0.3807 0.2157
## Cumulative Proportion  0.4036 0.7843 1.0000

Compare these random effects to those in the full model:

VarCorr(l5c)
##  Groups    Name            Std.Dev. Corr                              
##  item      (Intercept)     0.102010                                   
##            d1.Mi.Ne        0.047862  0.179                            
##            d2.Ne.Ma        0.060936 -0.066 -0.565                     
##  subject   (Intercept)     0.169141                                   
##            d1.Mi.Ne        0.039689 -0.380                            
##            d2.Ne.Ma        0.033785 -0.131 -0.578                     
##            t.lexP          0.085999 -0.411  0.398 -0.083              
##            d1.Mi.Ne:t.lexP 0.039216  0.372 -0.730  0.359 -0.381       
##            d2.Ne.Ma:t.lexP 0.035115 -0.166  0.669 -0.861  0.164 -0.735
##  subject.1 Corder          0.040783                                   
##  Residual                  0.190037
summary(rePCA(l5c))
## $item
## Importance of components:
##                          [,1]   [,2]    [,3]
## Standard deviation     0.5405 0.3600 0.18077
## Proportion of Variance 0.6429 0.2851 0.07192
## Cumulative Proportion  0.6429 0.9281 1.00000
## 
## $subject
## Importance of components:
##                          [,1]   [,2]    [,3]    [,4]    [,5]    [,6] [,7]
## Standard deviation     0.9232 0.4149 0.30504 0.21460 0.13846 0.11046    0
## Proportion of Variance 0.7133 0.1441 0.07787 0.03854 0.01604 0.01021    0
## Cumulative Proportion  0.7133 0.8573 0.93521 0.97375 0.98979 1.00000    1

We see that variances are usually smaller, and sometimes much smaller, in the full model, and that the participant and item structures, and their dimensionality estimates, are reasonably similar between the separate and the combined analysis. In this particular example this does not seem particularly useful, but other datasets may require extra taming before producing the sought-after tests of our hypotheses.

Conclusions and some ideas/recommendations

Maximal random structures make the LMEMs directly comparable to ANOVA (when both can be similarly applied). So we should strive for maximal random structures, following Barr et al. (2013), to keep our statistics consistent and our alpha levels under control.

Temporal dependencies in the data violate model assumptions. So we should include modeling of trial order and previous RT, following Baayen and Milin (2010). This improves model fit and may even clarify effects of interest.

Too many parameters in the random structure make no sense and can cause all sorts of trouble, including preventing convergence. We cannot work with unconverged models and don’t want to work with unreliable solutions. So we should strive for parsimonious models, following Bates et al. (2015).

Models with two or more factors test many hypotheses simultaneously, that is, they constitute a “multiple comparison” situation. If you are really interested in multiple comparisons, correct your alpha levels accordingly, following Cramer et al. (2014). Otherwise, examine and report only test results pertaining to the specific questions posed by your theory and design. This will have the added benefit of making your results section readable.

Statistics is not going to tell you what you care about or what to test. Decide in advance what contrasts you are interested in, based on your experimental design and research questions. Encode your conditions on the basis of what you care to find out. Think carefully about reference levels, if they naturally occur in your design. Do not strive to fit an omnibus model unless you are specifically interested in the interactions. In the above example, do we really care whether priming differs between words and pseudowords? Or do we just need to know whether there is facilitation and inhibition in words and pseudowords, separately? In the latter case, there is no reason to fit a single model with both types of items.

Don’t forget to visualize and clean up your data before you even consider fitting a LMEM. Check individual accuracy and RT distributions (qqmath(~logRT|subject,data) is very useful; you will find qqmath in the lattice package).

If you just throw your data into a full-omnibus-maximal LMEM, you are asking for trouble. Check your random structures to get an idea of what might be supported by the data. Although random effects won’t stay the same when combined with other random and fixed effects, low-variance components are not likely to attain variance in the presence of other factors; instead, addition of more parameters will likely diminish related random variance components. Investigate simple models with participants-only and items-only random structures, compute their dimensionality and try to anticipate problems before you build something that will take hours to report nonconvergence.

We are in a transitional period, in which statistics is evolving and our tools and methods are in flux. Null hypothesis significance testing meets model fitting, and it looks more like a collision than a merger. Deal with it. And keep an eye open for further developments, which are likely imminent, given increased statistical sophistication and computational power.


References

Baayen, R. H., & Milin, P. (2010). Analyzing reaction times. International Journal of Psychological Research, 3(2), 12–28.

Barr, D. J., Levy, R., Scheepers, C., & Tily, H. J. (2013). Random effects structure for confirmatory hypothesis testing: Keep it maximal. Journal of Memory and Language, 68, 255–278.

Bates, D., Kliegl, R., Vasishth, S., & Baayen, H. (2015). Parsimonious mixed models. http://de.arxiv.org/pdf/1506.04967

Cramer, A. O., van Ravenzwaaij, D., Matzke, D., Steingroever, H., Wetzels, R., Grasman, R. P., … & Wagenmakers, E. J. (2015). Hidden multiplicity in exploratory multiway ANOVA: Prevalence and remedies. Psychonomic Bulletin & Review. doi:10.3758/s13423-015-0913-5


Send corrections, comments, and suggestions to protopap[at]gmail[dot]com

First draft of 19 October 2015