1 Introduction

In this document I demonstrate how to use library simr to run power simulations for already-fitted (G)LMMs, that is, not by constructing variables and data structures from scratch but by using fixed effect sizes and random structures from models fit to existing data. The models shown here are a bit more complex than those shown in the simr tutorials. There are also additional tricks and solutions to issues with nondefault contrasts and use of parallel computation.

I am grateful to Peter Green for help and support with debugging and solving problems.

Note: None of the simulation code shown here is actually evaluated in the context of this Rmarkdown document! Τhe code is copy-pasted from the actual scripts that have been run, but for computational reasons code chunks are set with eval=FALSE (except for showing the data structure and the model summary). The model used for this example takes a few minutes to fit, so you can imagine that 1000 simulated datasets would take a while, even on an 8-core machine. The simulations I used were actually run on a 64-core HPC Linux cluster. You should select a model to work with depending on available computing power. For example, you may want to trim your model to a simpler version that can be fitted faster.

We first load the necessary libraries.

library(MASS)
library(lme4) 
library(simr)
library(future)
library(future.apply)

We use MASS for the successive-differences contrast (contr.sdif) instead of the treatment-contrast default, as we want mean intercepts so that effects are interpretable as main (i.e., average) effects rather than as effects at some reference level.

Library future.apply (based on the future framework) is used to distribute simulations over multiple cores for faster overall computation (even though each individual model fit runs on a single core).

2 Setting up the model

For clarity and completeness we go through all the steps of fitting the model. However, this is not necessary for the power simulation. You can skip this whole section if you already have a fitted model you are happy with.

The study where the data come from concerns an orthographic learning task in Dutch where participants first read a set of sentences which contain certain target items (words). Participants are children in Grades 2 and 5. Each item may appear in either 2 or 6 sentences; or it may not appear at all. Which items are in which sentence context and exposure condition varies with participant. Some items are orally known words and some are novel (pseudowords). After this exposure phase participants perform an orthographic choice task where they have to identify the correctly spelled items among other options.

We are generally interested in the orthographic learning effect, that is, the effect of exposure; broken into early learning (two exposures vs. none) and late learning (six exposures vs. two). More importantly, we are interested in the effects of grade and lexicality on learning. These are interaction effects between the exposure factor (reps, with three levels: 0, 2, 6) and either grade (G2 and G5) or lexicality (words vs. pseudowords). It is these interactions that we will be running power simulations for.

2.1 Data loading and variable coding

The data are loaded from a compressed Rda file (saved using save instead of write.table).

load("som.Rda")
str(som)

## 'data.frame':    4063 obs. of  15 variables:
##  $ sID     : Factor w/ 127 levels "40010118","40030118",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ targetS : Factor w/ 32 levels "accu","beitel",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ spelitem: int  14 15 26 6 25 31 18 1 4 2 ...
##  $ reps    : Factor w/ 3 levels "0","2","6": 2 1 1 2 1 1 1 1 2 1 ...
##  $ spelaccu: int  0 0 1 0 0 0 0 1 0 1 ...
##  $ Nsyl2   : Factor w/ 2 levels "1","2": 2 2 2 1 2 2 1 1 1 1 ...
##  $ lex2    : Factor w/ 2 levels "p","w": 2 2 1 2 1 1 1 2 2 2 ...
##  $ orthitem: int  11 10 31 3 32 26 23 8 5 7 ...
##  $ orthaccu: int  1 1 1 1 0 0 1 1 1 1 ...
##  $ grade   : Factor w/ 2 levels "G2","G5": 1 1 1 1 1 1 1 1 1 1 ...
##  $ lex2C   : num  -0.5 -0.5 0.5 -0.5 0.5 0.5 0.5 -0.5 -0.5 -0.5 ...
##  $ Nsyl2C  : num  0.5 0.5 0.5 -0.5 0.5 0.5 -0.5 -0.5 -0.5 -0.5 ...
##  $ gradeC  : num  -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 ...
##  $ reps2   : Factor w/ 3 levels "2","0","6": 1 2 2 1 2 2 2 2 1 2 ...
##  $ repsC   : num  NA -0.5 -0.5 NA -0.5 -0.5 -0.5 -0.5 NA -0.5 ...

Independent variables that are two-level factors have already been manually difference-coded into numeric variables, so that one of the levels equals −0.5 and the other level equals +0.5. Using numeric variables rather than defining difference contrasts over the factors is necessary for using the double-bar (||) lme4 syntax to force correlations to zero (when this is statistically defensible, as in this case). Otherwise lme4 will still calculate within-factor correlations despite the double bar. The use of numeric variables is also the easiest way to let the random structure match the fixed effects (both amounting to a difference contrast rather than a treatment contrast).

The three-level variable rep requires two contrast variables to code the two differences, and this we can set up via contr.sdif from the MASS library. However, we do not want to retain the default generic names of these two contrasts (“2-1” and “3-2”) because it makes the results more difficult to read, so we replace their names with the labels of our conditions (the two differences: “2_0” and “6_2”). Note that we use underscores instead of dashes because that makes it easier to use these labels as indices when needed.

reps.contr.sdif <- contr.sdif(3)
colnames(reps.contr.sdif) <- c("2_0","6_2")
contrasts( som$reps ) <- reps.contr.sdif

Note also that we assign the contrast to the variable within the data frame, which causes it to be stored as an attribute that gets picked up whenever relevant. In this way we do not need to specify the contrast explicitly when we fit the model.

2.2 Using the successive-differences contrast in the random structure

Although we have defined successive-differences coding for the exposure variable (reps), the random structure will not pick this up and instead will calculate random effects for the factor levels as usual. To force the random structure to use the two successive-differences contrast variables that are defined by contr.sdif we need to use a special trick as shown by Kliegl (2014). To do this we need to recover the relevant vectors from the model matrix; and in order to get those we have to define a model with the factors in question. So we fit a dummy model for this purpose; we do not care about the fit and we use the simplest random structure (intercepts only) to speed up computation.

olfrm0 <- as.formula( "orthaccu ~ 1 + reps*lex2C*gradeC + (1|sID) + (1|targetS)" )
ol3c0 <- glmer( olfrm0, som, family=binomial )

We then recover the two vectors we need from the model matrix.

reps2_0 <- model.matrix(ol3c0)[,"reps2_0"] # 2 vs 0 exposures
reps6_2 <- model.matrix(ol3c0)[,"reps6_2"] # 6 vs 2 exposures

So we can then use these vectors in the random structure of the model we are actually interested in.

2.3 Target model fit

We fit a generalized linear mixed model with accuracy in an orthographic choice task as the dependent variable (orthaccu). The independent variables are number of exposures to the target item (reps; factor with three levels: 0, 2, 6; successive-difference-coded), lexicality (lex2C; numeric variable difference-coding the two levels of lexicality), and grade (gradeC; numeric variable difference-coding the two grades). The random factors include participants (sID) and target items (targetS). The random structure shown in the formula below has resulted after trimming a maximal model until a nonsingular fit can be achieved with no significant difference in model fit compared to the maximal (the trimming procedure is not shown here).

Note that vectors reps2_0 and reps6_2 represent the exposure (reps) contrast from the model matrix, and by using them instead of reps in the random structure we get the random structure to evaluate the random slopes for these variables instead of the treatment-contrast default variables, thereby matching the fixed effects. Otherwise, if we retained the default behavior we would have successive differences in the fixed effects and treatment contrast in the random effects, for the same factor.

olfrm3 <- as.formula( "orthaccu ~ 1 + reps*lex2C*gradeC + (reps6_2||sID) + (reps2_0+reps6_2+gradeC||targetS)" )
glmerctrlist <- glmerControl(optCtrl=list(maxfun=1e5), optimizer = "bobyqa")
ol3c3 <- glmer( olfrm3, som, family=binomial, control=glmerctrlist )
save(ol3c3,file="ol3c3.Rda")

Once evaluated, we save the model in a compressed Rda file for future use.

3 Running one simulation

To run the power simulation, we don’t need to fit the model every time; we can load the fitted model directly if we have saved it as above; everything we need can be found in the model object. (The entire previous section was only included for illustration and clarity.)

load("ol3c3.Rda")
Model <- ol3c3
print(summary(Model),corr=F)

## Generalized linear mixed model fit by maximum likelihood (Laplace
##   Approximation) [glmerMod]
##  Family: binomial  ( logit )
## Formula: orthaccu ~ 1 + reps * lex2C * gradeC + (reps2_0 || sID) + (reps6_2 +  
##     reps2_0 + gradeC || targetS)
##    Data: som
## Control: glmerControl(optCtrl = list(maxfun = 1e+05), optimizer = "bobyqa")
## 
##      AIC      BIC   logLik deviance df.resid 
##   4959.1   5072.6  -2461.5   4923.1     4045 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -5.4548 -0.8205  0.3043  0.7788  2.1308 
## 
## Random effects:
##  Groups    Name        Variance Std.Dev.
##  sID       (Intercept) 0.03327  0.1824  
##  sID.1     reps2_0     0.17604  0.4196  
##  targetS   (Intercept) 0.69803  0.8355  
##  targetS.1 reps6_2     0.05251  0.2292  
##  targetS.2 reps2_0     0.14876  0.3857  
##  targetS.3 gradeC      0.75922  0.8713  
## Number of obs: 4063, groups:  sID, 127; targetS, 32
## 
## Fixed effects:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           0.48616    0.15443   3.148  0.00164 ** 
## reps2_0               0.21955    0.09808   2.238  0.02519 *  
## reps6_2               0.26410    0.13077   2.020  0.04343 *  
## lex2C                -1.02539    0.30708  -3.339  0.00084 ***
## gradeC                0.48736    0.17750   2.746  0.00604 ** 
## reps2_0:lex2C        -0.33710    0.19614  -1.719  0.08568 .  
## reps6_2:lex2C         0.38908    0.25002   1.556  0.11966    
## reps2_0:gradeC        0.11517    0.17768   0.648  0.51688    
## reps6_2:gradeC        0.12052    0.22107   0.545  0.58563    
## lex2C:gradeC         -0.64708    0.34879  -1.855  0.06356 .  
## reps2_0:lex2C:gradeC  0.09637    0.35537   0.271  0.78625    
## reps6_2:lex2C:gradeC  0.02726    0.41569   0.066  0.94771    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

3.1 Set up parameters

We define the basic parameters of the simulation: How many samples to run (1000) and what the \(\alpha\) level for statistical significance is (0.05).

nsim <- 1000
alpha <- 0.05

We then identify which fixed effect we want to run the the simulation for and what effect size we are interested in (see the model summary above). This “effect size” refers to the unstandardized estimate for the effect (regression coefficient) as would be seen in the model summary.

teff <- "reps2_0:gradeC"
veff <- 0.50

So we are interested in the probability of detecting an effect of 0.50 for the interaction of early learning and grade. To use simr to find this out we need to create a model structure based on the fitted model that incorporates the effect query.

First we retrieve the fixed effects vector from the model and set the effect in question to the desired value.

fef <- fixef(Model)
fef[teff] <- veff

Then we retrieve the random effects structure from the model and retain only the variance-covariance matrix, discarding the standard deviations and the correlation matrix.

vcv <- VarCorr(Model)
for (l in names(vcv)) {
  attr(vcv[[l]],"stddev") <- NULL
  attr(vcv[[l]],"correlation") <- NULL
}

Finally we recover the data frame from the model, retain only the independent variables and random factors from it, and add the two vectors coding the successive-differences contrast. We could have used the original data frame and the vectors as extracted from the dummy model, which are identical, but here we are showing how to get everything from the model object itself.

sdata <- cbind(Model@frame[,c("sID","targetS","reps","lex2C","gradeC")],
               as.data.frame(model.matrix(Model))[,c("reps2_0","reps6_2")])
# you can also probably just get away with:
sdata <- Model@frame[,-1] # just exclude the dependent variable in the 1st column, the rest is already there

Note that there should be no missing values in the data frame! Use na.omit() if passing a data frame with NA values in any of these variables.

We also need to specify any optimizer parameters we want to be used with model fit.

glmerctrlist <- glmerControl(optCtrl=list(maxfun=1e5), optimizer = "bobyqa")

3.2 Build a model object

We use all these components to build the model structure that will be used for the simulation. This also includes the model formula retrieved directly from the model object as well.

tglmer <- makeGlmer(attr(Model@frame,"formula"), 
                    family="binomial", fixef=fef, VarCorr=vcv, data=sdata)

3.3 Run simulation with parallel execution

Before runing the simulation, we set up R to use as many cores as we have available. On a regular windows PC we would just initialize a “multisession” plan as follows:

plan(multisession)

On our HPC Linux cluster it turns out that we get best performance if we limit each PID to use one core only and run as many instances (PIDs) as the available cores:

Sys.setenv(OMP_NUM_THREADS=1)
plan(multisession,workers=64)

The actual simulation is running using the following code. The trick that achieves parallel support for simr via future/future.apply was based on the discussion for simr issue #39 on github. Specifically, instead of running the desired number of simulations using function powerSim directly for Nsim samples, this trick runs the function repeatedly Nsim times, each time with a single sample. In this way each run can be allocated to a different instance on a different core, and the future framework then collects and assembles all output into a list.

pstests <- future_replicate(nsim, powerSim(tglmer, nsim=1, test=fixed(teff,"z"),
                                           fitOpts=list(control=glmerctrlist),
                                           progress = FALSE),
                            future.globals = c("powerSim","tglmer","teff","glmerctrlist"),
                            simplify = FALSE)
plan(sequential)

The future.globals parameter identifies the variables that should be available to the additional instances of R that execute powerSim, that is, all the values needed for the arguments in the call to powerSim.

Note that we revert to non-parallel execution after the simulations are finished.

3.4 Compute power estimate

We can then count the proportion of runs for which the target effect was statistically significant, and this is our power.

pvals <- sapply(pstests,function(x){x$pval})
print(round(sum(pvals<alpha)/length(pvals),2))

(No output is shown here because no simulation was run.)

4 Power curves by sample size

The above procedure lets us estimate power for a specific pre-determined data structure and size. However, it is possible to use simr to estimate power for more extended (or more limited) datasets. This is achieved by first extending the dataset and then running the power simulation as above with the extended dataset.

To extend our dataset we specify which grouping factor to expand (by adding levels) and how many levels it should have. For example, to expand to 200 participants we set sID to 200 as follows:

nsub <- 200
tglx <- extend(tglmer,along="sID",n=nsub)

We can then use function powerCurve to run simulations for specific values of this factor (e.g., for 150, 175, and 200 participants). However, this does not allow parallelization because powerSim is called internally from powerCurve and we have no control over allocation to cores.

Instead, we can run our own simulations with extended datasets as we wish. Here is an example running power simulations for larger groups of items (by expanding factor targetS rather than sID). The original dataset had 32 items, so we can see how much more power we could get for 64, 96, and 128 items (assuming the same fixed effects vector and random effects structure).

for (nitem in c(32*1:3+32)) {
  
  print(nitem)
  
  tglx <- extend(tglmer,along="targetS",n=nitem)
  
  Sys.setenv(OMP_NUM_THREADS=1)
  plan(multisession,workers=64) 
  
  pstests <- future_replicate(nsim, powerSim(tglx, nsim=1, test=fixed(teff,"z"),
                                             fitOpts=list(control=glmerctrlist),
                                             progress = FALSE),
                              future.globals = c("powerSim","tglx","teff","glmerctrlist"),
                              simplify = FALSE)
  plan(sequential)
  
  pvals <- sapply(pstests,function(x){x$pval})
  print(round(sum(pvals<alpha)/length(pvals),2))
}

Send corrections, comments, and suggestions to protopap[at]gmail[dot]com

First draft of 14 June 2021

How to run power simulations for fitted (G)LMMs using library simr

Athanassios Protopapas

14 June 2021