Convenience function for parallel estimation of multiple (lmer) models

Relevance

When fitting models to data sets, I sometimes follow a more data-driven (some might say: chaotic) approach. This often involves fitting multiple models with differing parameterizations so the same data set, or using different variables and/or codings of variables.

As long as the data set is small or the models are simple, this seldomly poses a problem (besides me getting confused). More recently, I started to fit more complex mixed effect models, often involving crossed random factors, to medium to large data sets. Although the great lme4 package handles those kinds of models efficiently, I often (for example right now) find myself sitting in front of my computer waiting for results. This can become somewhat tedious. To speed up this process, I wrote a simple convenience function that uses parallel::mclapply() to estimate multiple modells at the same time. As the function requires parallel::mclapply(), it does not work on Windows [1].

The function

f_lmer_mc = function(data, calls, mc.cores) {
    require(parallel)
    if (is.data.frame(data)) 
        data = replicate(length(calls), data, simplify = F)
    for (i in 1:length(data)) attr(data[[i]], "cll") = calls[i]
    m.list = mclapply(data, function(i) eval(parse(text = attr(i, "cll"))), 
        mc.cores = mc.cores)
    return(m.list)
}

Example

The function is quite simple. I will demonstrate its use with the sleepstudy data from the lme4 package. In principle, the function will work for all kind of models which have a similar call as lmer. Therefore, the package has to be loaded separately.

library(lme4)
## Loading required package: Matrix Loading required package: Rcpp

The function needs three inputs: data, calls, and mc.cores.

data: If you want to fit all models to the same data set, data is just the name of the data.frame. In or example: sleepstudy. The data.frame is replicated as many times as there are model calls. If you want to fit models to different data sets, the function accepts a list of data.frame objects.

calls: The function calls (in this case to lmer()) are stored as a character vector. If you use different data sets, the order of data sets in the data list has to be the same as the order of function calls. Each call has to include “data = i”! This could probably be easily made more convenient, but I am to lazy to figure out exactly how.

models = c("lmer(Reaction ~ 1 + (1 | Subject), data = i)", "lmer(Reaction ~ Days + (Days | Subject), data = i)", 
    "lmer(Reaction ~ Days + (1 | Subject) + (0 + Days | Subject), data = i)")

In this example, we fit different models to the sleepstudy data. The first model is an intercept-only model. The second call fits a latent growth curve model. The third model omits the correlation between intercept and slope in the growth curve model.

mc.cores gives mclapply() the number of cores to use for parallelization. It makes sense to use as many cores as you have models to fit. Consequently, it does not make to fit more models as you have cores. If you do, you have to wait longer before getting the first results.

We call the convenience function and save the results (objects of class lmerMod) in a list.

m.list = f_lmer_mc(sleepstudy, models, 3)
## Loading required package: parallel

Ofcourse only a neglectable amount of time is saved in this simple example.

The results of single models can be viewed like all list elements. For example the second model:

summary(m.list[[2]])
## Linear mixed model fit by REML ['lmerMod']
## Formula: Reaction ~ Days + (Days | Subject)
##    Data: i
## 
## REML criterion at convergence: 1744
## 
## Scaled residuals: 
##    Min     1Q Median     3Q    Max 
## -3.954 -0.463  0.023  0.463  5.179 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr
##  Subject  (Intercept) 612.1    24.74        
##           Days         35.1     5.92    0.07
##  Residual             654.9    25.59        
## Number of obs: 180, groups:  Subject, 18
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)   251.41       6.82    36.8
## Days           10.47       1.55     6.8
## 
## Correlation of Fixed Effects:
##      (Intr)
## Days -0.138

If you want to see the results of all models, use:

lapply(m.list, summary)
## [[1]]
## Linear mixed model fit by REML ['lmerMod']
## Formula: Reaction ~ 1 + (1 | Subject)
##    Data: i
## 
## REML criterion at convergence: 1904
## 
## Scaled residuals: 
##    Min     1Q Median     3Q    Max 
## -2.498 -0.550 -0.148  0.512  3.345 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  Subject  (Intercept) 1278     35.8    
##  Residual             1959     44.3    
## Number of obs: 180, groups:  Subject, 18
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)   298.51       9.05      33
## 
## [[2]]
## Linear mixed model fit by REML ['lmerMod']
## Formula: Reaction ~ Days + (Days | Subject)
##    Data: i
## 
## REML criterion at convergence: 1744
## 
## Scaled residuals: 
##    Min     1Q Median     3Q    Max 
## -3.954 -0.463  0.023  0.463  5.179 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr
##  Subject  (Intercept) 612.1    24.74        
##           Days         35.1     5.92    0.07
##  Residual             654.9    25.59        
## Number of obs: 180, groups:  Subject, 18
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)   251.41       6.82    36.8
## Days           10.47       1.55     6.8
## 
## Correlation of Fixed Effects:
##      (Intr)
## Days -0.138
## 
## [[3]]
## Linear mixed model fit by REML ['lmerMod']
## Formula: Reaction ~ Days + (1 | Subject) + (0 + Days | Subject)
##    Data: i
## 
## REML criterion at convergence: 1744
## 
## Scaled residuals: 
##    Min     1Q Median     3Q    Max 
## -3.963 -0.463  0.020  0.465  5.186 
## 
## Random effects:
##  Groups    Name        Variance Std.Dev.
##  Subject   (Intercept) 627.6    25.05   
##  Subject.1 Days         35.9     5.99   
##  Residual              653.6    25.57   
## Number of obs: 180, groups:  Subject, 18
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)   251.41       6.89    36.5
## Days           10.47       1.56     6.7
## 
## Correlation of Fixed Effects:
##      (Intr)
## Days -0.184

And to compare the fits (make sure that a model comparison makes sense - the function runs no checks!):

do.call(anova, m.list)
## Warning: failed to find unique model names, assigning generic names
## refitting model(s) with ML (instead of REML)
## Data: i
## Models:
## MODEL1: Reaction ~ 1 + (1 | Subject)
## MODEL3: Reaction ~ Days + (1 | Subject) + (0 + Days | Subject)
## MODEL2: Reaction ~ Days + (Days | Subject)
##        Df  AIC  BIC logLik deviance  Chisq Chi Df Pr(>Chisq)    
## MODEL1  3 1917 1926   -955     1911                             
## MODEL3  5 1762 1778   -876     1752 158.54      2     <2e-16 ***
## MODEL2  6 1764 1783   -876     1752   0.06      1        0.8    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Limitations

Most importantly, the function mostly makes sense if all estimations require an equal amount of time. You have to wait for the slowest estimation to finish before you get any results.

Notes

[1] That is, it will work on Windows machines, but without parallel model estimation. All models will be estimated one at a time, hence making the convenience function more or less useless. See http://stat.ethz.ch/R-manual/R-devel/library/parallel/doc/parallel.pdf