estimating group membership models with lme4

Consider a situation where we have multiple items of some sort which have associated effects, but we can’t association observations with single items; instead, each observation is associated with a group of items. (Two examples which have come up are (1) authorship of papers and (2) hockey players on a team.) These are not quite identical to “multi-membership models” (I think), although similar techniques to those shown here could work for multi-membership models.

Load packages (we don’t really need anything beyond lme4; the rest are for convenience/drawing pictures).

library(lme4)
library(broom.mixed)
library(ggplot2); theme_set(theme_bw())
library(Matrix)

Construct a simulated example: first, simulate the design (structure).

nm <- 20
nobs <- 500
set.seed(101)
## choose items for observations
pres <- matrix(rbinom(nobs*nm, prob=0.25, size=1),nrow=nobs,ncol=nm)
dimnames(pres) <- list(NULL,LETTERS[seq(nm)])
pres[1:5,]

##      A B C D E F G H I J K L M N O P Q R S T
## [1,] 0 0 0 0 0 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0
## [2,] 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [3,] 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 1 0 0
## [4,] 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0
## [5,] 0 1 0 1 0 0 0 0 0 0 1 0 1 1 1 1 1 0 0 0

table(rowSums(pres))

## 
##   1   2   3   4   5   6   7   8   9  10  11 
##  11  40  63  85 106  85  52  37  11   7   3

The first 10 observations:

image(Matrix(pres),ylim=c(1,10),sub="",ylab="Observation",
      xlab="Item",
      ## draw tick labels at top
      scales=list(at=1:20,x=list(labels=colnames(pres),
                                 alternating=2)))

Since we chose which items/individuals to include for each observation randomly and independently (Bernoulli with probability 0.25), we have a highly variable number of individuals present for different observations (1-11). This would be realistic for some examples (authorship), unrealistic for others (hockey) … I don’t think it really makes much difference computationally (statistically, having some observations with a single member must make estimation more powerful …) The 0/1 matrix (indicator variable for whether item \(i\) is included in observation \(j\)) is convenient, and will turn out to be the form we need for inclusion in the model. It should be fairly straightforward to convert other forms (e.g. a list of sets of items associated with each observation) to this form …

Now simulate the response variable.

b <- rnorm(nm)  ## item-level effects
## n.b. get in trouble if we don't add residual error
## (theta is scaled relative to residual error)
## here, use theta=sigma=1
y <- c(pres %*% b) +rnorm(nobs,sd=1)

Fit (see ?modular):

## helpful to specify a factor with the right levels:
## actual values are unimportant since we will specify Zt/Ztlist directly
fake <- rep(LETTERS[seq(nm)],length.out=nobs)
lmod <- lFormula(y~1+(1|fake))
lmod$reTrms$Zt <- lmod$reTrms$Ztlist[[1]] <- Matrix(t(pres))
devfun <- do.call(mkLmerDevfun, lmod)
opt <- optimizeLmer(devfun)
m1 <- mkMerMod(environment(devfun), opt, lmod$reTrms, fr = lmod$fr)

Results look OK (correct item and residual variance estimated):

m1

## Linear mixed model fit by REML ['lmerMod']
## REML criterion at convergence: 1500.701
## Random effects:
##  Groups   Name        Std.Dev.
##  fake     (Intercept) 0.9444  
##  Residual             0.9896  
## Number of obs: 500, groups:  fake, 20
## Fixed Effects:
## (Intercept)  
##      0.1392

The conditional modes by item look good:

dd <- tidy(m1, effects="ran_vals")
dd <- transform(dd, level=reorder(level,estimate))
truth <- data.frame(level=LETTERS[seq(nm)],estimate=b)
ggplot(dd,aes(x=level,y=estimate))+
    geom_pointrange(aes(ymin=estimate-2*std.error,
              ymax=estimate+2*std.error))+coord_flip()+
    geom_point(data=truth,colour="red")

Session info:

sessionInfo()

## R version 3.5.0 (2018-04-23)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: OS X El Capitan 10.11.6
## 
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] bindrcpp_0.2.2    ggplot2_3.1.0     broom.mixed_0.2.3 lme4_1.1-19      
## [5] Matrix_1.2-15    
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.0       plyr_1.8.4       glmmTMB_0.2.2.0  compiler_3.5.0  
##  [5] pillar_1.3.0     nloptr_1.2.1     bindr_0.1.1      TMB_1.7.15      
##  [9] tools_3.5.0      digest_0.6.18    gtable_0.2.0     evaluate_0.12   
## [13] tibble_1.4.2     nlme_3.1-137     lattice_0.20-38  pkgconfig_2.0.2 
## [17] rlang_0.3.0.1    yaml_2.2.0       coda_0.19-2      withr_2.1.2     
## [21] dplyr_0.7.8      stringr_1.3.1    knitr_1.20       rprojroot_1.3-2 
## [25] grid_3.5.0       tidyselect_0.2.5 glue_1.3.0       R6_2.3.0        
## [29] rmarkdown_1.10   minqa_1.2.4      reshape2_1.4.3   tidyr_0.8.2     
## [33] purrr_0.2.5      magrittr_1.5     scales_1.0.0     backports_1.1.2 
## [37] htmltools_0.3.6  MASS_7.3-51.1    splines_3.5.0    assertthat_0.2.0
## [41] colorspace_1.3-2 labeling_0.3     stringi_1.2.4    lazyeval_0.2.1  
## [45] munsell_0.5.0    broom_0.5.0      crayon_1.3.4

estimating group membership models with lme4

Ben Bolker

24 November 2018