Simulation is hard to pin down because there are so many possible axes of decision, but there are some tricks that make it easier. (Chapter 5 of Bolker 2008 has some examples, but they’re simultaneously less complicated [simpler experimental designs] and more complicated [nonlinear responses] than the examples here.)
library(reshape2) ## for melt()
library(plyr) ## for raply()
library(ggplot2); theme_set(theme_bw())
library(grid) ## for unit()
zmargin <- theme(panel.margin=unit(0,"lines"))
library(lme4)
Here are some of the properties of the simulation that you have to decide on. It’s easier if you have a specific, real example in mind to guide you.
expand.grid, runif, rnorm, seq, repexpand.grid, sample, rep, gl (maybe rpois for number of samples within levels)Here I’ll do an example somewhat like the Culcita example. Here there are a specified number of random-effects groups with each treatment replicated multiple times within a block; in addition, there’s a continuous covariate (uniformly distributed) that has a unique value for every observation. The data are binomial with a user-specified \(N\) (I don’t know what will happen if \(N\) (N.binom in the function below) is specified as a vector: can simulate.formula handle a vector-valued size argument??)
For convenience in running multiple simulations I’ll wrap everything in functions whose parameters can easily be varied.
The simulation function:
simfun <- function(n.blocks=10,n.ttt=4,n.rep=4,N.binom=1,
x.range=c(-1,1),
## intercept, ttt effects, slope
## default gives homogeneous log-odds of 0 -> prob=0.5
beta=c(0,rep(0,n.ttt-1),0),
theta=0,
seed=NULL) {
if (!is.null(seed)) set.seed(seed)
dd <- expand.grid(block=factor(seq(n.blocks)),
ttt=factor(letters[seq(n.ttt)]),
rep=seq(n.rep))
dd$x <- runif(nrow(dd),min=x.range[1],max=x.range[2])
dd$N <- N.binom
## the messages about "theta/beta parameter vector not named"
## are annoying ...
dd$y <- suppressMessages(simulate(~ttt+x+(1|block),family="binomial",
size=N.binom,
newdata=dd,
newparams=list(beta=beta,theta=theta))[[1]])
return(dd)
}
A function for fitting the data. If it weren’t for x the experimental design would stay fixed and we could use refit, but we can’t (it is theoretically possible to adapt refit to this situation, but it would take a bit of hacking).
fitfun <- function(dd) {
glmer(y/N~ttt+x+(1|block),
weights=N,family="binomial",
data=dd)
}
And a function to generate summary output from a fit:
sumfun <- function(fit) {
c(fixef(fit),getME(fit,"theta"))
}
We can use raply from the plyr package
simres <- raply(100,sumfun(fitfun(simfun())))
true_vals <- data.frame(Var2=c("(Intercept)",
paste0("ttt",letters[2:4]),
"x","block.(Intercept)"),
value=c(0,0,0,0,0,0))
ggplot(melt(simres),aes(x=Var2,y=value))+geom_violin(fill="gray")+
facet_wrap(~Var2,scale="free")+
geom_hline(data=true_vals,aes(yintercept=value),col=2,lty=2)
When I want to run a factorial simulation experiment (varying values for several different parameters), I usually prefer to run a series of nested for loops to fill in an array (although I may use raply, sometimes with .progress="text", for the inner loop that loops over replicates of the same parameter values). I typically set up the array first with carefully assigned dimnames so that it can be melted: my array dimensions are usually {par1,par2,...,parN,rep,vars}, where the last dimension is the vector of variables saved for each run.
For example:
thetavec <- seq(0,1,by=0.2)
slopevec <- seq(-1,1,by=0.5)
res0 <- sumfun(fitfun(simfun()))
nrep <- 20
resarr <- array(dim=c(length(thetavec),length(slopevec),nrep,length(res0)),
dimnames=list(theta=thetavec,slope=slopevec,rep=seq(nrep),
var=names(res0)))
beta.base <- rep(0,5)
names(beta.base) <- names(res0)[1:5]
## prefer to use i,j,k,... for indices
for (i in seq_along(thetavec)) {
for (j in seq_along(slopevec)) {
## if you want to see progress in the output file
## cat(i,j,"\n")
## fancier progress bars available: ?txtProgressBar
beta <- beta.base
beta["x"] <- slopevec[j]
resarr[i,j,,] <- raply(nrep,
sumfun(fitfun(simfun(theta=thetavec[i],
beta=beta))))
save("resarr",file="lme4_sims.RData")
## checkpointing
## decide whether you want this in the innermost or next-
## or next-next ... probably don't need to save more than once every
## 10 minutes or so unless you're really impatient and want
## to look at interim results in an independent R session
}
}
There are various ways of saving long runs.
* For the simplest, shortest kinds of runs, you can use knitr’s caching facility (just add cache=TRUE to the chunk options) * For longer runs, I often do my own version of caching by defining a file name (e.g. fn) and then using code structured as if (file.exists(fn)) load(fn) else { ...; save(fn)}. The simulations won’t be re-run if you change the code, but they will if you remove (or rename) the output file. * Alternatively, you may want to put the batch code in a completely separate .R file (you still need to save the .RData file from the batch run and use load() when you want to analyze the results).
At this point you have a fairly complex object (a 4-dimensional, $6 5 20 6 array), but at least one that is fairly well labeled:
dimnames(resarr)
## $theta
## [1] "0" "0.2" "0.4" "0.6" "0.8" "1"
##
## $slope
## [1] "-1" "-0.5" "0" "0.5" "1"
##
## $rep
## [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12" "13" "14"
## [15] "15" "16" "17" "18" "19" "20"
##
## $var
## [1] "(Intercept)" "tttb" "tttc"
## [4] "tttd" "x" "block.(Intercept)"
You can calculate summary statistics across various dimensions, especially across the replication dimension:
resmeans <- apply(resarr,c(1,2,4),mean,na.rm=TRUE)
resci <- apply(resarr,c(1,2,4),quantile,c(0.025,0.975))
Or draw pictures, after melt()ing the data (melting arrays with sensible dimnames automatically produces sensible results):
mres <- melt(resarr)
ggplot(mres,aes(theta,value,colour=slope))+
geom_boxplot(aes(fill=slope,
group=interaction(theta,slope)))+
facet_wrap(~var,scale="free")
or
ggplot(mres,aes(slope,value))+
geom_boxplot(aes(group=slope))+
facet_grid(var~theta,scale="free",
labeller=label_both)+
zmargin
or
ggplot(melt(resmeans),aes(slope,value,colour=theta,
group=theta))+
geom_line()+facet_wrap(~var)+zmargin
Other handy tricks include
data argument with a data frame composed of the true values;geom_violin) to summarise large sets of runssweep()ing the array with a comparison operator and the true values)mean(p<0.05) across the replication dimensionsweep() is handy again here)