Simulating variance

We’re interested in simulating (and eventually fitting) models where the variance varies across levels of a fixed (categorical) treatment factor. This is dealt with to some extent in another document on Rpubs, although here I will try to be a bit more basic.

library("ggplot2"); theme_set(theme_bw())
library("lme4")
library("nlme")

Parameter values/experimental design (I decided to make up parameters etc. as I went along): two treatments (ttt=c("homog","var")) with multiple individuals in each one, (indiv) measured repeatedly at equal time intervals (obs). (The names of the treatment categories aren’t meaningful, just left over from the problem that inspired this writeup.)

nindiv <- 5
nobs <- 20 
expdat <- expand.grid(indiv=factor(seq(nindiv)),
                      obs=seq(nobs),
                      ttt=c("homog","var"))
## construct dummy variables 
expdat <- transform(expdat,
                    homog=as.numeric(ttt=="homog"),
                    var=as.numeric(ttt=="var"),
                    total_obs=factor(seq(nrow(expdat))))
##
expdat <- transform(expdat,
                    homogobs = homog*obs,
                    varobs = var*obs)

Simple case: equal variation in intercepts and slopes across individuals in both treatments:

res <- list()
res[[1]] <- data.frame(expdat,model="intslopevar",
                       y = simulate(~ttt+(obs|indiv:ttt),
                       seed=101,
                       newdata=expdat,
                       newparams=list(beta=c(0,4),
                       theta=c(1,0,0.5),
                       sigma=0.2),
                       family="gaussian")[[1]])

We use indiv:ttt as the grouping variable since individuals are nested within treatment (i.e. not a crossover design).
We didn’t put obs into the fixed effects, so average slope is zero

(g1 <- ggplot(res[[1]],aes(obs,y,colour=indiv:ttt,pch=ttt,lty=ttt))+
    geom_point()+
    geom_smooth(method="lm",se=FALSE))

plot of chunk plot1

If we want different variation in intercepts among individuals, we can construct a numeric dummy variable and use it (with +0 to suppress the intercept) in two separate random effects terms.

res[[2]] <- data.frame(expdat,model="ttt_intvar",
                       y = simulate(~ttt+(0+homog|indiv:ttt)+
                                         (0+var|indiv:ttt),
                       seed=101,
                       newdata=expdat,
                       newparams=list(beta=c(0,4),
                       theta=c(1,2),
                       sigma=0.2),
                       family="gaussian")[[1]])
g1 %+% res[[2]]

plot of chunk model2

Here there is no among-individual variation in slope in the model. Apparent variation in slopes is due to measurement error – could see this more clearly if I reduced sigma substantially.
It’s clear that the variance in intercepts is larger for var group.
We could also use the dummy() syntax, which lets us construct dummy variables on the fly. Rather than re-do the whole model, I’m just going to use lFormula (see ?modular) and examine the random-effects components:

L1 <- lFormula(y~ttt+(0+homog|indiv:ttt)+
          (0+var|indiv:ttt),data=res[[2]])
L2 <- lFormula(y~ttt+(0+dummy(ttt,"homog")|indiv:ttt)+
          (0+dummy(ttt,"var")|indiv:ttt),data=res[[2]])
diffs <- all.equal(L1$reTrms,L2$reTrms)  
diffs[!grepl("string mismatch",diffs)]  ## exclude string mismatches

## character(0)

Now we’ll allow variance in both intercepts and slopes, differentially across groups. Because we’re including these variances as four separate random effects terms, the slope and intercept REs for each individual are assumed to be independent of each other.

res[[3]] <- data.frame(expdat,model="ttt_intslopevar",
                       y = simulate(~ttt+
                       (0+homog|indiv:ttt)+
                       (0+var|indiv:ttt)+
                       (0+homogobs|indiv:ttt)+
                       (0+varobs|indiv:ttt),
                       seed=101,
                       newdata=expdat,
                       newparams=list(beta=c(0,4),
                       theta=c(0.1,3,0.5,0.1),
                       sigma=0.2),
                       family="gaussian")[[1]])
g1 %+% res[[3]]

plot of chunk model3

As we have specified, there is smaller among-intercept variation (\(\theta_1=\sigma_{\textrm{homog},\textrm{int}}=0.1\) vs. \(\theta_2=\sigma_{\textrm{var},\textrm{int}}=3\)), but larger among-slope variation (\(\theta_3=\sigma_{\textrm{homog},\textrm{slope}}=0.5\) vs. \(\theta_4==\sigma_{\textrm{homog},\textrm{var}}=0.1\)), in the homog (mean-0) than in the var (mean-4) group.

As a final example, here’s the case that allows correlation between the intercept and slope random effects for each individual. We don’t actually specify any correlation (the \(\theta\) parameters are specified as the lower triangles of a Cholesky factor for the variance-covariance matrix. The definition is

\[ \Sigma = \left( \begin{array}{cc} \sigma_1^2 & \sigma_{12} \\ \sigma_{12} & \sigma_2^2 \end{array} \right) = \left( \begin{array}{cc} \theta_1 & 0 \\ \theta_2 & \theta_3 \end{array} \right) \left( \begin{array}{cc} \theta_1 & \theta_2 \\ 0 & \theta_3 \end{array} \right) = \left( \begin{array}{cc} \theta_1^2 & \theta_1 \theta_2 \\ \theta_1 \theta_2 & \theta_2^2 + \theta_3^2 \end{array} \right) \]

In other words, if we set \(\theta_2=0\), then we get back to \(\theta_1=\sigma_1\), \(\theta_3=\sigma_2\).

res[[4]] <- data.frame(expdat,model="ttt_corintslopevar",
                       y = simulate(~ttt+
                       (0+homog+homogobs|indiv:ttt)+
                       (0+var+varobs|indiv:ttt),
                       seed=101,
                       newdata=expdat,
                       newparams=list(beta=c(0,4),
                       theta=c(0.1,0,0.5,3,0,0.1),
                       sigma=0.2),
                       family="gaussian")[[1]])
g1 %+% res[[4]]

plot of chunk model4

Can we shorten this at all, e.g. by using appropriate combinations of interactions (:) and dummy()?

Can we look at results of mkReTrms to interpret whether we got the right model?

L1 <- lFormula(y~ttt+(0+homog+homogobs|indiv:ttt)+
                       (0+var+varobs|indiv:ttt),
                       data=data.frame(expdat,y=1))

Everything together:

g1 %+% do.call(rbind,res)+facet_wrap(~model)

plot of chunk unnamed-chunk-3

Example 2

Now we want to look at another example (for this one we have some test data).

test <- read.table("testdat.txt",header=TRUE,na.strings=".")

The proximal question is how to allow for differential residual variation across the two treatments. Based on this picture, I don’t really think it’s there:

w <- 0.4
(g2 <- ggplot(test,aes(Line,CL2,colour=Trtment))+
  geom_boxplot(width=w,position=position_dodge(width=w))+
  stat_sum(aes(size=..n..,alpha=0.5),position=position_dodge(width=w))+
  scale_size_continuous(range=c(2,6),breaks=1:3))

plot of chunk testplot1

Or collapsing to just the standard deviations (and reordering lines by overall standard deviation):

ggplot(test,aes(reorder(Line,CL2,FUN=sd),CL2,colour=Trtment))+
  stat_summary(fun.y=sd,geom="point")

plot of chunk sumplot

First fit a basic model:

(a1 <- lmer(CL2~Trtment + (1|Line/Trtment), data=test))

## Linear mixed model fit by REML ['lmerMod']
## Formula: CL2 ~ Trtment + (1 | Line/Trtment)
##    Data: test
## REML criterion at convergence: 114.4
## Random effects:
##  Groups       Name        Std.Dev.
##  Trtment:Line (Intercept) 0.211   
##  Line         (Intercept) 0.000   
##  Residual                 0.536   
## Number of obs: 66, groups:  Trtment:Line, 10; Line, 5
## Fixed Effects:
## (Intercept)   TrtmentStd  
##      1.0241      -0.0131

This model allows for differences in among-line variance among treatments (i.e., fits them separately: we could alternatively use (1|Trtment:Line) + (0+dummy(Trtment,"Std")) to estimate the difference in variance across treatments, although this model would be restricted to var(Std)>var(HT)):

(a2 <- lmer(CL2~Trtment + (1|Line) + 
              (0+dummy(Trtment,"HT")|Trtment:Line) +
              (0+dummy(Trtment,"Std")|Trtment:Line), data=test))

## Linear mixed model fit by REML ['lmerMod']
## Formula: 
## CL2 ~ Trtment + (1 | Line) + (0 + dummy(Trtment, "HT") | Trtment:Line) +  
##     (0 + dummy(Trtment, "Std") | Trtment:Line)
##    Data: test
## REML criterion at convergence: 114.4
## Random effects:
##  Groups         Name                  Std.Dev.
##  Trtment.Line   dummy(Trtment, "Std") 0.200   
##  Trtment.Line.1 dummy(Trtment, "HT")  0.225   
##  Line           (Intercept)           0.000   
##  Residual                             0.536   
## Number of obs: 66, groups:  Trtment:Line, 10; Line, 5
## Fixed Effects:
## (Intercept)   TrtmentStd  
##      1.0259      -0.0157

(cc <- confint(a2,oldNames=FALSE))

##                                         2.5 % 97.5 %
## sd_dummy(Trtment, "Std")|Trtment:Line  0.0000 0.5272
## sd_dummy(Trtment, "HT")|Trtment:Line   0.0000 0.5944
## sd_(Intercept)|Line                    0.0000 0.2819
## sigma                                  0.4510 0.6584
## (Intercept)                            0.7266 1.3501
## TrtmentStd                            -0.3945 0.3512

Fit model with heterogeneous residual variance: we can do this by setting up an observation-level factor and turning on variance in that factor only for one treatment or the other (residual variance applies to all observations). (I initially used dummy(Trtment,"Std") below, but the HT level actually has higher variance, and the estimated variance is not allowed to be negative …)

We have to use lmerControl to tell lmer that we really do know what we’re doing – the observation-level random effect is not actually confounded with the residual variance term, because it only applies to a subset of the data …

test2 <- transform(test,obs=factor(seq(nrow(test))))
(a3 <- lmer(CL2~Trtment + (1|Line) + (1|Trtment:Line) +
              (0+dummy(Trtment,"HT")|obs), data=test2,
            control=lmerControl(check.nobs.vs.nlev="ignore",
                                check.nobs.vs.nRE="ignore"),
 na.action=na.exclude))

## Linear mixed model fit by REML ['lmerMod']
## Formula: 
## CL2 ~ Trtment + (1 | Line) + (1 | Trtment:Line) + (0 + dummy(Trtment,  
##     "HT") | obs)
##    Data: test2
## REML criterion at convergence: 114.2
## Random effects:
##  Groups       Name                 Std.Dev.
##  obs          dummy(Trtment, "HT") 0.210   
##  Trtment:Line (Intercept)          0.209   
##  Line         (Intercept)          0.000   
##  Residual                          0.519   
## Number of obs: 66, groups:  obs, 66; Trtment:Line, 10; Line, 5
## Fixed Effects:
## (Intercept)   TrtmentStd  
##      1.0227      -0.0114

confint(a3,quietly=TRUE,oldNames=FALSE)

##                               2.5 % 97.5 %
## sd_dummy(Trtment, "HT")|obs  0.0000 0.5642
## sd_(Intercept)|Trtment:Line  0.0000 0.4009
## sd_(Intercept)|Line          0.0000 0.2837
## sigma                        0.4165 0.6557
## (Intercept)                  0.7514 1.2994
## TrtmentStd                  -0.3806 0.3543

We don’t get anything very interesting (all of the confidence intervals on the standard deviations overlap zero), but let’s see what the data would look like if we simulated a data set with a larger difference:

testsim <- test2
testsim$CL2 <- simulate(a3,newparams=list(theta=c(8,0.4,0.5),
                           beta=c(1,-0.011),
                           sigma=0.5),
                        seed=101)[[1]]
g2 %+% testsim

plot of chunk simbigvar

VarCorr(a4 <- refit(a3,testsim$CL2))

##  Groups       Name                 Std.Dev.
##  obs          dummy(Trtment, "HT") 3.268   
##  Trtment:Line (Intercept)          0.302   
##  Line         (Intercept)          0.000   
##  Residual                          0.527

confint(a4,oldNames=FALSE)

##                               2.5 % 97.5 %
## sd_dummy(Trtment, "HT")|obs  2.4996 4.3153
## sd_(Intercept)|Trtment:Line  0.0000 0.6776
## sd_(Intercept)|Line          0.0000 0.5828
## sigma                        0.4217 0.6882
## (Intercept)                 -0.5832 1.9542
## TrtmentStd                  -0.9780 1.6184

We do get a significant variance difference, but I’m not quite sure why we get such a biased estimate of the estimate (should be approx. 8, comes out as approx 3) – finite size effects? Would be worth simulating some larger examples …

Now try the equivalent with lme. We need to use random=~1|Line/Trtment to nest Trtment within Line.

test <- transform(test,TrtLine=interaction(Trtment,Line))
a5 <- lme(CL2~Trtment, data=test, random=~1|Line/Trtment, 
           weights = varIdent(form=~1|Trtment), na.action=na.exclude)
VarCorr(a5)

##             Variance     StdDev  
## Line =      pdLogChol(1)         
## (Intercept) 5.762e-10    0.000024
## Trtment =   pdLogChol(1)         
## (Intercept) 4.385e-02    0.209405
## Residual    3.138e-01    0.560164

print(VarCorr(a3),comp=c("Variance","Std.Dev."))

##  Groups       Name                 Variance Std.Dev.
##  obs          dummy(Trtment, "HT") 0.0442   0.210   
##  Trtment:Line (Intercept)          0.0439   0.209   
##  Line         (Intercept)          0.0000   0.000   
##  Residual                          0.2696   0.519

The residual variance is slightly larger here because a5 takes HT as its baseline; it says that the Std residual variance is about 7% smaller …

a5$modelStruct$varStruct

## Variance function structure of class varIdent representing
##     HT    Std 
## 1.0000 0.9269

lme4 is reasonably close.

av <- as.data.frame(VarCorr(a3))[,"vcov"]
av[4]/(av[4]+av[1])

## [1] 0.8592

We do still get the intervals() error:

intervals(a5)

## Error: cannot get confidence intervals on var-cov components: Non-positive
## definite approximate variance-covariance

(but we can use the confidence intervals that we got above from lmer with some comfort).

The basic problem is that lme can’t compute confidence intervals when one of the parameters is on the boundary; in this case, the issue is that the among-line variance is effectively zero.