This great document explores the differences between fixed- and random-effect modeling of a batch covariate. First, let's make up some data:
set.seed(123)
numDonors <- 5
donors <- paste("D", 1:numDonors, sep = "")
treatments <- c("Veh", "Drug1")
data <- expand.grid(donor = donors, treatment = treatments)
data$y <- 0 # grand mean
donorOffsets <- rnorm(length(donors), 0, 2) # donor-specific random effects
data$y <- data$y + donorOffsets[data$donor] # additive donor differences
data$y <- data$y + c(0, 2)[data$treatment] # additive treatment effect
data$y <- data$y + rnorm(nrow(data), 0, 1) # random noise
Visualized, with color and lines connecting “paired” data points; note how data from donor D3 in particular falls much higher than the other data points:
library(ggplot2)
qplot(treatment, y,
color=donor, group=donor, data=data, size=I(4)) +
geom_line() + theme_bw()
You can calculate the magnitude of the Drug1-vs-Veh difference (or “delta”) either by ignoring the pairing:
mean(subset(data, treatment == "Drug1")$y) -
mean(subset(data, treatment == "Veh")$y)
## [1] 2.352
Or, you can include the pairing by calculating the average of deltas computed for each donor:
mean(deltas <- sapply(donors,
function (d) {
subset(data, treatment == "Drug1" & donor == d)$y -
subset(data, treatment == "Veh" & donor == d)$y
})
)
## [1] 2.352
As you can see, the estimated drug effect is the same in either case, so donor-level pairing has no effect on the magnitude. However, it does effect the estimation of variance, which in turn will change our understanding of the statistical significance of the difference, here using the two-sample Student's t-test:
t1 <- t.test(subset(data, treatment == "Drug1")$y,
subset(data, treatment == "Veh")$y,
alternative = "two.sided",
paired = FALSE,
var.equal = TRUE)
t1
##
## Two Sample t-test
##
## data: subset(data, treatment == "Drug1")$y and subset(data, treatment == "Veh")$y
## t = 2.846, df = 8, p-value = 0.02162
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.4462 4.2583
## sample estimates:
## mean of x mean of y
## 2.6950 0.3428
Compared that result to the same test, but where the donor-level pairing is explicitly considered (note that because the data frame is correctly ordered with respect to donor pairing, there is no resorting necessary here; that may not be the case in real-world setting, so be careful using paired=T to ensure the two vectors are correctly matched:
t.test(subset(data, treatment == "Drug1")$y,
subset(data, treatment == "Veh")$y,
alternative = "two.sided",
paired = TRUE,
var.equal = TRUE)
##
## Paired t-test
##
## data: subset(data, treatment == "Drug1")$y and subset(data, treatment == "Veh")$y
## t = 6.019, df = 4, p-value = 0.003837
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 1.267 3.437
## sample estimates:
## mean of the differences
## 2.352
An equivalent way of performing the paired t-test is to test whether the donor-specific deltas are significantly different from zero (a one-sample t-test; no need to specify either the paired or var.equal nonsensical parameters):
t.test(deltas,
alternative = "two.sided")
##
## One Sample t-test
##
## data: deltas
## t = 6.019, df = 4, p-value = 0.003837
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 1.267 3.437
## sample estimates:
## mean of x
## 2.352
OK, great. From this you can see why using paired t-tests are appropriate when there exists some form of batch-level correlation (in this case, due to common donors):
cor(subset(data, treatment == "Veh")$y,
subset(data, treatment == "Drug1")$y
)
## [1] 0.887
But for more complicated experimental designs, you use linear (regression) modeling (or ANOVA, see below), testing for coefficients of interest not equal to zero (or for various linear combinations of coefficients, “contrasts”, not equal to zero). So, recasting the above test as a linear model with lm, you have:
summary(lm(y ~ 1 + treatment, data))
##
## Call:
## lm(formula = y ~ 1 + treatment, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.992 -0.745 -0.487 0.103 2.823
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.343 0.584 0.59 0.574
## treatmentDrug1 2.352 0.827 2.85 0.022 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.31 on 8 degrees of freedom
## Multiple R-squared: 0.503, Adjusted R-squared: 0.441
## F-statistic: 8.1 on 1 and 8 DF, p-value: 0.0216
Note that two coefficients are estimated: an intercept and a treatment “delta” due to the influence of Drug1. Both the estimate of the treatment effect magnitude and its significance (P value) is identical to that of the unpaired two-sample Student's t-test above. If you want to include the donor-level “pairing” in the model, you can add the donor as a second covariate:
summary(lm(y ~ 1 + treatment + donor, data))
##
## Call:
## lm(formula = y ~ 1 + treatment + donor, data = data)
##
## Residuals:
## 1 2 3 4 5 6 7 8 9 10
## 0.422 0.227 -0.657 -0.223 0.231 -0.422 -0.227 0.657 0.223 -0.231
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.173 0.479 0.36 0.7367
## treatmentDrug1 2.352 0.391 6.02 0.0038 **
## donorD2 -0.399 0.618 -0.65 0.5540
## donorD3 2.337 0.618 3.78 0.0194 *
## donorD4 -0.496 0.618 -0.80 0.4674
## donorD5 -0.591 0.618 -0.96 0.3931
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.618 on 4 degrees of freedom
## Multiple R-squared: 0.944, Adjusted R-squared: 0.875
## F-statistic: 13.6 on 5 and 4 DF, p-value: 0.0128
Now there are an additional four covariates estimated for the D2 through D5 (D1 is taken as the “baseline”, and is subsumed by the intercept). The magnitude and significance of the drug effect is identical to that seen with the paired t-test.
You can also perform the same statistical testing using the ANOVA F-test framework with two nested models (a null model, m0, and a full or complete model containing the variable of interest, m1):
m0 <- lm(y ~ 1 + donor, data)
m1 <- lm(y ~ 1 + treatment + donor, data)
anova(m0, m1, test="F")
## Analysis of Variance Table
##
## Model 1: y ~ 1 + donor
## Model 2: y ~ 1 + treatment + donor
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 5 15.36
## 2 4 1.53 1 13.8 36.2 0.0038 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
And again, you get the exact same significance for the treatment effect as with the paired t-test.
However, you don't really think these five donors are the only five donors in the world; and, you're not really interested in identifying whether a specific donor effect is significantly different than the others. Rather, the donors used in the experiment can be assumed to be a randomly drawn subset of the population of all possible donors, and their individual effect sizes are similarly randomly drawn values from the population-level distribution of possible effect sizes. You can further assume that this distribution is Normally distributed, with unknown mean and variance – if you can incorporate this into the model, then instead of estimating four separate donor effect coefficients (parameters), you can instead estimate only two distributional parameters (the mean and variance of random donor effects); this will increase your available degrees of freedom (df) a bit, and should improve statistical significance beyond that obtained by pairing (in other words, increase your statistical power to detect differences), as long as this distribution can be estimated stably enough to realize the additional dfs.
To do so, you need to use so-called “mixed” linear modeling software that understands both fixed and random effects; the lme4 package provides the function lmer for this task, with a specialized formula interface for specifying both fixed and random effects:
library(lme4)
## Loading required package: lattice
## Loading required package: Matrix
##
## Attaching package: 'lme4'
##
## The following object is masked from 'package:ggplot2':
##
## fortify
m0 <- lmer(y ~ 1 + (1 | donor), data, REML=F)
m1 <- lmer(y ~ 1 + treatment + (1 | donor), data, REML=F)
fixef(m1)
## (Intercept) treatmentDrug1
## 0.3428 2.3522
anova(m0, m1)
## Data: data
## Models:
## m0: y ~ 1 + (1 | donor)
## m1: y ~ 1 + treatment + (1 | donor)
## Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
## m0 3 44.5 45.4 -19.2 38.5
## m1 4 34.9 36.1 -13.4 26.9 11.6 1 0.00066 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Wow, so now you have an incredibly significant P value (7 × 10-4) compared to your first naive Student's t-test P value (0.02). That actually seems a little bit too good to be true. It turns out that when you compare two (nested) models with random effects that the usual likelihood ratio test (LRT) can be anti-conservative: the Chi-squared P values tend to be too small. While there are various theoretical approaches to addressing this, a more trustworthy solution is the parametric bootstrap …
The idea here is to repeatedly simulate new y values from the null model, then refit both the full and null models and calculate the anova LRT (Chi-squared) test value; you'll use a function to do this:
m0 <- lmer(y ~ 1 + (1 | donor), data)
m1 <- lmer(y ~ 1 + treatment + (1 | donor), data)
oneLR <- function() {
data$newy <- unlist(simulate(m0))
full <- lmer(newy ~ 1 + treatment + (1 | donor), data)
null <- update(full, .~. - treatment)
return(anova(null, full)$Chisq[2])
}
x2 <- replicate(10000, oneLR())
sum(anova(m0, m1)$Chisq[2] < x2)/10000
So, after 10,000 random simulations under the actual null model, the full model has a Chisq value larger than that observed from the real data only 56 times, for a parametric bootstrap P value of ~6e-3; had you performed more replicates the significand would surely change, but the order of magnitude (1e-3) is unlikely to change meaningfully. This isn't very different from the fixed effect P value you obtained earlier.
In the example provided here, the use of random effects modeling did not have a dramatic effect; closer inspection of the random effects model shows that the estimation of the random effects was not very precise. Let's see what happens when we increase the number of replicate data points from each donor:
set.seed(123)
numDonors <- 5
numReplicates <- 5
donors <- paste("D", 1:numDonors, sep = "")
treatments <- c("Veh", "Drug1")
replicates <- 1:numReplicates
data <- expand.grid(donor = donors, treatment = treatments, replicate = replicates)
data$y <- 0 # grand mean
donorOffsets <- rnorm(length(donors), 0, 2) # donor-specific random effects
data$y <- data$y + donorOffsets[data$donor] # additive donor differences
data$y <- data$y + c(0, 2)[data$treatment] # additive treatment effect
data$y <- data$y + rnorm(nrow(data), 0, 1) # random noise
Visualized; note the differences in “baselines” observed between donors:
qplot(treatment, y, data=data, size=I(3)) +
facet_wrap(~donor) + theme_bw()
Fixed effect results:
m0 <- lm(y ~ 1 + donor, data)
m1 <- lm(y ~ 1 + treatment + donor, data)
anova(m0, m1, test="F")
## Analysis of Variance Table
##
## Model 1: y ~ 1 + donor
## Model 2: y ~ 1 + treatment + donor
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 45 96.0
## 2 44 37.2 1 58.7 69.4 1.4e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Random effect results:
lm0 <- lmer(y ~ 1 + (1 | donor), data)
lm1 <- lmer(y ~ 1 + treatment + (1 | donor), data)
anova(lm0, lm1)
## Data: data
## Models:
## lm0: y ~ 1 + (1 | donor)
## lm1: y ~ 1 + treatment + (1 | donor)
## Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
## lm0 3 196 201 -94.7 190
## lm1 4 155 162 -73.4 147 42.6 1 6.7e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Here, the probabilities are so small that even after 10,000 simulations, not a single Chi-squared value is larger than the observed (not shown, nor calculated). Because the random effects are more strongly controlled in the null and full model, the usual Chi-squared approximation is perhaps more trustworthy here. So in this case you can see a small amount of additional power obtained from the random effect (7e-11) vs. the fixed effect (1e-10) model. In this case, the additional power comes from a better, slightly more precise estimation of the donor offsets:
df <- cbind(donorOffsets, fixed=coef(m1)[-2], random=ranef(lm1)$donor[,1])
cor(df)
## donorOffsets fixed random
## donorOffsets 1.0000 0.9543 0.9948
## fixed 0.9543 1.0000 0.9732
## random 0.9948 0.9732 1.0000
sqrt(colSums((df[,2:3] - df[,1]) ^ 2)) # RMSD
## fixed random
## 2.051 1.115
You can also see what would happen if you had a crystal ball to know exactly what the donor effects are, and what the maximum level of achievable significance might be:
data$yfixed <- data$y - donorOffsets[data$donor]
summary(lm(yfixed ~ treatment, data))
##
## Call:
## lm(formula = yfixed ~ treatment, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.9244 -0.5911 -0.0975 0.7222 2.0436
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.0423 0.1844 -0.23 0.82
## treatmentDrug1 2.1676 0.2608 8.31 7.5e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.922 on 48 degrees of freedom
## Multiple R-squared: 0.59, Adjusted R-squared: 0.582
## F-statistic: 69.1 on 1 and 48 DF, p-value: 7.45e-11