Small samples

We generate 3 samples of 10 standard normal variates each, and a corresponding factor variable with levels A, B, and C:

set.seed(1234)
(dat <- rnorm(n=30))
##  [1] -1.20707  0.27743  1.08444 -2.34570  0.42912  0.50606 -0.57474
##  [8] -0.54663 -0.56445 -0.89004 -0.47719 -0.99839 -0.77625  0.06446
## [15]  0.95949 -0.11029 -0.51101 -0.91120 -0.83717  2.41584  0.13409
## [22] -0.49069 -0.44055  0.45959 -0.69372 -1.44820  0.57476 -1.02366
## [29] -0.01514 -0.93595
(fac <- rep(LETTERS[1:3], each=10))
##  [1] "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "B" "B" "B" "B" "B" "B" "B"
## [18] "B" "B" "B" "C" "C" "C" "C" "C" "C" "C" "C" "C" "C"

Standard parameterization

We fit a linear model to the data using R’s standard parameterization (i.e., mean of sample A, differences of means of samples B and C from mean of sample A):

(lma <- lm(dat ~ fac))
## 
## Call:
## lm(formula = dat ~ fac)
## 
## Coefficients:
## (Intercept)         facB         facC  
##    -0.38316      0.26499     -0.00479
vcov(lma)
##             (Intercept)     facB     facC
## (Intercept)     0.08581 -0.08581 -0.08581
## facB           -0.08581  0.17162  0.08581
## facC           -0.08581  0.08581  0.17162

We want to investigate the difference between the means of samples B and C using a simple linear contrast:

mat <- rbind(c(0, 1, -1))

The associated variance should be:

mat %*% vcov(lma) %*% t(mat)
##        [,1]
## [1,] 0.1716

Now observe the difference between estimated variances:

library(multcomp)
vcov(glht(lma, mat))
##        1
## 1 0.1716
vcov(glht(mmm(lma), mlf(mat)))
##        lma: 1
## lma: 1 0.1486

Alternative parameterization

We can use an alternative parameterization for the linear model (i.e., means of samples A, B, and C):

(lmb <- lm(dat ~ fac - 1))
## 
## Call:
## lm(formula = dat ~ fac - 1)
## 
## Coefficients:
##   facA    facB    facC  
## -0.383  -0.118  -0.388
vcov(lmb)
##         facA    facB    facC
## facA 0.08581 0.00000 0.00000
## facB 0.00000 0.08581 0.00000
## facC 0.00000 0.00000 0.08581

Again we are interested in the difference between the means of samples B and C. We can use the same linear contrast as above, and the associated variance should be the same as above:

mat %*% vcov(lmb) %*% t(mat)
##        [,1]
## [1,] 0.1716

This time we find that the estimated variances are the same:

vcov(glht(lmb, mat))
##        1
## 1 0.1716
vcov(glht(mmm(lmb), mlf(mat)))
##        lmb: 1
## lmb: 1 0.1716

Large samples

We increase the sample size from 10 to 10,000:

DAT <- rnorm(n=30000)
FAC <- rep(LETTERS[1:3], each=10000)

Standard parameterization

We redo the analysis using R’s standard parameterization:

lmA <- lm(DAT ~ FAC)
vcov(lmA)
##             (Intercept)       FACB       FACC
## (Intercept)   9.954e-05 -9.954e-05 -9.954e-05
## FACB         -9.954e-05  1.991e-04  9.954e-05
## FACC         -9.954e-05  9.954e-05  1.991e-04
mat %*% vcov(lmA) %*% t(mat)
##           [,1]
## [1,] 0.0001991
vcov(glht(lmA, mat))
##           1
## 1 0.0001991
vcov(glht(mmm(lmA), mlf(mat)))
##           lmA: 1
## lmA: 1 0.0002022

The relative discrepancy between variance estimates is much smaller than with a sample size of 10.

Alternative parameterization

Using the alternative parameterization, the variance estimates are again identical.

lmB <- lm(DAT ~ FAC - 1)
vcov(lmB)
##           FACA      FACB      FACC
## FACA 9.954e-05 0.000e+00 0.000e+00
## FACB 0.000e+00 9.954e-05 0.000e+00
## FACC 0.000e+00 0.000e+00 9.954e-05
mat %*% vcov(lmB) %*% t(mat)
##           [,1]
## [1,] 0.0001991
vcov(glht(lmB, mat))
##           1
## 1 0.0001991
vcov(glht(mmm(lmB), mlf(mat)))
##           lmB: 1
## lmB: 1 0.0001991

So we see that the alternative parameterization is preferable here since it uses different portions of the data to estimate different parameters. The problem decreases with sample size because the observed and expected Fisher information, and thus variance, are (only) asymptotically equal.

Observed vs. expected variance

This plot shows the average percentage deviation (plus/minus standard deviation) of the observed from the expected variance as estimated from 1000 simulated datasets under the above setting (3 balanced samples of standard normal variates):

plot of chunk unnamed-chunk-13

We repeat the simulation, now with 5 balanced samples of standard normal variates:

plot of chunk unnamed-chunk-15

Another example

A <- rbind(c(1, 0, 0), c(1, 1, 0), c(1, 0, 1))
B <- rbind(c(1, 0, 0), c(0, 1, 0), c(0, 0, 1))

A %*% vcov(lma) %*% t(A)
##            [,1]      [,2]       [,3]
## [1,]  8.581e-02 1.388e-17 -1.388e-17
## [2,]  1.388e-17 8.581e-02  1.388e-17
## [3,] -1.388e-17 1.388e-17  8.581e-02
vcov(glht(lma, A))
##            1         2          3
## 1  8.581e-02 1.388e-17 -1.388e-17
## 2  1.388e-17 8.581e-02  1.388e-17
## 3 -1.388e-17 1.388e-17  8.581e-02
vcov(glht(mmm(lma), mlf(A)))
##           lma: 1     lma: 2     lma: 3
## lma: 1  0.085812  0.0030258 -0.0150624
## lma: 2  0.003026  0.0918635 -0.0005311
## lma: 3 -0.015062 -0.0005311  0.0556872
B %*% vcov(lmb) %*% t(B)
##         [,1]    [,2]    [,3]
## [1,] 0.08581 0.00000 0.00000
## [2,] 0.00000 0.08581 0.00000
## [3,] 0.00000 0.00000 0.08581
vcov(glht(lmb, B))
##         1       2       3
## 1 0.08581 0.00000 0.00000
## 2 0.00000 0.08581 0.00000
## 3 0.00000 0.00000 0.08581
vcov(glht(mmm(lmb), mlf(B)))
##         lmb: 1  lmb: 2  lmb: 3
## lmb: 1 0.08581 0.00000 0.00000
## lmb: 2 0.00000 0.08581 0.00000
## lmb: 3 0.00000 0.00000 0.08581