1 Preamble

This document was generated from a file in the Rmd format. Both the source and target files are under the CC BY-SA license. The source file is versioned online.

2 Overview

This tutorial aims at explaining contrasts in R using the one-way ANOVA in a completely balanced setting. It draws heavily on the references listed at the end of the document.

In an example, yield as the response variable is regressed on an unordered factor, genotype. A small data set will be simulated, then a first version of the statistical model will be described, then a second, leading to the motivation behind the usage of contrasts. Finally, two different contrasts will be described in details.

Computations are made using R base functions as well as “by hand”. But it is also shown how to obtain the required information (incidence matrix, contrast estimates, standard errors, test statistics and p values) using an external package, emmeans:

suppressPackageStartupMessages(library(emmeans))

3 Simulate some data

3.1 Choose the data dimensions

G <- 3 # number of genotypes
R <- 5 # number of replicates per genotype

3.2 Set up the data structure

(lev.genos <- paste0("var", 1:G))
## [1] "var1" "var2" "var3"
(lev.reps <- LETTERS[1:R])
## [1] "A" "B" "C" "D" "E"
dat <- data.frame(geno=rep(lev.genos, each=R),
                  rep=rep(lev.reps, G),
                  yield=NA,
                  stringsAsFactors=FALSE)
dat
##    geno rep yield
## 1  var1   A    NA
## 2  var1   B    NA
## 3  var1   C    NA
## 4  var1   D    NA
## 5  var1   E    NA
## 6  var2   A    NA
## 7  var2   B    NA
## 8  var2   C    NA
## 9  var2   D    NA
## 10 var2   E    NA
## 11 var3   A    NA
## 12 var3   B    NA
## 13 var3   C    NA
## 14 var3   D    NA
## 15 var3   E    NA

3.3 Generate the data

set.seed(12345) # to make the simulation reproducible
N <- nrow(dat) # total number of observations
stopifnot(G == 3)
(true.geno.means <- setNames(c(2, 4, 6), lev.genos))
## var1 var2 var3 
##    2    4    6
(true.mean <- mean(true.geno.means))
## [1] 4
for(i in 1:N){
  true.geno.mean <- true.geno.means[dat$geno[i]]
  dat$yield[i] <- true.geno.mean
}
sd.error <- 0.8
epsilons <- rnorm(n=N, mean=0, sd=sd.error)
dat$yield <- dat$yield + epsilons
dat$geno <- as.factor(dat$geno)
dat$rep <- as.factor(dat$rep)
dat
##    geno rep    yield
## 1  var1   A 2.468423
## 2  var1   B 2.567573
## 3  var1   C 1.912557
## 4  var1   D 1.637202
## 5  var1   E 2.484710
## 6  var2   A 2.545635
## 7  var2   B 4.504079
## 8  var2   C 3.779053
## 9  var2   D 3.772672
## 10 var2   E 3.264542
## 11 var3   A 5.907002
## 12 var3   B 7.453850
## 13 var3   C 6.296502
## 14 var3   D 6.416173
## 15 var3   E 5.399574

4 Explore the data

4.1 Data means

str(dat)
## 'data.frame':    15 obs. of  3 variables:
##  $ geno : Factor w/ 3 levels "var1","var2",..: 1 1 1 1 1 2 2 2 2 2 ...
##  $ rep  : Factor w/ 5 levels "A","B","C","D",..: 1 2 3 4 5 1 2 3 4 5 ...
##  $ yield: num  2.47 2.57 1.91 1.64 2.48 ...
(grand.mean <- mean(dat$yield))
## [1] 4.027303
(geno.means <- tapply(dat$yield, dat$geno, mean))
##     var1     var2     var3 
## 2.214093 3.573196 6.294620

4.2 Plots

hist(dat$yield, col="grey", border="white", las=1)
abline(v=grand.mean, lty=2)
legend("topright", legend="grand mean", lty=2, bty="n")

boxplot(yield ~ geno, data=dat, las=1, varwidth=TRUE,
        main="Boxplots of dat$yield")
abline(h=grand.mean, lty=2)
legend("topleft", legend="grand mean", lty=2, bty="n")

5 Build the models

The goal of the statistical models here will be to explain the mean of the response variable using the explanatory factor.

5.1 Version 1

As a first model, one can choose to explain the mean of the responses directly by the genotype means.

5.1.1 Notations

  • \(G\): total number of genotypes

  • \(R\): total number of replicates (here, the same number for each genotype)

  • \(g\): index of the genotype

  • \(r\): index of the replicate

  • \(y_{gr}\): yield for replicate \(r\) of genotype \(g\)

  • \(m_g\): mean yield of genotype \(g\)

  • \(\epsilon_{gr}\): error for replicate \(r\) of genotype \(g\)

  • \(\sigma^2\): variance of the errors

5.1.2 Likelihood

5.1.2.1 Scalar way

\[ \forall g \in \{1,\ldots,G\} \text{ and } r \in \{1,\ldots,R\}, \; y_{gr} = m_g + \epsilon_{gr} \text{ with } \epsilon_{gr} \sim \mathcal{N}(0, \sigma^2) \]

5.1.2.2 Matrix way

A few other notations:

  • \(N\): total number of observations (here, \(N = G \times R\))

  • \(i \in \{1,\ldots,N\}\): index of the observation

  • \(\boldsymbol{y}\): \(N\)-vector of observations

  • \(X_1\): \(N \times G\) design/incidence matrix relating observations to genotypes

  • \(\boldsymbol{m} = \{m_1,\ldots,m_G\}\): \(G\)-vector of parameters explaining the mean of the response variable

  • \(\mathcal{N}_N\): multivariate Normal distribution of dimension \(N\)

  • \(\text{I}_N\): \(N \times N\) identity matrix

\[ \boldsymbol{y} = X_1 \boldsymbol{m} + \boldsymbol{\epsilon} \text{ where } \boldsymbol{\epsilon} \sim \mathcal{N}_N(\boldsymbol{0}, \sigma^2 \text{I}_N) \]

which is equivalent to:

\[ \boldsymbol{y} \, | \, X_1, \boldsymbol{m}, \sigma^2 \; \sim \; \mathcal{N}_N(X_1 \boldsymbol{m}, \sigma^2 \text{I}_N) \]

Note that this model has \(G\) parameters for the expectation of \(\boldsymbol{y}\).

5.1.3 Inference

5.1.3.1 Effect estimation

Under our assumptions, by minimizing the squared errors or by maximizing the likelihood, one ends up with the same estimator of \(\boldsymbol{m}\), noted \(\hat{\boldsymbol{m}}\) (see the Gauss-Markov theorem):

\[ X_1^T \, X_1 \; \hat{\boldsymbol{m}} \; = \, X_1^T \, \boldsymbol{y} \]

Without going into too many mathematical details, this system of equations can be easily resolved (by inverting \(X_1^T X_1\)) if the columns of \(X_1\) are independent from each other:

\[ \hat{\boldsymbol{m}} \; = \, (X_1^T X_1)^{-1} \, X_1^T \, \boldsymbol{y} \]

Importantly, to obtain the estimates of the parameters for the expectation, one first needs to specify a coding scheme for \(X_1\). At this stage, it helps writing down the insides of the model vectors/matrices. Here it is using \(R=2\):

\[ \begin{bmatrix} y_{11} \\ y_{12} \\ y_{21} \\ y_{22} \\ \vdots \\ y_{GR} \end{bmatrix} = \begin{bmatrix} 1 & 0 & 0 & ... \\ 1 & 0 & 0 & ... \\ 0 & 1 & 0 & ... \\ 0 & 1 & 0 & ... \\ \vdots & \vdots & \vdots & ... \end{bmatrix} \times \begin{bmatrix} m_1 \\ m_2 \\ \vdots \\ m_G \end{bmatrix} + \begin{bmatrix} \epsilon_{11} \\ \epsilon_{12} \\ \epsilon_{21} \\ \epsilon_{22} \\ \vdots \\ \epsilon_{GR} \end{bmatrix} \]

Using the rule of matrix multiplication, one can easily retrieve the following equations:

  • \(y_{11} = m_1 + \epsilon_{11}\)

  • \(y_{12} = m_1 + \epsilon_{12}\)

  • \(y_{21} = m_2 + \epsilon_{21}\)

Here is an example in R:

## example with G=3 genotypes and R=2 reps
y <- dat$yield[dat$rep %in% c("A","B")]
(X_1 <- matrix(c(1,0,0,
                 1,0,0,
                 0,1,0,
                 0,1,0,
                 0,0,1,
                 0,0,1),
               nrow=6, ncol=3, byrow=TRUE))
##      [,1] [,2] [,3]
## [1,]    1    0    0
## [2,]    1    0    0
## [3,]    0    1    0
## [4,]    0    1    0
## [5,]    0    0    1
## [6,]    0    0    1
tXX_1 <- t(X_1) %*% X_1
inv_tXX_1 <- solve(tXX_1)
(m.hat <- inv_tXX_1 %*% t(X_1) %*% y)
##          [,1]
## [1,] 2.517998
## [2,] 3.524857
## [3,] 6.680426
(epsilons.hat <- y - X_1 %*% m.hat)
##             [,1]
## [1,] -0.04957488
## [2,]  0.04957488
## [3,] -0.97922181
## [4,]  0.97922181
## [5,] -0.77342394
## [6,]  0.77342394
mean(epsilons.hat)
## [1] -1.480207e-16

5.1.3.2 Uncertainty quantification

Obtaining an estimate of the parameters is important, but quantifying our uncertainty about it is better.

Here is the variance of the estimator \(\hat{\boldsymbol{m}}\):

\[ \mathbb{V}(\hat{\boldsymbol{m}}) = (X_1^T X_1)^{-1} \sigma^2 \]

To compute it, one hence needs to estimate the error variance. The estimate of \(\sigma^2\) will be noted \(S^2\):

\[ \mathbb{E}(S^2) = \frac{RSS}{N - r} \]

where \(RSS\) is the residual sum of squares (\(\sum_i (y_i - \hat{y_i})^2\)) and \(r\) is the number of independent columns of \(X_1\) (called the rank of \(X_1\)).

5.1.3.3 Hypothesis testing

Say our null hypothesis is “\(m_g = a\)”. One can reject it at level \(l\) if:

\[ \frac{|m_g - a|}{\sqrt{\mathbb{V}(\hat{m_g})}} \ge t_{1-l/2, N-r} \]

where \(t\) is the Student’s distribution.

5.2 Version 2

The “cell-mean” model (v1 above) is intuitive, but it can be re-parametrize into another, equivalent model with an intercept.

5.2.1 Likelihood

5.2.1.1 Simple way

A few other notations:

  • \(\mu\): intercept

\[ \forall g \in \{1,\ldots,G\} \text{ and } r \in \{1,\ldots,R\}, \; y_{gr} = \mu + m_g + \epsilon_{gr} \text{ with } \epsilon_{gr} \sim \mathcal{N}(0, \sigma^2) \]

5.2.1.2 Matrix way

A few other notations:

  • \(X_2\): \(N \times (G+1)\) design/incidence matrix relating observations (the \(y_{gr}\)’s) to genotypes

  • \(\boldsymbol{\theta} = \{\mu,m_1,\ldots,m_G\}\): \((G+1)\)-vector of parameters explaining the mean of the response variable

\[ \boldsymbol{y} = X_2 \boldsymbol{\theta} + \boldsymbol{\epsilon} \text{ where } \boldsymbol{\epsilon} \sim \mathcal{N}_N(\boldsymbol{0}, \sigma^2 \text{I}_N) \]

Note that this model has \(G+1\) parameters for the expectation of \(\boldsymbol{y}\).

5.2.2 Inference

5.2.2.1 Effect estimation

Let us make the same exercise as before, that is, writing the insides of the model vectors/matrices:

\[ \begin{bmatrix} y_{11} \\ y_{12} \\ y_{21} \\ y_{22} \\ \vdots \\ y_{GR} \end{bmatrix} = \begin{bmatrix} 1 & 1 & 0 & 0 & ... \\ 1 & 1 & 0 & 0 & ... \\ 1 & 0 & 1 & 0 & ... \\ 1 & 0 & 1 & 0 & ... \\ \vdots & \vdots & \vdots & ... \end{bmatrix} \times \begin{bmatrix} \mu \\ m_1 \\ m_2 \\ \vdots \\ m_G \end{bmatrix} + \begin{bmatrix} \epsilon_{11} \\ \epsilon_{12} \\ \epsilon_{21} \\ \epsilon_{22} \\ \vdots \\ \epsilon_{GR} \end{bmatrix} \]

Again as above, using the rule of matrix multiplication, one can easily retrieve the following equations:

  • \(y_{11} = \mu + m_1 + \epsilon_{11}\)

  • \(y_{12} = \mu + m_1 + \epsilon_{12}\)

  • \(y_{21} = \mu + m_2 + \epsilon_{21}\)

However, note that the first column of \(X_2\) (corresponding to \(\mu\)) is equal to the sum of all its other columns.

Here is an example in R:

## example with G=3 genotypes and R=2 reps
y <- dat$yield[dat$rep %in% c("A","B")]
(X_2 <- matrix(c(1,1,0,0,
                 1,1,0,0,
                 1,0,1,0,
                 1,0,1,0,
                 1,0,0,1,
                 1,0,0,1),
               nrow=6, ncol=3+1, byrow=TRUE))
##      [,1] [,2] [,3] [,4]
## [1,]    1    1    0    0
## [2,]    1    1    0    0
## [3,]    1    0    1    0
## [4,]    1    0    1    0
## [5,]    1    0    0    1
## [6,]    1    0    0    1
kappa(X_2) # the higher, the more singular X is
## [1] 4.160247e+16
tXX_2 <- t(X_2) %*% X_2
eigen(tXX_2)$values # the matrix is singular if at least one eigenvalue is zero
## [1] 8.000000e+00 2.000000e+00 2.000000e+00 5.329071e-15
try(inv_tXX_2 <- solve(tXX_2))
## Error in solve.default(tXX_2) : 
##   system is computationally singular: reciprocal condition number = 1.38778e-17

This lack of independence between the columns of \(X_2\) (such a matrix is said to be singular) means that one needs to add constraints (only one in here) to estimate the parameters for the expectation of \(y\).

5.3 Contrasts

Still a few other notations:

  • \(\boldsymbol{\alpha}\): \(G\)-vector of regression coefficients

    • the first element corresponds to the intercept: \(\alpha_0 := \mu\)

    • the others correspond to a \((G-1)\)-vector noted \(\boldsymbol{\alpha}_\star\)

  • \(\{c_0,\ldots,c_{G-1}\}\): set of \(G\) real numbers

As said above, to deal with the non-independence between columns of the incidence matrix \(X\), one needs to add constraint(s). Typically, a constraint takes the form of a linear combination of the regression coefficients:

\[\begin{equation} \label{eq:contrast_lin_comb} \sum_{j=0}^{G-1} c_j \, \boldsymbol{\alpha}_j \; = \; c_0 \times \alpha_0 + c_1 \times \alpha_1 + \ldots + c_{G-1} \times \alpha_{G-1} \end{equation}\]

But it has a specificity, the sum of the coefficients of this linear combination equals zero: \(\sum_{j=0}^{G-1} c_j = 0\). Such a linear combination is called a contrast, and this is what will be estimated.

Obviously, one want to be able to go from \(\boldsymbol{m}\) to \(\boldsymbol{\alpha}\), et reciprocally. Such transformations can be easily seen using matrix calculus:

\[ \boldsymbol{\alpha} = C \, \boldsymbol{m} \; \Leftrightarrow \; \boldsymbol{m} = B \, \boldsymbol{\alpha} \]

where \(B = C^{-1}\) and both are \(G \times G\) matrices.

One typically chooses \(B\) so that its first column is made of 1’s: \(B = [ \boldsymbol{1}_G \; B_\star ]\).

\(\Rightarrow\) The \(B_\star\) matrix is \(G \times (G-1)\) and is called a coding matrix.

This leads to:

\[ \boldsymbol{m} = B \boldsymbol{\alpha} = [ \boldsymbol{1}_G \; B_\star ] [\alpha_0 \ldots \alpha_{G-1}]^T = \boldsymbol{1}_G \mu + B_\star \boldsymbol{\alpha}_\star \]

Similarly, \(C\) will be partitioned by separating the first row: \(C = \begin{bmatrix} \boldsymbol{c}_0^T \\ C_\star^T \end{bmatrix}\).

\(\Rightarrow\) The \(C_\star\) matrix is \((G-1) \times G\) and is called a contrast matrix.

Its rows are the contrasts we want to estimate: they sum to zero.


CAUTION: in R, the contr.treatment, contr.sum, contr.helmert, contr.poly, contr.diff, etc functions take factor levels as input and return a coding matrix (\(B_\star\)), not a contrast matrix (\(C_\star\))!

Several possible coding schemes exist, each leading to different contrasts, and the interpretation of the contrasts depends on the coding used to obtain them.

5.3.1 Estimation

As a contrast is a linear combination of parameters, say \(C_g \, \boldsymbol{m}\) (where \(C_g\) is the \(g^\text{th}\) row of \(C\)), one can easily compute its estimate:

\[ C_g \, \hat{\boldsymbol{m}} \]

5.3.2 Uncertainty quantification

The variance of a given contrast will be: \[ C_g \, \mathbb{V}(\hat{\boldsymbol{m}}) \, C_g^T \]

5.3.3 Hypothesis testing

The null hypothesis “\(C_g \, \boldsymbol{m} = a\)” is rejected a level \(l\) if:

\[ \frac{|C_g \, \boldsymbol{m} - a|}{\sqrt{C_g \, \mathbb{V}(\hat{\boldsymbol{m}}) \, C_g^T}} \ge t_{1-l/2, N-r} \]

6 Dummy coding (contr.treatment)

6.1 Look at the contrasts

In R, this coding is obtained with contr.treatment, which also happens to be the default coding for unordered factors.

getOption("contrasts")
##         unordered           ordered 
## "contr.treatment"      "contr.poly"
contrasts(dat$geno)
##      var2 var3
## var1    0    0
## var2    1    0
## var3    0    1
(Bstar.ct <- contr.treatment(levels(dat$geno)))
##      var2 var3
## var1    0    0
## var2    1    0
## var3    0    1

As explained above, the columns of Bstar (\(B_\star\)) are clearly not contrasts as they don’t sum to zero. But obtaining the contrast matrix is easy:

  1. add a column of 1’s to obtain the \(B\) matrix,

  2. then inverse it to obtain the \(C\) matrix,

  3. and remove the first row to obtain the \(C_\star\) matrix.

(B.ct <- cbind(intercept=1, Bstar.ct))
##      intercept var2 var3
## var1         1    0    0
## var2         1    1    0
## var3         1    0    1
(C.ct <- solve(B.ct))
##           var1 var2 var3
## intercept    1    0    0
## var2        -1    1    0
## var3        -1    0    1
(Cstar.ct <- C.ct[-1,])
##      var1 var2 var3
## var2   -1    1    0
## var3   -1    0    1

As explained above, the rows of Cstar (\(C_\star\)) are contrasts and they sum to zero:

apply(Cstar.ct, 1, sum)
## var2 var3 
##    0    0

Given that \(\boldsymbol{\alpha} = C \boldsymbol{m}\), these rows show how to interpret the regression coefficients (the \(\alpha\)’s) in terms of the factor level means (the \(m\)’s):

  • \(\alpha_0 = 1 \times m_1 + 0 \times m_2 + 0 \times m_3 = m_1\)

  • \(\alpha_1 = -1 \times m_1 + 1 \times m_2 + 0 \times m_3 = m_2 - m_1\)

  • \(\alpha_2 = -1 \times m_1 + 0 \times m_2 + 1 \times m_3 = m_3 - m_1\)

The intercept is the mean of a factor level chosen arbitrarily (the first level by default). It usually makes sense to choose this coding when one level is a natural reference for the others, such as a control.

6.2 Fit the model

fit.ct <- lm(yield ~ geno, data=dat, contrasts=list(geno="contr.treatment"))
summary(fit.ct)
## 
## Call:
## lm(formula = yield ~ geno, data = dat, contrasts = list(geno = "contr.treatment"))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.0276 -0.3481  0.1216  0.2625  1.1592 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.2141     0.2914   7.597 6.36e-06 ***
## genovar2      1.3591     0.4122   3.297  0.00637 ** 
## genovar3      4.0805     0.4122   9.900 3.99e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6517 on 12 degrees of freedom
## Multiple R-squared:  0.8944, Adjusted R-squared:  0.8768 
## F-statistic: 50.83 on 2 and 12 DF,  p-value: 1.385e-06
(alpha.ct.hat <- coef(fit.ct))
## (Intercept)    genovar2    genovar3 
##    2.214093    1.359103    4.080527

Note that all three estimates are deemed significantly different than 0.

Residual sum of squares and estimate of error variance:

(tmp <- anova(fit.ct))
## Analysis of Variance Table
## 
## Response: yield
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## geno       2 43.173 21.5867  50.828 1.385e-06 ***
## Residuals 12  5.096  0.4247                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(RSS.ct <- tmp["Residuals", "Sum Sq"])
## [1] 5.096397
summary(fit.ct)$sigma
## [1] 0.6516899
summary(fit.ct)$sigma^2
## [1] 0.4246998
(r.ct <- fit.ct$rank)
## [1] 3
(S2.ct <- RSS.ct / (N - r.ct))
## [1] 0.4246998

6.3 Retrieve the contrast estimates from the data means

6.3.1 Reference level

\(\alpha_0 = \mu = m_1\)

alpha.ct.hat[1+0]
## (Intercept) 
##    2.214093
geno.means[1]
##     var1 
## 2.214093
C.ct[1,,drop=FALSE] %*% geno.means
##               [,1]
## intercept 2.214093

6.3.2 Estimate of the first contrast

\(\alpha_1 = m_2 - m_1\)

alpha.ct.hat[1+1]
## genovar2 
## 1.359103
geno.means[2] - geno.means[1]
##     var2 
## 1.359103
C.ct[2,,drop=FALSE] %*% geno.means
##          [,1]
## var2 1.359103

6.3.3 Estimate of the second contrast

\(\alpha_2 = m_3 - m_1\)

alpha.ct.hat[1+2]
## genovar3 
## 4.080527
geno.means[3] - geno.means[1]
##     var3 
## 4.080527
C.ct[3,,drop=FALSE] %*% geno.means
##          [,1]
## var3 4.080527

6.3.4 As a matrix multiplication

alpha.ct.hat
## (Intercept)    genovar2    genovar3 
##    2.214093    1.359103    4.080527
C.ct %*% geno.means
##               [,1]
## intercept 2.214093
## var2      1.359103
## var3      4.080527

Knowing the interpretation of the intercept in terms of data means and looking at the boxplots below, it makes sense that all three regression coefficients are significantly different than zero:

boxplot(yield ~ geno, data=dat, las=1, varwidth=TRUE,
        main="Boxplots of dat$yield")
abline(h=alpha.ct.hat[1], lty=2)
legend("topleft", legend="intercept", lty=2, bty="n")

6.4 Retrieve the data means from the contrast estimates

As a matrix multiplication:

geno.means
##     var1     var2     var3 
## 2.214093 3.573196 6.294620
B.ct %*% alpha.ct.hat
##          [,1]
## var1 2.214093
## var2 3.573196
## var3 6.294620

7 Deviation coding (contr.sum)

7.1 Look at the contrasts

In R, this coding is obtained with contr.sum.

options(contrasts=c("contr.sum", "contr.poly"))
getOption("contrasts")
## [1] "contr.sum"  "contr.poly"
contrasts(dat$geno)
##      [,1] [,2]
## var1    1    0
## var2    0    1
## var3   -1   -1
(Bstar.cs <- contr.sum(levels(dat$geno)))
##      [,1] [,2]
## var1    1    0
## var2    0    1
## var3   -1   -1

As explained above, the columns of Bstar (\(B_\star\)) are not contrasts (even though here they sum to zero!). But obtaining the contrast matrix is easy:

  1. add a column of 1’s to obtain the \(B\) matrix,

  2. then inverse it to obtain the \(C\) matrix,

  3. and remove the first row to obtain the \(C_\star\) matrix.

(B.cs <- cbind(intercept=1, Bstar.cs))
##      intercept      
## var1         1  1  0
## var2         1  0  1
## var3         1 -1 -1
(C.cs <- solve(B.cs))
##                 var1       var2       var3
## intercept  0.3333333  0.3333333  0.3333333
##            0.6666667 -0.3333333 -0.3333333
##           -0.3333333  0.6666667 -0.3333333
(Cstar.cs <- C.cs[-1,])
##        var1       var2       var3
##   0.6666667 -0.3333333 -0.3333333
##  -0.3333333  0.6666667 -0.3333333

As explained above, the rows of Cstar (\(C_\star\)) are contrasts and they sum to zero:

apply(Cstar.cs, 1, sum)
##                             
## -5.551115e-17  0.000000e+00

Given that \(\boldsymbol{\alpha} = C \boldsymbol{m}\), these rows show how to interpret the regression coefficients (the \(\alpha\)’s) in terms of the factor level means (the \(m\)’s):

  • \(\alpha_0 = \mu = \frac{1}{G} \sum_{g=1}^G m_g\)

  • \(\alpha_1 = m_1 - \mu\)

  • \(\alpha_2 = m_2 - \mu\)

The intercept is the mean of all the factor level means. It usually makes sense to choose this coding when no level is a natural reference for the others.

Even though it is not returned by the lm function below, the contrast of the last factor level is of interest and can be deduced from the others:

  • \(\alpha_3 = m_3 - \mu\)

7.2 Fit the model

fit.cs <- lm(yield ~ geno, data=dat, contrasts=list(geno="contr.sum"))
summary(fit.cs)
## 
## Call:
## lm(formula = yield ~ geno, data = dat, contrasts = list(geno = "contr.sum"))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.0276 -0.3481  0.1216  0.2625  1.1592 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   4.0273     0.1683  23.934 1.70e-11 ***
## geno1        -1.8132     0.2380  -7.620 6.17e-06 ***
## geno2        -0.4541     0.2380  -1.908   0.0806 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6517 on 12 degrees of freedom
## Multiple R-squared:  0.8944, Adjusted R-squared:  0.8768 
## F-statistic: 50.83 on 2 and 12 DF,  p-value: 1.385e-06
(alpha.cs.hat <- coef(fit.cs))
## (Intercept)       geno1       geno2 
##   4.0273032  -1.8132101  -0.4541069

Note that only the estimates of the intercept and the first contrast are deemed significantly different than 0.

Residual sum of squares and estimate of error variance:

(tmp <- anova(fit.cs))
## Analysis of Variance Table
## 
## Response: yield
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## geno       2 43.173 21.5867  50.828 1.385e-06 ***
## Residuals 12  5.096  0.4247                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(RSS.cs <- tmp["Residuals", "Sum Sq"])
## [1] 5.096397
summary(fit.cs)$sigma
## [1] 0.6516899
summary(fit.cs)$sigma^2
## [1] 0.4246998
(r.cs <- fit.cs$rank)
## [1] 3
(S2.cs <- RSS.cs / (N - r.cs))
## [1] 0.4246998

7.3 Retrieve the contrast estimates from the data means

7.3.1 Reference level

\(\alpha_0 = \mu = \frac{1}{G} \sum_{g=1}^G m_g\)

alpha.cs.hat[1+0]
## (Intercept) 
##    4.027303
mean(geno.means)
## [1] 4.027303

7.3.2 Estimate of the first contrast

\(\alpha_1 = m_1 - \mu\)

alpha.cs.hat[1+1]
##    geno1 
## -1.81321
geno.means[1] - mean(geno.means)
##     var1 
## -1.81321

7.3.3 Estimate of the second contrast

\(\alpha_2 = m_2 - \mu\)

alpha.cs.hat[1+2]
##      geno2 
## -0.4541069
geno.means[2] - mean(geno.means)
##       var2 
## -0.4541069

7.3.4 Estimate of the last contrast

\(\alpha_3 = m_3 - \mu\)

geno.means[3] - mean(geno.means)
##     var3 
## 2.267317

7.3.5 As a matrix multiplication

alpha.cs.hat
## (Intercept)       geno1       geno2 
##   4.0273032  -1.8132101  -0.4541069
C.cs %*% geno.means
##                 [,1]
## intercept  4.0273032
##           -1.8132101
##           -0.4541069

Knowing the interpretation of the intercept in terms of data means and looking at the boxplots below, it makes sense that all three regression coefficients are significantly different than zero except the one corresponding to var2:

boxplot(yield ~ geno, data=dat, las=1, varwidth=TRUE,
        main="Boxplots of dat$yield")
abline(h=alpha.cs.hat[1], lty=2)
legend("topleft", legend="intercept", lty=2, bty="n")

7.4 Retrieve the data means from the contrast estimates

As a matrix multiplication:

geno.means
##     var1     var2     var3 
## 2.214093 3.573196 6.294620
B.cs %*% alpha.cs.hat
##          [,1]
## var1 2.214093
## var2 3.573196
## var3 6.294620

7.5 By hand

7.5.1 Same as with lm

In this section, we show how to obtain all the information we want using only low-level functions (incidence matrix, contrast estimates, standard errors, test statistics and p values).

The model.matrix function returns the design/incidence matrix:

X.cs <- model.matrix(yield ~ geno, data=dat,
                     contrasts.arg=list(geno="contr.sum"))
print(X.cs)
##    (Intercept) geno1 geno2
## 1            1     1     0
## 2            1     1     0
## 3            1     1     0
## 4            1     1     0
## 5            1     1     0
## 6            1     0     1
## 7            1     0     1
## 8            1     0     1
## 9            1     0     1
## 10           1     0     1
## 11           1    -1    -1
## 12           1    -1    -1
## 13           1    -1    -1
## 14           1    -1    -1
## 15           1    -1    -1
## attr(,"assign")
## [1] 0 1 1
## attr(,"contrasts")
## attr(,"contrasts")$geno
## [1] "contr.sum"

But we can also compute it as \(X_1 B\):

X.1 <- model.matrix(yield ~ 0 + geno, data=dat,
                    contrasts.arg=list(geno="contr.sum"))
## X.1 <- matrix(data=0, nrow=nrow(dat), ncol=nlevels(dat$geno))
## colnames(X.1) <- levels(dat$geno)
## for(j in 1:ncol(X.1))
##   X.1[,j] <- as.numeric(dat$geno == colnames(X.1)[j])
X.1
##    genovar1 genovar2 genovar3
## 1         1        0        0
## 2         1        0        0
## 3         1        0        0
## 4         1        0        0
## 5         1        0        0
## 6         0        1        0
## 7         0        1        0
## 8         0        1        0
## 9         0        1        0
## 10        0        1        0
## 11        0        0        1
## 12        0        0        1
## 13        0        0        1
## 14        0        0        1
## 15        0        0        1
## attr(,"assign")
## [1] 1 1 1
## attr(,"contrasts")
## attr(,"contrasts")$geno
## [1] "contr.sum"
B.cs
##      intercept      
## var1         1  1  0
## var2         1  0  1
## var3         1 -1 -1
(X.cs <- X.1 %*% B.cs)
##    intercept      
## 1          1  1  0
## 2          1  1  0
## 3          1  1  0
## 4          1  1  0
## 5          1  1  0
## 6          1  0  1
## 7          1  0  1
## 8          1  0  1
## 9          1  0  1
## 10         1  0  1
## 11         1 -1 -1
## 12         1 -1 -1
## 13         1 -1 -1
## 14         1 -1 -1
## 15         1 -1 -1

Estimates of regression coefficients:

tXX_cs <- t(X.cs) %*% X.cs
inv_tXX_cs <- solve(tXX_cs)
(alpha.cs.hat <- inv_tXX_cs %*% t(X.cs) %*% dat$yield)
##                 [,1]
## intercept  4.0273032
##           -1.8132101
##           -0.4541069

Residual sum of squares and estimate of error variance:

epsilon.cs.hat <- dat$yield - X.cs %*% alpha.cs.hat
(RSS.cs <- t(epsilon.cs.hat) %*% epsilon.cs.hat)
##          [,1]
## [1,] 5.096397
(S2.cs <- RSS.cs / (N - r.cs))
##           [,1]
## [1,] 0.4246998
S2.cs <- S2.cs[1,1]

Variance of estimates of regression coefficients:

(var.alpha.cs.hat <- inv_tXX_cs * S2.cs)
##            intercept                        
## intercept 0.02831332  0.00000000  0.00000000
##           0.00000000  0.05662664 -0.02831332
##           0.00000000 -0.02831332  0.05662664
(se.alpha.cs.hat <- sqrt(diag(var.alpha.cs.hat)))
## intercept                     
## 0.1682656 0.2379635 0.2379635

Test statistics:

(test.stats.cs <- alpha.cs.hat / se.alpha.cs.hat)
##                [,1]
## intercept 23.934201
##           -7.619698
##           -1.908305

p values:

l <- 0.05
(pvals.cs <- 2 * pt(q=abs(test.stats.cs), df=N-r.cs, ncp=1-l/2, lower.tail=FALSE))
##                   [,1]
## intercept 4.094860e-10
##           1.140875e-04
##           4.044391e-01
(pvals.cs <- 2 * pt(q=abs(test.stats.cs), df=N-r.cs, lower.tail=FALSE))
##                   [,1]
## intercept 1.698261e-11
##           6.168255e-06
##           8.055764e-02

7.5.2 Custom set of contrasts

Now that we know how to compute the same contrast estimates by hand as with lm, we can decide to compute the estimates for the contrasts corresponding to, say, the second and third factor level, omitting the first one. This is done by carefully making the \(B_\star\) matrix, noted here Bstar.cs2 (second version of the coding matrix corresponding to the sum contrast, and then all the same equations apply:

Bstar.cs2 <- matrix(data=c(-1, -1,
                           1, 0,
                           0, 1),
                    byrow=TRUE,
                    nrow=G, ncol=G-1)
rownames(Bstar.cs2) <- lev.genos
(B.cs2 <- cbind(intercept=1, Bstar.cs2))
##      intercept      
## var1         1 -1 -1
## var2         1  1  0
## var3         1  0  1
X.cs2 <- X.1 %*% B.cs2

As expected, we obtain the same estimate for the intercept and the contrast corresponding to the second factor level, but we also now obtain the estimate for the last factor level which we didn’t obtain with the usual call of lm above:

tXX_cs2 <- t(X.cs2) %*% X.cs2
inv_tXX_cs2 <- solve(tXX_cs2)
(alpha.cs2.hat <- inv_tXX_cs2 %*% t(X.cs2) %*% dat$yield)
##                 [,1]
## intercept  4.0273032
##           -0.4541069
##            2.2673170

The rest of the information (standard errors, test statistics and p values) are also obtained with the same equations as above:

epsilon.cs2.hat <- dat$yield - X.cs2 %*% alpha.cs2.hat
(RSS.cs2 <- t(epsilon.cs2.hat) %*% epsilon.cs2.hat)
##          [,1]
## [1,] 5.096397
(S2.cs2 <- RSS.cs2 / (N - r.cs))
##           [,1]
## [1,] 0.4246998
S2.cs2 <- S2.cs2[1,1]
(var.alpha.cs2.hat <- inv_tXX_cs2 * S2.cs2)
##            intercept                        
## intercept 0.02831332  0.00000000  0.00000000
##           0.00000000  0.05662664 -0.02831332
##           0.00000000 -0.02831332  0.05662664
(se.alpha.cs2.hat <- sqrt(diag(var.alpha.cs2.hat)))
## intercept                     
## 0.1682656 0.2379635 0.2379635
(test.stats.cs2 <- alpha.cs2.hat / se.alpha.cs2.hat)
##                [,1]
## intercept 23.934201
##           -1.908305
##            9.528003
l <- 0.05
(pvals.cs2 <- 2 * pt(q=abs(test.stats.cs2), df=N-r.cs, ncp=1-l/2, lower.tail=FALSE))
##                   [,1]
## intercept 4.094860e-10
##           4.044391e-01
##           1.228199e-05
(pvals.cs2 <- 2 * pt(q=abs(test.stats.cs2), df=N-r.cs, lower.tail=FALSE))
##                   [,1]
## intercept 1.698261e-11
##           8.055764e-02
##           6.023008e-07

8 Marginal means

The emmeans package aims at simplifying all this:

emm <- emmeans(object=fit.cs, specs="geno")
summary(emm, level=0.95)
##  geno emmean    SE df lower.CL upper.CL
##  var1   2.21 0.291 12     1.58     2.85
##  var2   3.57 0.291 12     2.94     4.21
##  var3   6.29 0.291 12     5.66     6.93
## 
## Confidence level used: 0.95
contrast(object=emm, method="dunnett") # same as contr.treatment
##  contrast    estimate    SE df t.ratio p.value
##  var2 - var1     1.36 0.412 12   3.297  0.0121
##  var3 - var1     4.08 0.412 12   9.900  <.0001
## 
## P value adjustment: dunnettx method for 2 tests
contrast(object=emm, method="eff")     # same as contr.sum
##  contrast    estimate    SE df t.ratio p.value
##  var1 effect   -1.813 0.238 12  -7.620  <.0001
##  var2 effect   -0.454 0.238 12  -1.908  0.0806
##  var3 effect    2.267 0.238 12   9.528  <.0001
## 
## P value adjustment: fdr method for 3 tests
test(emm, null=mean(dat$yield)) # testing if each mean is equal to the grand mean
##  geno emmean    SE df null t.ratio p.value
##  var1   2.21 0.291 12 4.03  -6.221  <.0001
##  var2   3.57 0.291 12 4.03  -1.558  0.1452
##  var3   6.29 0.291 12 4.03   7.780  <.0001
test(emm, null=coef(fit.cs)[1]) # testing if each mean is equal to the intercept with contr.sum
##  geno emmean    SE df null t.ratio p.value
##  var1   2.21 0.291 12 4.03  -6.221  <.0001
##  var2   3.57 0.291 12 4.03  -1.558  0.1452
##  var3   6.29 0.291 12 4.03   7.780  <.0001

Comparisons:

pairs(emm, adjust="tukey") # Tukey's HSD test
##  contrast    estimate    SE df t.ratio p.value
##  var1 - var2    -1.36 0.412 12  -3.297  0.0163
##  var1 - var3    -4.08 0.412 12  -9.900  <.0001
##  var2 - var3    -2.72 0.412 12  -6.603  0.0001
## 
## P value adjustment: tukey method for comparing a family of 3 estimates
(tukey <- multcomp::cld(object=emm, adjust="Tukey", alpha=0.05, Letters=letters))
## Note: adjust = "tukey" was changed to "sidak"
## because "tukey" is only appropriate for one set of pairwise comparisons
##  geno emmean    SE df lower.CL upper.CL .group
##  var1   2.21 0.291 12     1.41     3.02  a    
##  var2   3.57 0.291 12     2.77     4.38   b   
##  var3   6.29 0.291 12     5.49     7.10    c  
## 
## Confidence level used: 0.95 
## Conf-level adjustment: sidak method for 3 estimates 
## P value adjustment: tukey method for comparing a family of 3 estimates 
## significance level used: alpha = 0.05 
## NOTE: Compact letter displays can be misleading
##       because they show NON-findings rather than findings.
##       Consider using 'pairs()', 'pwpp()', or 'pwpm()' instead.
plot(tukey) + ggplot2::theme_bw()

9 Perspectives

  • ANOVA table (aov) and types of sum of squares

  • other contrasts: contr.helmert, etc

  • two-way ANOVA with interactions

10 References

  • vignette of the codingMatrices package from Bill Venables

  • notes of a course on applied linear models devoted to one-way ANOVA from Bo Li (Purdue University)

  • tutorial on coding systems from the Institute for Digital Reasearch and Education at UCLA

  • Piepho H. 2018. Letters in Mean Comparisons: What They Do and Don’t Mean. Agronomy Journal. 110(2):431–434. https://doi.org/10.2134/agronj2017.10.0580

11 Appendix

print(sessionInfo(), locale=FALSE)
## R version 4.2.2 Patched (2022-11-10 r83330)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Debian GNU/Linux 12 (bookworm)
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.11.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.11.0
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] emmeans_1.7.4-1
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.2.0    zoo_1.8-10          xfun_0.42          
##  [4] bslib_0.4.0         splines_4.2.2       lattice_0.20-45    
##  [7] generics_0.1.2      colorspace_2.0-3    vctrs_0.6.5        
## [10] htmltools_0.5.3     yaml_2.3.5          utf8_1.2.2         
## [13] survival_3.5-3      rlang_1.1.4         jquerylib_0.1.4    
## [16] pillar_1.9.0        withr_2.5.0         glue_1.6.2         
## [19] multcomp_1.4-26     lifecycle_1.0.3     multcompView_0.1-10
## [22] munsell_0.5.0       gtable_0.3.0        mvtnorm_1.1-3      
## [25] codetools_0.2-19    coda_0.19-4         evaluate_0.15      
## [28] labeling_0.4.2      knitr_1.45          fastmap_1.1.0      
## [31] fansi_1.0.3         highr_0.9           TH.data_1.1-1      
## [34] xtable_1.8-4        scales_1.3.0        cachem_1.0.6       
## [37] jsonlite_1.8.0      farver_2.1.0        ggplot2_3.5.0      
## [40] digest_0.6.29       dplyr_1.1.4         grid_4.2.2         
## [43] cli_3.6.2           tools_4.2.2         sandwich_3.0-1     
## [46] magrittr_2.0.3      sass_0.4.2          tibble_3.2.1       
## [49] pkgconfig_2.0.3     MASS_7.3-57         Matrix_1.5-3       
## [52] estimability_1.3    rmarkdown_2.25      rstudioapi_0.13    
## [55] R6_2.5.1            compiler_4.2.2