This document was generated from a file in the Rmd format. Both the source and target files are under the CC BY-SA license. The source file is versioned online.
This tutorial aims at explaining contrasts in R using the one-way ANOVA in a completely balanced setting. It draws heavily on the references listed at the end of the document.
In an example, yield as the response variable is
regressed on an unordered factor, genotype. A small data
set will be simulated, then a first version of the statistical model
will be described, then a second, leading to the motivation behind the
usage of contrasts. Finally, two different contrasts will be described
in details.
Computations are made using R base functions as well as “by hand”.
But it is also shown how to obtain the required information (incidence
matrix, contrast estimates, standard errors, test statistics and p
values) using an external package, emmeans:
suppressPackageStartupMessages(library(emmeans))
G <- 3 # number of genotypes
R <- 5 # number of replicates per genotype
(lev.genos <- paste0("var", 1:G))
## [1] "var1" "var2" "var3"
(lev.reps <- LETTERS[1:R])
## [1] "A" "B" "C" "D" "E"
dat <- data.frame(geno=rep(lev.genos, each=R),
rep=rep(lev.reps, G),
yield=NA,
stringsAsFactors=FALSE)
dat
## geno rep yield
## 1 var1 A NA
## 2 var1 B NA
## 3 var1 C NA
## 4 var1 D NA
## 5 var1 E NA
## 6 var2 A NA
## 7 var2 B NA
## 8 var2 C NA
## 9 var2 D NA
## 10 var2 E NA
## 11 var3 A NA
## 12 var3 B NA
## 13 var3 C NA
## 14 var3 D NA
## 15 var3 E NA
set.seed(12345) # to make the simulation reproducible
N <- nrow(dat) # total number of observations
stopifnot(G == 3)
(true.geno.means <- setNames(c(2, 4, 6), lev.genos))
## var1 var2 var3
## 2 4 6
(true.mean <- mean(true.geno.means))
## [1] 4
for(i in 1:N){
true.geno.mean <- true.geno.means[dat$geno[i]]
dat$yield[i] <- true.geno.mean
}
sd.error <- 0.8
epsilons <- rnorm(n=N, mean=0, sd=sd.error)
dat$yield <- dat$yield + epsilons
dat$geno <- as.factor(dat$geno)
dat$rep <- as.factor(dat$rep)
dat
## geno rep yield
## 1 var1 A 2.468423
## 2 var1 B 2.567573
## 3 var1 C 1.912557
## 4 var1 D 1.637202
## 5 var1 E 2.484710
## 6 var2 A 2.545635
## 7 var2 B 4.504079
## 8 var2 C 3.779053
## 9 var2 D 3.772672
## 10 var2 E 3.264542
## 11 var3 A 5.907002
## 12 var3 B 7.453850
## 13 var3 C 6.296502
## 14 var3 D 6.416173
## 15 var3 E 5.399574
str(dat)
## 'data.frame': 15 obs. of 3 variables:
## $ geno : Factor w/ 3 levels "var1","var2",..: 1 1 1 1 1 2 2 2 2 2 ...
## $ rep : Factor w/ 5 levels "A","B","C","D",..: 1 2 3 4 5 1 2 3 4 5 ...
## $ yield: num 2.47 2.57 1.91 1.64 2.48 ...
(grand.mean <- mean(dat$yield))
## [1] 4.027303
(geno.means <- tapply(dat$yield, dat$geno, mean))
## var1 var2 var3
## 2.214093 3.573196 6.294620
hist(dat$yield, col="grey", border="white", las=1)
abline(v=grand.mean, lty=2)
legend("topright", legend="grand mean", lty=2, bty="n")
boxplot(yield ~ geno, data=dat, las=1, varwidth=TRUE,
main="Boxplots of dat$yield")
abline(h=grand.mean, lty=2)
legend("topleft", legend="grand mean", lty=2, bty="n")
The goal of the statistical models here will be to explain the mean of the response variable using the explanatory factor.
As a first model, one can choose to explain the mean of the responses directly by the genotype means.
\(G\): total number of genotypes
\(R\): total number of replicates (here, the same number for each genotype)
\(g\): index of the genotype
\(r\): index of the replicate
\(y_{gr}\): yield for replicate \(r\) of genotype \(g\)
\(m_g\): mean yield of genotype \(g\)
\(\epsilon_{gr}\): error for replicate \(r\) of genotype \(g\)
\(\sigma^2\): variance of the errors
\[ \forall g \in \{1,\ldots,G\} \text{ and } r \in \{1,\ldots,R\}, \; y_{gr} = m_g + \epsilon_{gr} \text{ with } \epsilon_{gr} \sim \mathcal{N}(0, \sigma^2) \]
A few other notations:
\(N\): total number of observations (here, \(N = G \times R\))
\(i \in \{1,\ldots,N\}\): index of the observation
\(\boldsymbol{y}\): \(N\)-vector of observations
\(X_1\): \(N \times G\) design/incidence matrix relating observations to genotypes
\(\boldsymbol{m} = \{m_1,\ldots,m_G\}\): \(G\)-vector of parameters explaining the mean of the response variable
\(\mathcal{N}_N\): multivariate Normal distribution of dimension \(N\)
\(\text{I}_N\): \(N \times N\) identity matrix
\[ \boldsymbol{y} = X_1 \boldsymbol{m} + \boldsymbol{\epsilon} \text{ where } \boldsymbol{\epsilon} \sim \mathcal{N}_N(\boldsymbol{0}, \sigma^2 \text{I}_N) \]
which is equivalent to:
\[ \boldsymbol{y} \, | \, X_1, \boldsymbol{m}, \sigma^2 \; \sim \; \mathcal{N}_N(X_1 \boldsymbol{m}, \sigma^2 \text{I}_N) \]
Note that this model has \(G\) parameters for the expectation of \(\boldsymbol{y}\).
Under our assumptions, by minimizing the squared errors or by maximizing the likelihood, one ends up with the same estimator of \(\boldsymbol{m}\), noted \(\hat{\boldsymbol{m}}\) (see the Gauss-Markov theorem):
\[ X_1^T \, X_1 \; \hat{\boldsymbol{m}} \; = \, X_1^T \, \boldsymbol{y} \]
Without going into too many mathematical details, this system of equations can be easily resolved (by inverting \(X_1^T X_1\)) if the columns of \(X_1\) are independent from each other:
\[ \hat{\boldsymbol{m}} \; = \, (X_1^T X_1)^{-1} \, X_1^T \, \boldsymbol{y} \]
Importantly, to obtain the estimates of the parameters for the expectation, one first needs to specify a coding scheme for \(X_1\). At this stage, it helps writing down the insides of the model vectors/matrices. Here it is using \(R=2\):
\[ \begin{bmatrix} y_{11} \\ y_{12} \\ y_{21} \\ y_{22} \\ \vdots \\ y_{GR} \end{bmatrix} = \begin{bmatrix} 1 & 0 & 0 & ... \\ 1 & 0 & 0 & ... \\ 0 & 1 & 0 & ... \\ 0 & 1 & 0 & ... \\ \vdots & \vdots & \vdots & ... \end{bmatrix} \times \begin{bmatrix} m_1 \\ m_2 \\ \vdots \\ m_G \end{bmatrix} + \begin{bmatrix} \epsilon_{11} \\ \epsilon_{12} \\ \epsilon_{21} \\ \epsilon_{22} \\ \vdots \\ \epsilon_{GR} \end{bmatrix} \]
Using the rule of matrix multiplication, one can easily retrieve the following equations:
\(y_{11} = m_1 + \epsilon_{11}\)
\(y_{12} = m_1 + \epsilon_{12}\)
\(y_{21} = m_2 + \epsilon_{21}\)
…
Here is an example in R:
## example with G=3 genotypes and R=2 reps
y <- dat$yield[dat$rep %in% c("A","B")]
(X_1 <- matrix(c(1,0,0,
1,0,0,
0,1,0,
0,1,0,
0,0,1,
0,0,1),
nrow=6, ncol=3, byrow=TRUE))
## [,1] [,2] [,3]
## [1,] 1 0 0
## [2,] 1 0 0
## [3,] 0 1 0
## [4,] 0 1 0
## [5,] 0 0 1
## [6,] 0 0 1
tXX_1 <- t(X_1) %*% X_1
inv_tXX_1 <- solve(tXX_1)
(m.hat <- inv_tXX_1 %*% t(X_1) %*% y)
## [,1]
## [1,] 2.517998
## [2,] 3.524857
## [3,] 6.680426
(epsilons.hat <- y - X_1 %*% m.hat)
## [,1]
## [1,] -0.04957488
## [2,] 0.04957488
## [3,] -0.97922181
## [4,] 0.97922181
## [5,] -0.77342394
## [6,] 0.77342394
mean(epsilons.hat)
## [1] -1.480207e-16
Obtaining an estimate of the parameters is important, but quantifying our uncertainty about it is better.
Here is the variance of the estimator \(\hat{\boldsymbol{m}}\):
\[ \mathbb{V}(\hat{\boldsymbol{m}}) = (X_1^T X_1)^{-1} \sigma^2 \]
To compute it, one hence needs to estimate the error variance. The estimate of \(\sigma^2\) will be noted \(S^2\):
\[ \mathbb{E}(S^2) = \frac{RSS}{N - r} \]
where \(RSS\) is the residual sum of squares (\(\sum_i (y_i - \hat{y_i})^2\)) and \(r\) is the number of independent columns of \(X_1\) (called the rank of \(X_1\)).
Say our null hypothesis is “\(m_g = a\)”. One can reject it at level \(l\) if:
\[ \frac{|m_g - a|}{\sqrt{\mathbb{V}(\hat{m_g})}} \ge t_{1-l/2, N-r} \]
where \(t\) is the Student’s distribution.
The “cell-mean” model (v1 above) is intuitive, but it can be re-parametrize into another, equivalent model with an intercept.
A few other notations:
\[ \forall g \in \{1,\ldots,G\} \text{ and } r \in \{1,\ldots,R\}, \; y_{gr} = \mu + m_g + \epsilon_{gr} \text{ with } \epsilon_{gr} \sim \mathcal{N}(0, \sigma^2) \]
A few other notations:
\(X_2\): \(N \times (G+1)\) design/incidence matrix relating observations (the \(y_{gr}\)’s) to genotypes
\(\boldsymbol{\theta} = \{\mu,m_1,\ldots,m_G\}\): \((G+1)\)-vector of parameters explaining the mean of the response variable
\[ \boldsymbol{y} = X_2 \boldsymbol{\theta} + \boldsymbol{\epsilon} \text{ where } \boldsymbol{\epsilon} \sim \mathcal{N}_N(\boldsymbol{0}, \sigma^2 \text{I}_N) \]
Note that this model has \(G+1\) parameters for the expectation of \(\boldsymbol{y}\).
Let us make the same exercise as before, that is, writing the insides of the model vectors/matrices:
\[ \begin{bmatrix} y_{11} \\ y_{12} \\ y_{21} \\ y_{22} \\ \vdots \\ y_{GR} \end{bmatrix} = \begin{bmatrix} 1 & 1 & 0 & 0 & ... \\ 1 & 1 & 0 & 0 & ... \\ 1 & 0 & 1 & 0 & ... \\ 1 & 0 & 1 & 0 & ... \\ \vdots & \vdots & \vdots & ... \end{bmatrix} \times \begin{bmatrix} \mu \\ m_1 \\ m_2 \\ \vdots \\ m_G \end{bmatrix} + \begin{bmatrix} \epsilon_{11} \\ \epsilon_{12} \\ \epsilon_{21} \\ \epsilon_{22} \\ \vdots \\ \epsilon_{GR} \end{bmatrix} \]
Again as above, using the rule of matrix multiplication, one can easily retrieve the following equations:
\(y_{11} = \mu + m_1 + \epsilon_{11}\)
\(y_{12} = \mu + m_1 + \epsilon_{12}\)
\(y_{21} = \mu + m_2 + \epsilon_{21}\)
…
However, note that the first column of \(X_2\) (corresponding to \(\mu\)) is equal to the sum of all its other columns.
Here is an example in R:
## example with G=3 genotypes and R=2 reps
y <- dat$yield[dat$rep %in% c("A","B")]
(X_2 <- matrix(c(1,1,0,0,
1,1,0,0,
1,0,1,0,
1,0,1,0,
1,0,0,1,
1,0,0,1),
nrow=6, ncol=3+1, byrow=TRUE))
## [,1] [,2] [,3] [,4]
## [1,] 1 1 0 0
## [2,] 1 1 0 0
## [3,] 1 0 1 0
## [4,] 1 0 1 0
## [5,] 1 0 0 1
## [6,] 1 0 0 1
kappa(X_2) # the higher, the more singular X is
## [1] 4.160247e+16
tXX_2 <- t(X_2) %*% X_2
eigen(tXX_2)$values # the matrix is singular if at least one eigenvalue is zero
## [1] 8.000000e+00 2.000000e+00 2.000000e+00 5.329071e-15
try(inv_tXX_2 <- solve(tXX_2))
## Error in solve.default(tXX_2) :
## system is computationally singular: reciprocal condition number = 1.38778e-17
This lack of independence between the columns of \(X_2\) (such a matrix is said to be singular) means that one needs to add constraints (only one in here) to estimate the parameters for the expectation of \(y\).
Still a few other notations:
\(\boldsymbol{\alpha}\): \(G\)-vector of regression coefficients
the first element corresponds to the intercept: \(\alpha_0 := \mu\)
the others correspond to a \((G-1)\)-vector noted \(\boldsymbol{\alpha}_\star\)
\(\{c_0,\ldots,c_{G-1}\}\): set of \(G\) real numbers
As said above, to deal with the non-independence between columns of the incidence matrix \(X\), one needs to add constraint(s). Typically, a constraint takes the form of a linear combination of the regression coefficients:
\[\begin{equation} \label{eq:contrast_lin_comb} \sum_{j=0}^{G-1} c_j \, \boldsymbol{\alpha}_j \; = \; c_0 \times \alpha_0 + c_1 \times \alpha_1 + \ldots + c_{G-1} \times \alpha_{G-1} \end{equation}\]
But it has a specificity, the sum of the coefficients of this linear combination equals zero: \(\sum_{j=0}^{G-1} c_j = 0\). Such a linear combination is called a contrast, and this is what will be estimated.
Obviously, one want to be able to go from \(\boldsymbol{m}\) to \(\boldsymbol{\alpha}\), et reciprocally. Such transformations can be easily seen using matrix calculus:
\[ \boldsymbol{\alpha} = C \, \boldsymbol{m} \; \Leftrightarrow \; \boldsymbol{m} = B \, \boldsymbol{\alpha} \]
where \(B = C^{-1}\) and both are \(G \times G\) matrices.
One typically chooses \(B\) so that its first column is made of 1’s: \(B = [ \boldsymbol{1}_G \; B_\star ]\).
\(\Rightarrow\) The \(B_\star\) matrix is \(G \times (G-1)\) and is called a coding matrix.
This leads to:
\[ \boldsymbol{m} = B \boldsymbol{\alpha} = [ \boldsymbol{1}_G \; B_\star ] [\alpha_0 \ldots \alpha_{G-1}]^T = \boldsymbol{1}_G \mu + B_\star \boldsymbol{\alpha}_\star \]
Similarly, \(C\) will be partitioned by separating the first row: \(C = \begin{bmatrix} \boldsymbol{c}_0^T \\ C_\star^T \end{bmatrix}\).
\(\Rightarrow\) The \(C_\star\) matrix is \((G-1) \times G\) and is called a contrast matrix.
Its rows are the contrasts we want to estimate: they sum to zero.
CAUTION: in
R, the contr.treatment, contr.sum,
contr.helmert, contr.poly,
contr.diff, etc functions take factor levels as input and
return a coding matrix (\(B_\star\)),
not a contrast matrix (\(C_\star\))!
Several possible coding schemes exist, each leading to different contrasts, and the interpretation of the contrasts depends on the coding used to obtain them.
As a contrast is a linear combination of parameters, say \(C_g \, \boldsymbol{m}\) (where \(C_g\) is the \(g^\text{th}\) row of \(C\)), one can easily compute its estimate:
\[ C_g \, \hat{\boldsymbol{m}} \]
The variance of a given contrast will be: \[ C_g \, \mathbb{V}(\hat{\boldsymbol{m}}) \, C_g^T \]
The null hypothesis “\(C_g \, \boldsymbol{m} = a\)” is rejected a level \(l\) if:
\[ \frac{|C_g \, \boldsymbol{m} - a|}{\sqrt{C_g \, \mathbb{V}(\hat{\boldsymbol{m}}) \, C_g^T}} \ge t_{1-l/2, N-r} \]
contr.treatment)In R, this coding is obtained with contr.treatment,
which also happens to be the default coding for unordered factors.
getOption("contrasts")
## unordered ordered
## "contr.treatment" "contr.poly"
contrasts(dat$geno)
## var2 var3
## var1 0 0
## var2 1 0
## var3 0 1
(Bstar.ct <- contr.treatment(levels(dat$geno)))
## var2 var3
## var1 0 0
## var2 1 0
## var3 0 1
As explained above, the columns of Bstar (\(B_\star\)) are clearly not contrasts as
they don’t sum to zero. But obtaining the contrast matrix is easy:
add a column of 1’s to obtain the \(B\) matrix,
then inverse it to obtain the \(C\) matrix,
and remove the first row to obtain the \(C_\star\) matrix.
(B.ct <- cbind(intercept=1, Bstar.ct))
## intercept var2 var3
## var1 1 0 0
## var2 1 1 0
## var3 1 0 1
(C.ct <- solve(B.ct))
## var1 var2 var3
## intercept 1 0 0
## var2 -1 1 0
## var3 -1 0 1
(Cstar.ct <- C.ct[-1,])
## var1 var2 var3
## var2 -1 1 0
## var3 -1 0 1
As explained above, the rows of Cstar (\(C_\star\)) are contrasts and they sum to
zero:
apply(Cstar.ct, 1, sum)
## var2 var3
## 0 0
Given that \(\boldsymbol{\alpha} = C \boldsymbol{m}\), these rows show how to interpret the regression coefficients (the \(\alpha\)’s) in terms of the factor level means (the \(m\)’s):
\(\alpha_0 = 1 \times m_1 + 0 \times m_2 + 0 \times m_3 = m_1\)
\(\alpha_1 = -1 \times m_1 + 1 \times m_2 + 0 \times m_3 = m_2 - m_1\)
\(\alpha_2 = -1 \times m_1 + 0 \times m_2 + 1 \times m_3 = m_3 - m_1\)
The intercept is the mean of a factor level chosen arbitrarily (the first level by default). It usually makes sense to choose this coding when one level is a natural reference for the others, such as a control.
fit.ct <- lm(yield ~ geno, data=dat, contrasts=list(geno="contr.treatment"))
summary(fit.ct)
##
## Call:
## lm(formula = yield ~ geno, data = dat, contrasts = list(geno = "contr.treatment"))
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.0276 -0.3481 0.1216 0.2625 1.1592
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.2141 0.2914 7.597 6.36e-06 ***
## genovar2 1.3591 0.4122 3.297 0.00637 **
## genovar3 4.0805 0.4122 9.900 3.99e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6517 on 12 degrees of freedom
## Multiple R-squared: 0.8944, Adjusted R-squared: 0.8768
## F-statistic: 50.83 on 2 and 12 DF, p-value: 1.385e-06
(alpha.ct.hat <- coef(fit.ct))
## (Intercept) genovar2 genovar3
## 2.214093 1.359103 4.080527
Note that all three estimates are deemed significantly different than 0.
Residual sum of squares and estimate of error variance:
(tmp <- anova(fit.ct))
## Analysis of Variance Table
##
## Response: yield
## Df Sum Sq Mean Sq F value Pr(>F)
## geno 2 43.173 21.5867 50.828 1.385e-06 ***
## Residuals 12 5.096 0.4247
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(RSS.ct <- tmp["Residuals", "Sum Sq"])
## [1] 5.096397
summary(fit.ct)$sigma
## [1] 0.6516899
summary(fit.ct)$sigma^2
## [1] 0.4246998
(r.ct <- fit.ct$rank)
## [1] 3
(S2.ct <- RSS.ct / (N - r.ct))
## [1] 0.4246998
\(\alpha_0 = \mu = m_1\)
alpha.ct.hat[1+0]
## (Intercept)
## 2.214093
geno.means[1]
## var1
## 2.214093
C.ct[1,,drop=FALSE] %*% geno.means
## [,1]
## intercept 2.214093
\(\alpha_1 = m_2 - m_1\)
alpha.ct.hat[1+1]
## genovar2
## 1.359103
geno.means[2] - geno.means[1]
## var2
## 1.359103
C.ct[2,,drop=FALSE] %*% geno.means
## [,1]
## var2 1.359103
\(\alpha_2 = m_3 - m_1\)
alpha.ct.hat[1+2]
## genovar3
## 4.080527
geno.means[3] - geno.means[1]
## var3
## 4.080527
C.ct[3,,drop=FALSE] %*% geno.means
## [,1]
## var3 4.080527
alpha.ct.hat
## (Intercept) genovar2 genovar3
## 2.214093 1.359103 4.080527
C.ct %*% geno.means
## [,1]
## intercept 2.214093
## var2 1.359103
## var3 4.080527
Knowing the interpretation of the intercept in terms of data means and looking at the boxplots below, it makes sense that all three regression coefficients are significantly different than zero:
boxplot(yield ~ geno, data=dat, las=1, varwidth=TRUE,
main="Boxplots of dat$yield")
abline(h=alpha.ct.hat[1], lty=2)
legend("topleft", legend="intercept", lty=2, bty="n")
As a matrix multiplication:
geno.means
## var1 var2 var3
## 2.214093 3.573196 6.294620
B.ct %*% alpha.ct.hat
## [,1]
## var1 2.214093
## var2 3.573196
## var3 6.294620
contr.sum)In R, this coding is obtained with contr.sum.
options(contrasts=c("contr.sum", "contr.poly"))
getOption("contrasts")
## [1] "contr.sum" "contr.poly"
contrasts(dat$geno)
## [,1] [,2]
## var1 1 0
## var2 0 1
## var3 -1 -1
(Bstar.cs <- contr.sum(levels(dat$geno)))
## [,1] [,2]
## var1 1 0
## var2 0 1
## var3 -1 -1
As explained above, the columns of Bstar (\(B_\star\)) are not contrasts (even though
here they sum to zero!). But obtaining the contrast matrix is easy:
add a column of 1’s to obtain the \(B\) matrix,
then inverse it to obtain the \(C\) matrix,
and remove the first row to obtain the \(C_\star\) matrix.
(B.cs <- cbind(intercept=1, Bstar.cs))
## intercept
## var1 1 1 0
## var2 1 0 1
## var3 1 -1 -1
(C.cs <- solve(B.cs))
## var1 var2 var3
## intercept 0.3333333 0.3333333 0.3333333
## 0.6666667 -0.3333333 -0.3333333
## -0.3333333 0.6666667 -0.3333333
(Cstar.cs <- C.cs[-1,])
## var1 var2 var3
## 0.6666667 -0.3333333 -0.3333333
## -0.3333333 0.6666667 -0.3333333
As explained above, the rows of Cstar (\(C_\star\)) are contrasts and they sum to
zero:
apply(Cstar.cs, 1, sum)
##
## -5.551115e-17 0.000000e+00
Given that \(\boldsymbol{\alpha} = C \boldsymbol{m}\), these rows show how to interpret the regression coefficients (the \(\alpha\)’s) in terms of the factor level means (the \(m\)’s):
\(\alpha_0 = \mu = \frac{1}{G} \sum_{g=1}^G m_g\)
\(\alpha_1 = m_1 - \mu\)
\(\alpha_2 = m_2 - \mu\)
The intercept is the mean of all the factor level means. It usually makes sense to choose this coding when no level is a natural reference for the others.
Even though it is not returned by the lm function below,
the contrast of the last factor level is of interest and can be deduced
from the others:
fit.cs <- lm(yield ~ geno, data=dat, contrasts=list(geno="contr.sum"))
summary(fit.cs)
##
## Call:
## lm(formula = yield ~ geno, data = dat, contrasts = list(geno = "contr.sum"))
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.0276 -0.3481 0.1216 0.2625 1.1592
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.0273 0.1683 23.934 1.70e-11 ***
## geno1 -1.8132 0.2380 -7.620 6.17e-06 ***
## geno2 -0.4541 0.2380 -1.908 0.0806 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6517 on 12 degrees of freedom
## Multiple R-squared: 0.8944, Adjusted R-squared: 0.8768
## F-statistic: 50.83 on 2 and 12 DF, p-value: 1.385e-06
(alpha.cs.hat <- coef(fit.cs))
## (Intercept) geno1 geno2
## 4.0273032 -1.8132101 -0.4541069
Note that only the estimates of the intercept and the first contrast are deemed significantly different than 0.
Residual sum of squares and estimate of error variance:
(tmp <- anova(fit.cs))
## Analysis of Variance Table
##
## Response: yield
## Df Sum Sq Mean Sq F value Pr(>F)
## geno 2 43.173 21.5867 50.828 1.385e-06 ***
## Residuals 12 5.096 0.4247
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(RSS.cs <- tmp["Residuals", "Sum Sq"])
## [1] 5.096397
summary(fit.cs)$sigma
## [1] 0.6516899
summary(fit.cs)$sigma^2
## [1] 0.4246998
(r.cs <- fit.cs$rank)
## [1] 3
(S2.cs <- RSS.cs / (N - r.cs))
## [1] 0.4246998
\(\alpha_0 = \mu = \frac{1}{G} \sum_{g=1}^G m_g\)
alpha.cs.hat[1+0]
## (Intercept)
## 4.027303
mean(geno.means)
## [1] 4.027303
\(\alpha_1 = m_1 - \mu\)
alpha.cs.hat[1+1]
## geno1
## -1.81321
geno.means[1] - mean(geno.means)
## var1
## -1.81321
\(\alpha_2 = m_2 - \mu\)
alpha.cs.hat[1+2]
## geno2
## -0.4541069
geno.means[2] - mean(geno.means)
## var2
## -0.4541069
\(\alpha_3 = m_3 - \mu\)
geno.means[3] - mean(geno.means)
## var3
## 2.267317
alpha.cs.hat
## (Intercept) geno1 geno2
## 4.0273032 -1.8132101 -0.4541069
C.cs %*% geno.means
## [,1]
## intercept 4.0273032
## -1.8132101
## -0.4541069
Knowing the interpretation of the intercept in terms of data means
and looking at the boxplots below, it makes sense that all three
regression coefficients are significantly different than zero except the
one corresponding to var2:
boxplot(yield ~ geno, data=dat, las=1, varwidth=TRUE,
main="Boxplots of dat$yield")
abline(h=alpha.cs.hat[1], lty=2)
legend("topleft", legend="intercept", lty=2, bty="n")
As a matrix multiplication:
geno.means
## var1 var2 var3
## 2.214093 3.573196 6.294620
B.cs %*% alpha.cs.hat
## [,1]
## var1 2.214093
## var2 3.573196
## var3 6.294620
lmIn this section, we show how to obtain all the information we want using only low-level functions (incidence matrix, contrast estimates, standard errors, test statistics and p values).
The model.matrix function returns the design/incidence
matrix:
X.cs <- model.matrix(yield ~ geno, data=dat,
contrasts.arg=list(geno="contr.sum"))
print(X.cs)
## (Intercept) geno1 geno2
## 1 1 1 0
## 2 1 1 0
## 3 1 1 0
## 4 1 1 0
## 5 1 1 0
## 6 1 0 1
## 7 1 0 1
## 8 1 0 1
## 9 1 0 1
## 10 1 0 1
## 11 1 -1 -1
## 12 1 -1 -1
## 13 1 -1 -1
## 14 1 -1 -1
## 15 1 -1 -1
## attr(,"assign")
## [1] 0 1 1
## attr(,"contrasts")
## attr(,"contrasts")$geno
## [1] "contr.sum"
But we can also compute it as \(X_1 B\):
X.1 <- model.matrix(yield ~ 0 + geno, data=dat,
contrasts.arg=list(geno="contr.sum"))
## X.1 <- matrix(data=0, nrow=nrow(dat), ncol=nlevels(dat$geno))
## colnames(X.1) <- levels(dat$geno)
## for(j in 1:ncol(X.1))
## X.1[,j] <- as.numeric(dat$geno == colnames(X.1)[j])
X.1
## genovar1 genovar2 genovar3
## 1 1 0 0
## 2 1 0 0
## 3 1 0 0
## 4 1 0 0
## 5 1 0 0
## 6 0 1 0
## 7 0 1 0
## 8 0 1 0
## 9 0 1 0
## 10 0 1 0
## 11 0 0 1
## 12 0 0 1
## 13 0 0 1
## 14 0 0 1
## 15 0 0 1
## attr(,"assign")
## [1] 1 1 1
## attr(,"contrasts")
## attr(,"contrasts")$geno
## [1] "contr.sum"
B.cs
## intercept
## var1 1 1 0
## var2 1 0 1
## var3 1 -1 -1
(X.cs <- X.1 %*% B.cs)
## intercept
## 1 1 1 0
## 2 1 1 0
## 3 1 1 0
## 4 1 1 0
## 5 1 1 0
## 6 1 0 1
## 7 1 0 1
## 8 1 0 1
## 9 1 0 1
## 10 1 0 1
## 11 1 -1 -1
## 12 1 -1 -1
## 13 1 -1 -1
## 14 1 -1 -1
## 15 1 -1 -1
Estimates of regression coefficients:
tXX_cs <- t(X.cs) %*% X.cs
inv_tXX_cs <- solve(tXX_cs)
(alpha.cs.hat <- inv_tXX_cs %*% t(X.cs) %*% dat$yield)
## [,1]
## intercept 4.0273032
## -1.8132101
## -0.4541069
Residual sum of squares and estimate of error variance:
epsilon.cs.hat <- dat$yield - X.cs %*% alpha.cs.hat
(RSS.cs <- t(epsilon.cs.hat) %*% epsilon.cs.hat)
## [,1]
## [1,] 5.096397
(S2.cs <- RSS.cs / (N - r.cs))
## [,1]
## [1,] 0.4246998
S2.cs <- S2.cs[1,1]
Variance of estimates of regression coefficients:
(var.alpha.cs.hat <- inv_tXX_cs * S2.cs)
## intercept
## intercept 0.02831332 0.00000000 0.00000000
## 0.00000000 0.05662664 -0.02831332
## 0.00000000 -0.02831332 0.05662664
(se.alpha.cs.hat <- sqrt(diag(var.alpha.cs.hat)))
## intercept
## 0.1682656 0.2379635 0.2379635
Test statistics:
(test.stats.cs <- alpha.cs.hat / se.alpha.cs.hat)
## [,1]
## intercept 23.934201
## -7.619698
## -1.908305
p values:
l <- 0.05
(pvals.cs <- 2 * pt(q=abs(test.stats.cs), df=N-r.cs, ncp=1-l/2, lower.tail=FALSE))
## [,1]
## intercept 4.094860e-10
## 1.140875e-04
## 4.044391e-01
(pvals.cs <- 2 * pt(q=abs(test.stats.cs), df=N-r.cs, lower.tail=FALSE))
## [,1]
## intercept 1.698261e-11
## 6.168255e-06
## 8.055764e-02
Now that we know how to compute the same contrast estimates by hand
as with lm, we can decide to compute the estimates for the
contrasts corresponding to, say, the second and third factor level,
omitting the first one. This is done by carefully making the
\(B_\star\) matrix, noted here
Bstar.cs2 (second version of the coding matrix
corresponding to the sum contrast, and then all the same equations
apply:
Bstar.cs2 <- matrix(data=c(-1, -1,
1, 0,
0, 1),
byrow=TRUE,
nrow=G, ncol=G-1)
rownames(Bstar.cs2) <- lev.genos
(B.cs2 <- cbind(intercept=1, Bstar.cs2))
## intercept
## var1 1 -1 -1
## var2 1 1 0
## var3 1 0 1
X.cs2 <- X.1 %*% B.cs2
As expected, we obtain the same estimate for the intercept and the
contrast corresponding to the second factor level, but we also now
obtain the estimate for the last factor level which we didn’t obtain
with the usual call of lm above:
tXX_cs2 <- t(X.cs2) %*% X.cs2
inv_tXX_cs2 <- solve(tXX_cs2)
(alpha.cs2.hat <- inv_tXX_cs2 %*% t(X.cs2) %*% dat$yield)
## [,1]
## intercept 4.0273032
## -0.4541069
## 2.2673170
The rest of the information (standard errors, test statistics and p values) are also obtained with the same equations as above:
epsilon.cs2.hat <- dat$yield - X.cs2 %*% alpha.cs2.hat
(RSS.cs2 <- t(epsilon.cs2.hat) %*% epsilon.cs2.hat)
## [,1]
## [1,] 5.096397
(S2.cs2 <- RSS.cs2 / (N - r.cs))
## [,1]
## [1,] 0.4246998
S2.cs2 <- S2.cs2[1,1]
(var.alpha.cs2.hat <- inv_tXX_cs2 * S2.cs2)
## intercept
## intercept 0.02831332 0.00000000 0.00000000
## 0.00000000 0.05662664 -0.02831332
## 0.00000000 -0.02831332 0.05662664
(se.alpha.cs2.hat <- sqrt(diag(var.alpha.cs2.hat)))
## intercept
## 0.1682656 0.2379635 0.2379635
(test.stats.cs2 <- alpha.cs2.hat / se.alpha.cs2.hat)
## [,1]
## intercept 23.934201
## -1.908305
## 9.528003
l <- 0.05
(pvals.cs2 <- 2 * pt(q=abs(test.stats.cs2), df=N-r.cs, ncp=1-l/2, lower.tail=FALSE))
## [,1]
## intercept 4.094860e-10
## 4.044391e-01
## 1.228199e-05
(pvals.cs2 <- 2 * pt(q=abs(test.stats.cs2), df=N-r.cs, lower.tail=FALSE))
## [,1]
## intercept 1.698261e-11
## 8.055764e-02
## 6.023008e-07
The emmeans package aims at simplifying all this:
emm <- emmeans(object=fit.cs, specs="geno")
summary(emm, level=0.95)
## geno emmean SE df lower.CL upper.CL
## var1 2.21 0.291 12 1.58 2.85
## var2 3.57 0.291 12 2.94 4.21
## var3 6.29 0.291 12 5.66 6.93
##
## Confidence level used: 0.95
contrast(object=emm, method="dunnett") # same as contr.treatment
## contrast estimate SE df t.ratio p.value
## var2 - var1 1.36 0.412 12 3.297 0.0121
## var3 - var1 4.08 0.412 12 9.900 <.0001
##
## P value adjustment: dunnettx method for 2 tests
contrast(object=emm, method="eff") # same as contr.sum
## contrast estimate SE df t.ratio p.value
## var1 effect -1.813 0.238 12 -7.620 <.0001
## var2 effect -0.454 0.238 12 -1.908 0.0806
## var3 effect 2.267 0.238 12 9.528 <.0001
##
## P value adjustment: fdr method for 3 tests
test(emm, null=mean(dat$yield)) # testing if each mean is equal to the grand mean
## geno emmean SE df null t.ratio p.value
## var1 2.21 0.291 12 4.03 -6.221 <.0001
## var2 3.57 0.291 12 4.03 -1.558 0.1452
## var3 6.29 0.291 12 4.03 7.780 <.0001
test(emm, null=coef(fit.cs)[1]) # testing if each mean is equal to the intercept with contr.sum
## geno emmean SE df null t.ratio p.value
## var1 2.21 0.291 12 4.03 -6.221 <.0001
## var2 3.57 0.291 12 4.03 -1.558 0.1452
## var3 6.29 0.291 12 4.03 7.780 <.0001
Comparisons:
pairs(emm, adjust="tukey") # Tukey's HSD test
## contrast estimate SE df t.ratio p.value
## var1 - var2 -1.36 0.412 12 -3.297 0.0163
## var1 - var3 -4.08 0.412 12 -9.900 <.0001
## var2 - var3 -2.72 0.412 12 -6.603 0.0001
##
## P value adjustment: tukey method for comparing a family of 3 estimates
(tukey <- multcomp::cld(object=emm, adjust="Tukey", alpha=0.05, Letters=letters))
## Note: adjust = "tukey" was changed to "sidak"
## because "tukey" is only appropriate for one set of pairwise comparisons
## geno emmean SE df lower.CL upper.CL .group
## var1 2.21 0.291 12 1.41 3.02 a
## var2 3.57 0.291 12 2.77 4.38 b
## var3 6.29 0.291 12 5.49 7.10 c
##
## Confidence level used: 0.95
## Conf-level adjustment: sidak method for 3 estimates
## P value adjustment: tukey method for comparing a family of 3 estimates
## significance level used: alpha = 0.05
## NOTE: Compact letter displays can be misleading
## because they show NON-findings rather than findings.
## Consider using 'pairs()', 'pwpp()', or 'pwpm()' instead.
plot(tukey) + ggplot2::theme_bw()
ANOVA table (aov) and types of sum of
squares
other contrasts: contr.helmert, etc
two-way ANOVA with interactions
vignette
of the codingMatrices package from Bill Venables
notes of a course on applied linear models devoted to one-way ANOVA from Bo Li (Purdue University)
tutorial on coding systems from the Institute for Digital Reasearch and Education at UCLA
Piepho H. 2018. Letters in Mean Comparisons: What They Do and Don’t Mean. Agronomy Journal. 110(2):431–434. https://doi.org/10.2134/agronj2017.10.0580
print(sessionInfo(), locale=FALSE)
## R version 4.2.2 Patched (2022-11-10 r83330)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Debian GNU/Linux 12 (bookworm)
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.11.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.11.0
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] emmeans_1.7.4-1
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.2.0 zoo_1.8-10 xfun_0.42
## [4] bslib_0.4.0 splines_4.2.2 lattice_0.20-45
## [7] generics_0.1.2 colorspace_2.0-3 vctrs_0.6.5
## [10] htmltools_0.5.3 yaml_2.3.5 utf8_1.2.2
## [13] survival_3.5-3 rlang_1.1.4 jquerylib_0.1.4
## [16] pillar_1.9.0 withr_2.5.0 glue_1.6.2
## [19] multcomp_1.4-26 lifecycle_1.0.3 multcompView_0.1-10
## [22] munsell_0.5.0 gtable_0.3.0 mvtnorm_1.1-3
## [25] codetools_0.2-19 coda_0.19-4 evaluate_0.15
## [28] labeling_0.4.2 knitr_1.45 fastmap_1.1.0
## [31] fansi_1.0.3 highr_0.9 TH.data_1.1-1
## [34] xtable_1.8-4 scales_1.3.0 cachem_1.0.6
## [37] jsonlite_1.8.0 farver_2.1.0 ggplot2_3.5.0
## [40] digest_0.6.29 dplyr_1.1.4 grid_4.2.2
## [43] cli_3.6.2 tools_4.2.2 sandwich_3.0-1
## [46] magrittr_2.0.3 sass_0.4.2 tibble_3.2.1
## [49] pkgconfig_2.0.3 MASS_7.3-57 Matrix_1.5-3
## [52] estimability_1.3 rmarkdown_2.25 rstudioapi_0.13
## [55] R6_2.5.1 compiler_4.2.2