Matrix Algebra

Matrix Algebra is introduced in the context of data analysis and using one of the main applications: Linear Models.

We will describe three examples from the life sciences: one from physics, one related to genetics, and one from a mouse experiment. They are very different, yet we end up using the same statistical technique: fitting linear models. Linear models are typically taught and described in the language of matrix algebra.

Motivating Examples

Falling objects

Imagine you are Galileo in the 16th century trying to describe the velocity of a falling object. An assistant climbs the Tower of Pisa and drops a ball, while several other assistants record the position at different times. Let’s simulate some data using the equations we know today and adding some measurement error:

set.seed(1)
g <- 9.8 ##meters per second
n <- 25
tt <- seq(0,3.4,len=n) ##time in secs, note: we use tt because t is a base function
d <- 56.67  - 0.5*g*tt^2 + rnorm(n,sd=1) ##meters

The assistants hand the data to Galileo and this is what he sees:

mypar()
plot(tt,d,ylab="Distance in meters",xlab="Time in seconds")
Simulated data for distance travelled versus time of falling object measured with error.

Simulated data for distance travelled versus time of falling object measured with error.

He does not know the exact equation, but by looking at the plot above he deduces that the position should follow a parabola. So he models the data with:

\[ Y_i = \beta_0 + \beta_1 x_i + \beta_2 x_i^2 + \varepsilon_i, i=1,\dots,n \]

With \(Y_i\) representing location, \(x_i\) representing the time, and \(\varepsilon_i\) accounting for measurement error. This is a linear model because it is a linear combination of known quantities (the \(x\)’s) referred to as predictors or covariates and unknown parameters (the \(\beta\)’s).

Father & son heights

Now imagine you are Francis Galton in the 19th century and you collect paired height data from fathers and sons. You suspect that height is inherited. Your data:

data(father.son,package="UsingR")
x=father.son$fheight
y=father.son$sheight

looks like this:

plot(x,y,xlab="Father's height",ylab="Son's height")
Galton's data. Son heights versus father heights.

Galton’s data. Son heights versus father heights.

The sons’ heights do seem to increase linearly with the fathers’ heights. In this case, a model that describes the data is as follows:

\[ Y_i = \beta_0 + \beta_1 x_i + \varepsilon_i, i=1,\dots,N \]

This is also a linear model with \(x_i\) and \(Y_i\), the father and son heights respectively, for the \(i\)-th pair and \(\varepsilon_i\) a term to account for the extra variability. Here we think of the fathers’ heights as the predictor and being fixed (not random) so we use lower case. Measurement error alone can’t explain all the variability seen in \(\varepsilon_i\). This makes sense as there are other variables not in the model, for example, mothers’ heights, genetic randomness, and environmental factors.

Random samples from multiple populations

Here we read-in mouse body weight data from mice that were fed two different diets: high fat and control (chow). We have a random sample of 12 mice for each. We are interested in determining if the diet has an effect on weight. Here is the data:

dat <- read.csv("femaleMiceWeights.csv")
mypar(1,1)
stripchart(Bodyweight~Diet,data=dat,vertical=TRUE,method="jitter",pch=1,main="Mice weights")
Mouse weights under two diets.

Mouse weights under two diets.

We want to estimate the difference in average weight between populations. We demonstrated how to do this using t-tests and confidence intervals, based on the difference in sample averages. We can obtain the same exact results using a linear model:

\[ Y_i = \beta_0 + \beta_1 x_{i} + \varepsilon_i\]

with \(\beta_0\) the chow diet average weight, \(\beta_1\) the difference between averages, \(x_i = 1\) when mouse \(i\) gets the high fat (hf) diet, \(x_i = 0\) when it gets the chow diet, and \(\varepsilon_i\) explains the differences between mice of the same population.

Linear models in general

We have seen three very different examples in which linear models can be used. A general model that encompasses all of the above examples is the following:

\[ Y_i = \beta_0 + \beta_1 x_{i,1} + \beta_2 x_{i,2} + \dots + \beta_2 x_{i,p} + \varepsilon_i, i=1,\dots,n \]

\[ Y_i = \beta_0 + \sum_{j=1}^p \beta_j x_{i,j} + \varepsilon_i, i=1,\dots,n \]

Note that we have a general number of predictors \(p\). Matrix algebra provides a compact language and mathematical framework to compute and make derivations with any linear model that fits into the above framework.

Estimating parameters

For the models above to be useful we have to estimate the unknown \(\beta\) s. In the first example, we want to describe a physical process for which we can’t have unknown parameters. In the second example, we better understand inheritance by estimating how much, on average, the father’s height affects the son’s height. In the final example, we want to determine if there is in fact a difference: if \(\beta_1 \neq 0\).

The standard approach in science is to find the values that minimize the distance of the fitted model to the data. The following is called the least squares (LS) equation and we will see it often in this chapter:

\[ \sum_{i=1}^n \left\{ Y_i - \left(\beta_0 + \sum_{j=1}^p \beta_j x_{i,j}\right)\right\}^2 \]

Once we find the minimum, we will call the values the least squares estimates (LSE) and denote them with \(\hat{\beta}\). The quantity obtained when evaluating the least squares equation at the estimates is called the residual sum of squares (RSS). Since all these quantities depend on \(Y\), they are random variables. The \(\hat{\beta}\) s are random variables and we will eventually perform inference on them.

Falling object example revisited

We know that the equation for the trajectory of a falling object is:

\[d = h_0 + v_0 t - 0.5 \times 9.8 t^2\]

with \(h_0\) and \(v_0\) the starting height and velocity respectively. The data we simulated above followed this equation and added measurement error to simulate n observations for dropping the ball \((v_0=0)\) from the tower of Pisa \((h_0=56.67)\). This is why we used this code to simulate data:

g <- 9.8 ##meters per second
n <- 25
tt <- seq(0,3.4,len=n) ##time in secs, t is a base function
f <- 56.67  - 0.5*g*tt^2
y <-  f + rnorm(n,sd=1)

Here is what the data looks like with the solid line representing the true trajectory:

plot(tt,y,ylab="Distance in meters",xlab="Time in seconds")
lines(tt,f,col=2)
Fitted model for simulated data for distance travelled versus time of falling object measured with error.

Fitted model for simulated data for distance travelled versus time of falling object measured with error.

But we were pretending to be Galileo and so we don’t know the parameters in the model. The data does suggest it is a parabola, so we model it as such:

\[ Y_i = \beta_0 + \beta_1 x_i + \beta_2 x_i^2 + \varepsilon_i, i=1,\dots,n \]

How do we find the LSE?

The lm function

In R we can fit this model by simply using the lm function. We will describe this function in detail later, but here is a preview:

tt2 <-tt^2
fit <- lm(y~tt+tt2)
summary(fit)$coef
##               Estimate Std. Error    t value     Pr(>|t|)
## (Intercept) 57.1047803  0.4996845 114.281666 5.119823e-32
## tt          -0.4460393  0.6806757  -0.655289 5.190757e-01
## tt2         -4.7471698  0.1933701 -24.549662 1.767229e-17

It gives us the LSE, as well as standard errors and p-values.

Part of what we do in this section is to explain the mathematics behind this function.

The least squares estimate (LSE)

Let’s write a function that computes the RSS for any vector \(\beta\):

rss <- function(Beta0,Beta1,Beta2){
  r <- y - (Beta0+Beta1*tt+Beta2*tt^2)
  return(sum(r^2))
}

So for any three dimensional vector we get an RSS. Here is a plot of the RSS as a function of \(\beta_2\) when we keep the other two fixed:

Beta2s<- seq(-10,0,len=100)
plot(Beta2s,sapply(Beta2s,rss,Beta0=55,Beta1=0),
     ylab="RSS",xlab="Beta2",type="l")
##Let's add another curve fixing another pair:
Beta2s<- seq(-10,0,len=100)
lines(Beta2s,sapply(Beta2s,rss,Beta0=65,Beta1=0),col=2)
Residual sum of squares obtained for several values of the parameters.

Residual sum of squares obtained for several values of the parameters.

Trial and error here is not going to work. Instead, we can use calculus: take the partial derivatives, set them to 0 and solve. Of course, if we have many parameters, these equations can get rather complex. Linear algebra provides a compact and general way of solving this problem.

More on Galton (Advanced)

When studying the father-son data, Galton made a fascinating discovery using exploratory analysis.

Galton’s plot.

Galton’s plot.

He noted that if he tabulated the number of father-son height pairs and followed all the x,y values having the same totals in the table, they formed an ellipse. In the plot above, made by Galton, you see the ellipse formed by the pairs having 3 cases. This then led to modeling this data as correlated bivariate normal which we described earlier:

\[ Pr(X<a,Y<b) = \]

\[ \int_{-\infty}^{a} \int_{-\infty}^{b} \frac{1}{2\pi\sigma_x\sigma_y\sqrt{1-\rho^2}} \exp{ \left\{ \frac{1}{2(1-\rho^2)} \left[\left(\frac{x-\mu_x}{\sigma_x}\right)^2 - 2\rho\left(\frac{x-\mu_x}{\sigma_x}\right)\left(\frac{y-\mu_y}{\sigma_y}\right)+ \left(\frac{y-\mu_y}{\sigma_y}\right)^2 \right] \right\} } \]

We described how we can use math to show that if you keep \(X\) fixed (condition to be \(x\)) the distribution of \(Y\) is normally distributed with mean: \(\mu_x +\sigma_y \rho \left(\frac{x-\mu_x}{\sigma_x}\right)\) and standard deviation \(\sigma_y \sqrt{1-\rho^2}\). Note that \(\rho\) is the correlation between \(Y\) and \(X\), which implies that if we fix \(X=x\), \(Y\) does in fact follow a linear model. The \(\beta_0\) and \(\beta_1\) parameters in our simple linear model can be expressed in terms of \(\mu_x,\mu_y,\sigma_x,\sigma_y\), and \(\rho\).

Matrix Notation

Here we introduce the basics of matrix notation. Initially this may seem over-complicated, but once we discuss examples, you will appreciate the power of using this notation to both explain and derive solutions, as well as implement them as R code.

The language of linear models

Linear algebra notation actually simplifies the mathematical descriptions and manipulations of linear models, as well as coding in R. We will discuss the basics of this notation and then show some examples in R.

The main point of this entire exercise is to show how we can write the models above using matrix notation, and then explain how this is useful for solving the least squares equation. We start by simply defining notation and matrix multiplication, but bear with us since we eventually get back to the practical application.

Solving Systems of Equations

Linear algebra was created by mathematicians to solve systems of linear equations such as this:

\[ \begin{align*} a + b + c &= 6\\ 3a - 2b + c &= 2\\ 2a + b - c &= 1 \end{align*} \]

It provides very useful machinery to solve these problems generally. We will learn how we can write and solve this system using matrix algebra notation:

\[ \, \begin{pmatrix} 1&1&1\\ 3&-2&1\\ 2&1&-1 \end{pmatrix} \begin{pmatrix} a\\ b\\ c \end{pmatrix} = \begin{pmatrix} 6\\ 2\\ 1 \end{pmatrix} \implies \begin{pmatrix} a\\ b\\ c \end{pmatrix} = \begin{pmatrix} 1&1&1\\ 3&-2&1\\ 2&1&-1 \end{pmatrix}^{-1} \begin{pmatrix} 6\\ 2\\ 1 \end{pmatrix} \]

This section explains the notation used above. It turns out that we can borrow this notation for linear models in statistics as well.

Vectors, Matrices, and Scalars

In the falling object, father-son heights, and mouse weight examples, the random variables associated with the data were represented by \(Y_1,\dots,Y_n\). We can think of this as a vector. In fact, in R we are already doing this:

data(father.son,package="UsingR")
y=father.son$fheight
head(y)
## [1] 65.04851 63.25094 64.95532 65.75250 61.13723 63.02254

In math we can also use just one symbol. We usually use bold to distinguish it from the individual entries:

\[ \mathbf{Y} = \begin{pmatrix} Y_1\\\ Y_2\\\ \vdots\\\ Y_N \end{pmatrix} \]

For reasons that will soon become clear, default representation of data vectors have dimension \(N\times 1\) as opposed to \(1 \times N\) .

Here we don’t always use bold because normally one can tell what is a matrix from the context.

Similarly, we can use math notation to represent the covariates or predictors. In a case with two predictors we can represent them like this:

\[ \mathbf{X}_1 = \begin{pmatrix} x_{1,1}\\ \vdots\\ x_{N,1} \end{pmatrix} \mbox{ and } \mathbf{X}_2 = \begin{pmatrix} x_{1,2}\\ \vdots\\ x_{N,2} \end{pmatrix} \]

Note that for the falling object example \(x_{1,1}= t_i\) and \(x_{i,1}=t_i^2\) with \(t_i\) the time of the i-th observation. Also, keep in mind that vectors can be thought of as \(N\times 1\) matrices.

For reasons that will soon become apparent, it is convenient to represent these in matrices:

\[ \mathbf{X} = [ \mathbf{X}_1 \mathbf{X}_2 ] = \begin{pmatrix} x_{1,1}&x_{1,2}\\ \vdots\\ x_{N,1}&x_{N,2} \end{pmatrix} \]

This matrix has dimension \(N \times 2\). We can create this matrix in R this way:

n <- 25
tt <- seq(0,3.4,len=n) ##time in secs, t is a base function
X <- cbind(X1=tt,X2=tt^2)
head(X)
##             X1         X2
## [1,] 0.0000000 0.00000000
## [2,] 0.1416667 0.02006944
## [3,] 0.2833333 0.08027778
## [4,] 0.4250000 0.18062500
## [5,] 0.5666667 0.32111111
## [6,] 0.7083333 0.50173611
dim(X)
## [1] 25  2

We can also use this notation to denote an arbitrary number of covariates with the following \(N\times p\) matrix:

\[ \mathbf{X} = \begin{pmatrix} x_{1,1}&\dots & x_{1,p} \\ x_{2,1}&\dots & x_{2,p} \\ & \vdots & \\ x_{N,1}&\dots & x_{N,p} \end{pmatrix} \]

Just as an example, we show you how to make one in R now using matrix instead of cbind:

N <- 100; p <- 5
X <- matrix(1:(N*p),N,p)
head(X)
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1  101  201  301  401
## [2,]    2  102  202  302  402
## [3,]    3  103  203  303  403
## [4,]    4  104  204  304  404
## [5,]    5  105  205  305  405
## [6,]    6  106  206  306  406
dim(X)
## [1] 100   5

By default, the matrices are filled column by column. The byrow=TRUE argument lets us change that to row by row:

N <- 100; p <- 5
X <- matrix(1:(N*p),N,p,byrow=TRUE)
head(X)
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    2    3    4    5
## [2,]    6    7    8    9   10
## [3,]   11   12   13   14   15
## [4,]   16   17   18   19   20
## [5,]   21   22   23   24   25
## [6,]   26   27   28   29   30

Finally, we define a scalar. A scalar is just a number, which we call a scalar because we want to distinguish it from vectors and matrices. We usually use lower case and don’t bold. In the next section, we will understand why we make this distinction.

Examples

Now we are ready to see how matrix algebra can be useful when analyzing data. We start with some simple examples and eventually arrive at the main one: how to write linear models with matrix algebra notation and solve the least squares problem.

The average

To compute the sample average and variance of our data, we use these formulas \(\bar{Y}=\frac{1}{N} Y_i\) and \(\mbox{var}(Y)=\frac{1}{N} \sum_{i=1}^N (Y_i - \bar{Y})^2\). We can represent these with matrix multiplication. First, define this \(N \times 1\) matrix made just of 1s:

\[ A=\begin{pmatrix} 1\\ 1\\ \vdots\\ 1 \end{pmatrix} \]

This implies that:

\[ \frac{1}{N} \mathbf{A}^\top Y = \frac{1}{N} \begin{pmatrix}1&1&\dots&1\end{pmatrix} \begin{pmatrix} Y_1\\ Y_2\\ \vdots\\ Y_N \end{pmatrix}= \frac{1}{N} \sum_{i=1}^N Y_i = \bar{Y} \]

Note that we are multiplying by the scalar \(1/N\). In R, we multiply matrix using %*%:

data(father.son,package="UsingR")
y <- father.son$sheight
print(mean(y))
## [1] 68.68407
N <- length(y)
Y<- matrix(y,N,1)
A <- matrix(1,N,1)
barY=t(A)%*%Y / N

print(barY)
##          [,1]
## [1,] 68.68407

The variance

As we will see later, multiplying the transpose of a matrix with another is very common in statistics. In fact, it is so common that there is a function in R:

barY=crossprod(A,Y) / N
print(barY)
##          [,1]
## [1,] 68.68407

For the variance, we note that if:

\[ \mathbf{r}\equiv \begin{pmatrix} Y_1 - \bar{Y}\\ \vdots\\ Y_N - \bar{Y} \end{pmatrix}, \,\, \frac{1}{N} \mathbf{r}^\top\mathbf{r} = \frac{1}{N}\sum_{i=1}^N (Y_i - \bar{Y})^2 \]

In R, if you only send one matrix into crossprod, it computes: \(r^\top r\) so we can simply type:

r <- y - barY
## Warning in y - barY: Recycling array of length 1 in vector-array arithmetic is deprecated.
##   Use c() or as.vector() instead.
crossprod(r)/N
##          [,1]
## [1,] 7.915196

Which is almost equivalent to:

library(rafalib)
popvar(y) 
## [1] 7.915196

Linear models

Now we are ready to put all this to use. Let’s start with Galton’s example. If we define these matrices:

\[ \mathbf{Y} = \begin{pmatrix} Y_1\\ Y_2\\ \vdots\\ Y_N \end{pmatrix} , \mathbf{X} = \begin{pmatrix} 1&x_1\\ 1&x_2\\ \vdots\\ 1&x_N \end{pmatrix} , \mathbf{\beta} = \begin{pmatrix} \beta_0\\ \beta_1 \end{pmatrix} \mbox{ and } \mathbf{\varepsilon} = \begin{pmatrix} \varepsilon_1\\ \varepsilon_2\\ \vdots\\ \varepsilon_N \end{pmatrix} \]

Then we can write the model:

\[ Y_i = \beta_0 + \beta_1 x_i + \varepsilon_i, i=1,\dots,N \]

as:

\[ \, \begin{pmatrix} Y_1\\ Y_2\\ \vdots\\ Y_N \end{pmatrix} = \begin{pmatrix} 1&x_1\\ 1&x_2\\ \vdots\\ 1&x_N \end{pmatrix} \begin{pmatrix} \beta_0\\ \beta_1 \end{pmatrix} + \begin{pmatrix} \varepsilon_1\\ \varepsilon_2\\ \vdots\\ \varepsilon_N \end{pmatrix} \]

or simply:

\[ \mathbf{Y}=\mathbf{X}\boldsymbol{\beta}+\boldsymbol{\varepsilon} \]

which is a much simpler way to write it.

The least squares equation becomes simpler as well since it is the following cross-product:

\[ (\mathbf{Y}-\mathbf{X}\boldsymbol{\beta})^\top (\mathbf{Y}-\mathbf{X}\boldsymbol{\beta}) \]

So now we are ready to determine which values of \(\beta\) minimize the above, which we can do using calculus to find the minimum.

Advanced: Finding the minimum using calculus

There are a series of rules that permit us to compute partial derivative equations in matrix notation. By equating the derivative to 0 and solving for the \(\beta\), we will have our solution. The only one we need here tells us that the derivative of the above equation is:

\[ 2 \mathbf{X}^\top (\mathbf{Y} - \mathbf{X} \boldsymbol{\hat{\beta}})=0 \]

\[ \mathbf{X}^\top \mathbf{X} \boldsymbol{\hat{\beta}} = \mathbf{X}^\top \mathbf{Y} \]

\[ \boldsymbol{\hat{\beta}} = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{Y} \]

and we have our solution. We usually put a hat on the \(\beta\) that solves this, \(\hat{\beta}\) , as it is an estimate of the “real” \(\beta\) that generated the data.

Remember that the least squares are like a square (multiply something by itself) and that this formula is similar to the derivative of \(f(x)^2\) being \(2f(x)f\prime (x)\).

Finding LSE in R

Let’s see how it works in R:

data(father.son,package="UsingR")
x=father.son$fheight
y=father.son$sheight
X <- cbind(1,x)
betahat <- solve( t(X) %*% X ) %*% t(X) %*% y
###or
betahat <- solve( crossprod(X) ) %*% crossprod( X, y )

Now we can see the results of this by computing the estimated \(\hat{\beta}_0+\hat{\beta}_1 x\) for any value of \(x\):

newx <- seq(min(x),max(x),len=100)
X <- cbind(1,newx)
fitted <- X%*%betahat
plot(x,y,xlab="Father's height",ylab="Son's height")
lines(newx,fitted,col=2)
Galton's data with fitted regression line.

Galton’s data with fitted regression line.

This \(\hat{\boldsymbol{\beta}}=(\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{Y}\) is one of the most widely used results in data analysis. One of the advantages of this approach is that we can use it in many different situations. For example, in our falling object problem:

set.seed(1)
g <- 9.8 #meters per second
n <- 25
tt <- seq(0,3.4,len=n) #time in secs, t is a base function
d <- 56.67  - 0.5*g*tt^2 + rnorm(n,sd=1)

Notice that we are using almost the same exact code:

X <- cbind(1,tt,tt^2)
y <- d
betahat <- solve(crossprod(X))%*%crossprod(X,y)
newtt <- seq(min(tt),max(tt),len=100)
X <- cbind(1,newtt,newtt^2)
fitted <- X%*%betahat
plot(tt,y,xlab="Time",ylab="Height")
lines(newtt,fitted,col=2)
Fitted parabola to simulated data for distance travelled versus time of falling object measured with error.

Fitted parabola to simulated data for distance travelled versus time of falling object measured with error.

And the resulting estimates are what we expect:

betahat
##          [,1]
##    56.5317368
## tt  0.5013565
##    -5.0386455

The Tower of Pisa is about 56 meters high. Since we are just dropping the object there is no initial velocity, and half the constant of gravity is 9.8/2=4.9 meters per second squared.

The lm Function

R has a very convenient function that fits these models. We will learn more about this function later, but here is a preview:

X <- cbind(tt,tt^2)
fit=lm(y~X)
summary(fit)
## 
## Call:
## lm(formula = y ~ X)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.5295 -0.4882  0.2537  0.6560  1.5455 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  56.5317     0.5451 103.701   <2e-16 ***
## Xtt           0.5014     0.7426   0.675    0.507    
## X            -5.0386     0.2110 -23.884   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9822 on 22 degrees of freedom
## Multiple R-squared:  0.9973, Adjusted R-squared:  0.997 
## F-statistic:  4025 on 2 and 22 DF,  p-value: < 2.2e-16

Note that we obtain the same values as above.

Summary

We have shown how to write linear models using linear algebra. We are going to do this for several examples, many of which are related to designed experiments. We also demonstrated how to obtain least squares estimates. Nevertheless, it is important to remember that because \(y\) is a random variable, these estimates are random as well. In a later section, we will learn how to compute standard error for these estimates and use this to perform inference.

The Design Matrix

Here we will show how to use the two R functions, formula and model.matrix, in order to produce design matrices (also known as model matrices) for a variety of linear models. For example, in the mouse diet examples we wrote the model as

\[ Y_i = \beta_0 + \beta_1 x_i + \varepsilon_i, i=1,\dots,N \]

with \(Y_i\) the weights and \(x_i\) equal to 1 only when mouse \(i\) receives the high fat diet. We use the term experimental unit to \(N\) different entities from which we obtain a measurement. In this case, the mice are the experimental units.

This is the type of variable we will focus on in this chapter. We call them indicator variables since they simply indicate if the experimental unit had a certain characteristic or not. As we described earlier, we can use linear algebra to represent this model:

\[ \mathbf{Y} = \begin{pmatrix} Y_1\\ Y_2\\ \vdots\\ Y_N \end{pmatrix} , \mathbf{X} = \begin{pmatrix} 1&x_1\\ 1&x_2\\ \vdots\\ 1&x_N \end{pmatrix} , \boldsymbol{\beta} = \begin{pmatrix} \beta_0\\ \beta_1 \end{pmatrix} \mbox{ and } \boldsymbol{\varepsilon} = \begin{pmatrix} \varepsilon_1\\ \varepsilon_2\\ \vdots\\ \varepsilon_N \end{pmatrix} \]

as:

\[ \, \begin{pmatrix} Y_1\\ Y_2\\ \vdots\\ Y_N \end{pmatrix} = \begin{pmatrix} 1&x_1\\ 1&x_2\\ \vdots\\ 1&x_N \end{pmatrix} \begin{pmatrix} \beta_0\\ \beta_1 \end{pmatrix} + \begin{pmatrix} \varepsilon_1\\ \varepsilon_2\\ \vdots\\ \varepsilon_N \end{pmatrix} \]

or simply:

\[ \mathbf{Y}=\mathbf{X}\boldsymbol{\beta}+\boldsymbol{\varepsilon} \]

The design matrix is the matrix \(\mathbf{X}\).

Once we define a design matrix, we are ready to find the least squares estimates. We refer to this as fitting the model. For fitting linear models in R, we will directly provide a formula to the lm function. In this script, we will use the model.matrix function, which is used internally by the lm function. This will help us to connect the R formula with the matrix \(\mathbf{X}\). It will therefore help us interpret the results from lm.

Choice of design

The choice of design matrix is a critical step in linear modeling since it encodes which coefficients will be fit in the model, as well as the inter-relationship between the samples. A common misunderstanding is that the choice of design follows straightforward from a description of which samples were included in the experiment. This is not the case. The basic information about each sample (whether control or treatment group, experimental batch, etc.) does not imply a single ‘correct’ design matrix. The design matrix additionally encodes various assumptions about how the variables in \(\mathbf{X}\) explain the observed values in \(\mathbf{Y}\), on which the investigator must decide.

For the examples we cover here, we use linear models to make comparisons between different groups. Hence, the design matrices that we ultimately work with will have at least two columns: an intercept column, which consists of a column of 1’s, and a second column, which specifies which samples are in a second group. In this case, two coefficients are fit in the linear model: the intercept, which represents the population average of the first group, and a second coefficient, which represents the difference between the population averages of the second group and the first group. The latter is typically the coefficient we are interested in when we are performing statistical tests: we want to know if there is a difference between the two groups.

We encode this experimental design in R with two pieces. We start with a formula with the tilde symbol ~. This means that we want to model the observations using the variables to the right of the tilde. Then we put the name of a variable, which tells us which samples are in which group.

Let’s try an example. Suppose we have two groups, control and high fat diet, with two samples each. For illustrative purposes, we will code these with 1 and 2 respectively. We should first tell R that these values should not be interpreted numerically, but as different levels of a factor. We can then use the paradigm ~ group to, say, model on the variable group.

group <- factor( c(1,1,2,2) )
model.matrix(~ group)
##   (Intercept) group2
## 1           1      0
## 2           1      0
## 3           1      1
## 4           1      1
## attr(,"assign")
## [1] 0 1
## attr(,"contrasts")
## attr(,"contrasts")$group
## [1] "contr.treatment"

(Don’t worry about the attr lines printed beneath the matrix. We won’t be using this information.)

What about the formula function? We don’t have to include this. By starting an expression with ~, it is equivalent to telling R that the expression is a formula:

model.matrix(formula(~ group))
##   (Intercept) group2
## 1           1      0
## 2           1      0
## 3           1      1
## 4           1      1
## attr(,"assign")
## [1] 0 1
## attr(,"contrasts")
## attr(,"contrasts")$group
## [1] "contr.treatment"

What happens if we don’t tell R that group should be interpreted as a factor?

group <- c(1,1,2,2)
model.matrix(~ group)
##   (Intercept) group
## 1           1     1
## 2           1     1
## 3           1     2
## 4           1     2
## attr(,"assign")
## [1] 0 1

This is not the design matrix we wanted, and the reason is that we provided a numeric variable as opposed to an indicator to the formula and model.matrix functions, without saying that these numbers actually referred to different groups. We want the second column to have only 0 and 1, indicating group membership.

A note about factors: the names of the levels are irrelevant to model.matrix and lm. All that matters is the order. For example:

group <- factor(c("control","control","highfat","highfat"))
model.matrix(~ group)
##   (Intercept) grouphighfat
## 1           1            0
## 2           1            0
## 3           1            1
## 4           1            1
## attr(,"assign")
## [1] 0 1
## attr(,"contrasts")
## attr(,"contrasts")$group
## [1] "contr.treatment"

produces the same design matrix as our first code chunk.

More groups

Using the same formula, we can accommodate modeling more groups. Suppose we have a third diet:

group <- factor(c(1,1,2,2,3,3))
model.matrix(~ group)
##   (Intercept) group2 group3
## 1           1      0      0
## 2           1      0      0
## 3           1      1      0
## 4           1      1      0
## 5           1      0      1
## 6           1      0      1
## attr(,"assign")
## [1] 0 1 1
## attr(,"contrasts")
## attr(,"contrasts")$group
## [1] "contr.treatment"

Now we have a third column which specifies which samples belong to the third group.

An alternate formulation of design matrix is possible by specifying + 0 in the formula:

group <- factor(c(1,1,2,2,3,3))
model.matrix(~ group + 0)
##   group1 group2 group3
## 1      1      0      0
## 2      1      0      0
## 3      0      1      0
## 4      0      1      0
## 5      0      0      1
## 6      0      0      1
## attr(,"assign")
## [1] 1 1 1
## attr(,"contrasts")
## attr(,"contrasts")$group
## [1] "contr.treatment"

This group now fits a separate coefficient for each group. We will explore this design in more depth later on.

More variables

We have been using a simple case with just one variable (diet) as an example. In the life sciences, it is quite common to perform experiments with more than one variable. For example, we may be interested in the effect of diet and the difference in sexes. In this case, we have four possible groups:

diet <- factor(c(1,1,1,1,2,2,2,2))
sex <- factor(c("f","f","m","m","f","f","m","m"))
table(diet,sex)
##     sex
## diet f m
##    1 2 2
##    2 2 2

If we assume that the diet effect is the same for males and females (this is an assumption), then our linear model is:

\[ Y_{i}= \beta_0 + \beta_1 x_{i,1} + \beta_2 x_{i,2} + \varepsilon_i \]

To fit this model in R, we can simply add the additional variable with a + sign in order to build a design matrix which fits based on the information in additional variables:

diet <- factor(c(1,1,1,1,2,2,2,2))
sex <- factor(c("f","f","m","m","f","f","m","m"))
model.matrix(~ diet + sex)
##   (Intercept) diet2 sexm
## 1           1     0    0
## 2           1     0    0
## 3           1     0    1
## 4           1     0    1
## 5           1     1    0
## 6           1     1    0
## 7           1     1    1
## 8           1     1    1
## attr(,"assign")
## [1] 0 1 2
## attr(,"contrasts")
## attr(,"contrasts")$diet
## [1] "contr.treatment"
## 
## attr(,"contrasts")$sex
## [1] "contr.treatment"

The design matrix includes an intercept, a term for diet and a term for sex. We would say that this linear model accounts for differences in both the group and condition variables. However, as mentioned above, the model assumes that the diet effect is the same for both males and females. We say these are an additive effect. For each variable, we add an effect regardless of what the other is. Another model is possible here, which fits an additional term and which encodes the potential interaction of group and condition variables. We will cover interaction terms in depth in a later script.

The interaction model can be written in either of the following two formulas:

model.matrix(~ diet + sex + diet:sex)

or

model.matrix(~ diet*sex)
##   (Intercept) diet2 sexm diet2:sexm
## 1           1     0    0          0
## 2           1     0    0          0
## 3           1     0    1          0
## 4           1     0    1          0
## 5           1     1    0          0
## 6           1     1    0          0
## 7           1     1    1          1
## 8           1     1    1          1
## attr(,"assign")
## [1] 0 1 2 3
## attr(,"contrasts")
## attr(,"contrasts")$diet
## [1] "contr.treatment"
## 
## attr(,"contrasts")$sex
## [1] "contr.treatment"

Releveling

The level which is chosen for the reference level is the level which is contrasted against. By default, this is simply the first level alphabetically. We can specify that we want group 2 to be the reference level by either using the relevel function:

group <- factor(c(1,1,2,2))
group <- relevel(group, "2")
model.matrix(~ group)
##   (Intercept) group1
## 1           1      1
## 2           1      1
## 3           1      0
## 4           1      0
## attr(,"assign")
## [1] 0 1
## attr(,"contrasts")
## attr(,"contrasts")$group
## [1] "contr.treatment"

or by providing the levels explicitly in the factor call:

group <- factor(group, levels=c("1","2"))
model.matrix(~ group)
##   (Intercept) group2
## 1           1      0
## 2           1      0
## 3           1      1
## 4           1      1
## attr(,"assign")
## [1] 0 1
## attr(,"contrasts")
## attr(,"contrasts")$group
## [1] "contr.treatment"

Where does model.matrix look for the data?

The model.matrix function will grab the variable from the R global environment, unless the data is explicitly provided as a data frame to the data argument:

group <- 1:4
model.matrix(~ group, data=data.frame(group=5:8))
##   (Intercept) group
## 1           1     5
## 2           1     6
## 3           1     7
## 4           1     8
## attr(,"assign")
## [1] 0 1

Note how the R global environment variable group is ignored.

Continuous variables

In this chapter, we focus on models based on indicator values. In certain designs, however, we will be interested in using numeric variables in the design formula, as opposed to converting them to factors first. For example, in the falling object example, time was a continuous variable in the model and time squared was also included:

tt <- seq(0,3.4,len=4) 
model.matrix(~ tt + I(tt^2))
##   (Intercept)       tt   I(tt^2)
## 1           1 0.000000  0.000000
## 2           1 1.133333  1.284444
## 3           1 2.266667  5.137778
## 4           1 3.400000 11.560000
## attr(,"assign")
## [1] 0 1 2

The I function above is necessary to specify a mathematical transformation of a variable. For more details, see the manual page for the I function by typing ?I.

In the life sciences, we could be interested in testing various dosages of a treatment, where we expect a specific relationship between a measured quantity and the dosage, e.g. 0 mg, 10 mg, 20 mg.

The assumptions imposed by including continuous data as variables are typically hard to defend and motivate than the indicator function variables. Whereas the indicator variables simply assume a different mean between two groups, continuous variables assume a very specific relationship between the outcome and predictor variables.

In cases like the falling object, we have the theory of gravitation supporting the model. In the father-son height example, because the data is bivariate normal, it follows that there is a linear relationship if we condition. However, we find that continuous variables are included in linear models without justification to “adjust” for variables such as age. We highly discourage this practice unless the data support the model being used.

The mouse diet example

We will demonstrate how to analyze the high fat diet data using linear models instead of directly applying a t-test. We will demonstrate how ultimately these two approaches are equivalent.

We start by reading in the data and creating a quick stripchart:

dat <- read.csv("femaleMiceWeights.csv") ##previously downloaded
stripchart(dat$Bodyweight ~ dat$Diet, vertical=TRUE, method="jitter",
           main="Bodyweight over Diet")
Mice bodyweights stratified by diet.

Mice bodyweights stratified by diet.

We can see that the high fat diet group appears to have higher weights on average, although there is overlap between the two samples.

For demonstration purposes, we will build the design matrix \(\mathbf{X}\) using the formula ~ Diet. The group with the 1’s in the second column is determined by the level of Diet which comes second; that is, the non-reference level.

levels(dat$Diet)
## [1] "chow" "hf"
X <- model.matrix(~ Diet, data=dat)
head(X)
##   (Intercept) Diethf
## 1           1      0
## 2           1      0
## 3           1      0
## 4           1      0
## 5           1      0
## 6           1      0

The Mathematics Behind lm()

Before we use our shortcut for running linear models, lm, we want to review what will happen internally. Inside of lm, we will form the design matrix \(\mathbf{X}\) and calculate the \(\boldsymbol{\beta}\), which minimizes the sum of squares using the previously described formula. The formula for this solution is:

\[ \hat{\boldsymbol{\beta}} = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{Y} \]

We can calculate this in R using our matrix multiplication operator %*%, the inverse function solve, and the transpose function t.

Y <- dat$Bodyweight
X <- model.matrix(~ Diet, data=dat)
solve(t(X) %*% X) %*% t(X) %*% Y
##                  [,1]
## (Intercept) 23.813333
## Diethf       3.020833

These coefficients are the average of the control group and the difference of the averages:

s <- split(dat$Bodyweight, dat$Diet)
mean(s[["chow"]])
## [1] 23.81333
mean(s[["hf"]]) - mean(s[["chow"]])
## [1] 3.020833

Finally, we use our shortcut, lm, to run the linear model:

fit <- lm(Bodyweight ~ Diet, data=dat)
summary(fit)
## 
## Call:
## lm(formula = Bodyweight ~ Diet, data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.1042 -2.4358 -0.4138  2.8335  7.1858 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   23.813      1.039  22.912   <2e-16 ***
## Diethf         3.021      1.470   2.055   0.0519 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.6 on 22 degrees of freedom
## Multiple R-squared:  0.1611, Adjusted R-squared:  0.1229 
## F-statistic: 4.224 on 1 and 22 DF,  p-value: 0.05192
(coefs <- coef(fit))
## (Intercept)      Diethf 
##   23.813333    3.020833

Examining the coefficients

The following plot provides a visualization of the meaning of the coefficients with colored arrows (code not shown):

Estimated linear model coefficients for bodyweight data illustrated with arrows.

Estimated linear model coefficients for bodyweight data illustrated with arrows.

To make a connection with material presented earlier, this simple linear model is actually giving us the same result (the t-statistic and p-value) for the difference as a specific kind of t-test. This is the t-test between two groups with the assumption that the population standard deviation is the same for both groups. This was encoded into our linear model when we assumed that the errors \(\boldsymbol{\varepsilon}\) were all equally distributed.

Although in this case the linear model is equivalent to a t-test, we will soon explore more complicated designs, where the linear model is a useful extension. Below we demonstrate that one does in fact get the exact same results:

Our lm estimates were:

summary(fit)$coefficients
##              Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 23.813333   1.039353 22.911684 7.642256e-17
## Diethf       3.020833   1.469867  2.055174 5.192480e-02

And the t-statistic is the same:

ttest <- t.test(s[["hf"]], s[["chow"]], var.equal=TRUE)
summary(fit)$coefficients[2,3]
## [1] 2.055174
ttest$statistic
##        t 
## 2.055174

Standard Errors

We have shown how to find the least squares estimates with matrix algebra. These estimates are random variables since they are linear combinations of the data. For these estimates to be useful, we also need to compute their standard errors. Linear algebra provides a powerful approach for this task. We provide several examples.

Falling object

It is useful to think about where randomness comes from. In our falling object example, randomness was introduced through measurement errors. Each time we rerun the experiment, a new set of measurement errors will be made. This implies that our data will change randomly, which in turn suggests that our estimates will change randomly. For instance, our estimate of the gravitational constant will change every time we perform the experiment. The constant is fixed, but our estimates are not. To see this we can run a Monte Carlo simulation. Specifically, we will generate the data repeatedly and each time compute the estimate for the quadratic term.

set.seed(1)
B <- 10000
h0 <- 56.67
v0 <- 0
g <- 9.8 ##meters per second

n <- 25
tt <- seq(0,3.4,len=n) ##time in secs, t is a base function
X <-cbind(1,tt,tt^2)
##create X'X^-1 X'
A <- solve(crossprod(X)) %*% t(X)
betahat<-replicate(B,{
  y <- h0 + v0*tt  - 0.5*g*tt^2 + rnorm(n,sd=1)
  betahats <- A%*%y
  return(betahats[3])
})
head(betahat)
## [1] -5.038646 -4.894362 -5.143756 -5.220960 -5.063322 -4.777521

As expected, the estimate is different every time. This is because \(\hat{\beta}\) is a random variable. It therefore has a distribution:

library(rafalib)
mypar(1,2)
hist(betahat)
qqnorm(betahat)
qqline(betahat)
Distribution of estimated regression coefficients obtained from Monte Carlo simulated falling object data. The left is a histogram and on the right we have a qq-plot against normal theoretical quantiles.

Distribution of estimated regression coefficients obtained from Monte Carlo simulated falling object data. The left is a histogram and on the right we have a qq-plot against normal theoretical quantiles.

Since \(\hat{\beta}\) is a linear combination of the data which we made normal in our simulation, it is also normal as seen in the qq-plot above. Also, the mean of the distribution is the true parameter \(-0.5g\), as confirmed by the Monte Carlo simulation performed above.

round(mean(betahat),1)
## [1] -4.9

But we will not observe this exact value when we estimate because the standard error of our estimate is approximately:

sd(betahat) 
## [1] 0.2129976

Here we will show how we can compute the standard error without a Monte Carlo simulation. Since in practice we do not know exactly how the errors are generated, we can’t use the Monte Carlo approach.

Father and son heights

In the father and son height examples, we have randomness because we have a random sample of father and son pairs. For the sake of illustration, let’s assume that this is the entire population:

data(father.son,package="UsingR")
x <- father.son$fheight
y <- father.son$sheight
n <- length(y)

Now let’s run a Monte Carlo simulation in which we take a sample size of 50 over and over again.

N <- 50
B <-1000
betahat <- replicate(B,{
  index <- sample(n,N)
  sampledat <- father.son[index,]
  x <- sampledat$fheight
  y <- sampledat$sheight
  lm(y~x)$coef
  })
betahat <- t(betahat) #have estimates in two columns

By making qq-plots, we see that our estimates are approximately normal random variables:

mypar(1,2)
qqnorm(betahat[,1])
qqline(betahat[,1])
qqnorm(betahat[,2])
qqline(betahat[,2])
Distribution of estimated regression coefficients obtained from Monte Carlo simulated father-son height data. The left is a histogram and on the right we have a qq-plot against normal theoretical quantiles.

Distribution of estimated regression coefficients obtained from Monte Carlo simulated father-son height data. The left is a histogram and on the right we have a qq-plot against normal theoretical quantiles.

We also see that the correlation of our estimates is negative:

cor(betahat[,1],betahat[,2])
## [1] -0.9992293

When we compute linear combinations of our estimates, we will need to know this information to correctly calculate the standard error of these linear combinations.

In the next section, we will describe the variance-covariance matrix. The covariance of two random variables is defined as follows:

mean( (betahat[,1]-mean(betahat[,1] ))* (betahat[,2]-mean(betahat[,2])))
## [1] -1.035291

The covariance is the correlation multiplied by the standard deviations of each random variable:

\[\mbox{Corr}(X,Y) = \frac{\mbox{Cov}(X,Y)}{\sigma_X \sigma_Y}\]

Other than that, this quantity does not have a useful interpretation in practice. However, as we will see, it is a very useful quantity for mathematical derivations. In the next sections, we show useful matrix algebra calculations that can be used to estimate standard errors of linear model estimates.

Variance-covariance matrix (Advanced)

As a first step we need to define the variance-covariance matrix, \(\boldsymbol{\Sigma}\). For a vector of random variables, \(\mathbf{Y}\), we define \(\boldsymbol{\Sigma}\) as the matrix with the \(i,j\) entry:

\[ \Sigma_{i,j} \equiv \mbox{Cov}(Y_i, Y_j) \]

The covariance is equal to the variance if \(i = j\) and equal to 0 if the variables are independent. In the kinds of vectors considered up to now, for example, a vector \(\mathbf{Y}\) of individual observations \(Y_i\) sampled from a population, we have assumed independence of each observation and assumed the \(Y_i\) all have the same variance \(\sigma^2\), so the variance-covariance matrix has had only two kinds of elements:

\[ \mbox{Cov}(Y_i, Y_i) = \mbox{var}(Y_i) = \sigma^2\]

\[ \mbox{Cov}(Y_i, Y_j) = 0, \mbox{ for } i \neq j\]

which implies that \(\boldsymbol{\Sigma} = \sigma^2 \mathbf{I}\) with \(\mathbf{I}\), the identity matrix.

Later, we will see a case, specifically the estimate coefficients of a linear model, \(\hat{\boldsymbol{\beta}}\), that has non-zero entries in the off diagonal elements of \(\boldsymbol{\Sigma}\). Furthermore, the diagonal elements will not be equal to a single value \(\sigma^2\).

Variance of a linear combination

A useful result provided by linear algebra is that the variance covariance-matrix of a linear combination \(\mathbf{AY}\) of \(\mathbf{Y}\) can be computed as follows:

\[ \mbox{var}(\mathbf{AY}) = \mathbf{A}\mbox{var}(\mathbf{Y}) \mathbf{A}^\top \]

For example, if \(Y_1\) and \(Y_2\) are independent both with variance \(\sigma^2\) then:

\[\mbox{var}\{Y_1+Y_2\} = \mbox{var}\left\{ \begin{pmatrix}1&1\end{pmatrix}\begin{pmatrix} Y_1\\Y_2\\ \end{pmatrix}\right\}\]

\[ =\begin{pmatrix}1&1\end{pmatrix} \sigma^2 \mathbf{I}\begin{pmatrix} 1\\1\\ \end{pmatrix}=2\sigma^2\]

as we expect. We use this result to obtain the standard errors of the LSE (least squares estimate).

LSE standard errors (Advanced)

Note that \(\boldsymbol{\hat{\beta}}\) is a linear combination of \(\mathbf{Y}\): \(\mathbf{AY}\) with \(\mathbf{A}=\mathbf{(X^\top X)^{-1}X}^\top\), so we can use the equation above to derive the variance of our estimates:

\[\mbox{var}(\boldsymbol{\hat{\beta}}) = \mbox{var}( \mathbf{(X^\top X)^{-1}X^\top Y} ) = \]

\[\mathbf{(X^\top X)^{-1} X^\top} \mbox{var}(Y) (\mathbf{(X^\top X)^{-1} X^\top})^\top = \]

\[\mathbf{(X^\top X)^{-1} X^\top} \sigma^2 \mathbf{I} (\mathbf{(X^\top X)^{-1} X^\top})^\top = \]

\[\sigma^2 \mathbf{(X^\top X)^{-1} X^\top}\mathbf{X} \mathbf{(X^\top X)^{-1}} = \]

\[\sigma^2\mathbf{(X^\top X)^{-1}}\]

The diagonal of the square root of this matrix contains the standard error of our estimates.

Estimating \(\sigma^2\)

To obtain an actual estimate in practice from the formulas above, we need to estimate \(\sigma^2\). Previously we estimated the standard errors from the sample. However, the sample standard deviation of \(Y\) is not \(\sigma\) because \(Y\) also includes variability introduced by the deterministic part of the model: \(\mathbf{X}\boldsymbol{\beta}\). The approach we take is to use the residuals.

We form the residuals like this:

\[ \mathbf{r}\equiv\boldsymbol{\hat{\varepsilon}} = \mathbf{Y}-\mathbf{X}\boldsymbol{\hat{\beta}}\]

Both \(\mathbf{r}\) and \(\boldsymbol{\hat{\varepsilon}}\) notations are used to denote residuals.

Then we use these to estimate, in a similar way, to what we do in the univariate case:

\[ s^2 \equiv \hat{\sigma}^2 = \frac{1}{N-p}\mathbf{r}^\top\mathbf{r} = \frac{1}{N-p}\sum_{i=1}^N r_i^2\]

Here \(N\) is the sample size and \(p\) is the number of columns in \(\mathbf{X}\) or number of parameters (including the intercept term \(\beta_0\)). The reason we divide by \(N-p\) is because mathematical theory tells us that this will give us a better (unbiased) estimate.

Let’s try this in R and see if we obtain the same values as we did with the Monte Carlo simulation above:

n <- nrow(father.son)
N <- 50
index <- sample(n,N)
sampledat <- father.son[index,]
x <- sampledat$fheight
y <- sampledat$sheight
X <- model.matrix(~x)

N <- nrow(X)
p <- ncol(X)

XtXinv <- solve(crossprod(X))

resid <- y - X %*% XtXinv %*% crossprod(X,y)

s <- sqrt( sum(resid^2)/(N-p))
ses <- sqrt(diag(XtXinv))*s 

Let’s compare to what lm provides:

summary(lm(y~x))$coef[,2]
## (Intercept)           x 
##   8.3899781   0.1240767
ses
## (Intercept)           x 
##   8.3899781   0.1240767

They are identical because they are doing the same thing. Also, note that we approximate the Monte Carlo results:

apply(betahat,2,sd)
## (Intercept)           x 
##   8.3817556   0.1237362

Linear combination of estimates

Frequently, we want to compute the standard deviation of a linear combination of estimates such as \(\hat{\beta}_2 - \hat{\beta}_1\). This is a linear combination of \(\hat{\boldsymbol{\beta}}\):

\[\hat{\beta}_2 - \hat{\beta}_1 = \begin{pmatrix}0&-1&1&0&\dots&0\end{pmatrix} \begin{pmatrix} \hat{\beta}_0\\ \hat{\beta}_1 \\ \hat{\beta}_2 \\ \vdots\\ \hat{\beta}_p \end{pmatrix}\]

Using the above, we know how to compute the variance covariance matrix of \(\hat{\boldsymbol{\beta}}\).

CLT and t-distribution

We have shown how we can obtain standard errors for our estimates. However, as we learned in the first chapter, to perform inference we need to know the distribution of these random variables. The reason we went through the effort to compute the standard errors is because the CLT applies in linear models. If \(N\) is large enough, then the LSE will be normally distributed with mean \(\boldsymbol{\beta}\) and standard errors as described. For small samples, if the \(\varepsilon\) are normally distributed, then the \(\hat{\beta}-\beta\) follow a t-distribution. We do not derive this result here, but the results are extremely useful since it is how we construct p-values and confidence intervals in the context of linear models.

Code versus math

The standard approach to writing linear models either assume the values in \(\mathbf{X}\) are fixed or that we are conditioning on them. Thus \(\mathbf{X} \boldsymbol{\beta}\) has no variance as the \(\mathbf{X}\) is considered fixed. This is why we write \(\mbox{var}(Y_i) = \mbox{var}(\varepsilon_i)=\sigma^2\). This can cause confusion in practice because if you, for example, compute the following:

x =  father.son$fheight
beta =  c(34,0.5)
var(beta[1]+beta[2]*x)
## [1] 1.883576

it is nowhere near 0. This is an example in which we have to be careful in distinguishing code from math. The function var is simply computing the variance of the list we feed it, while the mathematical definition of variance is considering only quantities that are random variables. In the R code above, x is not fixed at all: we are letting it vary, but when we write \(\mbox{var}(Y_i) = \sigma^2\) we are imposing, mathematically, x to be fixed. Similarly, if we use R to compute the variance of \(Y\) in our object dropping example, we obtain something very different than \(\sigma^2=1\) (the known variance):

n <- length(tt)
y <- h0 + v0*tt  - 0.5*g*tt^2 + rnorm(n,sd=1)
var(y)
## [1] 329.5136

Again, this is because we are not fixing tt.

Interactions and Contrasts

As a running example to learn about more complex linear models, we will be using a dataset which compares the different frictional coefficients on the different legs of a spider. Specifically, we will be determining whether more friction comes from a pushing or pulling motion of the leg. The original paper from which the data was provided is:

Jonas O. Wolff & Stanislav N. Gorb, Radial arrangement of Janus-like setae permits friction control in spiders, Scientific Reports, 22 January 2013.

The abstract of the paper says,

The hunting spider Cupiennius salei (Arachnida, Ctenidae) possesses hairy attachment pads (claw tufts) at its distal legs, consisting of directional branched setae… Friction of claw tufts on smooth glass was measured to reveal the functional effect of seta arrangement within the pad.

Figure 1 includes some pretty cool electron microscope images of the tufts. We are interested in the comparisons in Figure 4, where the pulling and pushing motions are compared for different leg pairs (for a diagram of pushing and pulling see the top of Figure 3).

We include the data in our dagdata package and can download it from here.

spider <- read.csv("spider_wolff_gorb_2013.csv", skip=1)

Initial visual inspection of the data

Each measurement comes from one of our legs while it is either pushing or pulling. So we have two variables:

table(spider$leg,spider$type)
##     
##      pull push
##   L1   34   34
##   L2   15   15
##   L3   52   52
##   L4   40   40

We can make a boxplot summarizing the measurements for each of the eight pairs. This is similar to Figure 4 of the original paper:

boxplot(spider$friction ~ spider$type * spider$leg, 
        col=c("grey90","grey40"), las=2, 
        main="Comparison of friction coefficients of different leg pairs")
Comparison of friction coefficients of spiders' different leg pairs. The friction coefficient is calculated as the ratio of two forces (see paper Methods) so it is unitless.

Comparison of friction coefficients of spiders’ different leg pairs. The friction coefficient is calculated as the ratio of two forces (see paper Methods) so it is unitless.

What we can immediately see are two trends:

  • The pulling motion has higher friction than the pushing motion.
  • The leg pairs to the back of the spider (L4 being the last) have higher pulling friction.

Another thing to notice is that the groups have different spread around their average, what we call within-group variance. This is somewhat of a problem for the kinds of linear models we will explore below, since we will be assuming that around the population average values, the errors \(\varepsilon_i\) are distributed identically, meaning the same variance within each group. The consequence of ignoring the different variances for the different groups is that comparisons between those groups with small variances will be overly “conservative” (because the overall estimate of variance is larger than an estimate for just these groups), and comparisons between those groups with large variances will be overly confident. If the spread is related to the range of friction, such that groups with large friction values also have larger spread, a possibility is to transform the data with a function such as the log or sqrt. This looks like it could be useful here, since three of the four push groups (L1, L2, L3) have the smallest friction values and also the smallest spread.

Some alternative tests for comparing groups without transforming the values first include: t-tests without the equal variance assumption using a “Welch” or “Satterthwaite approximation”, or the Wilcoxon rank sum test mentioned previously. However here, for simplicity of illustration, we will fit a model that assumes equal variance and shows the different kinds of linear model designs using this dataset, setting aside the issue of different within-group variances.

A linear model with one variable

To remind ourselves how the simple two-group linear model looks, we will subset the data to include only the L1 leg pair, and run lm:

spider.sub <- spider[spider$leg == "L1",]
fit <- lm(friction ~ type, data=spider.sub)
summary(fit)
## 
## Call:
## lm(formula = friction ~ type, data = spider.sub)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.33147 -0.10735 -0.04941 -0.00147  0.76853 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.92147    0.03827  24.078  < 2e-16 ***
## typepush    -0.51412    0.05412  -9.499  5.7e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2232 on 66 degrees of freedom
## Multiple R-squared:  0.5776, Adjusted R-squared:  0.5711 
## F-statistic: 90.23 on 1 and 66 DF,  p-value: 5.698e-14
(coefs <- coef(fit))
## (Intercept)    typepush 
##   0.9214706  -0.5141176

These two estimated coefficients are the mean of the pull observations (the first estimated coefficient) and the difference between the means of the two groups (the second coefficient). We can show this with R code:

s <- split(spider.sub$friction, spider.sub$type)
mean(s[["pull"]])
## [1] 0.9214706
mean(s[["push"]]) - mean(s[["pull"]])
## [1] -0.5141176

We can form the design matrix, which was used inside lm:

X <- model.matrix(~ type, data=spider.sub)
colnames(X)
## [1] "(Intercept)" "typepush"
head(X)
##   (Intercept) typepush
## 1           1        0
## 2           1        0
## 3           1        0
## 4           1        0
## 5           1        0
## 6           1        0
tail(X)
##    (Intercept) typepush
## 63           1        1
## 64           1        1
## 65           1        1
## 66           1        1
## 67           1        1
## 68           1        1

Now we’ll make a plot of the \(\mathbf{X}\) matrix by putting a black block for the 1’s and a white block for the 0’s. This plot will be more interesting for the linear models later on in this script. Along the y-axis is the sample number (the row number of the data) and along the x-axis is the column of the design matrix \(\mathbf{X}\). If you have installed the rafalib library, you can make this plot with the imagemat function:

library(rafalib)
imagemat(X, main="Model matrix for linear model with one variable")
Model matrix for linear model with one variable.

Model matrix for linear model with one variable.

Examining the estimated coefficients

Now we show the coefficient estimates from the linear model in a diagram with arrows (code not shown).

Diagram of the estimated coefficients in the linear model. The green arrow indicates the Intercept term, which goes from zero to the mean of the reference group (here the 'pull' samples). The orange arrow indicates the difference between the push group and the pull group, which is negative in this example. The circles show the individual samples, jittered horizontally to avoid overplotting.

Diagram of the estimated coefficients in the linear model. The green arrow indicates the Intercept term, which goes from zero to the mean of the reference group (here the ‘pull’ samples). The orange arrow indicates the difference between the push group and the pull group, which is negative in this example. The circles show the individual samples, jittered horizontally to avoid overplotting.

A linear model with two variables

Now we’ll continue and examine the full dataset, including the observations from all leg pairs. In order to model both the leg pair differences (L1, L2, L3, L4) and the push vs. pull difference, we need to include both terms in the R formula. Let’s see what kind of design matrix will be formed with two variables in the formula:

X <- model.matrix(~ type + leg, data=spider)
colnames(X)
## [1] "(Intercept)" "typepush"    "legL2"       "legL3"       "legL4"
head(X)
##   (Intercept) typepush legL2 legL3 legL4
## 1           1        0     0     0     0
## 2           1        0     0     0     0
## 3           1        0     0     0     0
## 4           1        0     0     0     0
## 5           1        0     0     0     0
## 6           1        0     0     0     0
imagemat(X, main="Model matrix for linear model with two factors")
Image of the model matrix for a formula with type + leg

Image of the model matrix for a formula with type + leg

The first column is the intercept, and so it has 1’s for all samples. The second column has 1’s for the push samples, and we can see that there are four groups of them. Finally, the third, fourth and fifth columns have 1’s for the L2, L3 and L4 samples. The L1 samples do not have a column, because L1 is the reference level for leg. Similarly, there is no pull column, because pull is the reference level for the type variable.

To estimate coefficients for this model, we use lm with the formula ~ type + leg. We’ll save the linear model to fitTL standing for a fit with Type and Leg.

fitTL <- lm(friction ~ type + leg, data=spider)
summary(fitTL)
## 
## Call:
## lm(formula = friction ~ type + leg, data = spider)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.46392 -0.13441 -0.00525  0.10547  0.69509 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.05392    0.02816  37.426  < 2e-16 ***
## typepush    -0.77901    0.02482 -31.380  < 2e-16 ***
## legL2        0.17192    0.04569   3.763 0.000205 ***
## legL3        0.16049    0.03251   4.937 1.37e-06 ***
## legL4        0.28134    0.03438   8.183 1.01e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2084 on 277 degrees of freedom
## Multiple R-squared:  0.7916, Adjusted R-squared:  0.7886 
## F-statistic:   263 on 4 and 277 DF,  p-value: < 2.2e-16
(coefs <- coef(fitTL))
## (Intercept)    typepush       legL2       legL3       legL4 
##   1.0539153  -0.7790071   0.1719216   0.1604921   0.2813382

R uses the name coefficient to denote the component containing the least squares estimates. It is important to remember that the coefficients are parameters that we do not observe, but only estimate.

Mathematical representation

The model we are fitting above can be written as

\[ Y_i = \beta_0 + \beta_1 x_{i,1} + \beta_2 x_{i,2} + \beta_3 x_{i,3} + \beta_4 x_{i,4} + \varepsilon_i, i=1,\dots,N \]

with the \(x\) all indicator variables denoting push or pull and which leg. For example, a push on leg 3 will have \(x_{i,1}\) and \(x_{i,3}\) equal to 1 and the rest would be 0. Throughout this section we will refer to the \(\beta\) s with the effects they represent. For example we call \(\beta_0\) the intercept, \(\beta_1\) the pull effect, \(\beta_2\) the L2 effect, etc. We do not observe the coefficients, e.g. \(\beta_1\), directly, but estimate them with, e.g. \(\hat{\beta}_4\).

We can now form the matrix \(\mathbf{X}\) depicted above and obtain the least square estimates with:

\[ \hat{\boldsymbol{\beta}} = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{Y} \]

Y <- spider$friction
X <- model.matrix(~ type + leg, data=spider)
beta.hat <- solve(t(X) %*% X) %*% t(X) %*% Y
t(beta.hat)
##      (Intercept)   typepush     legL2     legL3     legL4
## [1,]    1.053915 -0.7790071 0.1719216 0.1604921 0.2813382
coefs
## (Intercept)    typepush       legL2       legL3       legL4 
##   1.0539153  -0.7790071   0.1719216   0.1604921   0.2813382

We can see that these values agree with the output of lm.

Examining the estimated coefficients

We can make the same plot as before, with arrows for each of the estimated coefficients in the model (code not shown).

Diagram of the estimated coefficients in the linear model. As before, the teal-green arrow represents the Intercept, which fits the mean of the reference group (here, the pull samples for leg L1). The purple, pink, and yellow-green arrows represent differences between the three other leg groups and L1. The orange arrow represents the difference between the push and pull samples for all groups.

Diagram of the estimated coefficients in the linear model. As before, the teal-green arrow represents the Intercept, which fits the mean of the reference group (here, the pull samples for leg L1). The purple, pink, and yellow-green arrows represent differences between the three other leg groups and L1. The orange arrow represents the difference between the push and pull samples for all groups.

In this case, the fitted means for each group, derived from the fitted coefficients, do not line up with those we obtain from simply taking the average from each of the eight possible groups. The reason is that our model uses five coefficients, instead of eight. We are assuming that the effects are additive. However, as we demonstrate in more detail below, this particular dataset is better described with a model including interactions.

s <- split(spider$friction, spider$group)
mean(s[["L1pull"]])
## [1] 0.9214706
coefs[1]
## (Intercept) 
##    1.053915
mean(s[["L1push"]])
## [1] 0.4073529
coefs[1] + coefs[2]
## (Intercept) 
##   0.2749082

Here we can demonstrate that the push vs. pull estimated coefficient, coefs[2], is a weighted average of the difference of the means for each group. Furthermore, the weighting is determined by the sample size of each group. The math works out simply here because the sample size is equal for the push and pull subgroups within each leg pair. If the sample sizes were not equal for push and pull within each leg pair, the weighting is more complicated but uniquely determined by a formula involving the sample size of each subgroup, the total sample size, and the number of coefficients. This can be worked out from \((\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top\).

means <- sapply(s, mean)
##the sample size of push or pull groups for each leg pair
ns <- sapply(s, length)[c(1,3,5,7)]
(w <- ns/sum(ns))
##    L1pull    L2pull    L3pull    L4pull 
## 0.2411348 0.1063830 0.3687943 0.2836879
sum(w * (means[c(2,4,6,8)] - means[c(1,3,5,7)]))
## [1] -0.7790071
coefs[2]
##   typepush 
## -0.7790071

Contrasting coefficients

Sometimes, the comparison we are interested in is represented directly by a single coefficient in the model, such as the push vs. pull difference, which was coefs[2] above. However, sometimes, we want to make a comparison which is not a single coefficient, but a combination of coefficients, which is called a contrast. To introduce the concept of contrasts, first consider the comparisons which we can read off from the linear model summary:

coefs
## (Intercept)    typepush       legL2       legL3       legL4 
##   1.0539153  -0.7790071   0.1719216   0.1604921   0.2813382

Here we have the intercept estimate, the push vs. pull estimated effect across all leg pairs, and the estimates for the L2 vs. L1 effect, the L3 vs. L1 effect, and the L4 vs. L1 effect. What if we want to compare two groups and one of those groups is not L1? The solution to this question is to use contrasts.

A contrast is a combination of estimated coefficient: \(\mathbf{c^\top} \hat{\boldsymbol{\beta}}\), where \(\mathbf{c}\) is a column vector with as many rows as the number of coefficients in the linear model. If \(\mathbf{c}\) has a 0 for one or more of its rows, then the corresponding estimated coefficients in \(\hat{\boldsymbol{\beta}}\) are not involved in the contrast.

If we want to compare leg pairs L3 and L2, this is equivalent to contrasting two coefficients from the linear model because, in this contrast, the comparison to the reference level L1 cancels out:

\[ (\mbox{L3} - \mbox{L1}) - (\mbox{L2} - \mbox{L1}) = \mbox{L3} - \mbox{L2 }\]

An easy way to make these contrasts of two groups is to use the contrast function from the contrast package. We just need to specify which groups we want to compare. We have to pick one of pull or push types, although the answer will not differ, as we will see below.

library(contrast) #Available from CRAN
L3vsL2 <- contrast(fitTL,list(leg="L3",type="pull"),list(leg="L2",type="pull"))
L3vsL2
## lm model parameter contrast
## 
##     Contrast       S.E.      Lower      Upper     t  df Pr(>|t|)
##  -0.01142949 0.04319685 -0.0964653 0.07360632 -0.26 277   0.7915

The first column Contrast gives the L3 vs. L2 estimate from the model we fit above.

We can show that the least squares estimates of a linear combination of coefficients is the same linear combination of the estimates. Therefore, the effect size estimate is just the difference between two estimated coefficients. The contrast vector used by contrast is stored as a variable called X within the resulting object (not to be confused with our original \(\mathbf{X}\), the design matrix).

coefs[4] - coefs[3]
##       legL3 
## -0.01142949
(cT <- L3vsL2$X)
##   (Intercept) typepush legL2 legL3 legL4
## 1           0        0    -1     1     0
## attr(,"assign")
## [1] 0 1 2 2 2
## attr(,"contrasts")
## attr(,"contrasts")$type
## [1] "contr.treatment"
## 
## attr(,"contrasts")$leg
## [1] "contr.treatment"
cT %*% coefs
##          [,1]
## 1 -0.01142949

What about the standard error and t-statistic? As before, the t-statistic is the estimate divided by the standard error. The standard error of the contrast estimate is formed by multiplying the contrast vector \(\mathbf{c}\) on either side of the estimated covariance matrix, \(\hat{\Sigma}\), our estimate for \(\mathrm{var}(\hat{\boldsymbol{\beta}})\):

\[ \sqrt{\mathbf{c^\top} \hat{\boldsymbol{\Sigma}} \mathbf{c}} \]

where we saw the covariance of the coefficients earlier:

\[ \boldsymbol{\Sigma} = \sigma^2 (\mathbf{X}^\top \mathbf{X})^{-1} \]

We estimate \(\sigma^2\) with the sample estimate \(\hat{\sigma}^2\) described above and obtain:

Sigma.hat <- sum(fitTL$residuals^2)/(nrow(X) - ncol(X)) * solve(t(X) %*% X)
signif(Sigma.hat, 2)
##             (Intercept) typepush    legL2    legL3    legL4
## (Intercept)     0.00079 -3.1e-04 -0.00064 -0.00064 -0.00064
## typepush       -0.00031  6.2e-04  0.00000  0.00000  0.00000
## legL2          -0.00064 -6.4e-20  0.00210  0.00064  0.00064
## legL3          -0.00064 -6.4e-20  0.00064  0.00110  0.00064
## legL4          -0.00064 -1.2e-19  0.00064  0.00064  0.00120
sqrt(cT %*% Sigma.hat %*% t(cT))
##            1
## 1 0.04319685
L3vsL2$SE
## [1] 0.04319685

We would have obtained the same result for a contrast of L3 and L2 had we picked type="push". The reason it does not change the contrast is because it leads to addition of the typepush effect on both sides of the difference, which cancels out:

L3vsL2.equiv <- contrast(fitTL,list(leg="L3",type="push"),list(leg="L2",type="push"))
L3vsL2.equiv$X
##   (Intercept) typepush legL2 legL3 legL4
## 1           0        0    -1     1     0
## attr(,"assign")
## [1] 0 1 2 2 2
## attr(,"contrasts")
## attr(,"contrasts")$type
## [1] "contr.treatment"
## 
## attr(,"contrasts")$leg
## [1] "contr.treatment"

Linear Model with Interactions

In the previous linear model, we assumed that the push vs. pull effect was the same for all of the leg pairs (the same orange arrow). You can easily see that this does not capture the trends in the data that well. That is, the tips of the arrows did not line up perfectly with the group averages. For the L1 leg pair, the push vs. pull estimated coefficient was too large, and for the L3 leg pair, the push vs. pull coefficient was somewhat too small.

Interaction terms will help us overcome this problem by introducing additional coefficients to compensate for differences in the push vs. pull effect across the 4 groups. As we already have a push vs. pull term in the model, we only need to add three more terms to have the freedom to find leg-pair-specific push vs. pull differences. As we will see, interaction terms are added to the design matrix by multiplying the columns of the design matrix representing existing terms.

We can rebuild our linear model with an interaction between type and leg, by including an extra term in the formula type:leg. The : symbol adds an interaction between the two variables surrounding it. An equivalent way to specify this model is ~ type*leg, which will expand to the formula ~ type + leg + type:leg, with main effects for type, leg and an interaction term type:leg.

X <- model.matrix(~ type + leg + type:leg, data=spider)
colnames(X)
## [1] "(Intercept)"    "typepush"       "legL2"          "legL3"         
## [5] "legL4"          "typepush:legL2" "typepush:legL3" "typepush:legL4"
head(X)
##   (Intercept) typepush legL2 legL3 legL4 typepush:legL2 typepush:legL3
## 1           1        0     0     0     0              0              0
## 2           1        0     0     0     0              0              0
## 3           1        0     0     0     0              0              0
## 4           1        0     0     0     0              0              0
## 5           1        0     0     0     0              0              0
## 6           1        0     0     0     0              0              0
##   typepush:legL4
## 1              0
## 2              0
## 3              0
## 4              0
## 5              0
## 6              0
imagemat(X, main="Model matrix for linear model with interactions")
Image of model matrix with interactions.

Image of model matrix with interactions.

Columns 6-8 (typepush:legL2, typepush:legL3, and typepush:legL4) are the product of the 2nd column (typepush) and columns 3-5 (the three leg columns). Looking at the last column, for example, the typepush:legL4 column is adding an extra coefficient \(\beta_{\textrm{push,L4}}\) to those samples which are both push samples and leg pair L4 samples. This accounts for a possible difference when the mean of samples in the L4-push group are not at the location which would be predicted by adding the estimated intercept, the estimated push coefficient typepush, and the estimated L4 coefficient legL4.

We can run the linear model using the same code as before:

fitX <- lm(friction ~ type + leg + type:leg, data=spider)
summary(fitX)
## 
## Call:
## lm(formula = friction ~ type + leg + type:leg, data = spider)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.46385 -0.10735 -0.01111  0.07848  0.76853 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     0.92147    0.03266  28.215  < 2e-16 ***
## typepush       -0.51412    0.04619 -11.131  < 2e-16 ***
## legL2           0.22386    0.05903   3.792 0.000184 ***
## legL3           0.35238    0.04200   8.390 2.62e-15 ***
## legL4           0.47928    0.04442  10.789  < 2e-16 ***
## typepush:legL2 -0.10388    0.08348  -1.244 0.214409    
## typepush:legL3 -0.38377    0.05940  -6.461 4.73e-10 ***
## typepush:legL4 -0.39588    0.06282  -6.302 1.17e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1904 on 274 degrees of freedom
## Multiple R-squared:  0.8279, Adjusted R-squared:  0.8235 
## F-statistic: 188.3 on 7 and 274 DF,  p-value: < 2.2e-16
coefs <- coef(fitX)

Examining the estimated coefficients

Here is where the plot with arrows really helps us interpret the coefficients. The estimated interaction coefficients (the yellow, brown and silver arrows) allow leg-pair-specific differences in the push vs. pull difference. The orange arrow now represents the estimated push vs. pull difference only for the reference leg pair, which is L1. If an estimated interaction coefficient is large, this means that the push vs. pull difference for that leg pair is very different than the push vs. pull difference in the reference leg pair.

Now, as we have eight terms in the model and eight parameters, you can check that the tips of the arrowheads are exactly equal to the group means (code not shown).

Diagram of the estimated coefficients in the linear model. In the design with interaction terms, the orange arrow now indicates the push vs. pull difference only for the reference group (L1), while three new arrows (yellow, brown and grey) indicate the additional push vs. pull differences in the non-reference groups (L2, L3 and L4) with respect to the reference group.

Diagram of the estimated coefficients in the linear model. In the design with interaction terms, the orange arrow now indicates the push vs. pull difference only for the reference group (L1), while three new arrows (yellow, brown and grey) indicate the additional push vs. pull differences in the non-reference groups (L2, L3 and L4) with respect to the reference group.

Contrasts

Again we will show how to combine estimated coefficients from the model using contrasts. For some simple cases, we can use the contrast package. Suppose we want to know the push vs. pull effect for the L2 leg pair samples. We can see from the arrow plot that this is the orange arrow plus the yellow arrow. We can also specify this comparison with the contrast function:

library(contrast) ##Available from CRAN
L2push.vs.pull <- contrast(fitX,
                   list(leg="L2", type = "push"), 
                   list(leg="L2", type = "pull"))
L2push.vs.pull
## lm model parameter contrast
## 
##  Contrast      S.E.      Lower      Upper     t  df Pr(>|t|)
##    -0.618 0.0695372 -0.7548951 -0.4811049 -8.89 274        0
coefs[2] + coefs[6] ##we know this is also orange + yellow arrow
## typepush 
##   -0.618

Differences of differences

The question of whether the push vs. pull difference is different in L2 compared to L1, is answered by a single term in the model: the typepush:legL2 estimated coefficient corresponding to the yellow arrow in the plot. A p-value for whether this coefficient is actually equal to zero can be read off from the table printed with summary(fitX) above. Similarly, we can read off the p-values for the differences of differences for L3 vs. L1 and for L4 vs. L1.

Suppose we want to know if the push vs. pull difference is different in L3 compared to L2. By examining the arrows in the diagram above, we can see that the push vs. pull effect for a leg pair other than L1 is the typepush arrow plus the interaction term for that group.

If we work out the math for comparing across two non-reference leg pairs, this is:

\[ (\mbox{typepush} + \mbox{typepush:legL3}) - (\mbox{typepush} + \mbox{typepush:legL2}) \]

…which simplifies to:

\[ = \mbox{typepush:legL3} - \mbox{typepush:legL2} \]

We can’t make this contrast using the contrast function shown before, but we can make this comparison using the glht (for “general linear hypothesis test”) function from the multcomp package. We need to form a 1-row matrix which has a -1 for the typepush:legL2 coefficient and a +1 for the typepush:legL3 coefficient. We provide this matrix to the linfct (linear function) argument, and obtain a summary table for this contrast of estimated interaction coefficients.

Note that there are other ways to perform contrasts using base R, and this is just our preferred way.

library(multcomp) ##Available from CRAN
## Warning: package 'multcomp' was built under R version 3.4.2
## Warning: package 'TH.data' was built under R version 3.4.2
C <- matrix(c(0,0,0,0,0,-1,1,0), 1)
L3vsL2interaction <- glht(fitX, linfct=C)
summary(L3vsL2interaction)
## 
##   Simultaneous Tests for General Linear Hypotheses
## 
## Fit: lm(formula = friction ~ type + leg + type:leg, data = spider)
## 
## Linear Hypotheses:
##        Estimate Std. Error t value Pr(>|t|)    
## 1 == 0 -0.27988    0.07893  -3.546  0.00046 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)
coefs[7] - coefs[6] ##we know this is also brown - yellow
## typepush:legL3 
##     -0.2798846

Analysis of Variance

Suppose that we want to know if the push vs. pull difference is different across leg pairs in general. We do not want to compare any two leg pairs in particular, but rather we want to know if the three interaction terms which represent differences in the push vs. pull difference across leg pairs are larger than we would expect them to be if the push vs. pull difference was in fact equal across all leg pairs.

Such a question can be answered by an analysis of variance, which is often abbreviated as ANOVA. ANOVA compares the reduction in the sum of squares of the residuals for models of different complexity. The model with eight coefficients is more complex than the model with five coefficients where we assumed the push vs. pull difference was equal across leg pairs. The least complex model would only use a single coefficient, an intercept. Under certain assumptions we can also perform inference that determines the probability of improvements as large as what we observed. Let’s first print the result of an ANOVA in R and then examine the results in detail:

anova(fitX)
## Analysis of Variance Table
## 
## Response: friction
##            Df Sum Sq Mean Sq  F value    Pr(>F)    
## type        1 42.783  42.783 1179.713 < 2.2e-16 ***
## leg         3  2.921   0.974   26.847 2.972e-15 ***
## type:leg    3  2.098   0.699   19.282 2.256e-11 ***
## Residuals 274  9.937   0.036                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The first line tells us that adding a variable type (push or pull) to the design is very useful (reduces the sum of squared residuals) compared to a model with only an intercept. We can see that it is useful, because this single coefficient reduces the sum of squares by 42.783. The original sum of squares of the model with just an intercept is:

mu0 <- mean(spider$friction)
(initial.ss <- sum((spider$friction - mu0)^2))
## [1] 57.73858

Note that this initial sum of squares is just a scaled version of the sample variance:

N <- nrow(spider)
(N - 1) * var(spider$friction)
## [1] 57.73858

Let’s see exactly how we get this 42.783. We need to calculate the sum of squared residuals for the model with only the type information. We can do this by calculating the residuals, squaring these, summing these within groups and then summing across the groups.

s <- split(spider$friction, spider$type)
after.type.ss <- sum( sapply(s, function(x) {
  residual <- x - mean(x) 
  sum(residual^2)
  }) )

The reduction in sum of squared residuals from introducing the type coefficient is therefore:

(type.ss <- initial.ss - after.type.ss)
## [1] 42.78307

Through simple arithmetic, this reduction can be shown to be equivalent to the sum of squared differences between the fitted values for the models with formula ~type and ~1:

sum(sapply(s, length) * (sapply(s, mean) - mu0)^2)
## [1] 42.78307

Keep in mind that the order of terms in the formula, and therefore rows in the ANOVA table, is important: each row considers the reduction in the sum of squared residuals after adding coefficients compared to the model in the previous row.

The other columns in the ANOVA table show the “degrees of freedom” with each row. As the type variable introduced only one term in the model, the Df column has a 1. Because the leg variable introduced three terms in the model (legL2, legL3 and legL4), the Df column has a 3.

Finally, there is a column which lists the F value. The F value is the mean of squares for the inclusion of the terms of interest (the sum of squares divided by the degrees of freedom) divided by the mean squared residuals (from the bottom row):

\[ r_i = Y_i - \hat{Y}_i \]

\[ \mbox{Mean Sq Residuals} = \frac{1}{N - p} \sum_{i=1}^N r_i^2 \]

where \(p\) is the number of coefficients in the model (here eight, including the intercept term).

Under the null hypothesis (the true value of the additional coefficient(s) is 0), we have a theoretical result for what the distribution of the F value will be for each row. The assumptions needed for this approximation to hold are similar to those of the t-distribution approximation we described in earlier chapters. We either need a large sample size so that CLT applies or we need the population data to follow a normal approximation.

As an example of how one interprets these p-values, let’s take the last row type:leg which specifies the three interaction coefficients. Under the null hypothesis that the true value for these three additional terms is actually 0, e.g. \(\beta_{\textrm{push,L2}} = 0, \beta_{\textrm{push,L3}} = 0, \beta_{\textrm{push,L4}} = 0\), then we can calculate the chance of seeing such a large F-value for this row of the ANOVA table. Remember that we are only concerned with large values here, because we have a ratio of sum of squares, the F-value can only be positive. The p-value in the last column for the type:leg row can be interpreted as: under the null hypothesis that there are no differences in the push vs. pull difference across leg pair, this is the probability of an estimated interaction coefficient explaining so much of the observed variance. If this p-value is small, we would consider rejecting the null hypothesis that the push vs. pull difference is the same across leg pairs.

The F distribution has two parameters: one for the degrees of freedom of the numerator (the terms of interest) and one for the denominator (the residuals). In the case of the interaction coefficients row, this is 3, the number of interaction coefficients divided by 274, the number of samples minus the total number of coefficients.

A different specification of the same model

Now we show an alternate specification of the same model, wherein we assume that each combination of type and leg has its own mean value (and so that the push vs. pull effect is not the same for each leg pair). This specification is in some ways simpler, as we will see, but it does not allow us to build the ANOVA table as above, because it does not split interaction coefficients out in the same way.

We start by constructing a factor variable with a level for each unique combination of type and leg. We include a 0 + in the formula because we do not want to include an intercept in the model matrix.

##earlier, we defined the 'group' column:
spider$group <- factor(paste0(spider$leg, spider$type))
X <- model.matrix(~ 0 + group, data=spider)
colnames(X)
## [1] "groupL1pull" "groupL1push" "groupL2pull" "groupL2push" "groupL3pull"
## [6] "groupL3push" "groupL4pull" "groupL4push"
head(X)
##   groupL1pull groupL1push groupL2pull groupL2push groupL3pull groupL3push
## 1           1           0           0           0           0           0
## 2           1           0           0           0           0           0
## 3           1           0           0           0           0           0
## 4           1           0           0           0           0           0
## 5           1           0           0           0           0           0
## 6           1           0           0           0           0           0
##   groupL4pull groupL4push
## 1           0           0
## 2           0           0
## 3           0           0
## 4           0           0
## 5           0           0
## 6           0           0
imagemat(X, main="Model matrix for linear model with group variable")
Image of model matrix for linear model with group variable. This model, also with eight terms, gives a unique fitted value for each combination of type and leg.

Image of model matrix for linear model with group variable. This model, also with eight terms, gives a unique fitted value for each combination of type and leg.

We can run the linear model with the familiar call:

fitG <- lm(friction ~ 0 + group, data=spider)
summary(fitG)
## 
## Call:
## lm(formula = friction ~ 0 + group, data = spider)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.46385 -0.10735 -0.01111  0.07848  0.76853 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## groupL1pull  0.92147    0.03266   28.21   <2e-16 ***
## groupL1push  0.40735    0.03266   12.47   <2e-16 ***
## groupL2pull  1.14533    0.04917   23.29   <2e-16 ***
## groupL2push  0.52733    0.04917   10.72   <2e-16 ***
## groupL3pull  1.27385    0.02641   48.24   <2e-16 ***
## groupL3push  0.37596    0.02641   14.24   <2e-16 ***
## groupL4pull  1.40075    0.03011   46.52   <2e-16 ***
## groupL4push  0.49075    0.03011   16.30   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1904 on 274 degrees of freedom
## Multiple R-squared:   0.96,  Adjusted R-squared:  0.9588 
## F-statistic:   821 on 8 and 274 DF,  p-value: < 2.2e-16
coefs <- coef(fitG)

Examining the estimated coefficients

Now we have eight arrows, one for each group. The arrow tips align directly with the mean of each group:

Diagram of the estimated coefficients in the linear model, with each term representing the mean of a combination of type and leg.

Diagram of the estimated coefficients in the linear model, with each term representing the mean of a combination of type and leg.

Simple contrasts using the contrast package

While we cannot perform an ANOVA with this formulation, we can easily contrast the estimated coefficients for individual groups using the contrast function:

groupL2push.vs.pull <- contrast(fitG,
                                list(group = "L2push"), 
                                list(group = "L2pull"))
groupL2push.vs.pull
## lm model parameter contrast
## 
##   Contrast      S.E.      Lower      Upper     t  df Pr(>|t|)
## 1   -0.618 0.0695372 -0.7548951 -0.4811049 -8.89 274        0
coefs[4] - coefs[3]
## groupL2push 
##      -0.618

Differences of differences when there is no intercept

We can also make pair-wise comparisons of the estimated push vs. pull difference across leg pair. For example, if we want to compare the push vs. pull difference in leg pair L3 vs. leg pair L2:

\[ (\mbox{L3push} - \mbox{L3pull}) - (\mbox{L2push} - \mbox{L2pull}) \]

\[ = \mbox{L3 push} + \mbox{L2pull} - \mbox{L3pull} - \mbox{L2push} \]

C <- matrix(c(0,0,1,-1,-1,1,0,0), 1)
groupL3vsL2interaction <- glht(fitG, linfct=C)
summary(groupL3vsL2interaction)
## 
##   Simultaneous Tests for General Linear Hypotheses
## 
## Fit: lm(formula = friction ~ 0 + group, data = spider)
## 
## Linear Hypotheses:
##        Estimate Std. Error t value Pr(>|t|)    
## 1 == 0 -0.27988    0.07893  -3.546  0.00046 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)
names(coefs)
## [1] "groupL1pull" "groupL1push" "groupL2pull" "groupL2push" "groupL3pull"
## [6] "groupL3push" "groupL4pull" "groupL4push"
(coefs[6] - coefs[5]) - (coefs[4] - coefs[3])
## groupL3push 
##  -0.2798846