In this module we try to minimize mathematical notation as much as possible. Furthermore, we avoid using calculus to motivate statistical concepts. However, Matrix Algebra (also referred to as Linear Algebra) and its mathematical notation greatly facilitates the exposition of the of the advanced data analysis techniques covered in the remainder of this book. We therefore dedicate a chapter of this book to introducing Matrix Algebra. We do this in the context of data analysis and using one of the main applications: Linear Models.
We will describe three examples from the life sciences: one from physics, one related to genetics, and one from a mouse experiment. They are very different, yet we end up using the same statistical technique: fitting linear models. Linear models are typically taught and described in the language of matrix algebra.
Imagine you are Galileo in the 16th century trying to describe the velocity of a falling object. An assistant climbs the Tower of Pisa and drops a ball, while several other assistants record the position at different times. Let’s simulate some data using the equations we know today and adding some measurement error:
set.seed(1)
g <- 9.8 ##meters per second
n <- 25
tt <- seq(0,3.4,len=n) ##time in secs, note: we use tt because t is a base function
d <- 56.67 - 0.5*g*tt^2 + rnorm(n,sd=1) ##meters
The assistants hand the data to Galileo and this is what he sees:
mypar()
plot(tt,d,ylab="Distance in meters",xlab="Time in seconds")
Simulated data for distance travelled versus time of falling object measured with error.
He does not know the exact equation, but by looking at the plot above he deduces that the position should follow a parabola. So he models the data with:
\[ Y_i = \beta_0 + \beta_1 x_i + \beta_2 x_i^2 + \varepsilon, i=1,\dots,n \]
With \(Y_i\) representing location, \(x_i\) representing the time, and \(\varepsilon\) accounting for measurement error. This is a linear model because it is a linear combination of known quantities (th \(x\) s) referred to as predictors or covariates and unknown parameters (the \(\beta\) s).
Now imagine you are Francis Galton in the 19th century and you collect paired height data from fathers and sons. You suspect that height is inherited. Your data:
data(father.son,package="UsingR")
x=father.son$fheight
y=father.son$sheight
looks like this:
plot(x,y,xlab="Father's height",ylab="Son's height")
Galton’s data. Son heights versus father heights.
The sons’ heights do seem to increase linearly with the fathers’ heights. In this case, a model that describes the data is as follows:
\[ Y_i = \beta_0 + \beta_1 x_i + \varepsilon, i=1,\dots,N \]
This is also a linear model with \(x_i\) and \(Y_i\), the father and son heights respectively, for the \(i\)-th pair and \(\varepsilon\) a term to account for the extra variability. Here we think of the fathers’ heights as the predictor and being fixed (not random) so we use lower case. Measurement error alone can’t explain all the variability seen in \(\varepsilon\). This makes sense as there are other variables not in the model, for example, mothers’ heights, genetic randomness, and environmental factors.
Here we read-in mouse body weight data from mice that were fed two different diets: high fat and control (chow). We have a random sample of 12 mice for each. We are interested in determining if the diet has an effect on weight. Here is the data:
dat <- read.csv("femaleMiceWeights.csv")
mypar(1,1)
stripchart(Bodyweight~Diet,data=dat,vertical=TRUE,method="jitter",pch=1,main="Mice weights")
Mouse weights under two diets.
We want to estimate the difference in average weight between populations. We demonstrated how to do this using t-tests and confidence intervals, based on the difference in sample averages. We can obtain the same exact results using a linear model:
\[ Y_i = \beta_0 + \beta_1 x_{i} + \varepsilon_i\]
with \(\beta_0\) the chow diet average weight, \(\beta_1\) the difference between averages, \(x_i = 1\) when mouse \(i\) gets the high fat (hf) diet, \(x_i = 0\) when it gets the chow diet, and \(\varepsilon_i\) explains the differences between mice of the same population.
If you haven’t done so already, install the library UsingR install.packages(“UsingR”)
Then once you load it you have access to Galton’s father and son heights:
library(UsingR)
data("father.son",package="UsingR")
head(father.son)
## fheight sheight
## 1 65.04851 59.77827
## 2 63.25094 63.21404
## 3 64.95532 63.34242
## 4 65.75250 62.79238
## 5 61.13723 64.28113
## 6 63.02254 64.24221
Q:What is the average height of the sons (don’t round off)?
mean(father.son$sheight)
## [1] 68.68407
A:68.68407
Introduction Exercises #2
One of the defining features of regression is that we stratify one variable based on others. In Statistics we use the verb “condition”. For example, the linear model for son and father heights answers the question how tall do I expect a son to be if I condition on his father being x inches. The regression line answers this question for any x.
Using the father.son dataset described above, we want to know the expected height of sons if we condition on the father being 71 inches. Create a list of son heights for sons that have fathers with heights of 71 inches (round to the nearest inch).
What is the mean of the son heights for fathers that have a height of 71 inches (don’t round off your answer)? (Hint: use the function round() on the fathers’ heights)
fatherht<-round(father.son$fheight,0)
fathersonht<-cbind(fatherht,father.son)
suppressMessages(library(dplyr))
sonht<-filter(fathersonht,fatherht==71) %>% select(sheight) %>% unlist
mean(sonht)
## [1] 70.54082
sonht<-fathersonht %>% filter(fatherht==71) %>% select(sheight) %>% unlist
mean(sonht)
## [1] 70.54082
#or
#filter(father.son,round(fheight)==71) %>% summarize(mean(sheight))
#or
mean(father.son$sheight[round(father.son$fheight)==71])
## [1] 70.54082
A:70.54082
Introduction Exercises #3
We say a statistical model is a linear model when we can write it as a linear combination of parameters and known covariates plus random error terms. In the choices below, Y represents our observations, time t is our only covariate, unknown parameters are represented with letters a,b,c,d and measurment error is represented by the letter e. Note that if t is known, then any transformation of t is also known. So, for example, both Y=a+bt +e and Y=a+b f(t) +e are linear models. Which of the following can’t be written as a linear model?
Y = a + bt + e
Y = a + b cos(t) + e
Y = a + b^t + e
Y = a + b t + c t^2 + d t^3 + e
A:Y = a + b^t + e
EXPLANATION:In every other case we can write the model as linear combination of parameters and known covariates. b^t is not a linear combination of b and t.
Introduction Exercises #4
Supposed you model the relationship between weight and height across individuals with a linear model. You assume that the height of individuals for a fixed weight x follows a liner model Y = a + b x + e. Which of the following do you feel best describes what e represents?
A:Between individual variability: people of the same height vary in their weight
EXPLANATION: Remember the model is across individuals and we fix x. People of the same height can vary greatly in other aspects of their physiology: for example different bone density or differing amounts of muscle and fat.
We have seen three very different examples in which linear models can be used. A general model that encompasses all of the above examples is the following:
\[ Y_i = \beta_0 + \beta_1 x_{i,1} + \beta_2 x_{i,2} + \dots + \beta_2 x_{i,p} + \varepsilon_i, i=1,\dots,n \]
\[ Y_i = \beta_0 + \sum_{j=1}^p \beta_j x_{i,j} + \varepsilon_i, i=1,\dots,n \]
Note that we have a general number of predictors \(p\). Matrix algebra provides a compact language and mathematical framework to compute and make derivations with any linear model that fit into the above framework.
For the models above to be useful we have to estimate the unknown \(\beta\) s. In the first example, we want to describe a physical process for which we can’t have unknown parameters. In the second example, we better understand inheritance by estimating how much, on average, the father’s height affects the son’s height. In the final example, we want to determine if there is in fact a difference: if \(\beta_1 \neq 0\).
The standard approach in science is to find the values that minimize the distance of the fitted model to the data. The following is called the least squares (LS) equation and we will see it often in this chapter:
\[ \sum_{i=1}^n \left\{ Y_i - \left(\beta_0 + \sum_{j=1}^p \beta_j x_{i,j}\right)\right\}^2 \]
Once we find the minimum, we will call the values the least squares estimates (LSE) and denote them with \(\hat{\beta}\). The quantity obtained when evaluating the least square equation at the estimates is called the residual sum of squares (RSS). Since all these quantities depend on \(Y\), they are random variables. The \(\hat{\beta}\) s are random variables and we will eventually perform inference on them.
Thanks to my high school physics teacher, I know that the equation for the trajectory of a falling object is:
\[d = h_0 + v_0 t - 0.5 \times 9.8 t^2\]
with \(h_0\) and \(v_0\) the starting height and velocity respectively. The data we simulated above followed this equation and added measurement error to simulate n
observations for dropping the ball \((v_0=0)\) from the tower of Pisa \((h_0=56.67)\). This is why we used this code to simulate data:
g <- 9.8 ##meters per second
n <- 25
tt <- seq(0,3.4,len=n) ##time in secs, t is a base function
f <- 56.67 - 0.5*g*tt^2
y <- f + rnorm(n,sd=1)
Here is what the data looks like with the solid line representing the true trajectory:
plot(tt,y,ylab="Distance in meters",xlab="Time in seconds")
lines(tt,f,col=2)
Fitted model for simulated data for distance travelled versus time of falling object measured with error.
simulate_drop_data_with_fit But we were pretending to be Galileo and so we don’t know the parameters in the model. The data does suggest it is a parabola, so we model it as such:
\[ Y_i = \beta_0 + \beta_1 x_i + \beta_2 x_i^2 + \varepsilon, i=1,\dots,n \]
How do we find the LSE?
lm
functionIn R we can fit this model by simply using the lm
function. We will describe this function in detail later, but here is a preview:
tt2 <-tt^2
fit <- lm(y~tt+tt2)
summary(fit)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 56.9733469 0.6059749 94.0193248 3.717816e-30
## tt -0.2088678 0.8254655 -0.2530303 8.025942e-01
## tt2 -4.9066385 0.2345028 -20.9235860 5.175646e-16
It gives us the LSE, as well as standard errors and p-values.
Part of what we do in this section is to explain the mathematics behind this function.
Let’s write a function that computes the RSS for any vector \(\beta\):
rss <- function(Beta0,Beta1,Beta2){
r <- y - (Beta0+Beta1*tt+Beta2*tt^2)
return(sum(r^2))
}
So for any three dimensional vector we get an RSS. Here is a plot of the RSS as a function of \(\beta_2\) when we keep the other two fixed:
Beta2s<- seq(-10,0,len=100)
plot(Beta2s,sapply(Beta2s,rss,Beta0=55,Beta1=0),
ylab="RSS",xlab="Beta2",type="l")
##Let's add another curve fixing another pair:
Beta2s<- seq(-10,0,len=100)
lines(Beta2s,sapply(Beta2s,rss,Beta0=65,Beta1=0),col=2)
Residual sum of squares obtained for several values of the parameters.
Trial and error here is not going to work. Instead, we can use calculus: take the partial derivatives, set them to 0 and solve. Of course, if we have many parameters, these equations can get rather complex. Linear algebra provides a compact and general way of solving this problem.
When studying the father-son data, Galton made a fascinating discovery using exploratory analysis.
He noted that if he tabulated the number of father-son height pairs and followed all the x,y values having the same totals in the table, they formed an ellipse. In the plot above, made by Galton, you see the ellipse formed by the pairs having 3 cases. This then led to modeling this data as correlated bivariate normal which we described earlier:
\[ Pr(X<a,Y<b) = \]
\[ \int_{-\infty}^{a} \int_{-\infty}^{b} \frac{1}{2\pi\sigma_x\sigma_y\sqrt{1-\rho^2}} \exp{ \left\{ \frac{1}{2(1-\rho^2)} \left[\left(\frac{x-\mu_x}{\sigma_x}\right)^2 - 2\rho\left(\frac{x-\mu_x}{\sigma_x}\right)\left(\frac{y-\mu_y}{\sigma_y}\right)+ \left(\frac{y-\mu_y}{\sigma_y}\right)^2 \right] \right\} } \]
We described how we can use math to show that if you keep \(X\) fixed (condition to be \(x\)) the distribution of \(Y\) is normally distributed with mean: \(\mu_x +\sigma_y \rho \left(\frac{x-\mu_x}{\sigma_x}\right)\) and standard deviation \(\sigma_y \sqrt{1-\rho^2}\). Note that \(\rho\) is the correlation between \(Y\) and \(X\) , which implies that if we fix \(X=x\), \(Y\) does in fact follow a linear model. The \(\beta_0\) and \(\beta_1\) parameters in our simple linear model can be expressed in terms of \(\mu_x,\mu_y,\sigma_x,\sigma_y\), and \(\rho\).
Here we introduce the basics of matrix notation. Initially this may seem over-complicated, but once we discuss examples, you will appreciate the power of using this notation to both explain and derive solutions, as well as implement them as R code.
Linear algebra notation actually simplifies the mathematical descriptions and manipulations of linear models, as well as coding in R. We will discuss the basics of this notation and then show some examples in R.
The main point of this entire exercise is to show how we can write the models above using matrix notation, and then explain how this is useful for solving the least squares equation. We start by simply defining notation and matrix multiplication, but bear with us since we eventually get back to the practical application.
Linear algebra was created by mathematicians to solve systems of linear equations such as this:
\[ \begin{align*} a + b + c &= 6\\ 3a - 2b + c &= 2\\ 2a + b - c &= 1 \end{align*} \]
It provides very useful machinery to solve these problems generally. We will learn how we can write and solve this system using matrix algebra notation:
\[ \, \begin{pmatrix} 1&1&1\\ 3&-2&1\\ 2&1&-1 \end{pmatrix} \begin{pmatrix} a\\ b\\ c \end{pmatrix} = \begin{pmatrix} 6\\ 2\\ 1 \end{pmatrix} \implies \begin{pmatrix} a\\ b\\ c \end{pmatrix} = \begin{pmatrix} 1&1&1\\ 3&-2&1\\ 2&1&-1 \end{pmatrix}^{-1} \begin{pmatrix} 6\\ 2\\ 1 \end{pmatrix} \]
This section explains the notation used above. It turns out that we can borrow this notation for linear models in statistics as well.
In the falling object, father-son heights, and mouse weight examples, the random variables associated with the data were represented by \(Y_1,\dots,Y_n\). We can think of this as a vector. In fact, in R we are already doing this:
data(father.son,package="UsingR")
y=father.son$fheight
head(y)
## [1] 65.04851 63.25094 64.95532 65.75250 61.13723 63.02254
In math we can also use just one symbol. We usually use bold to distinguish it from the individual entries:
\[ \mathbf{Y} = \begin{pmatrix} Y_1\\\ Y_2\\\ \vdots\\\ Y_N \end{pmatrix} \]
For reasons that will soon become clear, default representation of data vectors have dimension \(N\times 1\) as opposed to \(1 \times N\) .
Here we don’t always use bold because normally one can tell what is a matrix from the context.
Similarly, we can use math notation to represent the covariates or predictors. In a case with two predictors we can represent them like this:
\[ \mathbf{X}_1 = \begin{pmatrix} x_{1,1}\\ \vdots\\ x_{N,1} \end{pmatrix} \mbox{ and } \mathbf{X}_2 = \begin{pmatrix} x_{1,2}\\ \vdots\\ x_{N,2} \end{pmatrix} \]
Note that for the falling object example \(x_{1,1}= t_i\) and \(x_{i,1}=t_i^2\) with \(t_i\) the time of the i-th observation. Also, keep in mind that vectors can be thought of as \(N\times 1\) matrices.
For reasons that will soon become apparent, it is convenient to represent these in matrices:
\[ \mathbf{X} = [ \mathbf{X}_1 \mathbf{X}_2 ] = \begin{pmatrix} x_{1,1}&x_{1,2}\\ \vdots\\ x_{N,1}&x_{N,2} \end{pmatrix} \]
This matrix has dimension \(N \times 2\). We can create this matrix in R this way:
n <- 25
tt <- seq(0,3.4,len=n) ##time in secs, t is a base function
X <- cbind(X1=tt,X2=tt^2)
head(X)
## X1 X2
## [1,] 0.0000000 0.00000000
## [2,] 0.1416667 0.02006944
## [3,] 0.2833333 0.08027778
## [4,] 0.4250000 0.18062500
## [5,] 0.5666667 0.32111111
## [6,] 0.7083333 0.50173611
dim(X)
## [1] 25 2
We can also use this notation to denote an arbitrary number of covariates with the following \(N\times p\) matrix:
\[ \mathbf{X} = \begin{pmatrix} x_{1,1}&\dots & x_{1,p} \\ x_{2,1}&\dots & x_{2,p} \\ & \vdots & \\ x_{N,1}&\dots & x_{N,p} \end{pmatrix} \]
Just as an example, we show you how to make one in R now using matrix
instead of cbind
:
N <- 100; p <- 5
X <- matrix(1:(N*p),N,p)
head(X)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 101 201 301 401
## [2,] 2 102 202 302 402
## [3,] 3 103 203 303 403
## [4,] 4 104 204 304 404
## [5,] 5 105 205 305 405
## [6,] 6 106 206 306 406
dim(X)
## [1] 100 5
By default, the matrices are filled column by column. The byrow=TRUE
argument lets us change that to row by row:
N <- 100; p <- 5
X <- matrix(1:(N*p),N,p,byrow=TRUE)
head(X)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 2 3 4 5
## [2,] 6 7 8 9 10
## [3,] 11 12 13 14 15
## [4,] 16 17 18 19 20
## [5,] 21 22 23 24 25
## [6,] 26 27 28 29 30
Finally, we define a scalar. A scalar is just a number, which we call a scalar because we want to distinguish it from vectors and matrices. We usually use lower case and don’t bold. In the next section, we will understand why we make this distinction.
Matrix Notation Exercises #1
In R we have vectors and matrices. You can create your own vectors with the function c.
c(1,5,3,4)
They are also the output of many functions such as
rnorm(10)
You can turn vectors into matrices using functions such as rbind
, cbind
or matrix
.
Create the matrix from the vector 1:1000 like this:
X = matrix(1:1000,100,10)
What is the entry in row 25, column 3 ?
X = matrix(1:1000,100,10)
X[25,3]
## [1] 225
A:225
Matrix Notation Exercises #2
Using the function cbind
, create a 10 x 5 matrix with first column x=1:10. Then columns 2x, 3x, 4x and 5x in columns 2 through 5.
What is the sum of the elements of the 7th row?
x=1:10
M<-cbind(x,2*x,3*x,4*x,5*x)
sum(M[7,])
## [1] 105
A:105
Matrix Notation Exercises #3
Which of the following creates a matrix with multiples of 3 in the third column?
matrix(1:60,20,3)
## [,1] [,2] [,3]
## [1,] 1 21 41
## [2,] 2 22 42
## [3,] 3 23 43
## [4,] 4 24 44
## [5,] 5 25 45
## [6,] 6 26 46
## [7,] 7 27 47
## [8,] 8 28 48
## [9,] 9 29 49
## [10,] 10 30 50
## [11,] 11 31 51
## [12,] 12 32 52
## [13,] 13 33 53
## [14,] 14 34 54
## [15,] 15 35 55
## [16,] 16 36 56
## [17,] 17 37 57
## [18,] 18 38 58
## [19,] 19 39 59
## [20,] 20 40 60
matrix(1:60,20,3,byrow=TRUE)
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 4 5 6
## [3,] 7 8 9
## [4,] 10 11 12
## [5,] 13 14 15
## [6,] 16 17 18
## [7,] 19 20 21
## [8,] 22 23 24
## [9,] 25 26 27
## [10,] 28 29 30
## [11,] 31 32 33
## [12,] 34 35 36
## [13,] 37 38 39
## [14,] 40 41 42
## [15,] 43 44 45
## [16,] 46 47 48
## [17,] 49 50 51
## [18,] 52 53 54
## [19,] 55 56 57
## [20,] 58 59 60
x=11:20;rbind(x,2*x,3*x)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## x 11 12 13 14 15 16 17 18 19 20
## 22 24 26 28 30 32 34 36 38 40
## 33 36 39 42 45 48 51 54 57 60
x=1:40;matrix(3*x,20,2)
## [,1] [,2]
## [1,] 3 63
## [2,] 6 66
## [3,] 9 69
## [4,] 12 72
## [5,] 15 75
## [6,] 18 78
## [7,] 21 81
## [8,] 24 84
## [9,] 27 87
## [10,] 30 90
## [11,] 33 93
## [12,] 36 96
## [13,] 39 99
## [14,] 42 102
## [15,] 45 105
## [16,] 48 108
## [17,] 51 111
## [18,] 54 114
## [19,] 57 117
## [20,] 60 120
A:matrix(1:60,20,3,byrow=TRUE)
EXPLANATION:You can make each of the matrices in R and examine them visually. Or you can check whether the third column has all multiples of 3 with all(X[,3]%%3==0). Note that the fourth choice does not even have a 3rd column.
In a previous section, we motivated the use of matrix algebra with this system of equations:
\[ \begin{align*} a + b + c &= 6\\ 3a - 2b + c &= 2\\ 2a + b - c &= 1 \end{align*} \]
We described how this system can be rewritten and solved using matrix algebra:
\[ \, \begin{pmatrix} 1&1&1\\ 3&-2&1\\ 2&1&-1 \end{pmatrix} \begin{pmatrix} a\\ b\\ c \end{pmatrix} = \begin{pmatrix} 6\\ 2\\ 1 \end{pmatrix} \implies \begin{pmatrix} a\\ b\\ c \end{pmatrix}= \begin{pmatrix} 1&1&1\\ 3&-2&1\\ 2&1&-1 \end{pmatrix}^{-1} \begin{pmatrix} 6\\ 2\\ 1 \end{pmatrix} \]
Having described matrix notation, we will explain the operation we perform with them. For example, above we have matrix multiplication and we also have a symbol representing the inverse of a matrix. The importance of these operations and others will become clear once we present specific examples related to data analysis.
We start with one of the simplest operations: scalar multiplication. If \(a\) is scalar and \(\mathbf{X}\) is a matrix, then:
\[ \mathbf{X} = \begin{pmatrix} x_{1,1}&\dots & x_{1,p} \\ x_{2,1}&\dots & x_{2,p} \\ & \vdots & \\ x_{N,1}&\dots & x_{N,p} \end{pmatrix} \implies a \mathbf{X} = \begin{pmatrix} a x_{1,1} & \dots & a x_{1,p}\\ a x_{2,1}&\dots & a x_{2,p} \\ & \vdots & \\ a x_{N,1} & \dots & a x_{N,p} \end{pmatrix} \]
R automatically follows this rule when we multiply a number by a matrix using *
:
X <- matrix(1:12,4,3)
print(X)
## [,1] [,2] [,3]
## [1,] 1 5 9
## [2,] 2 6 10
## [3,] 3 7 11
## [4,] 4 8 12
a <- 2
print(a*X)
## [,1] [,2] [,3]
## [1,] 2 10 18
## [2,] 4 12 20
## [3,] 6 14 22
## [4,] 8 16 24
The transpose is an operation that simply changes columns to rows. We use a \(\top\) to denote a transpose. The technical definition is as follows: if X is as we defined it above, here is the transpose which will be \(p\times N\):
\[ \mathbf{X} = \begin{pmatrix} x_{1,1}&\dots & x_{1,p} \\ x_{2,1}&\dots & x_{2,p} \\ & \vdots & \\ x_{N,1}&\dots & x_{N,p} \end{pmatrix} \implies \mathbf{X}^\top = \begin{pmatrix} x_{1,1}&\dots & x_{p,1} \\ x_{1,2}&\dots & x_{p,2} \\ & \vdots & \\ x_{1,N}&\dots & x_{p,N} \end{pmatrix} \]
In R we simply use t
:
X <- matrix(1:12,4,3)
X
## [,1] [,2] [,3]
## [1,] 1 5 9
## [2,] 2 6 10
## [3,] 3 7 11
## [4,] 4 8 12
t(X)
## [,1] [,2] [,3] [,4]
## [1,] 1 2 3 4
## [2,] 5 6 7 8
## [3,] 9 10 11 12
We start by describing the matrix multiplication shown in the original system of equations example:
\[ \begin{align*} a + b + c &=6\\ 3a - 2b + c &= 2\\ 2a + b - c &= 1 \end{align*} \]
What we are doing is multiplying the rows of the first matrix by the columns of the second. Since the second matrix only has one column, we perform this multiplication by doing the following:
\[ \, \begin{pmatrix} 1&1&1\\ 3&-2&1\\ 2&1&-1 \end{pmatrix} \begin{pmatrix} a\\ b\\ c \end{pmatrix}= \begin{pmatrix} a + b + c \\ 3a - 2b + c \\ 2a + b - c \end{pmatrix} \]
Here is a simple example. We can check to see if abc=c(3,2,1)
is a solution:
X <- matrix(c(1,3,2,1,-2,1,1,1,-1),3,3)
abc <- c(3,2,1) #use as an example
rbind( sum(X[1,]*abc), sum(X[2,]*abc), sum(X[3,]%*%abc))
## [,1]
## [1,] 6
## [2,] 6
## [3,] 7
We can use the %*%
to perform the matrix multiplication and make this much more compact:
X%*%abc
## [,1]
## [1,] 6
## [2,] 6
## [3,] 7
We can see that c(3,2,1)
is not a solution as the answer here is not the required c(6,2,1)
.
To get the solution, we will need to invert the matrix on the left, a concept we learn about below.
Here is the general definition of matrix multiplication of matrices \(A\) and \(X\):
\[ \mathbf{AX} = \begin{pmatrix} a_{1,1} & a_{1,2} & \dots & a_{1,N}\\ a_{2,1} & a_{2,2} & \dots & a_{2,N}\\ & & \vdots & \\ a_{M,1} & a_{M,2} & \dots & a_{M,N} \end{pmatrix} \begin{pmatrix} x_{1,1}&\dots & x_{1,p} \\ x_{2,1}&\dots & x_{2,p} \\ & \vdots & \\ x_{N,1}&\dots & x_{N,p} \end{pmatrix} \]
\[ = \begin{pmatrix} \sum_{i=1}^N a_{1,i} x_{i,1} & \dots & \sum_{i=1}^N a_{1,i} x_{i,p}\\ & \vdots & \\ \sum_{i=1}^N a_{M,i} x_{i,1} & \dots & \sum_{i=1}^N a_{M,i} x_{i,p} \end{pmatrix} \]
You can only take the product if the number of columns of the first matrix \(A\) equals the number of rows of the second one \(X\). Also, the final matrix has the same row numbers as the first \(A\) and the same column numbers as the second \(X\). After you study the example below, you may want to come back and re-read the sections above.
The identity matrix is analogous to the number 1: if you multiply the identity matrix by another matrix, you get the same matrix. For this to happen, we need it to be like this:
\[ \mathbf{I} = \begin{pmatrix} 1&0&0&\dots&0&0\\ 0&1&0&\dots&0&0\\ 0&0&1&\dots&0&0\\ \vdots &\vdots & \vdots&\ddots&\vdots&\vdots\\ 0&0&0&\dots&1&0\\ 0&0&0&\dots&0&1 \end{pmatrix} \]
By this definition, the identity always has to have the same number of rows as columns or be what we call a square matrix.
If you follow the matrix multiplication rule above, you notice this works out:
\[ \mathbf{XI} = \begin{pmatrix} x_{1,1} & \dots & x_{1,p}\\ & \vdots & \\ x_{N,1} & \dots & x_{N,p} \end{pmatrix} \begin{pmatrix} 1&0&0&\dots&0&0\\ 0&1&0&\dots&0&0\\ 0&0&1&\dots&0&0\\ & & &\vdots& &\\ 0&0&0&\dots&1&0\\ 0&0&0&\dots&0&1 \end{pmatrix} = \begin{pmatrix} x_{1,1} & \dots & x_{1,p}\\ & \vdots & \\ x_{N,1} & \dots & x_{N,p} \end{pmatrix} \]
In R you can form an identity matrix this way:
n <- 5 #pick dimensions
diag(n)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 0 0 0 0
## [2,] 0 1 0 0 0
## [3,] 0 0 1 0 0
## [4,] 0 0 0 1 0
## [5,] 0 0 0 0 1
The inverse of matrix \(X\), denoted with \(X^{-1}\), has the property that, when multiplied, gives you the identity \(X^{-1}X=I\). Of course, not all matrices have inverses. For example, a \(2\times 2\) matrix with 1s in all its entries does not have an inverse.
As we will see when we get to the section on applications to linear models, being able to compute the inverse of a matrix is quite useful. A very convenient aspect of R is that it includes a predefined function solve
to do this. Here is how we would use it to solve the linear of equations:
X <- matrix(c(1,3,2,1,-2,1,1,1,-1),3,3)
y <- matrix(c(6,2,1),3,1)
solve(X)%*%y #equivalent to solve(X,y)
## [,1]
## [1,] 1
## [2,] 2
## [3,] 3
Please note that solve
is a function that should be used with caution as it is not generally numerically stable. We explain this in much more detail in the QR factorization section.
Matrix Operation Exercises #1
Suppose X is a matrix in R. Which of the following is NOT equivalent to X?
X <- matrix(1:12,4,3)
X
## [,1] [,2] [,3]
## [1,] 1 5 9
## [2,] 2 6 10
## [3,] 3 7 11
## [4,] 4 8 12
t( t(X) )
## [,1] [,2] [,3]
## [1,] 1 5 9
## [2,] 2 6 10
## [3,] 3 7 11
## [4,] 4 8 12
X %*% matrix(1,ncol(X) )
## [,1]
## [1,] 15
## [2,] 18
## [3,] 21
## [4,] 24
X*1
## [,1] [,2] [,3]
## [1,] 1 5 9
## [2,] 2 6 10
## [3,] 3 7 11
## [4,] 4 8 12
X%*%diag(ncol(X))
## [,1] [,2] [,3]
## [1,] 1 5 9
## [2,] 2 6 10
## [3,] 3 7 11
## [4,] 4 8 12
A:X %*% matrix(1,ncol(X) )
EXPLANATION: The first transposes the transpose, so we end up with our original X. The third is multiplying each element by 1, and the fourth is multiplying X by the identity. The second is not even guaranteed to have the same dimensions as X.
Matrix Operation Exercises #2
Solve the following system of equations using R:
3a + 4b - 5c + d = 10
2a + 2b + 2c - d = 5
a -b + 5c - 5d = 7
5a + d = 4
What is the solution for c:
X <- matrix(c(3,2,1,5,4,2,-1,0,-5,2,5,0,1,-1,-5,1),4,4)
y <- matrix(c(10,5,7,4),4,1)
solve(X)%*%y #equivalent to solve(X,y)
## [,1]
## [1,] 1.2477876
## [2,] 1.0176991
## [3,] -0.8849558
## [4,] -2.2389381
#or
X = matrix(c(3,2,1,5,4,2,-1,0,-5,2,5,0,1,-1,-5,1),4,4)
y = c(10,5,7,4)
sol = solve(X,y)
#and c is the third entry:
sol[ 3 ]
## [1] -0.8849558
A:-0.8849558
Load the following two matrices into R:
a <- matrix(1:12, nrow=4) b <- matrix(1:15, nrow=3)
Note the dimension of ‘a’ and the dimension of ‘b’.
In the question below, we will use the matrix multiplication operator in R, %*%, to multiply these two matrices
Matrix Operation Exercises #3
What is the value in the 3rd row and the 2nd column of the matrix product of ‘a’ and ‘b’
a <- matrix(1:12, nrow=4)
b <- matrix(1:15, nrow=3)
c<-a %*% b
c[3,2]
## [1] 113
A:113
Matrix Operation Exercises #4
Multiply the 3rd row of ‘a’ with the 2nd column of ‘b’, using the element-wise vector multiplication with *.
What is the sum of the elements in the resulting vector?
sum(a[3,] * b[,2])
## [1] 113
A:113 #which is equivalent to the 3rd row, 2nd column element of the product of the two matrices.
Now we are ready to see how matrix algebra can be useful when analyzing data. We start with some simple examples and eventually arrive at the main one: how to write linear models with matrix algebra notation and solve the least squares problem.
To compute the sample average and variance of our data, we use these formulas \(\bar{Y}=\frac{1}{N} Y_i\) and \(\mbox{var}(Y)=\frac{1}{N} \sum_{i=1}^N (Y_i - \bar{Y})^2\). We can represent these with matrix multiplication. First, define this \(N \times 1\) matrix made just of 1s:
\[ A=\begin{pmatrix} 1\\ 1\\ \vdots\\ 1 \end{pmatrix} \]
This implies that:
\[ \frac{1}{N} \mathbf{A}^\top Y = \frac{1}{N} \begin{pmatrix}1&1&,\dots&1\end{pmatrix} \begin{pmatrix} Y_1\\ Y_2\\ \vdots\\ Y_N \end{pmatrix}= \frac{1}{N} \sum_{i=1}^N Y_i = \bar{Y} \]
Note that we are multiplying by the scalar \(1/N\). In R, we multiply matrix using %*%
:
data(father.son,package="UsingR")
y <- father.son$sheight
print(mean(y))
## [1] 68.68407
N <- length(y)
Y<- matrix(y,N,1)
A <- matrix(1,N,1)
barY=t(A)%*%Y / N
print(barY)
## [,1]
## [1,] 68.68407
As we will see later, multiplying the transpose of a matrix with another is very common in statistics. In fact, it is so common that there is a function in R:
barY=crossprod(A,Y) / N
print(barY)
## [,1]
## [1,] 68.68407
For the variance, we note that if:
\[ \mathbf{r}\equiv \begin{pmatrix} Y_1 - \bar{Y}\\ \vdots\\ Y_N - \bar{Y} \end{pmatrix}, \,\, \frac{1}{N} \mathbf{r}^\top\mathbf{r} = \frac{1}{N}\sum_{i=1}^N (Y_i - \bar{Y})^2 \]
In R, if you only send one matrix into crossprod
, it computes: \(r^\top r\) so we can simply type:
r <- y - barY
crossprod(r)/N
## [,1]
## [1,] 7.915196
Which is almost equivalent to:
library(rafalib)
popvar(y)
## [1] 7.915196
Now we are ready to put all this to use. Let’s start with Galton’s example. If we define these matrices:
\[ \mathbf{Y} = \begin{pmatrix} Y_1\\ Y_2\\ \vdots\\ Y_N \end{pmatrix} , \mathbf{X} = \begin{pmatrix} 1&x_1\\ 1&x_2\\ \vdots\\ 1&x_N \end{pmatrix} , \mathbf{\beta} = \begin{pmatrix} \beta_0\\ \beta_1 \end{pmatrix} \mbox{ and } \mathbf{\varepsilon} = \begin{pmatrix} \varepsilon_1\\ \varepsilon_2\\ \vdots\\ \varepsilon_N \end{pmatrix} \]
Then we can write the model:
\[ Y_i = \beta_0 + \beta_1 x_i + \varepsilon, i=1,\dots,N \]
as:
\[ \, \begin{pmatrix} Y_1\\ Y_2\\ \vdots\\ Y_N \end{pmatrix} = \begin{pmatrix} 1&x_1\\ 1&x_2\\ \vdots\\ 1&x_N \end{pmatrix} \begin{pmatrix} \beta_0\\ \beta_1 \end{pmatrix} + \begin{pmatrix} \varepsilon_1\\ \varepsilon_2\\ \vdots\\ \varepsilon_N \end{pmatrix} \]
or simply:
\[ \mathbf{Y}=\mathbf{X}\boldsymbol{\beta}+\boldsymbol{\varepsilon} \]
which is a much simpler way to write it.
The least squares equation becomes simpler as well since it is the following cross-product:
\[ (\mathbf{Y}-\mathbf{X}\boldsymbol{\beta})^\top (\mathbf{Y}-\mathbf{X}\boldsymbol{\beta}) \]
So now we are ready to determine which values of \(\beta\) minimize the above, which we can do using calculus to find the minimum.
There are a series of rules that permit us to compute partial derivative equations in matrix notation. By equating the derivative to 0 and solving for the \(\beta\), we will have our solution. The only one we need here tells us that the derivative of the above equation is:
\[ 2 \mathbf{X}^\top (\mathbf{Y} - \mathbf{X} \boldsymbol{\hat{\beta}})=0 \]
\[ \mathbf{X}^\top \mathbf{X} \boldsymbol{\hat{\beta}} = \mathbf{X}^\top \mathbf{Y} \]
\[ \boldsymbol{\hat{\beta}} = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{Y} \]
and we have our solution. We usually put a hat on the \(\beta\) that solves this, \(\hat{\beta}\) , as it is an estimate of the “real” \(\beta\) that generated the data.
Remember that the least squares are like a square (multiply something by itself) and that this formula is similar to the derivative of \(f(x)^2\) being \(2f(x)f\prime (x)\).
Let’s see how it works in R:
data(father.son,package="UsingR")
x=father.son$fheight
y=father.son$sheight
X <- cbind(1,x)
betahat <- solve( t(X) %*% X ) %*% t(X) %*% y
###or
betahat <- solve( crossprod(X) ) %*% crossprod( X, y )
Now we can see the results of this by computing the estimated \(\hat{\beta}_0+\hat{\beta}_1 x\) for any value of \(x\):
newx <- seq(min(x),max(x),len=100)
X <- cbind(1,newx)
fitted <- X%*%betahat
plot(x,y,xlab="Father's height",ylab="Son's height")
lines(newx,fitted,col=2)
Galton’s data with fitted regression line.
This \(\hat{\boldsymbol{\beta}}=(\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{Y}\) is one of the most widely used results in data analysis. One of the advantages of this approach is that we can use it in many different situations. For example, in our falling object problem:
set.seed(1)
g <- 9.8 #meters per second
n <- 25
tt <- seq(0,3.4,len=n) #time in secs, t is a base function
d <- 56.67 - 0.5*g*tt^2 + rnorm(n,sd=1)
Notice that we are using almost the same exact code:
X <- cbind(1,tt,tt^2)
y <- d
betahat <- solve(crossprod(X))%*%crossprod(X,y)
newtt <- seq(min(tt),max(tt),len=100)
X <- cbind(1,newtt,newtt^2)
fitted <- X%*%betahat
plot(tt,y,xlab="Time",ylab="Height")
lines(newtt,fitted,col=2)
Fitted parabola to simulated data for distance travelled versus time of falling object measured with error.
And the resulting estimates are what we expect:
betahat
## [,1]
## 56.5317368
## tt 0.5013565
## -5.0386455
The Tower of Pisa is about 56 meters high. Since we are just dropping the object there is no initial velocity, and half the constant of gravity is 9.8/2=4.9 meters per second squared.
lm
FunctionR has a very convenient function that fits these models. We will learn more about this function later, but here is a preview:
X <- cbind(tt,tt^2)
fit=lm(y~X)
summary(fit)
##
## Call:
## lm(formula = y ~ X)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.5295 -0.4882 0.2537 0.6560 1.5455
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 56.5317 0.5451 103.701 <2e-16 ***
## Xtt 0.5014 0.7426 0.675 0.507
## X -5.0386 0.2110 -23.884 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9822 on 22 degrees of freedom
## Multiple R-squared: 0.9973, Adjusted R-squared: 0.997
## F-statistic: 4025 on 2 and 22 DF, p-value: < 2.2e-16
Note that we obtain the same values as above.
We have shown how to write linear models using linear algebra. We are going to do this for several examples, many of which are related to designed experiments. We also demonstrated how to obtain least squares estimates. Nevertheless, it is important to remember that because \(y\) is a random variable, these estimates are random as well. In a later section, we will learn how to compute standard error for these estimates and use this to perform inference.
We will describe three examples from the life sciences: one from physics, one related to genetics, and one from a mouse experiment. They are very different, yet we end up using the same statistical technique: fitting linear models. Linear models are typically taught and described in the language of matrix algebra.
Imagine you are Galileo in the 16th century trying to describe the velocity of a falling object. An assistant climbs the Tower of Pisa and drops a ball, while several other assistants record the position at different times. Let’s simulate some data using the equations we know today and adding some measurement error:
set.seed(1)
g <- 9.8 ##meters per second
n <- 25
tt <- seq(0,3.4,len=n) ##time in secs, note: we use tt because t is a base function
d <- 56.67 - 0.5*g*tt^2 + rnorm(n,sd=1) ##meters
The assistants hand the data to Galileo and this is what he sees:
mypar()
plot(tt,d,ylab="Distance in meters",xlab="Time in seconds")
Simulated data for distance travelled versus time of falling object measured with error.
He does not know the exact equation, but by looking at the plot above he deduces that the position should follow a parabola. So he models the data with:
\[ Y_i = \beta_0 + \beta_1 x_i + \beta_2 x_i^2 + \varepsilon, i=1,\dots,n \]
With \(Y_i\) representing location, \(x_i\) representing the time, and \(\varepsilon\) accounting for measurement error. This is a linear model because it is a linear combination of known quantities (th \(x\) s) referred to as predictors or covariates and unknown parameters (the \(\beta\) s).
Now imagine you are Francis Galton in the 19th century and you collect paired height data from fathers and sons. You suspect that height is inherited. Your data:
data(father.son,package="UsingR")
x=father.son$fheight
y=father.son$sheight
looks like this:
plot(x,y,xlab="Father's height",ylab="Son's height")
Galton’s data. Son heights versus father heights.
The sons’ heights do seem to increase linearly with the fathers’ heights. In this case, a model that describes the data is as follows:
\[ Y_i = \beta_0 + \beta_1 x_i + \varepsilon, i=1,\dots,N \]
This is also a linear model with \(x_i\) and \(Y_i\), the father and son heights respectively, for the \(i\)-th pair and \(\varepsilon\) a term to account for the extra variability. Here we think of the fathers’ heights as the predictor and being fixed (not random) so we use lower case. Measurement error alone can’t explain all the variability seen in \(\varepsilon\). This makes sense as there are other variables not in the model, for example, mothers’ heights, genetic randomness, and environmental factors.
Here we read-in mouse body weight data from mice that were fed two different diets: high fat and control (chow). We have a random sample of 12 mice for each. We are interested in determining if the diet has an effect on weight. Here is the data:
dat <- read.csv("femaleMiceWeights.csv")
mypar(1,1)
stripchart(Bodyweight~Diet,data=dat,vertical=TRUE,method="jitter",pch=1,main="Mice weights")
Mouse weights under two diets.
We want to estimate the difference in average weight between populations. We demonstrated how to do this using t-tests and confidence intervals, based on the difference in sample averages. We can obtain the same exact results using a linear model:
\[ Y_i = \beta_0 + \beta_1 x_{i} + \varepsilon_i\]
with \(\beta_0\) the chow diet average weight, \(\beta_1\) the difference between averages, \(x_i = 1\) when mouse \(i\) gets the high fat (hf) diet, \(x_i = 0\) when it gets the chow diet, and \(\varepsilon_i\) explains the differences between mice of the same population.
We have seen three very different examples in which linear models can be used. A general model that encompasses all of the above examples is the following:
\[ Y_i = \beta_0 + \beta_1 x_{i,1} + \beta_2 x_{i,2} + \dots + \beta_2 x_{i,p} + \varepsilon_i, i=1,\dots,n \]
\[ Y_i = \beta_0 + \sum_{j=1}^p \beta_j x_{i,j} + \varepsilon_i, i=1,\dots,n \]
Note that we have a general number of predictors \(p\). Matrix algebra provides a compact language and mathematical framework to compute and make derivations with any linear model that fit into the above framework.
For the models above to be useful we have to estimate the unknown \(\beta\) s. In the first example, we want to describe a physical process for which we can’t have unknown parameters. In the second example, we better understand inheritance by estimating how much, on average, the father’s height affects the son’s height. In the final example, we want to determine if there is in fact a difference: if \(\beta_1 \neq 0\).
The standard approach in science is to find the values that minimize the distance of the fitted model to the data. The following is called the least squares (LS) equation and we will see it often in this chapter:
\[ \sum_{i=1}^n \left\{ Y_i - \left(\beta_0 + \sum_{j=1}^p \beta_j x_{i,j}\right)\right\}^2 \]
Once we find the minimum, we will call the values the least squares estimates (LSE) and denote them with \(\hat{\beta}\). The quantity obtained when evaluating the least square equation at the estimates is called the residual sum of squares (RSS). Since all these quantities depend on \(Y\), they are random variables. The \(\hat{\beta}\) s are random variables and we will eventually perform inference on them.
Thanks to my high school physics teacher, I know that the equation for the trajectory of a falling object is:
\[d = h_0 + v_0 t - 0.5 \times 9.8 t^2\]
with \(h_0\) and \(v_0\) the starting height and velocity respectively. The data we simulated above followed this equation and added measurement error to simulate n
observations for dropping the ball \((v_0=0)\) from the tower of Pisa \((h_0=56.67)\). This is why we used this code to simulate data:
g <- 9.8 ##meters per second
n <- 25
tt <- seq(0,3.4,len=n) ##time in secs, t is a base function
f <- 56.67 - 0.5*g*tt^2
y <- f + rnorm(n,sd=1)
Here is what the data looks like with the solid line representing the true trajectory:
plot(tt,y,ylab="Distance in meters",xlab="Time in seconds")
lines(tt,f,col=2)
Fitted model for simulated data for distance travelled versus time of falling object measured with error.
But we were pretending to be Galileo and so we don’t know the parameters in the model. The data does suggest it is a parabola, so we model it as such:
\[ Y_i = \beta_0 + \beta_1 x_i + \beta_2 x_i^2 + \varepsilon, i=1,\dots,n \]
How do we find the LSE?
lm
functionIn R we can fit this model by simply using the lm
function. We will describe this function in detail later, but here is a preview:
tt2 <-tt^2
fit <- lm(y~tt+tt2)
summary(fit)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 57.1047803 0.4996845 114.281666 5.119823e-32
## tt -0.4460393 0.6806757 -0.655289 5.190757e-01
## tt2 -4.7471698 0.1933701 -24.549662 1.767229e-17
It gives us the LSE, as well as standard errors and p-values.
Part of what we do in this section is to explain the mathematics behind this function.
Let’s write a function that computes the RSS for any vector \(\beta\):
rss <- function(Beta0,Beta1,Beta2){
r <- y - (Beta0+Beta1*tt+Beta2*tt^2)
return(sum(r^2))
}
So for any three dimensional vector we get an RSS. Here is a plot of the RSS as a function of \(\beta_2\) when we keep the other two fixed:
Beta2s<- seq(-10,0,len=100)
plot(Beta2s,sapply(Beta2s,rss,Beta0=55,Beta1=0),
ylab="RSS",xlab="Beta2",type="l")
##Let's add another curve fixing another pair:
Beta2s<- seq(-10,0,len=100)
lines(Beta2s,sapply(Beta2s,rss,Beta0=65,Beta1=0),col=2)
Residual sum of squares obtained for several values of the parameters.
Trial and error here is not going to work. Instead, we can use calculus: take the partial derivatives, set them to 0 and solve. Of course, if we have many parameters, these equations can get rather complex. Linear algebra provides a compact and general way of solving this problem.
When studying the father-son data, Galton made a fascinating discovery using exploratory analysis.
He noted that if he tabulated the number of father-son height pairs and followed all the x,y values having the same totals in the table, they formed an ellipse. In the plot above, made by Galton, you see the ellipse formed by the pairs having 3 cases. This then led to modeling this data as correlated bivariate normal which we described earlier:
\[ Pr(X<a,Y<b) = \]
\[ \int_{-\infty}^{a} \int_{-\infty}^{b} \frac{1}{2\pi\sigma_x\sigma_y\sqrt{1-\rho^2}} \exp{ \left\{ \frac{1}{2(1-\rho^2)} \left[\left(\frac{x-\mu_x}{\sigma_x}\right)^2 - 2\rho\left(\frac{x-\mu_x}{\sigma_x}\right)\left(\frac{y-\mu_y}{\sigma_y}\right)+ \left(\frac{y-\mu_y}{\sigma_y}\right)^2 \right] \right\} } \]
We described how we can use math to show that if you keep \(X\) fixed (condition to be \(x\)) the distribution of \(Y\) is normally distributed with mean: \(\mu_x +\sigma_y \rho \left(\frac{x-\mu_x}{\sigma_x}\right)\) and standard deviation \(\sigma_y \sqrt{1-\rho^2}\). Note that \(\rho\) is the correlation between \(Y\) and \(X\) , which implies that if we fix \(X=x\), \(Y\) does in fact follow a linear model. The \(\beta_0\) and \(\beta_1\) parameters in our simple linear model can be expressed in terms of \(\mu_x,\mu_y,\sigma_x,\sigma_y\), and \(\rho\).
In our falling object problem:
set.seed(1)
g <- 9.8 #meters per second
n <- 25
tt <- seq(0,3.4,len=n) #time in secs, t is a base function
d <- 56.67 - 0.5*g*tt^2 + rnorm(n,sd=1)
Let’s write a function that computes the RSS for any vector \(\beta\):
rss <- function(Beta0,Beta1,Beta2){
r <- d- (Beta0+Beta1*tt+Beta2*tt^2)
return(sum(r^2))
}
Beta2s<- seq(-10,0,len=100)
RSS<-sapply(Beta2s,rss,Beta0=65,Beta1=0)
plot(Beta2s,sapply(Beta2s,rss,Beta0=55,Beta1=0),
ylab="RSS",xlab="Beta2",type="l")
lines(Beta2s,RSS,type="l",col=3)
X <- cbind(1,tt,tt^2) # or
#same as above X <- cbind(rep(1,length(tt),tt,tt^2)
head(X)
## tt
## [1,] 1 0.0000000 0.00000000
## [2,] 1 0.1416667 0.02006944
## [3,] 1 0.2833333 0.08027778
## [4,] 1 0.4250000 0.18062500
## [5,] 1 0.5666667 0.32111111
## [6,] 1 0.7083333 0.50173611
#lets now compute RSS for any given beta
y <- d
Beta<-matrix(c(55,0,5),3,1)
r<-y-X %*% Beta #residuals
RSS<-t(r) %*% r
#lets check if we get the same ans
rss(55,0,5)
## [1] 66131.18
RSS
## [,1]
## [1,] 66131.18
#faster function in R
RSS<-crossprod(r)
RSS # we get the same ans
## [,1]
## [1,] 66131.18
Least Squares Estimates (LSE)
#Using Matrix algebra
betahat<-solve(t(X) %*% X) %*% t(X) %*% y
betahat
## [,1]
## 56.5317368
## tt 0.5013565
## -5.0386455
#or
betahat <- solve(crossprod(X))%*%crossprod(X,y)
betahat
## [,1]
## 56.5317368
## tt 0.5013565
## -5.0386455
#Using R lm function
tt2 <-tt^2
fit <- lm(y~tt+tt2)
summary(fit)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 56.5317368 0.5451423 103.7008720 4.323897e-31
## tt 0.5013565 0.7425988 0.6751378 5.066226e-01
## tt2 -5.0386455 0.2109615 -23.8841916 3.167820e-17
#Note solve() is unstable function....we have another stable function backsolve()
QR<-qr(X)
Q<-qr.Q(QR)
R<-qr.R(QR)
backsolve(R,crossprod(Q,y))
## [,1]
## [1,] 56.5317368
## [2,] 0.5013565
## [3,] -5.0386455
Suppose we are analyzing a set of 4 samples. The first two samples are from a treatment group A and the second two samples are from a treatment group B. This design can be represented with a model matrix like so:
X <- matrix(c(1,1,1,1,0,0,1,1),nrow=4)
rownames(X) <- c("a","a","b","b")
X
## [,1] [,2]
## a 1 0
## a 1 0
## b 1 1
## b 1 1
Suppose that the fitted parameters for a linear model give us:
beta <- c(5, 2)
Use the matrix multiplication operator, %*%, in R to answer the following questions:
Matrix Algebra Examples Exercises #1
What is the fitted value for the A samples? (The fitted Y values.)
The formula for finding beta using least squares extimation is:
\[ \hat{\boldsymbol{\beta}} = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{Y} \]
We can calculate this in R using our matrix multiplication operator %*%
, the inverse function solve
, and the transpose function t
.
To compute y i.e. fitted value which is:
\[ \mathbf{Y}=\mathbf{X}\boldsymbol{\beta}+\boldsymbol{\varepsilon} \]
# so ultimately we have to find y give betas and the X matrix
fitted = X %*% beta
fitted[ 1:2, ]
## a a
## 5 5
A:5
Matrix Algebra Examples Exercises #2
What is the fitted value for the B samples? (The fitted Y values.)
fitted = X %*% beta
fitted[ 3:4, ]
## b b
## 7 7
A:7
Suppose now we are comparing two treatments B and C to a control group A, each with two samples. This design can be represented with a model matrix like so:
X <- matrix(c(1,1,1,1,1,1,0,0,1,1,0,0,0,0,0,0,1,1),nrow=6)
rownames(X) <- c("a","a","b","b","c","c")
#which results in a matrix that looks like
X
## [,1] [,2] [,3]
## a 1 0 0
## a 1 0 0
## b 1 1 0
## b 1 1 0
## c 1 0 1
## c 1 0 1
Suppose that the fitted values for the linear model are given by:
beta <- c(10,3,-3)
Matrix Algebra Examples Exercises #3
What is the fitted value for the B samples?
fitted = X %*% beta
fitted[ 3:4, ]
## b b
## 13 13
A:13
Matrix Algebra Examples Exercises #4
What is the fitted value for the C samples?
fitted = X %*% beta
fitted[ 5:6, ]
## c c
## 7 7
A:7
The standard error of an estimate is the standard deviation of the sampling distribution of an estimate. In PH525.1x, we saw that our estimate of the mean of a population changed depending on the sample that we took from the population. If we repeatedly sampled from the population, and each time estimated the mean, the collection of mean estimates formed the sampling distribution of the estimate. When we took the standard deviation of those estimates, that was the standard error of our mean estimate.
Here, we aren’t sampling individuals from a population, but we do have random noise in our observations Y. The estimate for the linear model terms (beta-hat) will not be the same if we were to re-run the experiment, because the random noise would be different. If we were to re-run the experiment many times, and estimate linear model terms (beta-hat) each time, this is called the sampling distribution of the estimates. If we take the standard deviation of all of these estimates from repetitions of the experiment, this is called the standard error of the estimate. While we are not sampling individuals, you can think about the repetition of the experiment that we are “sampling” new errors in our observation of Y.
Inference Review Exercises #1
We have shown how to find the least squares estimates with matrix algebra. These estimates are random variables as they are linear combinations of the data. For these estimates to be useful we also need to compute the standard errors.
Here we review standard errors in the context of linear models.
It is useful to think about where randomness comes from. In our falling object example, randomness was introduced through measurement errors. Every time we rerun the experiment a new set of measurement errors will be made which implies our data will be random. This implies that our estimate of, for example, the gravitational constant will change. The constant is fixed, but our estimates are not. To see this we can run a Monte Carlo simulation. Specifically we will generate the data repeatedly and compute the estimate for the quadratic term each time.
g = 9.8 ## meters per second
h0 = 56.67
v0 = 0
n = 25
tt = seq(0,3.4,len=n) ##time in secs, t is a base function
y = h0 + v0 *tt - 0.5* g*tt^2 + rnorm(n,sd=1)
Now we act as if we didn’t know h0, v0 and -0.5*g and use regression to estimate these. We can rewrite the model as y = b0 + b1 t + b2 t^2 + e and obtain the LSE we have used in this class. Note that g = -2 b2.
To obtain the LSE in R we could write:
X = cbind(1,tt,tt^2)
head(X)
## tt
## [1,] 1 0.0000000 0.00000000
## [2,] 1 0.1416667 0.02006944
## [3,] 1 0.2833333 0.08027778
## [4,] 1 0.4250000 0.18062500
## [5,] 1 0.5666667 0.32111111
## [6,] 1 0.7083333 0.50173611
A = solve(crossprod(X))%*%t(X)
head(A)
## [,1] [,2] [,3] [,4] [,5]
## 0.30803419 0.25948718 0.21435897 0.17264957 0.13435897
## tt -0.35475113 -0.27692308 -0.20539052 -0.14015345 -0.08121188
## 0.08517434 0.06388076 0.04443879 0.02684843 0.01110970
## [,6] [,7] [,8] [,9] [,10]
## 0.099487179 0.06803419 0.04000000 0.01538462 -0.005811966
## tt -0.028565808 0.01778477 0.05783986 0.09159945 0.119063545
## -0.002777424 -0.01481293 -0.02499682 -0.03332909 -0.039809746
## [,11] [,12] [,13] [,14] [,15] [,16]
## -0.02358974 -0.03794872 -0.04888889 -0.05641026 -0.06051282 -0.06119658
## tt 0.14023215 0.15510525 0.16368286 0.16596498 0.16195160 0.15164273
## -0.04443879 -0.04721621 -0.04814202 -0.04721621 -0.04443879 -0.03980975
## [,17] [,18] [,19] [,20] [,21]
## -0.05846154 -0.05230769 -0.04273504 -0.029743590 -0.013333333
## tt 0.13503836 0.11213850 0.08294314 0.047452292 0.005665945
## -0.03332909 -0.02499682 -0.01481293 -0.002777424 0.011109697
## [,22] [,23] [,24] [,25]
## 0.006495726 0.02974359 0.05641026 0.08649573
## tt -0.042415896 -0.09679323 -0.15746606 -0.22443439
## 0.026848434 0.04443879 0.06388076 0.08517434
Given how we have defined A, which of the following is the LSE of g, the acceleration due to gravity (suggestion: try the code in R)?
A %*% y
## [,1]
## 56.4305502
## tt 0.1467666
## -4.8943619
-2 * (A %*% y) [3]
## [1] 9.788724
A[3,3]
##
## 0.04443879
A:-2 * (A %*% y) [3]
EXPLANATION:9.8 is not the answer because the LSE is a random variable. The A%%y gives us the LSE for all three coefficients. The third entry gives us the coefficient for the quantradic term which is -0.5 g. We multiply by -2 to get the estimate of g.
Inference Review Exercises #2
In the lines of code above, there was a call to a random function rnorm(). This means that each time the lines of code above are repeated, the estimate of g will be different.
Use the code above in conjunction with the function replicate() to generate 100,000 Monte Carlo simulated datasets. For each dataset compute an estimate of g (remember to multiply by -2) and set the seed to 1.
What is the standard error of this estimate?:
B = 100000
g = 9.8 ## meters per second
n = 25
tt = seq(0,3.4,len=n) ##time in secs, t is a base function
X = cbind(1,tt,tt^2)
A = solve(crossprod(X))%*%t(X)
set.seed(1)
betahat = replicate(B,{
y = 56.67 - 0.5*g*tt^2 + rnorm(n,sd=1)
betahats = -2*A%*%y
return(betahats[3])
})
sqrt(mean( (betahat-mean(betahat) )^2))
## [1] 0.4297449
A:0.4297449
Inference Review Exercises #3
In the father and son height examples we have randomness because we have a random sample of father and son pairs. For the sake of illustration let’s assume that this is the entire population:
library(UsingR)
x = father.son$fheight
y = father.son$sheight
n = length(y)
#Now let's run a Monte Carlo simulation in which we take a sample of size 50 over and over again. Here is how we obtain one sample:
N = 50
index = sample(n,N)
sampledat = father.son[index,]
x = sampledat$fheight
y = sampledat$sheight
betahat = lm(y~x)$coef
Use the function replicate to take 10,000 samples.
What is the standard error of the slope estimate? That is, calculate the standard deviation of the estimate from many random samples. Again, set the seed to 1.
N = 50
B = 10000
set.seed(1)
betahat = replicate(B,{
index = sample(n,N)
sampledat = father.son[index,]
x = sampledat$fheight
y = sampledat$sheight
lm(y~x)$coef[2]
})
sqrt ( mean( (betahat - mean(betahat) )^2 ))
## [1] 0.1243209
A:0.1243209
Inference Review Exercises #4
We are defining a new concept: covariance. The covariance of two lists of numbers X=X1,…,Xn and Y=Y1,…,Yn is mean( (Y - mean(Y))*(X-mean(X) ) ).
Which of the following is closest to the covariance between father heights and son heights
Y=father.son$fheight
X=father.son$sheight
mean( (Y - mean(Y))*(X-mean(X) ) )
## [1] 3.869739
#or
x = father.son$fheight
y = father.son$sheight
cor(x,y)
## [1] 0.5013383
cov<-cor(x,y)*sd(x)*sd(y)
cov
## [1] 3.873333
#or using R in built function
cov(x,y)
## [1] 3.873333
A: 3.873333
We have shown how to find the least squares estimates with matrix algebra. These estimates are random variables since they are linear combinations of the data. For these estimates to be useful, we also need to compute their standard errors. Linear algebra provides a powerful approach for this task. We provide several examples.
It is useful to think about where randomness comes from. In our falling object example, randomness was introduced through measurement errors. Each time we rerun the experiment, a new set of measurement errors will be made. This implies that our data will change randomly, which in turn suggests that our estimates will change randomly. For instance, our estimate of the gravitational constant will change every time we perform the experiment. The constant is fixed, but our estimates are not. To see this we can run a Monte Carlo simulation. Specifically, we will generate the data repeatedly and each time compute the estimate for the quadratic term.
set.seed(1)
B <- 10000
h0 <- 56.67
v0 <- 0
g <- 9.8 ##meters per second
n <- 25
tt <- seq(0,3.4,len=n) ##time in secs, t is a base function
X <-cbind(1,tt,tt^2)
##create X'X^-1 X'
A <- solve(crossprod(X)) %*% t(X)
betahat<-replicate(B,{
y <- h0 + v0*tt - 0.5*g*tt^2 + rnorm(n,sd=1)
betahats <- A%*%y
return(betahats[3])
})
head(betahat)
## [1] -5.038646 -4.894362 -5.143756 -5.220960 -5.063322 -4.777521
As expected, the estimate is different every time. This is because \(\hat{\beta}\) is a random variable. It therefore has a distribution:
library(rafalib)
mypar(1,2)
hist(betahat)
qqnorm(betahat)
qqline(betahat)
Distribution of estimated regression coefficients obtained from Monte Carlo simulated falling object data. The left is a histogram and on the right we have a qq-plot against normal theoretical quantiles.
Since \(\hat{\beta}\) is a linear combination of the data which we made normal in our simulation, it is also normal as seen in the qq-plot above. Also, the mean of the distribution is the true parameter \(-0.5g\), as confirmed by the Monte Carlo simulation performed above.
round(mean(betahat),1)
## [1] -4.9
But we will not observe this exact value when we estimate because the standard error of our estimate is approximately:
sd(betahat)
## [1] 0.2129976
Here we will show how we can compute the standard error without a Monte Carlo simulation. Since in practice we do not know exactly how the errors are generated, we can’t use the Monte Carlo approach.
In the father and son height examples, we have randomness because we have a random sample of father and son pairs. For the sake of illustration, let’s assume that this is the entire population:
data(father.son,package="UsingR")
x <- father.son$fheight
y <- father.son$sheight
n <- length(y)
Now let’s run a Monte Carlo simulation in which we take a sample size of 50 over and over again.
N <- 50
B <-1000
betahat <- replicate(B,{
index <- sample(n,N)
sampledat <- father.son[index,]
x <- sampledat$fheight
y <- sampledat$sheight
lm(y~x)$coef
})
betahat <- t(betahat) #have estimates in two columns
By making qq-plots, we see that our estimates are approximately normal random variables:
mypar(1,2)
qqnorm(betahat[,1])
qqline(betahat[,1])
qqnorm(betahat[,2])
qqline(betahat[,2])
Distribution of estimated regression coefficients obtained from Monte Carlo simulated father-son height data. The left is a histogram and on the right we have a qq-plot against normal theoretical quantiles.
We also see that the correlation of our estimates is negative:
cor(betahat[,1],betahat[,2])
## [1] -0.9992293
When we compute linear combinations of our estimates, we will need to know this information to correctly calculate the standard error of these linear combinations.
In the next section, we will describe the variance-covariance matrix. The covariance of two random variables is defined as follows:
mean( (betahat[,1]-mean(betahat[,1] ))* (betahat[,2]-mean(betahat[,2])))
## [1] -1.035291
The covariance is the correlation multiplied by the standard deviations of each random variable:
\[\mbox{Corr}(X,Y) = \frac{\mbox{Cov}(X,Y)}{\sigma_X \sigma_Y}\]
Other than that, this quantity does not have a useful interpretation in practice. However, as we will see, it is a very useful quantity for mathematical derivations. In the next sections, we show useful matrix algebra calculations that can be used to estimate standard errors of linear model estimates.
As a first step we need to define the variance-covariance matrix, \(\boldsymbol{\Sigma}\). For a vector of random variables, \(\mathbf{Y}\), we define \(\boldsymbol{\Sigma}\) as the matrix with the \(i,j\) entry:
\[ \Sigma_{i,j} \equiv \mbox{Cov}(Y_i, Y_j) \]
The covariance is equal to the variance if \(i = j\) and equal to 0 if the variables are independent. In the kinds of vectors considered up to now, for example, a vector \(\mathbf{Y}\) of individual observations \(Y_i\) sampled from a population, we have assumed independence of each observation and assumed the \(Y_i\) all have the same variance \(\sigma^2\), so the variance-covariance matrix has had only two kinds of elements:
\[ \mbox{Cov}(Y_i, Y_i) = \mbox{var}(Y_i) = \sigma^2\]
\[ \mbox{Cov}(Y_i, Y_j) = 0, \mbox{ for } i \neq j\]
which implies that \(\boldsymbol{\Sigma} = \sigma^2 \mathbf{I}\) with \(\mathbf{I}\), the identity matrix.
Later, we will see a case, specifically the estimate coefficients of a linear model, \(\hat{\boldsymbol{\beta}}\), that has non-zero entries in the off diagonal elements of \(\boldsymbol{\Sigma}\). Furthermore, the diagonal elements will not be equal to a single value \(\sigma^2\).
A useful result provided by linear algebra is that the variance covariance-matrix of a linear combination \(\mathbf{AY}\) of \(\mathbf{Y}\) can be computed as follows:
\[ \mbox{var}(\mathbf{AY}) = \mathbf{A}\mbox{var}(\mathbf{Y}) \mathbf{A}^\top \]
For example, if \(Y_1\) and \(Y_2\) are independent both with variance \(\sigma^2\) then:
\[\mbox{var}\{Y_1+Y_2\} = \mbox{var}\left\{ \begin{pmatrix}1&1\end{pmatrix}\begin{pmatrix} Y_1\\Y_2\\ \end{pmatrix}\right\}\]
\[ =\begin{pmatrix}1&1\end{pmatrix} \sigma^2 \mathbf{I}\begin{pmatrix} 1\\1\\ \end{pmatrix}=2\sigma^2\]
as we expect. We use this result to obtain the standard errors of the LSE (least squares estimate).
Note that \(\boldsymbol{\hat{\beta}}\) is a linear combination of \(\mathbf{Y}\): \(\mathbf{AY}\) with \(\mathbf{A}=\mathbf{(X^\top X)^{-1}X}^\top\), so we can use the equation above to derive the variance of our estimates:
\[\mbox{var}(\boldsymbol{\hat{\beta}}) = \mbox{var}( \mathbf{(X^\top X)^{-1}X^\top Y} ) = \]
\[\mathbf{(X^\top X)^{-1} X^\top} \mbox{var}(Y) (\mathbf{(X^\top X)^{-1} X^\top})^\top = \]
\[\mathbf{(X^\top X)^{-1} X^\top} \sigma^2 \mathbf{I} (\mathbf{(X^\top X)^{-1} X^\top})^\top = \]
\[\sigma^2 \mathbf{(X^\top X)^{-1} X^\top}\mathbf{X} \mathbf{(X^\top X)^{-1}} = \]
\[\sigma^2\mathbf{(X^\top X)^{-1}}\]
The diagonal of the square root of this matrix contains the standard error of our estimates.
To obtain an actual estimate in practice from the formulas above, we need to estimate \(\sigma^2\). Previously we estimated the standard errors from the sample. However, the sample standard deviation of \(Y\) is not \(\sigma\) because \(Y\) also includes variability introduced by the deterministic part of the model: \(\mathbf{X}\boldsymbol{\beta}\). The approach we take is to use the residuals.
We form the residuals like this:
\[ \mathbf{r}\equiv\boldsymbol{\hat{\varepsilon}} = \mathbf{Y}-\mathbf{X}\boldsymbol{\hat{\beta}}\]
Both \(\mathbf{r}\) and \(\boldsymbol{\hat{\varepsilon}}\) notations are used to denote residuals.
Then we use these to estimate, in a similar way, to what we do in the univariate case:
\[ s^2 \equiv \hat{\sigma}^2 = \frac{1}{N-p}\mathbf{r}^\top\mathbf{r} = \frac{1}{N-p}\sum_{i=1}^N r_i^2\]
Here \(N\) is the sample size and \(p\) is the number of columns in \(\mathbf{X}\) or number of parameters (including the intercept term \(\beta_0\)). The reason we divide by \(N-p\) is because mathematical theory tells us that this will give us a better (unbiased) estimate.
Let’s try this in R and see if we obtain the same values as we did with the Monte Carlo simulation above:
n <- nrow(father.son)
N <- 50
index <- sample(n,N)
sampledat <- father.son[index,]
x <- sampledat$fheight
y <- sampledat$sheight
X <- model.matrix(~x)
N <- nrow(X)
p <- ncol(X)
XtXinv <- solve(crossprod(X))
resid <- y - X %*% XtXinv %*% crossprod(X,y)
s <- sqrt( sum(resid^2)/(N-p))
ses <- sqrt(diag(XtXinv))*s
Let’s compare to what lm
provides:
summary(lm(y~x))$coef[,2]
## (Intercept) x
## 8.3899781 0.1240767
ses
## (Intercept) x
## 8.3899781 0.1240767
They are identical because they are doing the same thing. Also, note that we approximate the Monte Carlo results:
apply(betahat,2,sd)
## (Intercept) x
## 8.3817556 0.1237362
Frequently, we want to compute the standard deviation of a linear combination of estimates such as \(\hat{\beta}_2 - \hat{\beta}_1\). This is a linear combination of \(\hat{\boldsymbol{\beta}}\):
\[\hat{\beta}_2 - \hat{\beta}_1 = \begin{pmatrix}0&-1&1&0&\dots&0\end{pmatrix} \begin{pmatrix} \hat{\beta}_0\\ \hat{\beta}_1 \\ \hat{\beta}_2 \\ \vdots\\ \hat{\beta}_p \end{pmatrix}\]
Using the above, we know how to compute the variance covariance matrix of \(\hat{\boldsymbol{\beta}}\).
We have shown how we can obtain standard errors for our estimates. However, as we learned in the first chapter, to perform inference we need to know the distribution of these random variables. The reason we went through the effort to compute the standard errors is because the CLT applies in linear models. If \(N\) is large enough, then the LSE will be normally distributed with mean \(\boldsymbol{\beta}\) and standard errors as described. For small samples, if the \(\varepsilon\) are normally distributed, then the \(\hat{\beta}-\beta\) follow a t-distribution. We do not derive this result here, but the results are extremely useful since it is how we construct p-values and confidence intervals in the context of linear models.
The standard approach to writing linear models either assume the \(\mathbf{X}\) are fixed or that we are conditioning on them. Thus \(\mathbf{X} \boldsymbol{\beta}\) has no variance as the \(\mathbf{X}\) is considered fixed. This is why we write \(\mbox{var}(Y_i) = \mbox{var}(\varepsilon_i)=\sigma^2\). This can cause confusion in practice because if you, for example, compute the following:
x = father.son$fheight
beta = c(34,0.5)
var(beta[1]+beta[2]*x)
## [1] 1.883576
it is nowhere near 0. This is an example in which we have to be careful in distinguishing code from math. The function var
is simply computing the variance of the list we feed it, while the mathematical definition of variance is considering only quantities that are random variables. In the R code above, x
is not fixed at all: we are letting it vary, but when we write \(\mbox{var}(Y_i) = \sigma^2\) we are imposing, mathematically, x
to be fixed. Similarly, if we use R to compute the variance of \(Y\) in our object dropping example, we obtain something very different than \(\sigma^2=1\) (the known variance):
n <- length(tt)
y <- h0 + v0*tt - 0.5*g*tt^2 + rnorm(n,sd=1)
var(y)
## [1] 329.5136
Again, this is because we are not fixing tt
.
The standard approach to writing linear models either assume the X are fixed or that we are conditioning on them. Thus X*beta has no variance as the X is considered fixed. This is why we write var(Y_i) = var(e_i)=sigma^2. This can cause confusion in practice because if you, for example, compute the following:
x = father.son$fheight
beta = c(34,0.5)
var(beta[1]+beta[2]*x)
## [1] 1.883576
t is nowhere near 0. This is an example in which we have to be careful in distinguishing code from math. The function var
is simply computing the variance of the list we feed it, while the mathematical use of var is considering only quantities that are random variables. In the R code above, x is not fixed at all: we are letting it vary but when we write var(Y_i) = sigma^2 we are imposing, mathematically, x to be fixed. Similarly if we use R to compute the variance of Y in our object dropping example we obtain something very different than sigma^2=1 (the known variance):
y = h0 + v0*tt - 0.5*g*tt^2 + rnorm(n,sd=1)
var(y)
## [1] 317.9628
Again, this is because we are not fixing tt.
In the previous assessment, we used a Monte Carlo technique to see that the linear model coefficients are random variables when the data is a random sample. Now we will use the matrix algebra from the previous video to try to estimate the standard error of the linear model coefficients. Again, take a random sample of the father.son heights data:
library(UsingR)
x = father.son$fheight
y = father.son$sheight
n = length(y)
N = 50
set.seed(1)
index = sample(n,N)
sampledat = father.son[index,]
x = sampledat$fheight
y = sampledat$sheight
betahat = lm(y~x)$coef
#The formula for the standard error in the previous video was (the following two lines are not R code):
SE(betahat) = sqrt(var(betahat)) var(betahat) = sigma^2 (X^T X)^-1
This is also listed in the standard error book page.
We will estimate or calculate each part of this equation and then combine them.
First, we want to estimate sigma^2, the variance of Y. As we have seen in the previous unit, the random part of Y is only coming from epsilon, because we assume X*beta is fixed. So we can try to estimate the variance of the epsilons from the residuals, the Yi minus the fitted values from the linear model.
Standard Errors Exercises #1
Note that the fitted values (Y-hat) from a linear model can be obtained with:
fit = lm(y ~ x)
fit$fitted.values
## 1 2 3 4 5 6 7 8
## 70.62707 70.36129 70.86093 68.73019 65.59181 70.55285 70.21256 68.62521
## 9 10 11 12 13 14 15 16
## 67.06729 69.64913 69.09958 71.70621 68.31598 70.57027 70.39537 70.39613
## 17 18 19 20 21 22 23 24
## 68.73977 68.98874 71.47021 72.03615 69.55975 68.15895 66.63557 71.53651
## 25 26 27 28 29 30 31 32
## 69.57083 69.71050 67.14263 70.99719 67.11046 69.04901 66.65243 67.82895
## 33 34 35 36 37 38 39 40
## 68.24209 70.70156 65.50431 67.36000 69.30065 67.94424 66.35150 71.40489
## 41 42 43 44 45 46 47 48
## 71.64301 66.81654 69.22900 69.11769 69.21793 69.69519 67.00674 68.67869
## 49 50
## 67.40752 69.28800
What is the sum of the squared residuals (where residuals are given by r_i = Y_i - Y-hat_i)?
e<-y-fit$fitted.values
sum(e^2)
## [1] 256.2152
#or
fit = lm(y ~ x)
sum((y - fit$fitted.values)^2)
## [1] 256.2152
A: 256.2152
Our estimate of sigma^2 will be the sum of squared residuals divided by (N - p), the sample size minus the number of terms in the model. Since we have a sample of 50 and 2 terms in the model (an intercept and a slope), our estimate of sigma^2 will be the sum of squared residuals divided by 48. Save this to a variable ‘sigma2’:
sigma2 = SSR / 48 where SSR is the answer to the previous question.
Standard Errors Exercises #2
Form the design matrix X (note: use a capital X!). This can be done by combining a column of 1’s with a column of ‘x’ the father’s heights.
X = cbind(rep(1,N), x)
Now calculate (X^T X)^-1, the inverse of X transpose times X. Use the solve() function for the inverse and t() for the transpose. What is the element in the first row, first column?
solve(t(X) %*% X)[1,1]
## [1] 11.30275
A:11.30275
Standard Errors Exercises #3
Now we are one step away from the standard error of beta-hat. Take the diagonals from the (X^T X)^-1 matrix above, using the diag() function. Now multiply our estimate of sigma^2 and the diagonals of this matrix. This is the estimated variance of beta-hat, so take the square root of this. You should end up with two numbers, the standard error for the intercept and the standard error for the slope.
What is the standard error for the slope?
sqrt(diag(solve(t(X) %*% X)) * (256.2152/48))
## x
## 7.7673678 0.1141966
# or
fit = lm(y ~ x)
sigma2 = sum((y - fit$fitted.values)^2) / (N - 2)
sqrt(sigma2 * diag(solve(t(X) %*% X)))
## x
## 7.7673671 0.1141966
#Compare this value with the value you estimated using Monte Carlo in the previous assessment. It will not be the same, because we are only estimating the standard error given a particular sample of 50 (which we obtained with set.seed(1)).
#Note that the standard error estimate is also printed in the second column of:
summary(fit)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.2030 -1.8027 0.2918 1.4226 6.8493
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 29.2542 7.7674 3.766 0.000453 ***
## x 0.5857 0.1142 5.129 5.19e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.31 on 48 degrees of freedom
## Multiple R-squared: 0.354, Adjusted R-squared: 0.3406
## F-statistic: 26.31 on 1 and 48 DF, p-value: 5.189e-06
A:0.1141966
We have shown how we can obtain standard errors for our estimates. But, as we learned in PH525.1x to perform inference we need to know the distribution of these random variables. The reason we went through the effort of computing the standard errors is because the CLT applies in linear models. If N is large enough, then the LSE will be normally distributed with mean beta and standard errors as described in our videos. For small samples, if the error term is normally distributed then the betahat-beta follow a t-distribution. Proving this mathematically is rather advanced, but the results are extremely useful as it is how we construct p-values and confidence intervals in the context of linear models.
Many of the models we use in data analysis can be presented using matrix algebra. We refer to these types of models as linear models. “Linear” here does not refer to lines, but rather to linear combinations. The representations we describe are convenient because we can write models more succinctly and we have the matrix algebra mathematical machinery to facilitate computation. In this chapter, we will describe in some detail how we use matrix algebra to represent and fit.
In this book, we focus on linear models that represent dichotomous groups: treatment versus control, for example. The effect of diet on mice weights is an example of this type of linear model. Here we describe slightly more complicated models, but continue to focus on dichotomous variables.
As we learn about linear models, we need to remember that we are still working with random variables. This means that the estimates we obtain using linear models are also random variables. Although the mathematics is more complex, the concepts we learned in previous chapters apply here. We begin with some exercises to review the concept of random variables in the context of linear models.
Let’s talk a little bit about matrix notation and why we use it. It makes writing formulas easy, it also makes computation easy, it makes mathematics easy.
So here’s an example.
So suppose you have some data and you are, you have a linear model. So that’s four variables, it’s a little bit more complicated than the ones we saw before.And you have four parameters plus the base level parameter. And you want to estimate, you want to find the parameters that fit the data best, so we might consider using least squares. So that’s what this is.So we look for the betas that minimize this equation. So that equation, with all those supscripts and x’s and betas can be written in, if we know that your matrix notation can be written like this, a formula down here, which is much faster to write. And once you get used to this, you see the bottom formula and you know that what you’re doing is the top one. All right, so, not only is it easier to look at on a piece of paper, or to write down on a piece of paper, it’s actually faster at least in R. R is made to work one on matrices, on matrix algebra computations.So here, in this example, I’m showing it as just a very quick simulation example where we compute that sum of squares value. And if we do it using matrix multiplication, it takes 0.01 seconds.
And if you instead write out every single step and put it in a for loop or in something that goes one by one instead of using the matrix trick, it takes 3.4 seconds.
So that is a lot slower for over 300 times slower.
So a little bit of a refresher on matrix algebra. So we want to be able to multiply two matrices, so instead of writing down each single model for each element, for each individual, we instead write these matrices.
And these are the elements of the matrix. So when I write x times beta, that xBeta that you said, that we saw.X is going to be a matrix with entries for each individual on the rows. And then the columns represent the different covariants or variables that are used in the model. And then the parameters are put in another little matrix.So we’re going to show you how this gives us, when you multiply these two matrices, it gives us what you want in terms of the original linear model we wrote out.
So we start by multiplying, if we’re going multiply these two matrices, you take the first row and you multiply it by the first column, and get this. Right?
So you see, now you recognize that.It’s what we had at the very beginning in the non-matrix version of this.Now we start going down to the next one and we get the second line, et cetera.
And we do it for several of them.
So now we can rewrite the two groups in the– instead of the formula with all the indices, we can actually write these matrices.Here we would just write y equals xBeta plus Epsilon.Now, in this particular formula, I’m actually writing out every individual so you can see what the matrices contain.
All right, so this is the same formula that we had before in matrix form. And you can see that every individual here is coming from some like this. If you’re in group A, you’re going to get Beta0 plus an error.If you’re in group B, you’re going to get Beta0 plus Beta1,so you’re up here, plus an error.
Now if you have three groups, again we can write it out in matrix notation.
So if you’re in group 1, you multiply this by that.You get Beta0 plus an error, and you’re going to be one of these points here.If you’re in group 2, you’re one of these guys Group B, so you get Beta0 plus Beta1, so you’re up here, plus some error.And the same goes for, if you’re in group C.It would be Beta0 plus Beta2 plus an error.
Today we’re going to talk about a very powerful mathematical technique called linear algebra.And we’re going to show it in the context of statistical analysis through linear models.
The t-test that we described in a previous module is actually something can be derived from the linear model machinery.In this slide I’m showing you a little bit of motivation.
You can think of the t-test as a corkscrew,while later models are much more applicable and much more general,so you can think of it as a Swiss Army knife. So just to give you an idea of the connection between t-tests and linear models, I’m going to do a bit of the notation here.
So in this picture, in this graph, I’m showing you two groups.
You can see them to the left and to the right, group A and group B. And you see all these points around two means. So what is this representing? So when we’re doing the t-test, we compare the average in one group and subtracting it from the average of the other group.And that was an estimate of the population averages.
Now here, we’re going to write it slightly different.And later you’re going to see why we’re doing this.What we’re going to do is we’re going to pick a baseline.Now, this average, instead of being like it was MuY, before in the previous module, we’re going to call it Beta0. and then we’re going to call the difference between the two groups Beta1.What is– and now, one of the nice things about this notation,is that the Beta1 is really what we care about most of the time.Is that difference, that’s what we want to estimate and perform statistical test on.
So reading out as a linear model, what we have is, we have a group A, has a mean of Beta0.And group B has a mean of Beta0 plus that difference.So we can write it like this.
Now, why I’m doing all of this because I want to eventually get to be able to write these as matrices. And the reason I want to do that is because if I write everything in matrices, everything is much easier and faster when we compute it on a computer, when we program it up. And it’s also more convenient to derive from mathematical properties as well.
So here’s how we would write the model that we’re talking about.We have the average for group A is Beta0 plus 0 times Beta1.And the mean for group B is Beta0 plus 1 times Beta1.OK, so now, we’re going to see that, we’re going to see more on that later.
So in general, the typical linear model that you will see, if you look it up in Wikipedia or something, or in a book.
You’re going to have the observed data represented with Y, and there’s going to be little subscript which is usually denoting an individual. In the case of genomics it could be a sample. So then you have the model parameters. These are the things that we want to estimate, the things that we want to find out. In the case of men and women heights, Beta1 was the difference we were interested in. So you have a base level, Beta0, then you have these parameters, Beta1, and you can have more. And then we also have these predictors or variables or covariates, as statisticians call them. That will define the model.And in the examples that we’re using up to now, they’re either 0 or 1.But they could be other things as well. They could be continuous data as well. And then we usually are almost, when we also include an error turn that describes variability. And this error term could be measurement error. Like in the case of if you’re measuring heights, and you’re doing it over and over for one person, it would be the measurement error of measuring height. But if you’re taking a random sample, then this Epsilon really represents the variability you get from sampling. So it’s like, tall people would have big errors and short people will have very large negative errors. It’s a little bit of a, it’s not the best name when we’re connecting it to biology. But it is what we call it in statistics.
So now, and now one way in which the linear model immediately become more useful than the t-test is in the case that we have three groups.
So say that instead of men and women, we wanted to know which ethnic group is tallest. So now we have three ethnic groups now instead of two genders. So in that case, we would have some baseline group, which have the Beta0. And then the other groups would be different. In what way? That we’re adding Beta1 in one case and Beta2 in the other. And the nice thing about linear models is that we can write this out and in one quick computation on the computer estimate Beta1, estimate Beta2, and also get standard errors for those estimates.
So here’s an illustration of what we’ve just shown.
You have mean for group A, mean for group B, mean for group C. But that’s not how the model is mathematically showing this.The model says, this here is Beta0, this difference is Beta1m and this difference is Beta2. And the nice thing again of that, is that’s what we’re interested in. It’s in those differences, so we estimate them directly.
Here we will show how to use the two R functions, formula
and model.matrix
, in order to produce design matrices (also known as model matrices) for a variety of linear models. For example, in the mouse diet examples we wrote the model as
\[ Y_i = \beta_0 + \beta_1 x_i + \varepsilon, i=1,\dots,N \]
with \(Y_i\) the weights and \(x_i\) equal to 1 only when mouse \(i\) receives the high fat diet. We use the term experimental unit to \(N\) different entities from which we obtain a measurement. In this case, the mice are the experimental units.
This is the type of variable we will focus on in this chapter. We call them indicator variables since they simply indicate if the experimental unit had a certain characteristic or not. As we described earlier, we can use linear algebra to represent this model:
\[ \mathbf{Y} = \begin{pmatrix} Y_1\\ Y_2\\ \vdots\\ Y_N \end{pmatrix} , \mathbf{X} = \begin{pmatrix} 1&x_1\\ 1&x_2\\ \vdots\\ 1&x_N \end{pmatrix} , \mathbf{\beta} = \begin{pmatrix} \beta_0\\ \beta_1 \end{pmatrix} \mbox{ and } \mathbf{\varepsilon} = \begin{pmatrix} \varepsilon_1\\ \varepsilon_2\\ \vdots\\ \varepsilon_N \end{pmatrix} \]
as:
\[ \, \begin{pmatrix} Y_1\\ Y_2\\ \vdots\\ Y_N \end{pmatrix} = \begin{pmatrix} 1&x_1\\ 1&x_2\\ \vdots\\ 1&x_N \end{pmatrix} \begin{pmatrix} \beta_0\\ \beta_1 \end{pmatrix} + \begin{pmatrix} \varepsilon_1\\ \varepsilon_2\\ \vdots\\ \varepsilon_N \end{pmatrix} \]
or simply:
\[ \mathbf{Y}=\mathbf{X}\boldsymbol{\beta}+\boldsymbol{\varepsilon} \]
The design matrix is the matrix \(\mathbf{X}\).
Once we define a design matrix, we are ready to find the least squares estimates. We refer to this as fitting the model. For fitting linear models in R, we will directly provide a formula to the lm
function. In this script, we will use the model.matrix
function, which is used internally by the lm
function. This will help us to connect the R formula
with the matrix \(\mathbf{X}\). It will therefore help us interpret the results from lm
.
The choice of design matrix is a critical step in linear modeling since it encodes which coefficients will be fit in the model, as well as the inter-relationship between the samples. A common misunderstanding is that the choice of design follows straightforward from a description of which samples were included in the experiment. This is not the case. The basic information about each sample (whether control or treatment group, experimental batch, etc.) does not imply a single ‘correct’ design matrix. The design matrix additionally encodes various assumptions about how the variables in \(\mathbf{X}\) explain the observed values in \(\mathbf{Y}\), on which the investigator must decide.
For the examples we cover here, we use linear models to make comparisons between different groups. Hence, the design matrices that we ultimately work with will have at least two columns: an intercept column, which consists of a column of 1’s, and a second column, which specifies which samples are in a second group. In this case, two coefficients are fit in the linear model: the intercept, which represents the population average of the first group, and a second coefficient, which represents the difference between the population averages of the second group and the first group. The latter is typically the coefficient we are interested in when we are performing statistical tests: we want to know if their is a difference between the two groups.
We encode this experimental design in R with two pieces. We start with a formula with the tilde symbol ~
. This means that we want to model the observations using the variables to the right of the tilde. Then we put the name of a variable, which tells us which samples are in which group.
Let’s try an example. Suppose we have two groups, control and high fat diet, with two samples each. For illustrative purposes, we will code these with 1 and 2 respectively. We should first tell R that these values should not be interpreted numerically, but as different levels of a factor. We can then use the paradigm ~ group
to, say, model on the variable group
.
group <- factor( c(1,1,2,2) )
model.matrix(~ group)
## (Intercept) group2
## 1 1 0
## 2 1 0
## 3 1 1
## 4 1 1
## attr(,"assign")
## [1] 0 1
## attr(,"contrasts")
## attr(,"contrasts")$group
## [1] "contr.treatment"
(Don’t worry about the attr
lines printed beneath the matrix. We won’t be using this information.)
What about the formula
function? We don’t have to include this. By starting an expression with ~
, it is equivalent to telling R that the expression is a formula:
model.matrix(formula(~ group))
## (Intercept) group2
## 1 1 0
## 2 1 0
## 3 1 1
## 4 1 1
## attr(,"assign")
## [1] 0 1
## attr(,"contrasts")
## attr(,"contrasts")$group
## [1] "contr.treatment"
What happens if we don’t tell R that group
should be interpreted as a factor?
group <- c(1,1,2,2)
model.matrix(~ group)
## (Intercept) group
## 1 1 1
## 2 1 1
## 3 1 2
## 4 1 2
## attr(,"assign")
## [1] 0 1
This is not the design matrix we wanted, and the reason is that we provided a numeric variable as opposed to an indicator to the formula
and model.matrix
functions, without saying that these numbers actually referred to different groups. We want the second column to have only 0 and 1, indicating group membership.
A note about factors: the names of the levels are irrelevant to model.matrix
and lm
. All that matters is the order. For example:
group <- factor(c("control","control","highfat","highfat"))
model.matrix(~ group)
## (Intercept) grouphighfat
## 1 1 0
## 2 1 0
## 3 1 1
## 4 1 1
## attr(,"assign")
## [1] 0 1
## attr(,"contrasts")
## attr(,"contrasts")$group
## [1] "contr.treatment"
produces the same design matrix as our first code chunk.
Using the same formula, we can accommodate modeling more groups. Suppose we have a third diet:
group <- factor(c(1,1,2,2,3,3))
model.matrix(~ group)
## (Intercept) group2 group3
## 1 1 0 0
## 2 1 0 0
## 3 1 1 0
## 4 1 1 0
## 5 1 0 1
## 6 1 0 1
## attr(,"assign")
## [1] 0 1 1
## attr(,"contrasts")
## attr(,"contrasts")$group
## [1] "contr.treatment"
Noow we have a third column which specifies which samples belong to the third group.
An alternate formulation of design matrix is possible by specifying + 0
in the formula:
group <- factor(c(1,1,2,2,3,3))
model.matrix(~ group + 0)
## group1 group2 group3
## 1 1 0 0
## 2 1 0 0
## 3 0 1 0
## 4 0 1 0
## 5 0 0 1
## 6 0 0 1
## attr(,"assign")
## [1] 1 1 1
## attr(,"contrasts")
## attr(,"contrasts")$group
## [1] "contr.treatment"
This group now fits a separate coefficient for each group. We will explore this design in more depth later on.
We have been using a simple case with just one variable (diet) as an example. In the life sciences, it is quite common to perform experiments with more than one variable. For example, we may be interested in the effect of diet and the difference in sexes. In this case, we have four possible groups:
diet <- factor(c(1,1,1,1,2,2,2,2))
sex <- factor(c("f","f","m","m","f","f","m","m"))
table(diet,sex)
## sex
## diet f m
## 1 2 2
## 2 2 2
If we assume that the diet effect is the same for males and females (this is an assumption), then our linear model is:
\[ Y_{i}= \beta_0 + \beta_1 x_{i,1} + \beta_2 x_{i,2} + \varepsilon_i \]
To fit this model in R, we can simply add the additional variable with a +
sign in order to build a design matrix which fits based on the information in additional variables:
diet <- factor(c(1,1,1,1,2,2,2,2))
sex <- factor(c("f","f","m","m","f","f","m","m"))
model.matrix(~ diet + sex)
## (Intercept) diet2 sexm
## 1 1 0 0
## 2 1 0 0
## 3 1 0 1
## 4 1 0 1
## 5 1 1 0
## 6 1 1 0
## 7 1 1 1
## 8 1 1 1
## attr(,"assign")
## [1] 0 1 2
## attr(,"contrasts")
## attr(,"contrasts")$diet
## [1] "contr.treatment"
##
## attr(,"contrasts")$sex
## [1] "contr.treatment"
The design matrix includes an intercept, a term for diet
and a term for sex
. We would say that this linear model accounts for differences in both the group and condition variables. However, as mentioned above, the model assumes that the diet effect is the same for both males and females. We say these are an additive effect. For each variable, we add an effect regardless of what the other is. Another model is possible here, which fits an additional term and which encodes the potential interaction of group and condition variables. We will cover interaction terms in depth in a later script.
The interaction model can be written in either of the following two formulas:
model.matrix(~ diet + sex + diet:sex)
or
model.matrix(~ diet*sex)
## (Intercept) diet2 sexm diet2:sexm
## 1 1 0 0 0
## 2 1 0 0 0
## 3 1 0 1 0
## 4 1 0 1 0
## 5 1 1 0 0
## 6 1 1 0 0
## 7 1 1 1 1
## 8 1 1 1 1
## attr(,"assign")
## [1] 0 1 2 3
## attr(,"contrasts")
## attr(,"contrasts")$diet
## [1] "contr.treatment"
##
## attr(,"contrasts")$sex
## [1] "contr.treatment"
The level which is chosen for the reference level is the level which is contrasted against. By default, this is simply the first level alphabetically. We can specify that we want group 2 to be the reference level by either using the relevel
function:
group <- factor(c(1,1,2,2))
group <- relevel(group, "2")
model.matrix(~ group)
## (Intercept) group1
## 1 1 1
## 2 1 1
## 3 1 0
## 4 1 0
## attr(,"assign")
## [1] 0 1
## attr(,"contrasts")
## attr(,"contrasts")$group
## [1] "contr.treatment"
or by providing the levels explicitly in the factor
call:
group <- factor(group, levels=c("1","2"))
model.matrix(~ group)
## (Intercept) group2
## 1 1 0
## 2 1 0
## 3 1 1
## 4 1 1
## attr(,"assign")
## [1] 0 1
## attr(,"contrasts")
## attr(,"contrasts")$group
## [1] "contr.treatment"
The model.matrix
function will grab the variable from the R global environment, unless the data is explicitly provided as a data frame to the data
argument:
group <- 1:4
model.matrix(~ group, data=data.frame(group=5:8))
## (Intercept) group
## 1 1 5
## 2 1 6
## 3 1 7
## 4 1 8
## attr(,"assign")
## [1] 0 1
Note how the R global environment variable group
is ignored.
In this chapter, we focus on models based on indicator values. In certain designs, however, we will be interested in using numeric variables in the design formula, as opposed to converting them to factors first. For example, in the falling object example, time was a continuous variable in the model and time squared was also included:
tt <- seq(0,3.4,len=4)
model.matrix(~ tt + I(tt^2))
## (Intercept) tt I(tt^2)
## 1 1 0.000000 0.000000
## 2 1 1.133333 1.284444
## 3 1 2.266667 5.137778
## 4 1 3.400000 11.560000
## attr(,"assign")
## [1] 0 1 2
The I
function above is necessary to specify a mathematical transformation of a variable. See ?I
for more information.
In the life sciences, we could be interested in testing various dosages of a treatment, where we expect a specific relationship between a measured quantity and the dosage, e.g. 0 mg, 10mg, 20mg.
The assumptions imposed by including continuous data as variables are typically hard to defend and motivate than the indicator function variables. Why the indicator variables simply assume a different mean between two groups, continuous variables assume a very specific relationship between the outcome and predictor variables.
In cases like the falling object, we have the theory of gravitation supporting the model. In the father-son height example, because the data is bi variate normal, it follows that there is a linear relationship if we condition. However, we find that continuous variables are included in linear models without justification to “adjust” for variables such as age. We highly discourage this practice unless the data support the model being used.
Suppose we have an experiment with the following design: on three different days, we perform an experiment with two treated and two control samples. We then measure some outcome Y_i, and we want to test the effect of condition, while controlling for whatever differences might have occured due to the the different day (maybe the temperature in the lab affects the measuring device). Assume that the true condition effect is the same for each day (no interaction between condition and day). We then define factors in R for ‘day’ and for ‘condition’.
Expressing Design Formula Exercises #1
Given the factors we have defined above, and not defining any new ones, which of the following R formula will produce a design matrix (model matrix) that let’s us analyze the effect of condition, controlling for the different days:
condition <- factor(c("treated","treated","treated","treated","treated","treated","control","control","control","control","control","control"))
day<- factor(c("A","A","B","B","C","C","A","A","B","B","C","C"))
table(condition,day)
## day
## condition A B C
## control 2 2 2
## treated 2 2 2
model.matrix(~day + condition)
## (Intercept) dayB dayC conditiontreated
## 1 1 0 0 1
## 2 1 0 0 1
## 3 1 1 0 1
## 4 1 1 0 1
## 5 1 0 1 1
## 6 1 0 1 1
## 7 1 0 0 0
## 8 1 0 0 0
## 9 1 1 0 0
## 10 1 1 0 0
## 11 1 0 1 0
## 12 1 0 1 0
## attr(,"assign")
## [1] 0 1 1 2
## attr(,"contrasts")
## attr(,"contrasts")$day
## [1] "contr.treatment"
##
## attr(,"contrasts")$condition
## [1] "contr.treatment"
A: ~day + condition
EXPLANATION:Using the ~ and the names for the two variables we want in the model will produce a design matrix controlling for all levels of day and all levels of condition, so “~ day + condition”. We do not use the levels A,B,C etc in the design formula.
We will demonstrate how to analyze the high fat diet data using linear models instead of directly applying a t-test. We will demonstrate how ultimately these two approaches are equivalent.
We start by reading in the data and creating a quick stripchart:
dat <- read.csv("femaleMiceWeights.csv") ##previously downloaded
head(dat)
## Diet Bodyweight
## 1 chow 21.51
## 2 chow 28.14
## 3 chow 24.04
## 4 chow 23.45
## 5 chow 23.68
## 6 chow 19.79
stripchart(dat$Bodyweight ~ dat$Diet, vertical=TRUE, method="jitter",
main="Bodyweight over Diet")
Mice bodyweights stratified by diet.
## Plotting the data
library(ggplot2)
library(dplyr)
dat %>%
ggplot( aes(x = Bodyweight) ) +
geom_density(fill = 1)
Mice bodyweights stratified by diet.
dat %>%
ggplot( aes(x = Bodyweight) ) +
geom_density(fill = 2)
Mice bodyweights stratified by diet.
dat %>%
ggplot( aes(fill = Diet, x = Bodyweight) ) +
geom_density()
Mice bodyweights stratified by diet.
N <- 100
levels <- c("control","experimental")
group <- rep(levels, N/2)
outcome <- rnorm(N, 42 * 5, 10) + (group == "experimental") * 42
data <- data.frame(group, outcome)
head(data)
## group outcome
## 1 control 203.7876
## 2 experimental 229.8530
## 3 control 221.2493
## 4 experimental 251.5507
## 5 control 209.8381
## 6 experimental 261.4384
## Plotting the data
data %>%
ggplot( aes(x = outcome) ) +
geom_density(fill = 1)
Mice bodyweights stratified by diet.
data %>%
ggplot( aes(fill = group, x = outcome) ) +
geom_density()
Mice bodyweights stratified by diet.
We can see that the high fat diet group appears to have higher weights on average, although there is overlap between the two samples.
For demonstration purposes, we will build the design matrix \(\mathbf{X}\) using the formula ~ Diet
. The group with the 1’s in the second column is determined by the level of Diet
which comes second; that is, the non-reference level.
levels(dat$Diet)
## [1] "chow" "hf"
X <- model.matrix(~ Diet, data=dat)
class(~ Diet)
## [1] "formula"
X #`Diethf` is an indicator variable---it is interpreted as diet which is hf gets 1's & all non hf diet gets 0's.It chooses the non reference level as 1's.Reference level is decided by the first level as given by the level() which by default is 'chow' (it can be changed if reqd).
## (Intercept) Diethf
## 1 1 0
## 2 1 0
## 3 1 0
## 4 1 0
## 5 1 0
## 6 1 0
## 7 1 0
## 8 1 0
## 9 1 0
## 10 1 0
## 11 1 0
## 12 1 0
## 13 1 1
## 14 1 1
## 15 1 1
## 16 1 1
## 17 1 1
## 18 1 1
## 19 1 1
## 20 1 1
## 21 1 1
## 22 1 1
## 23 1 1
## 24 1 1
## attr(,"assign")
## [1] 0 1
## attr(,"contrasts")
## attr(,"contrasts")$Diet
## [1] "contr.treatment"
#changing the levels---releveling
dat$Diet<-relevel(dat$Diet,ref="hf") #make sure 'dat$Diet' is a factor variable
model.matrix(~ Diet, data=dat)
## (Intercept) Dietchow
## 1 1 1
## 2 1 1
## 3 1 1
## 4 1 1
## 5 1 1
## 6 1 1
## 7 1 1
## 8 1 1
## 9 1 1
## 10 1 1
## 11 1 1
## 12 1 1
## 13 1 0
## 14 1 0
## 15 1 0
## 16 1 0
## 17 1 0
## 18 1 0
## 19 1 0
## 20 1 0
## 21 1 0
## 22 1 0
## 23 1 0
## 24 1 0
## attr(,"assign")
## [1] 0 1
## attr(,"contrasts")
## attr(,"contrasts")$Diet
## [1] "contr.treatment"
#again we relevel it to orginal 'chow' refrence level
dat$Diet<-relevel(dat$Diet,ref="chow")
Before we use our shortcut for running linear models, lm
, we want to review what will happen internally. Inside of lm
, we will form the design matrix \(\mathbf{X}\) and calculate the \(\boldsymbol{\beta}\), which minimizes the sum of squares using the previously described formula. The formula for this solution is:
\[ \hat{\boldsymbol{\beta}} = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{Y} \]
We can calculate this in R using our matrix multiplication operator %*%
, the inverse function solve
, and the transpose function t
.
Y <- dat$Bodyweight
X <- model.matrix(~ Diet, data=dat)
solve(t(X) %*% X) %*% t(X) %*% Y
## [,1]
## (Intercept) 23.813333
## Diethf 3.020833
These coefficients are the average of the control group and the difference of the averages:
s <- split(dat$Bodyweight, dat$Diet)
mean(s[["chow"]])
## [1] 23.81333
mean(s[["hf"]]) - mean(s[["chow"]])
## [1] 3.020833
Finally, we use our shortcut, lm
, to run the linear model:
fit <- lm(Bodyweight ~ Diet, data=dat)
summary(fit)
##
## Call:
## lm(formula = Bodyweight ~ Diet, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.1042 -2.4358 -0.4138 2.8335 7.1858
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23.813 1.039 22.912 <2e-16 ***
## Diethf 3.021 1.470 2.055 0.0519 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.6 on 22 degrees of freedom
## Multiple R-squared: 0.1611, Adjusted R-squared: 0.1229
## F-statistic: 4.224 on 1 and 22 DF, p-value: 0.05192
(coefs <- coef(fit)) #putting the brackets prints the output
## (Intercept) Diethf
## 23.813333 3.020833
The following plot provides a visualization of the meaning of the coefficients with colored arrows (code not shown):
Estimated linear model coefficients for bodyweight data illustrated with arrows.
To make a connection with material presented earlier, this simple linear model is actually giving us the same result (the t-statistic and p-value) for the difference as a specific kind of t-test. This is the t-test between two groups with the assumption that the population standard deviation is the same for both groups. This was encoded into our linear model when we assumed that the errors \(\boldsymbol{\varepsilon}\) were all equally distributed.
Although in this case the linear model is equivalent to a t-test, we will soon explore more complicated designs, where the linear model is a useful extension. Below we demonstrate that one does in fact get the exact same results:
Our lm
estimates were:
summary(fit)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23.813333 1.039353 22.911684 7.642256e-17
## Diethf 3.020833 1.469867 2.055174 5.192480e-02
And the t-statistic is the same:
ttest <- t.test(s[["hf"]], s[["chow"]], var.equal=TRUE)
summary(fit)$coefficients[2,3]
## [1] 2.055174
ttest$statistic
## t
## 2.055174
#or
(t.test(s[["hf"]], s[["chow"]], var.equal=TRUE)$statistic)
## t
## 2.055174
#to get t statistic as negative....
ttest <- t.test(s[["chow"]],s[["hf"]], var.equal=TRUE)
ttest$statistic
## t
## -2.055174
t.test(s[["chow"]],s[["hf"]], var.equal=TRUE)$statistic
## t
## -2.055174
In the last videos we saw how to use lm() to run a simple two group linear model, and then compared the t-value from the linear model with the t-value from a t-test with the equal variance assumption. Though the linear model in this case is equivalent to a t-test, we will soon explore more complicated designs, where the linear model is a useful extension (confounging variables, testing contrasts of terms, testing interactions, testing many terms at once, etc.)
Here we will review the mathematics on why these produce the same t-value and therefore p-value.
We already know that the numerator of the t-value in both cases is the difference between the average of the groups, so we only have to see that the denominator is the same. Of course, it makes sense that the denominator should be the same, since we are calculating the standard error of the same quanity (the difference) under the same assumptions (equal variance), but here we will show equivalence of the formula.
In the linear model, we saw how to calculate this standard error using the design matrix \(X\) and the estimate of \(\sigma^2\) from the residuals. The estimate of \(\sigma^2\) was the sum of squared residuals divided by \((N - p)\), where \(N\) is the total number of samples and \(p\) is the number of terms (an intercept and a group indicator, so here \(p=2\)).
In the t-test, the denominator of the t-value is the standard error of the difference. The t-test formula for the standard error of the difference, if we assume equal variance in the two groups is:
\[SE = \sqrt{var(diff)}\]
\[var(diff) = \frac{1}{1/nx + 1/ny} \frac{\sum { (x_i - mu_x)^2 } + \sum { (y_i - mu_y)^2 }}{(nx + ny - 2)}\]
Where \(nx\) is the number of samples in the first group and \(ny\) is the number of samples in the second group.
If we look carefully, the second part of this equation is the sum of squared residuals, divided by \((N - 2)\).
So all that is left to show is that
\(( (X^T X)^{-1}[2,2] = \frac {1}{nx} +\frac{1}{ny}\)
…where [2,2] indicates the 2nd row, 2nd column, with X as the design matrix of a linear model of two groups.
Linear Models in Practice Exercises #1
You can make a design matrix X for a two group comparison either using model.matrix or simply with:
X = cbind(rep(1,nx + ny),rep(c(0,1),c(nx, ny)))
For a comparison of two groups, where the first group has nx=5 samples, and the second group has ny=7 samples, what is the element in the 1st row and 1st column of \(X^\top X\)?
nx=5
ny=7
X = cbind(rep(1,nx + ny),rep(c(0,1),c(nx, ny)))
XtX = t(X) %*% X
XtX[ 1,1 ]
## [1] 12
A:12
Linear Models in Practice Exercises #2
What are all the other elements of \((X^t X)\)?
t(X) %*% X
## [,1] [,2]
## [1,] 12 7
## [2,] 7 7
A:7
Now we just need to invert the matrix to obtain \((X^T X)^{-1}\)
The formula for matrix inversion for a 2x2 matrix is:
The element of the inverse in the 2nd row and the 2nd column is the element which will be used to calculate the standard error of the second coefficient of the linear model. This is:
a / (ad - bc)
And for our two group comparison, we saw that a = nx + ny and the b = c = d = ny. So it follows that this element is:
\(\frac{nx + ny}{((nx + ny)ny - ny ny}\)
which simplifies to:
\(\frac{nx + ny}{nx * ny}\)
which simplifies to:
(1/ny + 1/nx)
\(\frac {1}{ny} +\frac{1}{nx}\)
One of the most useful aspects of matrix algebra is that it lets us compute the solution to the least squares problem we show much earlier, where the equation is pretty easy to write.And that’s what I’m writing down here. And this is not only easy to write down mathematically, but it’s very easy to compute this on– using R or some other program. So this is going to be the least square estimates.
Once you write out that matrices, as we did in the previous module– you write down the matrices, now you can find a solution to estimate the least squares solution, to estimate the parameters just by calculating this formula here.
And there’s also a very nice way to compute standard error of the estimates.
So when we’re testing to see if a particular estimated parameter is statistically significant, we’re going to have to know how much it varies.And so we’re going to have to compute its standard error.So this is basically the t-test.
We have an estimate, and the standard error of the estimate in most situations– this is going to be normal,this is going to be well approximated with normal 0,1 distribution. So here’s a formula for the standard error of parameter i, estimator for m or i.
It includes the sample standard deviation when your square is the variance.And then again, the matrices show up. So I’m not going to actually compute this right now,but it’s something that, on a computer, is straightforward to compute.And we get a general formula that lets us do it for any of the parameters in just one shot.Is very, very convenient.
Now we’re going to consider something that is called a cross design.So imagine you have two treatments, and you create 4 groups.One received treatment a, or control, one receives treatment b, other one receives treatment c, and the fourth group receives both b and c.So we can write down a model.
This is now a statistical model that tells us that if you give treatment b, if you’re in group b, then there’ll be a difference between that group and the control, which will be parametrized with beta 1. If you’re in group c, then there is an effect of being in that group that is beta 2. Now what about the fourth group, where you get both treatment b and treatment c?
One thing we can assume is that it’s an additive effect. So that you’ll get the effect that b gets, and you add the effect that c gets. It’s done with matrix notation over here.
So we have– for group a you have these two samples.So you have 1 times beta 0. And the other 2 get multiplied by 0, so there’s no beta 1 or beta 2. So, therefore, beta 0 is the effect of being in the control group.
If you are in the second group, then you get a beta 1. So it’s beta 0 plus the effect of being group b, that’s beta 0 plus beta 1, and then no beta 2.And that gets you up to here.
This arrow here shows you beta 1.
However in the group c, you get that other treatment.
You don’t get beta 1.And your average value should be beta 0 plus beta 2. So you have a 0 for the beta 1, and that gives you this quantity here.
Now what about in the fourth group, group d? There we’re assuming that the effects are additive. So we have beta 0 plus beta 1 plus beta 2. And that is shown in the matrix algebra notation by just noticing that you have 1, 1, 1, and you add up these 3 numbers.And that gets you up here. That’s beta 1 plus beta 2.
But what if the effects aren’t additive?
That could very well be, that the fourth group is not where it’s expected. It’s not at the level of adding beta 1 plus beta 2. In this case we can model– we can use as a separate model that now gives us a different mean for each group. We have four possibilities, and in the experimental design and linear algebra approaches we call that fourth term the interaction. So now the models are now like this, and let’s see what each one of these represents.
So just like before, we have in the first group, group a, you just have beta 0. In the second group you have beta 0 plus beta 1. In the third group you have beta 0 plus beta 2.Now in the fourth group, you have something new. You have the additive effects, which was beta 0 plus beta 1 plus beta 2. But notice that if we just use that model, we would predict the data to be here, but it’s not there.
It’s actually a little bit higher.
So it looks like adding these two treatments has an interaction effect that makes the values even higher than we would expect had we just added.And that term now is this beta 1 colon 2 that is represented by this quantity here. So by fitting this model, we get these 4 estimates.And using those 4 estimates we can actually predict the mean levels for each of the 4 groups. And we can interpret that last parameter as an interaction effect: the effect that you get on top of just adding the first 2 main effects.
Now you might have heard of these linear models in other contexts.The most common one is in epidemiology, where you hear a lot of media reports saying we controlled for age, race, and sex and they still found an effect.
And so when they say that, when they say that they control for this and that, basically what they’re saying is that they are fitting a linear model that includes all those possible explanatory variables as variables in the model.They fit them and they hope that that corrects for them, and then they report the difference that they’re actually interested in.
One word of caution, we see this very often that people that use linear models, they say, we have to correct for age.And age is continuous variable, and I’m using as an example.
So they just stick it into a linear model.So now what you’re saying is that you have a baseline effect, and then for every year older that you are, your measured value should increase by beta age, this beta here.And that’s a very strong assumption, right? You’re saying that every year you added on. So you’re assuming it’s a line.It’s not a curve, it’s not a line that flattens out, it’s a line.And if it’s anything else, then the model is going to be wrong.So that’s something that you want to be careful about.That when you use these linear models, don’t forget that they’re representing something real. You want you understand what it is that the linear models are representing, and then you’re not doing something nonsensical and just running with it. The one thing to remember is that the computer is going to give you an answer. Even if the model is wrong, it’s going to run.The computer is not going to say, that model doesn’t make sense.It’s just going to give you the estimates, the confidence intervals, standard errors, and p values. So keep that in mind when you use linear models. Now we’ve seen how useful linear models can be. In the labs, we’re actually going to use them.We’re going to do a lot of things with them.Then we’re going to use a package called Lima to do a lot of this, and we’re also going to show you how powerful R can be at doing some of these matrix computations.
As a running example to learn about more complex linear models, we will be using a dataset which compares the different frictional coefficients on the different legs of a spider. Specifically, we will be determining whether more friction comes from a pushing or pulling motion of the leg. The original paper from which the data was provided is:
Jonas O. Wolff & Stanislav N. Gorb, Radial arrangement of Janus-like setae permits friction control in spiders, Scientific Reports, 22 January 2013.
The abstract of the paper says,
The hunting spider Cupiennius salei (Arachnida, Ctenidae) possesses hairy attachment pads (claw tufts) at its distal legs, consisting of directional branched setae… Friction of claw tufts on smooth glass was measured to reveal the functional effect of seta arrangement within the pad.
Figure 1 includes some pretty cool electron microscope images of the tufts. We are interested in the comparisons in Figure 4, where the pulling and pushing motions are compared for different leg pairs (for a diagram of pushing and pulling see the top of Figure 3).
We include the data in our dagdata package and can download it from here.
spider <- read.csv("spider_wolff_gorb_2013.csv", skip=1)
head(spider)
## leg type friction
## 1 L1 pull 0.90
## 2 L1 pull 0.91
## 3 L1 pull 0.86
## 4 L1 pull 0.85
## 5 L1 pull 0.80
## 6 L1 pull 0.87
str(spider)
## 'data.frame': 282 obs. of 3 variables:
## $ leg : Factor w/ 4 levels "L1","L2","L3",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ type : Factor w/ 2 levels "pull","push": 1 1 1 1 1 1 1 1 1 1 ...
## $ friction: num 0.9 0.91 0.86 0.85 0.8 0.87 0.92 0.83 0.88 0.87 ...
Each measurement comes from one of our legs while it is either pushing or pulling. So we have two variables:
table(spider$leg,spider$type)
##
## pull push
## L1 34 34
## L2 15 15
## L3 52 52
## L4 40 40
We can make a boxplot summarizing the measurements for each of the eight pairs. This is similar to Figure 4 of the original paper:
boxplot(spider$friction ~ spider$type * spider$leg,
col=c("grey90","grey40"), las=2,
main="Comparison of friction coefficients of different leg pairs")
Comparison of friction coefficients of spiders’ different leg pairs. The friction coefficient is calculated as the ratio of two forces (see paper Methods) so it is unitless.
What we can immediately see are two trends:
Another thing to notice is that the groups have different spread around their average, what we call within-group variance. This is somewhat of a problem for the kinds of linear models we will explore below, since we will be assuming that around the population average values, the errors \(\varepsilon_i\) are distributed identically, meaning the same variance within each group. The consequence of ignoring the different variances for the different groups is that comparisons between those groups with small variances will be overly “conservative” (because the overall estimate of variance is larger than an estimate for just these groups), and comparisons between those groups with large variances will be overly confident. If the spread is related to the range of friction, such that groups with large friction values also have larger spread, a possibility is to transform the data with a function such as the log
or sqrt
. This looks like it could be useful here, since three of the four push groups (L1, L2, L3) have the smallest friction values and also the smallest spread.
Some alternative tests for comparing groups without transforming the values first include: t-tests without the equal variance assumption using a “Welch” or “Satterthwaite approximation”, or the Wilcoxon rank sum test mentioned previously. However here, for simplicity of illustration, we will fit a model that assumes equal variance and shows the different kinds of linear model designs using this dataset, setting aside the issue of different within-group variances.
To remind ourselves how the simple two-group linear model looks, we will subset the data to include only the L1 leg pair, and run lm
:
spider.sub <- spider[spider$leg == "L1",]
fit <- lm(friction ~ type, data=spider.sub)
summary(fit)
##
## Call:
## lm(formula = friction ~ type, data = spider.sub)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.33147 -0.10735 -0.04941 -0.00147 0.76853
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.92147 0.03827 24.078 < 2e-16 ***
## typepush -0.51412 0.05412 -9.499 5.7e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2232 on 66 degrees of freedom
## Multiple R-squared: 0.5776, Adjusted R-squared: 0.5711
## F-statistic: 90.23 on 1 and 66 DF, p-value: 5.698e-14
(coefs <- coef(fit))
## (Intercept) typepush
## 0.9214706 -0.5141176
These two estimated coefficients are the mean of the pull observations (the first estimated coefficient) and the difference between the means of the two groups (the second coefficient). We can show this with R code:
s <- split(spider.sub$friction, spider.sub$type)
mean(s[["pull"]])
## [1] 0.9214706
mean(s[["push"]]) - mean(s[["pull"]])
## [1] -0.5141176
We can form the design matrix, which was used inside lm
:
X <- model.matrix(~ type, data=spider.sub)
colnames(X)
## [1] "(Intercept)" "typepush"
head(X)
## (Intercept) typepush
## 1 1 0
## 2 1 0
## 3 1 0
## 4 1 0
## 5 1 0
## 6 1 0
tail(X)
## (Intercept) typepush
## 63 1 1
## 64 1 1
## 65 1 1
## 66 1 1
## 67 1 1
## 68 1 1
nrow(X)
## [1] 68
Now we’ll make a plot of the \(\mathbf{X}\) matrix by putting a black block for the 1’s and a white block for the 0’s. This plot will be more interesting for the linear models later on in this script. Along the y-axis is the sample number (the row number of the data
) and along the x-axis is the column of the design matrix \(\mathbf{X}\). If you have installed the rafalib library, you can make this plot with the imagemat
function:
library(rafalib)
imagemat(X, main="Model matrix for linear model with one variable")
Model matrix for linear model with one variable.
Now we show the coefficient estimates from the linear model in a diagram with arrows (code not shown).
Diagram of the estimated coefficients in the linear model. The green arrow indicates the Intercept term, which goes from zero to the mean of the reference group (here the ‘pull’ samples). The orange arrow indicates the difference between the push group and the pull group, which is negative in this example. The circles show the individual samples, jittered horizontally to avoid overplotting.
Now we’ll continue and examine the full dataset, including the observations from all leg pairs. In order to model both the leg pair differences (L1, L2, L3, L4) and the push vs. pull difference, we need to include both terms in the R formula. Let’s see what kind of design matrix will be formed with two variables in the formula:
X <- model.matrix(~ type + leg, data=spider)
colnames(X)
## [1] "(Intercept)" "typepush" "legL2" "legL3" "legL4"
head(X)
## (Intercept) typepush legL2 legL3 legL4
## 1 1 0 0 0 0
## 2 1 0 0 0 0
## 3 1 0 0 0 0
## 4 1 0 0 0 0
## 5 1 0 0 0 0
## 6 1 0 0 0 0
nrow(X)
## [1] 282
library(rafalib)
imagemat(X, main="Model matrix for linear model with two factors")
Image of the model matrix for a formula with type + leg
The first column is the intercept, and so it has 1’s for all samples. The second column has 1’s for the push samples, and we can see that there are four groups of them. Finally, the third, fourth and fifth columns have 1’s for the L2, L3 and L4 samples. The L1 samples do not have a column, because L1 is the reference level for leg
. Similarly, there is no pull column, because pull is the reference level for the type
variable.
To estimate coefficients for this model, we use lm
with the formula ~ type + leg
. We’ll save the linear model to fitTL
standing for a fit with Type and Leg.
fitTL <- lm(friction ~ type + leg, data=spider)
summary(fitTL)
##
## Call:
## lm(formula = friction ~ type + leg, data = spider)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.46392 -0.13441 -0.00525 0.10547 0.69509
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.05392 0.02816 37.426 < 2e-16 ***
## typepush -0.77901 0.02482 -31.380 < 2e-16 ***
## legL2 0.17192 0.04569 3.763 0.000205 ***
## legL3 0.16049 0.03251 4.937 1.37e-06 ***
## legL4 0.28134 0.03438 8.183 1.01e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2084 on 277 degrees of freedom
## Multiple R-squared: 0.7916, Adjusted R-squared: 0.7886
## F-statistic: 263 on 4 and 277 DF, p-value: < 2.2e-16
(coefs <- coef(fitTL))
## (Intercept) typepush legL2 legL3 legL4
## 1.0539153 -0.7790071 0.1719216 0.1604921 0.2813382
R uses the name coefficient
to denote the component containing the least squares estimates. It is important to remember that the coefficients are parameters that we do not observe, but only estimate.
The model we are fitting above can be written as
\[ Y_i = \beta_0 + \beta_1 x_{i,1} + \beta_2 x_{i,2} + \beta_3 x_{i,3} + \beta_4 x_{i,4} + \varepsilon_i, i=1,\dots,N \]
with the \(x\) all indicator variables denoting push or pull and which leg. For example, a push on leg 3 will have \(x_{i,1}\) and \(x_{i,3}\) equal to 1 and the rest would be 0. Throughout this section we will refer to the \(\beta\) s with the effects they represent. For example we call \(\beta_0\) the intercept, \(\beta_1\) the pull effect, \(\beta_2\) the L2 effect, etc. We do not observe the coefficients, e.g. \(\beta_1\), directly, but estimate them with, e.g. \(\hat{\beta}_4\).
We can now form the matrix \(\mathbf{X}\) depicted above and obtain the least square estimates with:
\[ \hat{\boldsymbol{\beta}} = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{Y} \]
Y <- spider$friction
X <- model.matrix(~ type + leg, data=spider)
beta.hat <- solve(t(X) %*% X) %*% t(X) %*% Y
t(beta.hat)
## (Intercept) typepush legL2 legL3 legL4
## [1,] 1.053915 -0.7790071 0.1719216 0.1604921 0.2813382
coefs
## (Intercept) typepush legL2 legL3 legL4
## 1.0539153 -0.7790071 0.1719216 0.1604921 0.2813382
We can see that these values agree with the output of lm
.
We can make the same plot as before, with arrows for each of the estimated coefficients in the model (code not shown).
Diagram of the estimated coefficients in the linear model. As before, the teal-green arrow represents the Intercept, which fits the mean of the reference group (here, the pull samples for leg L1). The purple, pink, and yellow-green arrows represent differences between the three other leg groups and L1. The orange arrow represents the difference between the push and pull samples for all groups.
In this case, the fitted means for each group, derived from the fitted coefficients, do not line up with those we obtain from simply taking the average from each of the eight possible groups. The reason is that our model uses five coefficients, instead of eight. We are assuming that the effects are additive. However, as we demonstrate in more detail below, this particular dataset is better described with a model including interactions.
s <- split(spider$friction, spider$group)
mean(s[["L1pull"]])
## [1] 0.9214706
coefs[1]
## (Intercept)
## 1.053915
mean(s[["L1push"]])
## [1] 0.4073529
coefs[1] + coefs[2]
## (Intercept)
## 0.2749082
Here we can demonstrate that the push vs. pull estimated coefficient, coefs[2]
, is a weighted average of the difference of the means for each group. Furthermore, the weighting is determined by the sample size of each group. The math works out simply here because the sample size is equal for the push and pull subgroups within each leg pair. If the sample sizes were not equal for push and pull within each leg pair, the weighting is more complicated but uniquely determined by a formula involving the sample size of each subgroup, the total sample size, and the number of coefficients. This can be worked out from \((\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top\).
means <- sapply(s, mean)
##the sample size of push or pull groups for each leg pair
ns <- sapply(s, length)[c(1,3,5,7)]
(w <- ns/sum(ns))
## L1pull L2pull L3pull L4pull
## 0.2411348 0.1063830 0.3687943 0.2836879
sum(w * (means[c(2,4,6,8)] - means[c(1,3,5,7)]))
## [1] -0.7790071
coefs[2]
## typepush
## -0.7790071
Sometimes, the comparison we are interested in is represented directly by a single coefficient in the model, such as the push vs pull difference, which was coefs[2]
above. However, sometimes, we want to make a comparison which is not a single coefficient, but a combination of coefficients, which is called a contrast. To introduce the concept of contrasts, first consider the comparisons which we can read off from the linear model summary:
coefs
## (Intercept) typepush legL2 legL3 legL4
## 1.0539153 -0.7790071 0.1719216 0.1604921 0.2813382
Here we have the intercept estimate, the push vs. pull estimated effect across all leg pairs, and the estimates for the L2 vs. L1 effect, the L3 vs. L1 effect, and the L4 vs. L1 effect. What if we want to compare two groups and one of those groups is not L1? The solution to this question is to use contrasts.
A contrast is a combination of estimated coefficient: \(\mathbf{c^\top} \hat{\boldsymbol{\beta}}\), where \(\mathbf{c}\) is a column vector with as many rows as the number of coefficients in the linear model. If \(\mathbf{c}\) has a 0 for one or more of its rows, then the corresponding estimated coefficients in \(\hat{\boldsymbol{\beta}}\) are not involved in the contrast.
If we want to compare leg pairs L3 and L2, this is equivalent to contrasting two coefficients from the linear model because, in this contrast, the comparison to the reference level L1 cancels out:
\[ (\mbox{L3} - \mbox{L1}) - (\mbox{L2} - \mbox{L1}) = \mbox{L3} - \mbox{L2 }\]
An easy way to make these contrasts of two groups is to use the contrast
function from the contrast package. We just need to specify which groups we want to compare. We have to pick one of pull or push types, although the answer will not differ, as we will see below.
library(contrast) #Available from CRAN
L3vsL2 <- contrast(fitTL,list(leg="L3",type="pull"),list(leg="L2",type="pull"))
L3vsL2
## lm model parameter contrast
##
## Contrast S.E. Lower Upper t df Pr(>|t|)
## -0.01142949 0.04319685 -0.0964653 0.07360632 -0.26 277 0.7915
The first column Contrast
gives the L3 vs L2 estimate from the model we fit above.
We can show that the least squares estimates of a linear combination of coefficients is the same linear combination of the estimates. Therefore, the effect size estimate is just the difference between two estimated coefficients. The contrast vector used by contrast
is stored as a variable called X
within the resulting object (not to be confused with our original \(\mathbf{X}\), the design matrix).
coefs
## (Intercept) typepush legL2 legL3 legL4
## 1.0539153 -0.7790071 0.1719216 0.1604921 0.2813382
coefs[4] - coefs[3]
## legL3
## -0.01142949
(cT <- L3vsL2$X)
## (Intercept) typepush legL2 legL3 legL4
## 1 0 0 -1 1 0
## attr(,"assign")
## [1] 0 1 2 2 2
## attr(,"contrasts")
## attr(,"contrasts")$type
## [1] "contr.treatment"
##
## attr(,"contrasts")$leg
## [1] "contr.treatment"
cT %*% coefs
## [,1]
## 1 -0.01142949
What about the standard error and t-statistic? As before, the t-statistic is the estimate divided by the standard error. The standard error of the contrast estimate is formed by multiplying the contrast vector \(\mathbf{c}\) on either side of the estimated covariance matrix, \(\hat{\Sigma}\), our estimate for \(\mathrm{var}(\hat{\boldsymbol{\beta}})\):
\[ \sqrt{\mathbf{c^\top} \hat{\boldsymbol{\Sigma}} \mathbf{c}} \]
where we saw the covariance of the coefficients earlier:
\[ \boldsymbol{\Sigma} = \sigma^2 (\mathbf{X}^\top \mathbf{X})^{-1} \]
We estimate \(\sigma^2\) with the sample estimate \(\hat{\sigma}^2\) described above and obtain:
Sigma.hat <- sum(fitTL$residuals^2)/(nrow(X) - ncol(X)) * solve(t(X) %*% X)
signif(Sigma.hat, 2)
## (Intercept) typepush legL2 legL3 legL4
## (Intercept) 0.00079 -3.1e-04 -0.00064 -0.00064 -0.00064
## typepush -0.00031 6.2e-04 0.00000 0.00000 0.00000
## legL2 -0.00064 -6.4e-20 0.00210 0.00064 0.00064
## legL3 -0.00064 -6.4e-20 0.00064 0.00110 0.00064
## legL4 -0.00064 -1.2e-19 0.00064 0.00064 0.00120
sqrt(cT %*% Sigma.hat %*% t(cT))
## 1
## 1 0.04319685
L3vsL2$SE
## [1] 0.04319685
We would have obtained the same result for a contrast of L3 and L2 had we picked type="push"
. The reason it does not change the contrast is because it leads to addition of the typepush
effect on both sides of the difference, which cancels out:
L3vsL2.equiv <- contrast(fitTL,list(leg="L3",type="push"),list(leg="L2",type="push"))
L3vsL2.equiv
## lm model parameter contrast
##
## Contrast S.E. Lower Upper t df Pr(>|t|)
## -0.01142949 0.04319685 -0.0964653 0.07360632 -0.26 277 0.7915
L3vsL2.equiv$X
## (Intercept) typepush legL2 legL3 legL4
## 1 0 0 -1 1 0
## attr(,"assign")
## [1] 0 1 2 2 2
## attr(,"contrasts")
## attr(,"contrasts")$type
## [1] "contr.treatment"
##
## attr(,"contrasts")$leg
## [1] "contr.treatment"
Suppose we have an experiment with two species A and B, and two conditions: control and treated.
species <- factor(c("A","A","B","B"))
condition <- factor(c("control","treated","control","treated"))
And we will use a formula of ‘~ species + condition’.
The model matrix is then:
model.matrix(~ species + condition)
## (Intercept) speciesB conditiontreated
## 1 1 0 0
## 2 1 0 1
## 3 1 1 0
## 4 1 1 1
## attr(,"assign")
## [1] 0 1 2
## attr(,"contrasts")
## attr(,"contrasts")$species
## [1] "contr.treatment"
##
## attr(,"contrasts")$condition
## [1] "contr.treatment"
Contrasts Exercises #1
Suppose we want to build a contrast of coefficients for the above experimental design.
You can either figure this question out through logic, by looking at the design matrix, or using the contrast() function from the contrast library. If you have not done so already, you should download the contrast library. The contrast vector is returned as contrast(…)$X.
Q:What should the contrast vector be, for the contrast of (species=B and condition=control) vs (species=A and condition=treatment)? Assume that the beta vector from the model fit by R is: Intercept, speciesB, conditiontreated.
#As you want to compare species B vs A and condition control vs treated, you want +1 for the speciesB coefficient and -1 for the conditiontreated coefficient.
#Another way to find this would be by generating some random y, fitting a model, and using contrast():
y = rnorm(4)
fit = lm(y ~ species + condition)
contrast(fit, list(species="B",condition="control"), list(species="A",condition="treated"))$X
## (Intercept) speciesB conditiontreated
## 1 0 1 -1
## attr(,"assign")
## [1] 0 1 2
## attr(,"contrasts")
## attr(,"contrasts")$species
## [1] "contr.treatment"
##
## attr(,"contrasts")$condition
## [1] "contr.treatment"
A:0 1 -1
Contrasts Exercises #2
Load the spider dataset like this:
Suppose we build a model using two variables: ~ type + leg
.
Q:What is the t-value for the contrast of leg pair L4 vs leg pair L2?
fitTL <- lm(friction ~ type + leg, data=spider)
summary(fitTL)
##
## Call:
## lm(formula = friction ~ type + leg, data = spider)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.46392 -0.13441 -0.00525 0.10547 0.69509
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.05392 0.02816 37.426 < 2e-16 ***
## typepush -0.77901 0.02482 -31.380 < 2e-16 ***
## legL2 0.17192 0.04569 3.763 0.000205 ***
## legL3 0.16049 0.03251 4.937 1.37e-06 ***
## legL4 0.28134 0.03438 8.183 1.01e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2084 on 277 degrees of freedom
## Multiple R-squared: 0.7916, Adjusted R-squared: 0.7886
## F-statistic: 263 on 4 and 277 DF, p-value: < 2.2e-16
(coefs <- coef(fitTL))
## (Intercept) typepush legL2 legL3 legL4
## 1.0539153 -0.7790071 0.1719216 0.1604921 0.2813382
library(contrast) #Available from CRAN
L4vsL2 <- contrast(fitTL,list(leg="L4",type="pull"),list(leg="L2",type="pull"))
L4vsL2$testStat
## 1
## 2.451974
A:2.451974
The t-value for the contrast of leg pair L4 vs leg pair L2 is constructed by taking the difference of the coefficients legL4 and legL2, and then dividing by the standard error of the difference. In the last question we will explore how the standard error of the difference is calculated here.
X <- model.matrix(~ type + leg, data=spider)
(Sigma <- sum(fitTL$residuals^2)/(nrow(X) - ncol(X)) * solve(t(X) %*% X))
## (Intercept) typepush legL2 legL3
## (Intercept) 0.0007929832 -3.081306e-04 -0.0006389179 -0.0006389179
## typepush -0.0003081306 6.162612e-04 0.0000000000 0.0000000000
## legL2 -0.0006389179 -6.439411e-20 0.0020871318 0.0006389179
## legL3 -0.0006389179 -6.439411e-20 0.0006389179 0.0010566719
## legL4 -0.0006389179 -1.191291e-19 0.0006389179 0.0006389179
## legL4
## (Intercept) -0.0006389179
## typepush 0.0000000000
## legL2 0.0006389179
## legL3 0.0006389179
## legL4 0.0011819981
#Our contrast matrix is:
C <- matrix(c(0,0,-1,0,1),1,5)
Contrasts Exercises #3
Using Sigma, what is Cov(beta-hat_L4, beta-hat_L2)?
Sigma[3,5]
## [1] 0.0006389179
A:0.0006389179
In previous linear model, we assumed that the push vs. pull effect was the same for all of the leg pairs (the same orange arrow). You can easily see that this does not capture the trends in the data that well. That is, the tips of the arrows did not line up perfectly with the group averages. For the L1 leg pair, the push vs. pull estimated coefficient was too large, and for the L3 leg pair, the push vs. pull coefficient was somewhat too small.
Interaction terms will help us overcome this problem by introducing additional coefficients to compensate for differences in the push vs. pull effect across the 4 groups. As we already have a push vs. pull term in the model, we only need to add three more terms to have the freedom to find leg-pair-specific push vs. pull differences. As we will see, interaction terms are added to the design matrix by multiplying the columns of the design matrix representing existing terms.
We can rebuild our linear model with an interaction between type
and leg
, by including an extra term in the formula type:leg
. The :
symbol adds an interaction between the two variables surrounding it. An equivalent way to specify this model is ~ type*leg
, which will expand to the formula ~ type + leg + type:leg
, with main effects for type
, leg
and an interaction term type:leg
.
X <- model.matrix(~ type + leg + type:leg, data=spider)
#Equivalently
X <- model.matrix(~ type*leg, data=spider)
colnames(X)
## [1] "(Intercept)" "typepush" "legL2" "legL3"
## [5] "legL4" "typepush:legL2" "typepush:legL3" "typepush:legL4"
head(X)
## (Intercept) typepush legL2 legL3 legL4 typepush:legL2 typepush:legL3
## 1 1 0 0 0 0 0 0
## 2 1 0 0 0 0 0 0
## 3 1 0 0 0 0 0 0
## 4 1 0 0 0 0 0 0
## 5 1 0 0 0 0 0 0
## 6 1 0 0 0 0 0 0
## typepush:legL4
## 1 0
## 2 0
## 3 0
## 4 0
## 5 0
## 6 0
imagemat(X, main="Model matrix for linear model with interactions")
Image of model matrix with interactions.
Columns 6-8 (typepush:legL2
, typepush:legL3
, and typepush:legL4
) are the product of the 2nd column (typepush
) and columns 3-5 (the three leg
columns). Looking at the last column, for example, the typepush:legL4
column is adding an extra coefficient \(\beta_{\textrm{push,L4}}\) to those samples which are both push samples and leg pair L4 samples. This accounts for a possible difference when the mean of samples in the L4-push group are not at the location which would be predicted by adding the estimated intercept, the estimated push coefficient typepush
, and the estimated L4 coefficient legL4
.
We can run the linear model using the same code as before:
fitX <- lm(friction ~ type + leg + type:leg, data=spider)
summary(fitX)
##
## Call:
## lm(formula = friction ~ type + leg + type:leg, data = spider)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.46385 -0.10735 -0.01111 0.07848 0.76853
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.92147 0.03266 28.215 < 2e-16 ***
## typepush -0.51412 0.04619 -11.131 < 2e-16 ***
## legL2 0.22386 0.05903 3.792 0.000184 ***
## legL3 0.35238 0.04200 8.390 2.62e-15 ***
## legL4 0.47928 0.04442 10.789 < 2e-16 ***
## typepush:legL2 -0.10388 0.08348 -1.244 0.214409
## typepush:legL3 -0.38377 0.05940 -6.461 4.73e-10 ***
## typepush:legL4 -0.39588 0.06282 -6.302 1.17e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1904 on 274 degrees of freedom
## Multiple R-squared: 0.8279, Adjusted R-squared: 0.8235
## F-statistic: 188.3 on 7 and 274 DF, p-value: < 2.2e-16
coefs <- coef(fitX)
Here is where the plot with arrows really helps us interpret the coefficients. The estimated interaction coefficients (the yellow, brown and silver arrows) allow leg-pair-specific differences in the push vs. pull difference. The orange arrow now represents the estimated push vs. pull difference only for the reference leg pair, which is L1. If an estimated interaction coefficient is large, this means that the push vs. pull difference for that leg pair is very different than the push vs. pull difference in the reference leg pair.
Now, as we have eight terms in the model and eight parameters, you can check that the tips of the arrowheads are exactly equal to the group means (code not shown).
Diagram of the estimated coefficients in the linear model. In the design with interaction terms, the orange arrow now indicates the push vs pull difference only for the reference group (L1), while three new arrows (yellow, brown and grey) indicate the additionally push vs pull differences in the non-reference groups (L2, L3 and L4) with respect to the reference group.
Again we will show how to combine estimated coefficients from the model using contrasts. For some simple cases, we can use the contrast package. Suppose we want to know the push vs. pull effect for the L2 leg pair samples. We can see from the arrow plot that this is the orange arrow plus the yellow arrow. We can also specify this comparison with the contrast
function:
library(contrast) ##Available from CRAN
L2push.vs.pull <- contrast(fitX,
list(leg="L2", type = "push"),
list(leg="L2", type = "pull"))
L2push.vs.pull
## lm model parameter contrast
##
## Contrast S.E. Lower Upper t df Pr(>|t|)
## -0.618 0.0695372 -0.7548951 -0.4811049 -8.89 274 0
coefs[2] + coefs[6] ##we know this is also orange + yellow arrow
## typepush
## -0.618
The question of whether the push vs. pull difference is different in L2 compared to L1, is answered by a single term in the model: the typepush:legL2
estimated coefficient corresponding to the yellow arrow in the plot. A p-value for whether this coefficient is actually equal to zero can be read off from the table printed with summary(fitX)
above. Similarly, we can read off the p-values for the differences of differences for L3 vs L1 and for L4 vs L1.
Suppose we want to know if the push vs. pull difference is different in L3 compared to L2. By examining the arrows in the diagram above, we can see that the push vs. pull effect for a leg pair other than L1 is the typepush
arrow plus the interaction term for that group.
If we work out the math for comparing across two non-reference leg pairs, this is:
\[ (\mbox{typepush} + \mbox{typepush:legL3}) - (\mbox{typepush} + \mbox{typepush:legL2}) \]
…which simplifies to:
\[ = \mbox{typepush:legL3} - \mbox{typepush:legL2} \]
We can’t make this contrast using the contrast
function shown before, but we can make this comparison using the glht
(for “general linear hypothesis test”) function from the multcomp package. We need to form a 1-row matrix which has a -1 for the typepush:legL2
coefficient and a +1 for the typepush:legL3
coefficient. We provide this matrix to the linfct
(linear function) argument, and obtain a summary table for this contrast of estimated interaction coefficients.
Note that there are other ways to perform contrasts using base R, and this is just our preferred way.
library(multcomp) ##Available from CRAN
C <- matrix(c(0,0,0,0,0,-1,1,0), 1)
L3vsL2interaction <- glht(fitX, linfct=C)
summary(L3vsL2interaction)
##
## Simultaneous Tests for General Linear Hypotheses
##
## Fit: lm(formula = friction ~ type + leg + type:leg, data = spider)
##
## Linear Hypotheses:
## Estimate Std. Error t value Pr(>|t|)
## 1 == 0 -0.27988 0.07893 -3.546 0.00046 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)
coefs[7] - coefs[6] ##we know this is also brown - yellow
## typepush:legL3
## -0.2798846
Suppose that we want to know if the push vs. pull difference is different across leg pairs in general. We do not want to compare any two leg pairs in particular, but rather we want to know if the three interaction terms which represent differences in the push vs. pull difference across leg pairs are larger than we would expect them to be if the push vs pull difference was in fact equal across all leg pairs.
Such a question can be answered by an analysis of variance, which is often abbreviated as ANOVA. ANOVA compares the reduction in the sum of squares of the residuals for models of different complexity. The model with eight coefficients is more complex than the model with five coefficients where we assumed the push vs pull difference was equal across leg pairs. The least complex model would only use a single coefficient, an intercept. Under certain assumptions we can also perform inference that determines the probability of improvements as large as what we observed. Let’s first print the result of an ANOVA in R and then examine the results in detail:
anova(fitX)
## Analysis of Variance Table
##
## Response: friction
## Df Sum Sq Mean Sq F value Pr(>F)
## type 1 42.783 42.783 1179.713 < 2.2e-16 ***
## leg 3 2.921 0.974 26.847 2.972e-15 ***
## type:leg 3 2.098 0.699 19.282 2.256e-11 ***
## Residuals 274 9.937 0.036
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The first line tells us that adding a variable type
(push or pull) to the design is very useful (reduces the sum of squared residuals) compared to a model with only an intercept. We can see that it is useful, because this single coefficient reduces the sum of squares by 42.783. The original sum of squares of the model with just an intercept is:
mu0 <- mean(spider$friction)
(initial.ss <- sum((spider$friction - mu0)^2))
## [1] 57.73858
Note that this initial sum of squares is just a scaled version of the sample variance:
N <- nrow(spider)
(N - 1) * var(spider$friction)
## [1] 57.73858
Let’s see exactly how get this 42.783. We need to calculate the sum of squared residuals for the model with only the type information. We can do this by calculating the residuals, squaring these, summing these within groups and then summing across the groups.
s <- split(spider$friction, spider$type)
after.type.ss <- sum( sapply(s, function(x) {
residual <- x - mean(x)
sum(residual^2)
}) )
The reduction in sum of squared residuals from introducing the type
coefficient is therefore:
(type.ss <- initial.ss - after.type.ss)
## [1] 42.78307
Through simple arithmetic, this reduction can be shown to be equivalent to the sum of squared differences between the fitted values for the models with formula ~type
and ~1
:
sum(sapply(s, length) * (sapply(s, mean) - mu0)^2)
## [1] 42.78307
Keep in mind that the order of terms in the formula, and therefore rows in the ANOVA table, is important: each row considers the reduction in the sum of squared residuals after adding coefficients compared to the model in the previous row.
The other columns in the ANOVA table show the “degrees of freedom” with each row. As the type
variable introduced only one term in the model, the Df
column has a 1. Because the leg
variable introduced three terms in the model (legL2
, legL3
and legL4
), the Df
column has a 3.
Finally, there is a column which lists the F value. The F value is the mean of squares for the inclusion of the terms of interest (the sum of squares divided by the degrees of freedom) divided by the mean squared residuals (from the bottom row):
\[ r_i = Y_i - \hat{Y}_i \]
\[ \mbox{Mean Sq Residuals} = \frac{1}{N - p} \sum_{i=1}^N r_i^2 \]
where \(p\) is the number of coefficients in the model (here eight, including the intercept term).
Under the null hypothesis (the true value of the additional coefficient(s) is 0), we have a theoretical result for what the distribution of the F value will be for each row. The assumptions needed for this approximation to hold are similar to those of the t-distribution approximation we described in earlier chapters. We either need a large sample size so that CLT applies or we need the population data to follow a normal approximation.
As an example of how one interprets these p-values, let’s take the last row type:leg
which specifies the three interaction coefficients. Under the null hypothesis that the true value for these three additional terms is actually 0, e.g. \(\beta_{\textrm{push,L2}} = 0, \beta_{\textrm{push,L3}} = 0, \beta_{\textrm{push,L4}} = 0\), then we can calculate the chance of seeing such a large F-value for this row of the ANOVA table. Remember that we are only concerned with large values here, because we have a ratio of sum of squares, the F-value can only be positive. The p-value in the last column for the type:leg
row can be interpreted as: under the null hypothesis that there are no differences in the push vs. pull difference across leg pair, this is the probability of an estimated interaction coefficient explaining so much of the observed variance. If this p-value is small, we would consider rejecting the null hypothesis that the push vs. pull difference is the same across leg pairs.
The F distribution has two parameters: one for the degrees of freedom of the numerator (the terms of interest) and one for the denominator (the residuals). In the case of the interaction coefficients row, this is 3, the number of interaction coefficients divided by 274, the number of samples minus the total number of coefficients.
Remember, you can check the book page for interactions and contrasts here.
Start by loading the spider dataset:
# url <- "https://raw.githubusercontent.com/genomicsclass/dagdata/master/inst/extdata/spider_wolff_gorb_2013.csv"
# filename <- "spider_wolff_gorb_2013.csv"
# library(downloader)
# if (!file.exists(filename)) download(url, filename)
# spider <- read.csv(filename, skip=1)
head(spider)
## leg type friction group
## 1 L1 pull 0.90 L1pull
## 2 L1 pull 0.91 L1pull
## 3 L1 pull 0.86 L1pull
## 4 L1 pull 0.85 L1pull
## 5 L1 pull 0.80 L1pull
## 6 L1 pull 0.87 L1pull
#Suppose that we notice that the within-group variances for the groups with smaller frictional coefficients are generally smaller, and so we try to apply a transformation to the frictional coefficients to make the within-group variances more constant.
#Add a new variable log2friction to the spider dataframe:
spider$log2friction <- log2(spider$friction)
head(spider)
## leg type friction group log2friction
## 1 L1 pull 0.90 L1pull -0.1520031
## 2 L1 pull 0.91 L1pull -0.1360615
## 3 L1 pull 0.86 L1pull -0.2175914
## 4 L1 pull 0.85 L1pull -0.2344653
## 5 L1 pull 0.80 L1pull -0.3219281
## 6 L1 pull 0.87 L1pull -0.2009127
#The 'Y' values now look like:
boxplot(log2friction ~ type*leg, data=spider)
#Run a linear model of log2friction with type, leg and interactions between type and leg.
fitln <- lm(log2friction ~ type*leg, data=spider)
summary(fitln)
##
## Call:
## lm(formula = log2friction ~ type * leg, data = spider)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.35902 -0.19193 0.00596 0.16315 1.33090
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.16828 0.06613 -2.545 0.011487 *
## typepush -1.20656 0.09352 -12.901 < 2e-16 ***
## legL2 0.34681 0.11952 2.902 0.004014 **
## legL3 0.48999 0.08505 5.762 2.24e-08 ***
## legL4 0.64668 0.08995 7.189 6.20e-12 ***
## typepush:legL2 0.09967 0.16903 0.590 0.555906
## typepush:legL3 -0.54075 0.12027 -4.496 1.02e-05 ***
## typepush:legL4 -0.46920 0.12721 -3.689 0.000272 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3856 on 274 degrees of freedom
## Multiple R-squared: 0.8125, Adjusted R-squared: 0.8077
## F-statistic: 169.6 on 7 and 274 DF, p-value: < 2.2e-16
Interactions Exercises #1
What is the t-value for the interaction of type push and leg L4? If this t-value is sufficiently large, we would reject the null hypothesis that the push vs pull effect on log2(friction) is the same in L4 as in L1.
summary(fitln)
##
## Call:
## lm(formula = log2friction ~ type * leg, data = spider)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.35902 -0.19193 0.00596 0.16315 1.33090
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.16828 0.06613 -2.545 0.011487 *
## typepush -1.20656 0.09352 -12.901 < 2e-16 ***
## legL2 0.34681 0.11952 2.902 0.004014 **
## legL3 0.48999 0.08505 5.762 2.24e-08 ***
## legL4 0.64668 0.08995 7.189 6.20e-12 ***
## typepush:legL2 0.09967 0.16903 0.590 0.555906
## typepush:legL3 -0.54075 0.12027 -4.496 1.02e-05 ***
## typepush:legL4 -0.46920 0.12721 -3.689 0.000272 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3856 on 274 degrees of freedom
## Multiple R-squared: 0.8125, Adjusted R-squared: 0.8077
## F-statistic: 169.6 on 7 and 274 DF, p-value: < 2.2e-16
A:-3.689
Interactions Exercises #2
What is the F-value for all of the type:leg interaction terms, in an analysis of variance? If this value is sufficiently large, we would reject the null hypothesis that the push vs pull effect on log2(friction) is the same for all leg pairs.
anova(fitln)
## Analysis of Variance Table
##
## Response: log2friction
## Df Sum Sq Mean Sq F value Pr(>F)
## type 1 164.709 164.709 1107.714 < 2.2e-16 ***
## leg 3 7.065 2.355 15.838 1.589e-09 ***
## type:leg 3 4.774 1.591 10.701 1.130e-06 ***
## Residuals 274 40.742 0.149
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
A:10.70
Interactions Exercises #3
What is the L2 vs L1 estimate in log2friction for the pull samples?
contrast(fitln, list(type="pull",leg="L2"), list(type="pull",leg="L1"))
## lm model parameter contrast
##
## Contrast S.E. Lower Upper t df Pr(>|t|)
## 0.3468125 0.1195246 0.1115092 0.5821157 2.9 274 0.004
#which is simply the legL2 coefficient:
coef(fitln)["legL2"]
## legL2
## 0.3468125
A:0.34681
Interactions Exercises #4
What is the L2 vs L1 estimate in log2friction for the push samples? Remember, because of the interaction terms, this is not the same as the L2 vs L1 difference for the pull samples. If you’re not sure use the contrast() function. Another hint: consider the arrows plot for the model with interactions.
contrast(fitln, list(type="push",leg="L2"), list(type="push",leg="L1"))
## lm model parameter contrast
##
## Contrast S.E. Lower Upper t df Pr(>|t|)
## 0.4464843 0.1195246 0.211181 0.6817875 3.74 274 2e-04
#which is the legL2 coefficient plus the push:L2 interaction:
coef(fitln)["legL2"] + coef(fitln)["typepush:legL2"]
## legL2
## 0.4464843
A:0.4464843
Note that taking the log2 of a Y value and then performing a linear model has a meaningful effect on the coefficients. If we have,
log2(Y_1) = beta_0
and
log2(Y_2) = beta_0 + beta_1
Then Y_2/Y_1 = 2^(beta_0 + beta_1) / 2^(beta_0)
= 2^beta_1
So beta_1 represents a log2 fold change of Y_2 over Y_1. If beta_1 = 1, then Y_2 is 2 times Y_1. If beta_1 = -1, then Y_2 is half of Y_1, etc.
In the video we briefly mentioned the analysis of variance (or ANOVA, performed in R using the anova() function), which allows us to test whether a number of coefficients are equal to zero, by comparing a linear model including these terms to a linear model where these terms are set to 0.
The book page for this section has a section, “Testing all differences of differences”, which explains the ANOVA concept and the F-test in some more detail. You can read over that section before or after the following question.
In this last question, we will use Monte Carlo techniques to observe the distribution of the ANOVA’s “F-value” under the null hypothesis, that there are no differences between groups.
Suppose we have 4 groups, and 10 samples per group, so 40 samples overall:
N <- 40
p <- 4
group <- factor(rep(1:p,each=N/p))
X <- model.matrix(~ group)
X
## (Intercept) group2 group3 group4
## 1 1 0 0 0
## 2 1 0 0 0
## 3 1 0 0 0
## 4 1 0 0 0
## 5 1 0 0 0
## 6 1 0 0 0
## 7 1 0 0 0
## 8 1 0 0 0
## 9 1 0 0 0
## 10 1 0 0 0
## 11 1 1 0 0
## 12 1 1 0 0
## 13 1 1 0 0
## 14 1 1 0 0
## 15 1 1 0 0
## 16 1 1 0 0
## 17 1 1 0 0
## 18 1 1 0 0
## 19 1 1 0 0
## 20 1 1 0 0
## 21 1 0 1 0
## 22 1 0 1 0
## 23 1 0 1 0
## 24 1 0 1 0
## 25 1 0 1 0
## 26 1 0 1 0
## 27 1 0 1 0
## 28 1 0 1 0
## 29 1 0 1 0
## 30 1 0 1 0
## 31 1 0 0 1
## 32 1 0 0 1
## 33 1 0 0 1
## 34 1 0 0 1
## 35 1 0 0 1
## 36 1 0 0 1
## 37 1 0 0 1
## 38 1 0 0 1
## 39 1 0 0 1
## 40 1 0 0 1
## attr(,"assign")
## [1] 0 1 1 1
## attr(,"contrasts")
## attr(,"contrasts")$group
## [1] "contr.treatment"
We will show here how to calculate the “F-value”, and then we will use random number to observe the distribution of the F-value under the null hypothesis.
The F-value is the mean sum of squares explained by the terms of interest (in our case, the ‘group’ terms) divided by the mean sum of squares of the residuals of a model including the terms of interest. So it is the explanatory power of the terms divided by the leftover variance.
Intuitively, if this number is large, it means that the group variable explains a lot of the variance in the data, compared to the amount of variance left in the data after using group information. We will calculate these values exactly here:
First generate some random, null data, where the mean is the same for all groups:
Y <- rnorm(N,mean=42,7)
The base model we wil compare against is simply Y-hat = mean(Y), which we will call mu0, and the initial sum of squares is the Y values minus mu0:
mu0 <- mean(Y)
initial.ss <- sum((Y - mu0)^2)
We then need to calculate the fitted values for each group, which is simply the mean of each group, and the residuals from this model, which we will call “after.group.ss” for the sum of squares after using the group information:
s <- split(Y, group)
after.group.ss <- sum(sapply(s, function(x) sum((x - mean(x))^2)))
#Then the explanatory power of the group variable is the initial sum of squares minus the residual sum of squares:
(group.ss <- initial.ss - after.group.ss)
## [1] 28.1478
We calculate the mean of these values, but we divide by terms which remove the number of fitted parameters. For the group sum of squares, this is number of parameters used to fit the groups (3, because the intercept is in the initial model). For the after group sum of squares, this is the number of samples minus the number of parameters total (So N - 4, including the intercept).
group.ms <- group.ss / (p - 1)
after.group.ms <- after.group.ss / (N - p)
#he F-value is simply the ratio of these mean sum of squares.
(f.value <- group.ms / after.group.ms)
## [1] 0.1701274
What’s the point of all these calculations? The point is that, after following these steps, the exact distribution of the F-value has a nice mathematical formula under the null hypothesis. We will see this below.
Interactions Exercises #5
Set the seed to 1, set.seed(1) then calculate the F-value for 1000 random versions of Y. What is the mean of these F-values?
set.seed(1)
Fs = replicate(1000, {
Y = rnorm(N,mean=42,7)
mu0 = mean(Y)
initial.ss = sum((Y - mu0)^2)
s = split(Y, group)
after.group.ss = sum(sapply(s, function(x) sum((x - mean(x))^2)))
(group.ss = initial.ss - after.group.ss)
group.ms = group.ss / (p - 1)
after.group.ms = after.group.ss / (N - p)
f.value = group.ms / after.group.ms
return(f.value)
})
mean(Fs)
## [1] 1.069771
A:1.069771
#Plot the distribution of the 1000 F-values:
hist(Fs, col="grey", border="white", breaks=50, freq=FALSE)
#Overlay the theoretical F-distribution, with parameters df1=p - 1, df2=N - p.
xs <- seq(from=0,to=6,length=100)
lines(xs, df(xs, df1 = p - 1, df2 = N - p), col="red")
#This is the distribution which is used to calculate the p-values for the ANOVA table produced by anova().
Now we show an alternate specification of the same model, wherein we assume that each combination of type and leg has its own mean value (and so that the push vs. pull effect is not the same for each leg pair). This specification is in some ways simpler, as we will see, but it does not allow us to build the ANOVA table as above, because it does not split interaction coefficients out in the same way.
We start by constructing a factor variable with a level for each unique combination of type
and leg
. We include a 0 +
in the formula because we do not want to include an intercept in the model matrix.
##earlier, we defined the 'group' column:
spider$group <- factor(paste0(spider$leg, spider$type))
X <- model.matrix(~ 0 + group, data=spider)
colnames(X)
## [1] "groupL1pull" "groupL1push" "groupL2pull" "groupL2push" "groupL3pull"
## [6] "groupL3push" "groupL4pull" "groupL4push"
head(X)
## groupL1pull groupL1push groupL2pull groupL2push groupL3pull groupL3push
## 1 1 0 0 0 0 0
## 2 1 0 0 0 0 0
## 3 1 0 0 0 0 0
## 4 1 0 0 0 0 0
## 5 1 0 0 0 0 0
## 6 1 0 0 0 0 0
## groupL4pull groupL4push
## 1 0 0
## 2 0 0
## 3 0 0
## 4 0 0
## 5 0 0
## 6 0 0
imagemat(X, main="Model matrix for linear model with group variable")
Image of model matrix for linear model with group variable. This model, also with eight terms, gives a unique fitted value for each combination of type and leg.
We can run the linear model with the familiar call:
fitG <- lm(friction ~ 0 + group, data=spider)
summary(fitG)
##
## Call:
## lm(formula = friction ~ 0 + group, data = spider)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.46385 -0.10735 -0.01111 0.07848 0.76853
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## groupL1pull 0.92147 0.03266 28.21 <2e-16 ***
## groupL1push 0.40735 0.03266 12.47 <2e-16 ***
## groupL2pull 1.14533 0.04917 23.29 <2e-16 ***
## groupL2push 0.52733 0.04917 10.72 <2e-16 ***
## groupL3pull 1.27385 0.02641 48.24 <2e-16 ***
## groupL3push 0.37596 0.02641 14.24 <2e-16 ***
## groupL4pull 1.40075 0.03011 46.52 <2e-16 ***
## groupL4push 0.49075 0.03011 16.30 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1904 on 274 degrees of freedom
## Multiple R-squared: 0.96, Adjusted R-squared: 0.9588
## F-statistic: 821 on 8 and 274 DF, p-value: < 2.2e-16
coefs <- coef(fitG)
Now we have eight arrows, one for each group. The arrow tips align directly with the mean of each group:
Diagram of the estimated coefficients in the linear model, with each term representing the mean of a combination of type and leg.
While we cannot perform an ANOVA with this formulation, we can easily contrast the estimated coefficients for individual groups using the contrast
function:
groupL2push.vs.pull <- contrast(fitG,
list(group = "L2push"),
list(group = "L2pull"))
groupL2push.vs.pull
## lm model parameter contrast
##
## Contrast S.E. Lower Upper t df Pr(>|t|)
## 1 -0.618 0.0695372 -0.7548951 -0.4811049 -8.89 274 0
coefs[4] - coefs[3]
## groupL2push
## -0.618
We can also make pair-wise comparisons of the estimated push vs. pull difference across leg pair. For example, if we want to compare the push vs. pull difference in leg pair L3 vs. leg pair L2:
\[ (\mbox{L3push} - \mbox{L3pull}) - (\mbox{L2push} - \mbox{L2pull}) \]
\[ = \mbox{L3 push} + \mbox{L2pull} - \mbox{L3pull} - \mbox{L2push} \]
C <- matrix(c(0,0,1,-1,-1,1,0,0), 1)
library(multcomp)
groupL3vsL2interaction <- glht(fitG, linfct=C)
summary(groupL3vsL2interaction)
##
## Simultaneous Tests for General Linear Hypotheses
##
## Fit: lm(formula = friction ~ 0 + group, data = spider)
##
## Linear Hypotheses:
## Estimate Std. Error t value Pr(>|t|)
## 1 == 0 -0.27988 0.07893 -3.546 0.00046 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)
names(coefs)
## [1] "groupL1pull" "groupL1push" "groupL2pull" "groupL2push" "groupL3pull"
## [6] "groupL3push" "groupL4pull" "groupL4push"
(coefs[6] - coefs[5]) - (coefs[4] - coefs[3])
## groupL3push
## -0.2798846
If an experiment is designed incorrectly we may not be able to estimate the parameters of interest. Similarly, when analyzing data we may incorrectly decide to use a model that can’t be fit. If we are using linear models then we can detect these problems mathematically by looking for collinearity in the design matrix.
The following system of equations:
\[ \begin{align*} a+c &=1\\ b-c &=1\\ a+b &=2 \end{align*} \]
has more than one solution since there are an infinite number of triplets that satisfy \(a=1-c, b=1+c\). Two examples are \(a=1,b=1,c=0\) and \(a=0,b=2,c=1\).
The system of equations above can be written like this:
\[ \, \begin{pmatrix} 1&0&1\\ 0&1&-1\\ 1&1&0\\ \end{pmatrix} \begin{pmatrix} a\\ b\\ c \end{pmatrix} = \begin{pmatrix} 1\\ 1\\ 2 \end{pmatrix} \]
Note that the third column is a linear combination of the first two:
\[ \, \begin{pmatrix} 1\\ 0\\ 1 \end{pmatrix} + -1 \begin{pmatrix} 0\\ 1\\ 1 \end{pmatrix} = \begin{pmatrix} 1\\ -1\\ 0 \end{pmatrix} \]
We say that the third column is collinear with the first 2. This implies that the system of equations can be written like this:
\[ \, \begin{pmatrix} 1&0&1\\ 0&1&-1\\ 1&1&0 \end{pmatrix} \begin{pmatrix} a\\ b\\ c \end{pmatrix} = a \begin{pmatrix} 1\\ 0\\ 1 \end{pmatrix} + b \begin{pmatrix} 0\\ 1\\ 1 \end{pmatrix} + c \begin{pmatrix} 1-0\\ 0-1\\ 1-1 \end{pmatrix} \]
\[ =(a+c) \begin{pmatrix} 1\\ 0\\ 1\\ \end{pmatrix} + (b-c) \begin{pmatrix} 0\\ 1\\ 1\\ \end{pmatrix} \]
The third column does not add a constraint and what we really have are three equations and two unknowns: \(a+c\) and \(b-c\). Once we have values for those two quantities, there are an infinity number of triplets that can be used.
Consider a design matrix \(\mathbf{X}\) with two collinear columns. Here we create an extreme example in which one column is the opposite of another:
\[ \mathbf{X} = \begin{pmatrix} \mathbf{1}&\mathbf{X}_1&\mathbf{X}_2&\mathbf{X}_3\\ \end{pmatrix} \mbox{ with, say, } \mathbf{X}_3 = - \mathbf{X}_2 \]
This means that we can rewrite the residuals like this:
\[ \mathbf{Y}- \left\{ \mathbf{1}\beta_0 + \mathbf{X}_1\beta_1 + \mathbf{X}_2\beta_2 + \mathbf{X}_3\beta_3\right\}\\ = \mathbf{Y}- \left\{ \mathbf{1}\beta_0 + \mathbf{X}_1\beta_1 + \mathbf{X}_2\beta_2 - \mathbf{X}_2\beta_3\right\}\\ = \mathbf{Y}- \left\{\mathbf{1}\beta_0 + \mathbf{X}_1 \beta_1 + \mathbf{X}_2(\beta_2 - \beta_3)\right\} \]
and if \(\hat{\beta}_1\), \(\hat{\beta}_2\), \(\hat{\beta}_3\) is a least squares solution, then, for example, \(\hat{\beta}_1\), \(\hat{\beta}_2+1\), \(\hat{\beta}_3+1\) is also a solution.
Now we will demonstrate how collinearity helps us determine problems with our design using one of the most common errors made in current experimental design: confounding. To illustrate, let’s use an imagined experiment in which we are interested in the effect of four treatments A, B, C and D. We assign two mice to each treatment. After starting the experiment by giving A and B to female mice, we realize there might be a sex effect. We decide to give C and D to males with hopes of estimating this effect. But can we estimate the sex effect? The described design implies the following design matrix:
\[ \, \begin{pmatrix} Sex & A & B & C & D\\ 0 & 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 & 0 \\ 0 & 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 & 0 \\ 1 & 0 & 0 & 1 & 0 \\ 1 & 0 & 0 & 1 & 0 \\ 1 & 0 & 0 & 0 & 1 \\ 1 & 0 & 0 & 0 & 1\\ \end{pmatrix} \]
Here we can see that sex and treatment are confounded. Specifically, the sex column can be written as a linear combination of the C and D matrices.
\[ \, \begin{pmatrix} Sex \\ 0\\ 0 \\ 0 \\ 0 \\ 1\\ 1\\ 1 \\ 1 \\ \end{pmatrix} = \begin{pmatrix} C \\ 0\\ 0\\ 0\\ 0\\ 1\\ 1\\ 0\\ 0\\ \end{pmatrix} + \begin{pmatrix} D \\ 0\\ 0\\ 0\\ 0\\ 0\\ 0\\ 1\\ 1\\ \end{pmatrix} \]
This implies that a unique least squares estimate is not achievable.
The rank of a matrix columns is the number of columns that are independent of all the others. If the rank is smaller than the number of columns, then the LSE are not unique. In R. we can obtain the rank of matrix with the function qr
, which we will describe in more detail in a following section.
Sex <- c(0,0,0,0,1,1,1,1)
A <- c(1,1,0,0,0,0,0,0)
B <- c(0,0,1,1,0,0,0,0)
C <- c(0,0,0,0,1,1,0,0)
D <- c(0,0,0,0,0,0,1,1)
X <- model.matrix(~Sex+A+B+C+D-1)
X
## Sex A B C D
## 1 0 1 0 0 0
## 2 0 1 0 0 0
## 3 0 0 1 0 0
## 4 0 0 1 0 0
## 5 1 0 0 1 0
## 6 1 0 0 1 0
## 7 1 0 0 0 1
## 8 1 0 0 0 1
## attr(,"assign")
## [1] 1 2 3 4 5
cat("ncol=",ncol(X),"rank=", qr(X)$rank,"\n")
## ncol= 5 rank= 4
Here we will not be able to estimate the effect of sex.
This particular experiment could have been designed better. Using the same number of male and female mice, we can easily design an experiment that allows us to compute the sex effect as well as all the treatment effects. Specifically, when we balance sex and treatments, the confounding is removed as demonstrated by the fact that the rank is now the same as the number of columns:
Sex <- c(0,1,0,1,0,1,0,1)
A <- c(1,1,0,0,0,0,0,0)
B <- c(0,0,1,1,0,0,0,0)
C <- c(0,0,0,0,1,1,0,0)
D <- c(0,0,0,0,0,0,1,1)
X <- model.matrix(~Sex+A+B+C+D-1)
cat("ncol=",ncol(X),"rank=", qr(X)$rank,"\n")
## ncol= 5 rank= 5
To answer the first question, consider the following design matrices:
Collinearity Exercises #1
Which of the above design matrices does NOT have the problem of collinearity?
#You can check in R, the rank of the E matrix is equal to the number of columns, so all of the columns are independent.
m = matrix(c(1,1,1,1,0,0,1,1,0,1,0,1,0,0,0,1),4,4)
m
## [,1] [,2] [,3] [,4]
## [1,] 1 0 0 0
## [2,] 1 0 1 0
## [3,] 1 1 0 0
## [4,] 1 1 1 1
qr(m)$rank
## [1] 4
A:E
Let’s use the example from the lecture to visualize how there is not a single best beta-hat, when the design matrix has collinearity of columns.
An example can be made with:
sex <- factor(rep(c("female","male"),each=4))
trt <- factor(c("A","A","B","B","C","C","D","D"))
#The model matrix can then be formed with:
X <- model.matrix( ~ sex + trt)
#And we can see that the number of independent columns is less than the number of columns of X:
qr(X)$rank
## [1] 4
#Suppose we observe some outcome, Y. For simplicity we will use synthetic data:
Y <- 1:8
Now, we will fix the value for two beta’s and optimize the remaining betas. We will fix beta_male and beta_D. And then we will find the optimal value for the remaining betas, in terms of minimizing sum((Y - X beta)^2).
The optimal value for the other betas is the one that minimizes:
sum( ( (Y - X_male* beta_male - X_D beta_D) - X_R beta_R )^2 )
Where X_male is the male column of the design matrix, X_D is the D column, and X_R has the remaining columns.
So all we need to do is redefine Y as Y* = Y - X* beta_male - X** beta_D and fit a linear model. The following line of code creates this variable Y*, after fixing beta_male to a value ‘a’, and beta_D to a value, ‘b’:
makeYstar <- function(a,b) Y - X[,2] * a - X[,5] * b
#Now we'll construct a function which, for a given value a and b, gives us back the the sum of squared residuals after fitting the other terms.
fitTheRest <- function(a,b) {
Ystar <- makeYstar(a,b)
Xrest <- X[,-c(2,5)]
betarest <- solve(t(Xrest) %*% Xrest) %*% t(Xrest) %*% Ystar
residuals <- Ystar - Xrest %*% betarest
sum(residuals^2)
}
Collinearity Exercises #2
What is the sum of squared residuals when the male coefficient is 1 and the D coefficient is 2, and the other coefficients are fit using the linear model solution?
fitTheRest(1,2)
## [1] 11
A:11
We can apply our function fitTheRest to a grid of values for beta_male and beta_D, using the expand.grid function in R. expand.grid takes two vectors and returns a matrix with rows containing all possible combination. Try it out:
expand.grid(1:3,1:3)
## Var1 Var2
## 1 1 1
## 2 2 1
## 3 3 1
## 4 1 2
## 5 2 2
## 6 3 2
## 7 1 3
## 8 2 3
## 9 3 3
#We can run fitTheRest on a grid of values, using the following code (the Vectorize() is necessary as outer() requires only vectorized functions):
betas = expand.grid(-2:8,-2:8)
rss = apply(betas,1,function(x) fitTheRest(x[1],x[2]))
Collinearity Exercises #3
Which of the following pairs of values minimizes the RSS?
## Note that all pairs add to 6
themin= min(rss)
betas[which(rss==themin),]
## Var1 Var2
## 11 8 -2
## 21 7 -1
## 31 6 0
## 41 5 1
## 51 4 2
## 61 3 3
## 71 2 4
## 81 1 5
## 91 0 6
## 101 -1 7
## 111 -2 8
A:All of the above. There is no single minimum.
t’s fairly clear from just looking at the numbers, but we can also visualize the sum of squared residuals over our grid with the imagemat() function from rafalib:
library(rafalib)
## plot the pairs what are minimum
themin=min(rss)
plot(betas[which(rss==themin),])
There is clearly not a single beta which optimizes the least squares equation, due to collinearity, but an infinite line of solutions which produce an identical sum of squares values.
We have seen that in order to calculate the LSE, we need to invert a matrix. In previous sections we used the function solve
. However, solve is not a stable solution. When coding LSE computation, we use the QR decomposition.
Remember that to minimize the RSS:
\[ (\mathbf{Y}-\mathbf{X}\boldsymbol{\beta})^\top (\mathbf{Y}-\mathbf{X}\boldsymbol{\beta}) \]
We need to solve:
\[ \mathbf{X}^\top \mathbf{X} \boldsymbol{\hat{\beta}} = \mathbf{X}^\top \mathbf{Y} \]
The solution is:
\[ \boldsymbol{\hat{\beta}} = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{Y} \]
Thus, we need to compute \((\mathbf{X}^\top \mathbf{X})^{-1}\).
solve
is numerically unstableTo demonstrate what we mean by numerically unstable, we construct an extreme case:
n <- 50;M <- 500
x <- seq(1,M,len=n)
X <- cbind(1,x,x^2,x^3)
colnames(X) <- c("Intercept","x","x2","x3")
beta <- matrix(c(1,1,1,1),4,1)
set.seed(1)
y <- X%*%beta+rnorm(n,sd=1)
plot(x,y)
The standard R function for inverse gives an error:
solve(crossprod(X)) %*% crossprod(X,y)
To see why this happens, look at \((\mathbf{X}^\top \mathbf{X})\)
options(digits=4)
log10(crossprod(X))
## Intercept x x2 x3
## Intercept 1.699 4.098 6.625 9.203
## x 4.098 6.625 9.203 11.810
## x2 6.625 9.203 11.810 14.434
## x3 9.203 11.810 14.434 17.070
Note the difference of several orders of magnitude. On a digital computer, we have a limited range of numbers. This makes some numbers seem like 0, when we also have to consider very large numbers. This in turn leads to divisions that are practically divisions by 0 errors.
The QR factorization is based on a mathematical result that tells us that we can decompose any full rank \(N\times p\) matrix \(\mathbf{X}\) as:
\[ \mathbf{X = QR} \]
with:
Upper triangular matrices are very convenient for solving system of equations.
In the example below, the matrix on the left is upper triangular: it only has 0s below the diagonal. This facilitates solving the system of equations greatly:
\[ \, \begin{pmatrix} 1&2&-1\\ 0&1&2\\ 0&0&1\\ \end{pmatrix} \begin{pmatrix} a\\ b\\ c\\ \end{pmatrix} = \begin{pmatrix} 6\\ 4\\ 1\\ \end{pmatrix} \]
We immediately know that \(c=1\), which implies that \(b+2=4\). This in turn implies \(b=2\) and thus \(a+4-1=6\) so \(a = 3\). Writing an algorithm to do this is straight-forward for any upper triangular matrix.
If we rewrite the equations of the LSE using \(\mathbf{QR}\) instead of \(\mathbf{X}\) we have:
\[\mathbf{X}^\top \mathbf{X} \boldsymbol{\beta} = \mathbf{X}^\top \mathbf{Y}\]
\[(\mathbf{Q}\mathbf{R})^\top (\mathbf{Q}\mathbf{R}) \boldsymbol{\beta} = (\mathbf{Q}\mathbf{R})^\top \mathbf{Y}\]
\[\mathbf{R}^\top (\mathbf{Q}^\top \mathbf{Q}) \mathbf{R} \boldsymbol{\beta} = \mathbf{R}^\top \mathbf{Q}^\top \mathbf{Y}\]
\[\mathbf{R}^\top \mathbf{R} \boldsymbol{\beta} = \mathbf{R}^\top \mathbf{Q}^\top \mathbf{Y}\]
\[(\mathbf{R}^\top)^{-1} \mathbf{R}^\top \mathbf{R} \boldsymbol{\beta} = (\mathbf{R}^\top)^{-1} \mathbf{R}^\top \mathbf{Q}^\top \mathbf{Y}\]
\[\mathbf{R} \boldsymbol{\beta} = \mathbf{Q}^\top \mathbf{Y}\]
\(\mathbf{R}\) being upper triangular makes solving this more stable. Also, because \(\mathbf{Q}^\top\mathbf{Q}=\mathbf{I}\) , we know that the columns of \(\mathbf{Q}\) are in the same scale which stabilizes the right side.
Now we are ready to find LSE using the QR decomposition. To solve:
\[\mathbf{R} \boldsymbol{\beta} = \mathbf{Q}^\top \mathbf{Y}\]
We use backsolve
which takes advantage of the upper triangular nature of \(\mathbf{R}\).
QR <- qr(X)
Q <- qr.Q( QR )
R <- qr.R( QR )
(betahat <- backsolve(R, crossprod(Q,y) ) )
## [,1]
## [1,] 0.9038
## [2,] 1.0066
## [3,] 1.0000
## [4,] 1.0000
In practice, we do not need to do any of this due to the built-in solve.qr
function:
QR <- qr(X)
(betahat <- solve.qr(QR, y))
## [,1]
## Intercept 0.9038
## x 1.0066
## x2 1.0000
## x3 1.0000
This factorization also simplifies the calculation for fitted values:
\[\mathbf{X}\boldsymbol{\hat{\beta}} = (\mathbf{QR})\mathbf{R}^{-1}\mathbf{Q}^\top \mathbf{y}= \mathbf{Q}\mathbf{Q}^\top\mathbf{y} \]
In R, we simply do the following:
library(rafalib)
mypar(1,1)
plot(x,y)
fitted <- tcrossprod(Q)%*%y
lines(x,fitted,col=2)
To obtain the standard errors of the LSE, we note that:
\[(\mathbf{X^\top X})^{-1} = (\mathbf{R^\top Q^\top QR})^{-1} = (\mathbf{R^\top R})^{-1}\]
The function chol2inv
is specifically designed to find this inverse. So all we do is the following:
df <- length(y) - QR$rank
sigma2 <- sum((y-fitted)^2)/df
varbeta <- sigma2*chol2inv(qr.R(QR))
SE <- sqrt(diag(varbeta))
cbind(betahat,SE)
## SE
## Intercept 0.9038 4.508e-01
## x 1.0066 7.858e-03
## x2 1.0000 3.662e-05
## x3 1.0000 4.802e-08
This gives us identical results to the lm
function.
summary(lm(y~0+X))$coef
## Estimate Std. Error t value Pr(>|t|)
## XIntercept 0.9038 4.508e-01 2.005e+00 5.089e-02
## Xx 1.0066 7.858e-03 1.281e+02 2.171e-60
## Xx2 1.0000 3.662e-05 2.731e+04 1.745e-167
## Xx3 1.0000 4.802e-08 2.082e+07 4.559e-300
We will use the spider dataset to try out the QR decomposition as a solution to linear models. Load the full spider dataset, by using the code in the Interactions and Contrasts book page. Run a linear model of the friction coefficient with two variable (no interactions):
url <- "https://raw.githubusercontent.com/genomicsclass/dagdata/master/inst/extdata/spider_wolff_gorb_2013.csv"
filename <- "spider_wolff_gorb_2013.csv"
library(downloader)
if (!file.exists(filename)) download(url, filename)
spider <- read.csv("spider_wolff_gorb_2013.csv", skip=1)
head(spider)
## leg type friction
## 1 L1 pull 0.90
## 2 L1 pull 0.91
## 3 L1 pull 0.86
## 4 L1 pull 0.85
## 5 L1 pull 0.80
## 6 L1 pull 0.87
fit <- lm(friction ~ type + leg, data=spider)
#The solution we are interested in solving is:
betahat <- coef(fit)
So for our matrix work,
Y <- matrix(spider$friction, ncol=1)
X <- model.matrix(~ type + leg, data=spider)
In the material on QR decomposition, we saw that the solution for beta is:
R beta = Q^T Y
QR Exercises #1
What is the first row, first column element in the Q matrix for this linear model?
Q = qr.Q(qr(X))
head(Q)
## [,1] [,2] [,3] [,4] [,5]
## [1,] -0.05955 -0.05955 -0.02055 -0.05281 -0.08916
## [2,] -0.05955 -0.05955 -0.02055 -0.05281 -0.08916
## [3,] -0.05955 -0.05955 -0.02055 -0.05281 -0.08916
## [4,] -0.05955 -0.05955 -0.02055 -0.05281 -0.08916
## [5,] -0.05955 -0.05955 -0.02055 -0.05281 -0.08916
## [6,] -0.05955 -0.05955 -0.02055 -0.05281 -0.08916
Q[1,1]
## [1] -0.05955
A:-0.05954913
QR Exercises #2
What is the first row, first column element in the R matrix for this linear model?
R = qr.R(qr(X))
head(R)
## (Intercept) typepush legL2 legL3 legL4
## 1 -16.79 -8.396 -1.786e+00 -6.193e+00 -4.764e+00
## 2 0.00 8.396 1.110e-16 -3.442e-15 -4.330e-15
## 3 0.00 0.000 5.178e+00 -2.137e+00 -1.644e+00
## 4 0.00 0.000 0.000e+00 7.815e+00 -4.225e+00
## 5 0.00 0.000 0.000e+00 0.000e+00 6.063e+00
R[1,1]
## [1] -16.79
A:-16.79286
QR Exercises #3
What is the first row, first column element of Q^T Y (use crossprod() as in the book page)
crossprod(qr.Q(qr(X)),Y)
## [,1]
## [1,] -13.79873
## [2,] -6.54088
## [3,] 0.08477
## [4,] 0.06578
## [5,] 1.70568
A:-13.79872520
Finally convince yourself that the QR gives the least squares solution by putting all the pieces together:
R^-1 (Q^T Y) compared to betahat
Linear models can be extended in many directions. Here are some examples of extensions, which you might come across in analyzing data in the life sciences:
In calculating the solution and its estimated error in the standard linear model, we minimize the squared errors. This involves a sum of squares from all the data points, which means that a few outlier data points can have a large influence on the solution. In addition, the errors are assumed to have constant variance (called homoskedasticity), which might not always hold true (when this is not true, it is called heteroskedasticity). Therefore, methods have been developed to generate more robust solutions, which behave well in the presence of outliers, or when the distributional assumptions are not met. A number of these are mentioned on the robust statistics page on the CRAN website. For more background, there is also a Wikipedia article with references.
In the standard linear model, we did not make any assumptions about the distribution of \(\mathbf{Y}\), though in some cases we can gain better estimates if we know that \(\mathbf{Y}\) is, for example, restricted to non-negative integers \(0,1,2,\dots\), or restricted to the interval \([0,1]\). A framework for analyzing such cases is referred to as generalized linear models, commonly abbreviated as GLMs. The two key components of the GLM are the link function and a probability distribution. The link function \(g\) connects our familiar matrix product \(\mathbf{X} \boldsymbol{\beta}\) to the \(\mathbf{Y}\) values through:
\[ \textrm{E}(\mathbf{Y}) = g^{-1}( \mathbf{X} \boldsymbol{\beta} ) \]
R includes the function glm
which fits GLMs and uses a familiar form as lm
. Additional arguments include family
, which can be used to specify the distributional assumption for \(\mathbf{Y}\). Some examples of the use of GLMs are shown at the Quick R website. There are a number of references for GLMs on the Wikipedia page.
In the standard linear model, we assumed that the matrix \(\mathbf{X}\) was fixed and not random. For example, we measured the frictional coefficients for each leg pair, and in the push and pull direction. The fact that an observation had a \(1\) for a given column in \(\mathbf{X}\) was not random, but dictated by the experimental design. However, in the father and son heights example, we did not fix the values of the fathers’ heights, but observed these (and likely these were measured with some error). A framework for studying the effect of the randomness for various columns in \(X\) is referred to as mixed effects models, which implies that some effects are fixed and some effects are random. One of the most popular packages in R for fitting linear mixed effects models is lme4 which has an accompanying paper on Fitting Linear Mixed-Effects Models using lme4. There is also a Wikipedia page with more references.
The approach presented here assumed \(\boldsymbol{\beta}\) was a fixed (non-random) parameter. We presented methodology that estimates this parameter, along with standard errors that quantify uncertainty, in the estimation process. This is referred to as the frequentist approach. An alternative approach is to assume that \(\boldsymbol{\beta}\) is random and its distribution quantifies our prior beliefs about what \(\boldsymbol{\beta}\) should be. Once we have observed data, then we update our prior beliefs by computing the conditional distribution, referred to as the posterior distribution, of \(\boldsymbol{\beta}\) given the data. This is referred to as the Bayesian approach. For example, once we have computed the posterior distribution of \(\boldsymbol{\beta}\) we can report the most likely outcome of an interval that occurs with high probability (credible interval). In addition, many models can be connected together in what is referred to as a hierarchical model. Note that we provide a brief introduction to Bayesian statistics and hierarchical models in a later chapter. A good reference for Bayesian hierarchical models is Bayesian Data Analysis, and some software for computing Bayesian linear models can be found on the Bayes page on CRAN. Some well known software for computing Bayesian models are stan and BUGS.
Note that if we include enough parameters in a model we can achieve a residual sum of squares of 0. Penalized linear models introduce a penalty term to the least square equation we minimize. These penalities are typically of the form, \(\lambda \sum_{j=1}^p \|\beta_j\|^k\) and they penalize for large absolute values of \(\beta\) as well as large numbers of parameters. The motivation for this extra term is to avoid over-fitting. To use these models, we need to pick \(\lambda\) which determines how much we penalize. When \(k=2\), this is referred to as ridge regression, Tikhonov regularization, or L2 regularization. When \(k=1\), this is referred to as LASSO or L1 regularization. A good reference for these penalized linear models is the Elements of Statistical Learning textbook, which is available as a free pdf. Some R packages which implement penalized linear models are the lm.ridge function in the MASS package, the lars package, and the glmnet package.
https://github.com/genomicsclass
devtools::session_info()
## Session info --------------------------------------------------------------
## setting value
## version R version 3.2.4 (2016-03-10)
## system x86_64, mingw32
## ui RTerm
## language (EN)
## collate English_India.1252
## tz Asia/Calcutta
## date 2016-03-20
## Packages ------------------------------------------------------------------
## package * version date source
## acepack 1.3-3.3 2013-05-03 CRAN (R 3.2.0)
## assertthat 0.1 2013-12-06 CRAN (R 3.2.1)
## cluster 2.0.3 2015-07-21 CRAN (R 3.2.4)
## codetools 0.2-14 2015-07-15 CRAN (R 3.2.4)
## colorspace 1.2-6 2015-03-11 CRAN (R 3.2.1)
## contrast * 0.19 2013-11-16 CRAN (R 3.2.3)
## DBI 0.3.1 2014-09-24 CRAN (R 3.2.1)
## devtools 1.10.0 2016-01-23 CRAN (R 3.2.3)
## digest 0.6.9 2016-01-08 CRAN (R 3.2.3)
## downloader * 0.4 2015-07-09 CRAN (R 3.2.2)
## dplyr * 0.4.3 2015-09-01 CRAN (R 3.2.2)
## evaluate 0.8.3 2016-03-05 CRAN (R 3.2.3)
## foreign 0.8-66 2015-08-19 CRAN (R 3.2.2)
## formatR 1.3 2016-03-05 CRAN (R 3.2.4)
## Formula * 1.2-1 2015-04-07 CRAN (R 3.2.0)
## ggplot2 * 2.1.0 2016-03-01 CRAN (R 3.2.3)
## gridExtra 2.2.1 2016-02-29 CRAN (R 3.2.3)
## gtable 0.2.0 2016-02-26 CRAN (R 3.2.3)
## HistData * 0.7-6 2015-10-16 CRAN (R 3.2.2)
## Hmisc * 3.17-2 2016-02-21 CRAN (R 3.2.3)
## htmltools 0.3 2015-12-29 CRAN (R 3.2.3)
## knitr * 1.12.3 2016-01-22 CRAN (R 3.2.3)
## labeling 0.3 2014-08-23 CRAN (R 3.2.0)
## lattice * 0.20-33 2015-07-14 CRAN (R 3.2.4)
## latticeExtra 0.6-28 2016-02-09 CRAN (R 3.2.3)
## lazyeval 0.1.10 2015-01-02 CRAN (R 3.2.1)
## magrittr 1.5 2014-11-22 CRAN (R 3.2.1)
## MASS * 7.3-45 2015-11-10 CRAN (R 3.2.4)
## Matrix 1.2-4 2016-03-02 CRAN (R 3.2.4)
## MatrixModels 0.4-1 2015-08-22 CRAN (R 3.2.2)
## memoise 1.0.0 2016-01-29 CRAN (R 3.2.3)
## multcomp * 1.4-4 2016-02-17 CRAN (R 3.2.3)
## munsell 0.4.3 2016-02-13 CRAN (R 3.2.3)
## mvtnorm * 1.0-5 2016-02-02 CRAN (R 3.2.3)
## nlme 3.1-126 2016-03-14 CRAN (R 3.2.4)
## nnet 7.3-12 2016-02-02 CRAN (R 3.2.3)
## plyr 1.8.3 2015-06-12 CRAN (R 3.2.1)
## polspline 1.1.12 2015-07-14 CRAN (R 3.2.3)
## quantreg 5.21 2016-02-13 CRAN (R 3.2.3)
## R6 2.1.2 2016-01-26 CRAN (R 3.2.3)
## rafalib * 1.0.0 2015-08-09 CRAN (R 3.2.2)
## RColorBrewer * 1.1-2 2014-12-07 CRAN (R 3.2.0)
## Rcpp 0.12.3 2016-01-10 CRAN (R 3.2.3)
## rmarkdown 0.9.5 2016-02-22 CRAN (R 3.2.3)
## rms * 4.4-2 2016-02-21 CRAN (R 3.2.3)
## rpart 4.1-10 2015-06-29 CRAN (R 3.2.1)
## sandwich * 2.3-4 2015-09-24 CRAN (R 3.2.2)
## scales 0.4.0 2016-02-26 CRAN (R 3.2.3)
## SparseM * 1.7 2015-08-15 CRAN (R 3.2.2)
## stringi 1.0-1 2015-10-22 CRAN (R 3.2.2)
## stringr 1.0.0 2015-04-30 CRAN (R 3.2.1)
## survival * 2.38-3 2015-07-02 CRAN (R 3.2.4)
## TH.data * 1.0-7 2016-01-28 CRAN (R 3.2.3)
## UsingR * 2.0-5 2015-08-06 CRAN (R 3.2.2)
## yaml 2.1.13 2014-06-12 CRAN (R 3.2.1)
## zoo 1.7-12 2015-03-16 CRAN (R 3.2.1)
#sessionInfo()$otherPkgs