Week 1 :Introduction

Matrix Algebra

In this module we try to minimize mathematical notation as much as possible. Furthermore, we avoid using calculus to motivate statistical concepts. However, Matrix Algebra (also referred to as Linear Algebra) and its mathematical notation greatly facilitates the exposition of the of the advanced data analysis techniques covered in the remainder of this book. We therefore dedicate a chapter of this book to introducing Matrix Algebra. We do this in the context of data analysis and using one of the main applications: Linear Models.

We will describe three examples from the life sciences: one from physics, one related to genetics, and one from a mouse experiment. They are very different, yet we end up using the same statistical technique: fitting linear models. Linear models are typically taught and described in the language of matrix algebra.

Motivating Examples

Falling objects

Imagine you are Galileo in the 16th century trying to describe the velocity of a falling object. An assistant climbs the Tower of Pisa and drops a ball, while several other assistants record the position at different times. Let’s simulate some data using the equations we know today and adding some measurement error:

set.seed(1)
g <- 9.8 ##meters per second
n <- 25
tt <- seq(0,3.4,len=n) ##time in secs, note: we use tt because t is a base function
d <- 56.67  - 0.5*g*tt^2 + rnorm(n,sd=1) ##meters

The assistants hand the data to Galileo and this is what he sees:

mypar()
plot(tt,d,ylab="Distance in meters",xlab="Time in seconds")

Simulated data for distance travelled versus time of falling object measured with error.

He does not know the exact equation, but by looking at the plot above he deduces that the position should follow a parabola. So he models the data with:

\[ Y_i = \beta_0 + \beta_1 x_i + \beta_2 x_i^2 + \varepsilon, i=1,\dots,n \]

With $Y_i$ representing location, $x_i$ representing the time, and $\varepsilon$ accounting for measurement error. This is a linear model because it is a linear combination of known quantities (th $x$ s) referred to as predictors or covariates and unknown parameters (the $\beta$ s).

Father & son heights

Now imagine you are Francis Galton in the 19th century and you collect paired height data from fathers and sons. You suspect that height is inherited. Your data:

data(father.son,package="UsingR")
x=father.son$fheight
y=father.son$sheight

looks like this:

plot(x,y,xlab="Father's height",ylab="Son's height")

Galton’s data. Son heights versus father heights.

The sons’ heights do seem to increase linearly with the fathers’ heights. In this case, a model that describes the data is as follows:

\[ Y_i = \beta_0 + \beta_1 x_i + \varepsilon, i=1,\dots,N \]

This is also a linear model with $x_i$ and $Y_i$, the father and son heights respectively, for the $i$-th pair and $\varepsilon$ a term to account for the extra variability. Here we think of the fathers’ heights as the predictor and being fixed (not random) so we use lower case. Measurement error alone can’t explain all the variability seen in $\varepsilon$. This makes sense as there are other variables not in the model, for example, mothers’ heights, genetic randomness, and environmental factors.

Random samples from multiple populations

Here we read-in mouse body weight data from mice that were fed two different diets: high fat and control (chow). We have a random sample of 12 mice for each. We are interested in determining if the diet has an effect on weight. Here is the data:

dat <- read.csv("femaleMiceWeights.csv")
mypar(1,1)
stripchart(Bodyweight~Diet,data=dat,vertical=TRUE,method="jitter",pch=1,main="Mice weights")

Mouse weights under two diets.

We want to estimate the difference in average weight between populations. We demonstrated how to do this using t-tests and confidence intervals, based on the difference in sample averages. We can obtain the same exact results using a linear model:

\[ Y_i = \beta_0 + \beta_1 x_{i} + \varepsilon_i\]

with $\beta_0$ the chow diet average weight, $\beta_1$ the difference between averages, $x_i = 1$ when mouse $i$ gets the high fat (hf) diet, $x_i = 0$ when it gets the chow diet, and $\varepsilon_i$ explains the differences between mice of the same population.

Introduction Exercises

If you haven’t done so already, install the library UsingR install.packages(“UsingR”)

Then once you load it you have access to Galton’s father and son heights:

library(UsingR)
data("father.son",package="UsingR")
head(father.son)

##    fheight  sheight
## 1 65.04851 59.77827
## 2 63.25094 63.21404
## 3 64.95532 63.34242
## 4 65.75250 62.79238
## 5 61.13723 64.28113
## 6 63.02254 64.24221

Q:What is the average height of the sons (don’t round off)?

mean(father.son$sheight)

## [1] 68.68407

A:68.68407

Introduction Exercises #2

One of the defining features of regression is that we stratify one variable based on others. In Statistics we use the verb “condition”. For example, the linear model for son and father heights answers the question how tall do I expect a son to be if I condition on his father being x inches. The regression line answers this question for any x.

Using the father.son dataset described above, we want to know the expected height of sons if we condition on the father being 71 inches. Create a list of son heights for sons that have fathers with heights of 71 inches (round to the nearest inch).

What is the mean of the son heights for fathers that have a height of 71 inches (don’t round off your answer)? (Hint: use the function round() on the fathers’ heights)

fatherht<-round(father.son$fheight,0)
fathersonht<-cbind(fatherht,father.son)

suppressMessages(library(dplyr))

sonht<-filter(fathersonht,fatherht==71) %>% select(sheight) %>% unlist
mean(sonht)

## [1] 70.54082

sonht<-fathersonht %>% filter(fatherht==71) %>% select(sheight) %>% unlist 

mean(sonht)

## [1] 70.54082

#or

#filter(father.son,round(fheight)==71) %>% summarize(mean(sheight))

#or
mean(father.son$sheight[round(father.son$fheight)==71])

## [1] 70.54082

A:70.54082

Introduction Exercises #3

We say a statistical model is a linear model when we can write it as a linear combination of parameters and known covariates plus random error terms. In the choices below, Y represents our observations, time t is our only covariate, unknown parameters are represented with letters a,b,c,d and measurment error is represented by the letter e. Note that if t is known, then any transformation of t is also known. So, for example, both Y=a+bt +e and Y=a+b f(t) +e are linear models. Which of the following can’t be written as a linear model?
Y = a + bt + e
Y = a + b cos(t) + e
Y = a + b^t + e
Y = a + b t + c t^2 + d t^3 + e

A:Y = a + b^t + e

EXPLANATION:In every other case we can write the model as linear combination of parameters and known covariates. b^t is not a linear combination of b and t.

Introduction Exercises #4

Supposed you model the relationship between weight and height across individuals with a linear model. You assume that the height of individuals for a fixed weight x follows a liner model Y = a + b x + e. Which of the following do you feel best describes what e represents?
A:Between individual variability: people of the same height vary in their weight
EXPLANATION: Remember the model is across individuals and we fix x. People of the same height can vary greatly in other aspects of their physiology: for example different bone density or differing amounts of muscle and fat.

Linear models in general

We have seen three very different examples in which linear models can be used. A general model that encompasses all of the above examples is the following:

\[ Y_i = \beta_0 + \beta_1 x_{i,1} + \beta_2 x_{i,2} + \dots + \beta_2 x_{i,p} + \varepsilon_i, i=1,\dots,n \]

\[ Y_i = \beta_0 + \sum_{j=1}^p \beta_j x_{i,j} + \varepsilon_i, i=1,\dots,n \]

Note that we have a general number of predictors $p$. Matrix algebra provides a compact language and mathematical framework to compute and make derivations with any linear model that fit into the above framework.

Estimating parameters

For the models above to be useful we have to estimate the unknown $\beta$ s. In the first example, we want to describe a physical process for which we can’t have unknown parameters. In the second example, we better understand inheritance by estimating how much, on average, the father’s height affects the son’s height. In the final example, we want to determine if there is in fact a difference: if $\beta_1 \neq 0$.

The standard approach in science is to find the values that minimize the distance of the fitted model to the data. The following is called the least squares (LS) equation and we will see it often in this chapter:

\[ \sum_{i=1}^n \left\{ Y_i - \left(\beta_0 + \sum_{j=1}^p \beta_j x_{i,j}\right)\right\}^2 \]

Once we find the minimum, we will call the values the least squares estimates (LSE) and denote them with $\hat{\beta}$. The quantity obtained when evaluating the least square equation at the estimates is called the residual sum of squares (RSS). Since all these quantities depend on $Y$, they are random variables. The $\hat{\beta}$ s are random variables and we will eventually perform inference on them.

Falling object example revisited

Thanks to my high school physics teacher, I know that the equation for the trajectory of a falling object is:

\[d = h_0 + v_0 t - 0.5 \times 9.8 t^2\]

with $h_0$ and $v_0$ the starting height and velocity respectively. The data we simulated above followed this equation and added measurement error to simulate n observations for dropping the ball $(v_0=0)$ from the tower of Pisa $(h_0=56.67)$. This is why we used this code to simulate data:

g <- 9.8 ##meters per second
n <- 25
tt <- seq(0,3.4,len=n) ##time in secs, t is a base function
f <- 56.67  - 0.5*g*tt^2
y <-  f + rnorm(n,sd=1)

Here is what the data looks like with the solid line representing the true trajectory:

plot(tt,y,ylab="Distance in meters",xlab="Time in seconds")
lines(tt,f,col=2)

Fitted model for simulated data for distance travelled versus time of falling object measured with error.

simulate_drop_data_with_fit But we were pretending to be Galileo and so we don’t know the parameters in the model. The data does suggest it is a parabola, so we model it as such:

\[ Y_i = \beta_0 + \beta_1 x_i + \beta_2 x_i^2 + \varepsilon, i=1,\dots,n \]

How do we find the LSE?

The `lm` function

In R we can fit this model by simply using the lm function. We will describe this function in detail later, but here is a preview:

tt2 <-tt^2
fit <- lm(y~tt+tt2)
summary(fit)$coef

##               Estimate Std. Error     t value     Pr(>|t|)
## (Intercept) 56.9733469  0.6059749  94.0193248 3.717816e-30
## tt          -0.2088678  0.8254655  -0.2530303 8.025942e-01
## tt2         -4.9066385  0.2345028 -20.9235860 5.175646e-16

It gives us the LSE, as well as standard errors and p-values.

Part of what we do in this section is to explain the mathematics behind this function.

The least squares estimate (LSE)

Let’s write a function that computes the RSS for any vector $\beta$:

rss <- function(Beta0,Beta1,Beta2){
  r <- y - (Beta0+Beta1*tt+Beta2*tt^2)
  return(sum(r^2))
}

So for any three dimensional vector we get an RSS. Here is a plot of the RSS as a function of $\beta_2$ when we keep the other two fixed:

Beta2s<- seq(-10,0,len=100)
plot(Beta2s,sapply(Beta2s,rss,Beta0=55,Beta1=0),
     ylab="RSS",xlab="Beta2",type="l")
##Let's add another curve fixing another pair:
Beta2s<- seq(-10,0,len=100)
lines(Beta2s,sapply(Beta2s,rss,Beta0=65,Beta1=0),col=2)

Residual sum of squares obtained for several values of the parameters.

Trial and error here is not going to work. Instead, we can use calculus: take the partial derivatives, set them to 0 and solve. Of course, if we have many parameters, these equations can get rather complex. Linear algebra provides a compact and general way of solving this problem.

More on Galton (Advanced)

When studying the father-son data, Galton made a fascinating discovery using exploratory analysis.

Galton’s plot

He noted that if he tabulated the number of father-son height pairs and followed all the x,y values having the same totals in the table, they formed an ellipse. In the plot above, made by Galton, you see the ellipse formed by the pairs having 3 cases. This then led to modeling this data as correlated bivariate normal which we described earlier:

\[ Pr(X<a,Y<b) = \]

\[ \int_{-\infty}^{a} \int_{-\infty}^{b} \frac{1}{2\pi\sigma_x\sigma_y\sqrt{1-\rho^2}} \exp{ \left\{ \frac{1}{2(1-\rho^2)} \left[\left(\frac{x-\mu_x}{\sigma_x}\right)^2 - 2\rho\left(\frac{x-\mu_x}{\sigma_x}\right)\left(\frac{y-\mu_y}{\sigma_y}\right)+ \left(\frac{y-\mu_y}{\sigma_y}\right)^2 \right] \right\} } \]

We described how we can use math to show that if you keep $X$ fixed (condition to be $x$) the distribution of $Y$ is normally distributed with mean: $\mu_x +\sigma_y \rho \left(\frac{x-\mu_x}{\sigma_x}\right)$ and standard deviation $\sigma_y \sqrt{1-\rho^2}$. Note that $\rho$ is the correlation between $Y$ and $X$ , which implies that if we fix $X=x$, $Y$ does in fact follow a linear model. The $\beta_0$ and $\beta_1$ parameters in our simple linear model can be expressed in terms of $\mu_x,\mu_y,\sigma_x,\sigma_y$, and $\rho$.

Week 1 :Introduction to Matrix Algebra

Matrix Notation

Here we introduce the basics of matrix notation. Initially this may seem over-complicated, but once we discuss examples, you will appreciate the power of using this notation to both explain and derive solutions, as well as implement them as R code.

The language of linear models

Linear algebra notation actually simplifies the mathematical descriptions and manipulations of linear models, as well as coding in R. We will discuss the basics of this notation and then show some examples in R.

The main point of this entire exercise is to show how we can write the models above using matrix notation, and then explain how this is useful for solving the least squares equation. We start by simply defining notation and matrix multiplication, but bear with us since we eventually get back to the practical application.

Solving System of Equations

Linear algebra was created by mathematicians to solve systems of linear equations such as this:

\[ \begin{align*} a + b + c &= 6\\ 3a - 2b + c &= 2\\ 2a + b - c &= 1 \end{align*} \]

It provides very useful machinery to solve these problems generally. We will learn how we can write and solve this system using matrix algebra notation:

\[ \, \begin{pmatrix} 1&1&1\\ 3&-2&1\\ 2&1&-1 \end{pmatrix} \begin{pmatrix} a\\ b\\ c \end{pmatrix} = \begin{pmatrix} 6\\ 2\\ 1 \end{pmatrix} \implies \begin{pmatrix} a\\ b\\ c \end{pmatrix} = \begin{pmatrix} 1&1&1\\ 3&-2&1\\ 2&1&-1 \end{pmatrix}^{-1} \begin{pmatrix} 6\\ 2\\ 1 \end{pmatrix} \]

This section explains the notation used above. It turns out that we can borrow this notation for linear models in statistics as well.

Vectors, Matrices and Scalars

In the falling object, father-son heights, and mouse weight examples, the random variables associated with the data were represented by $Y_1,\dots,Y_n$. We can think of this as a vector. In fact, in R we are already doing this:

data(father.son,package="UsingR")
y=father.son$fheight
head(y)

## [1] 65.04851 63.25094 64.95532 65.75250 61.13723 63.02254

In math we can also use just one symbol. We usually use bold to distinguish it from the individual entries:

\[ \mathbf{Y} = \begin{pmatrix} Y_1\\\ Y_2\\\ \vdots\\\ Y_N \end{pmatrix} \]

For reasons that will soon become clear, default representation of data vectors have dimension $N\times 1$ as opposed to $1 \times N$ .

Here we don’t always use bold because normally one can tell what is a matrix from the context.

Similarly, we can use math notation to represent the covariates or predictors. In a case with two predictors we can represent them like this:

\[ \mathbf{X}_1 = \begin{pmatrix} x_{1,1}\\ \vdots\\ x_{N,1} \end{pmatrix} \mbox{ and } \mathbf{X}_2 = \begin{pmatrix} x_{1,2}\\ \vdots\\ x_{N,2} \end{pmatrix} \]

Note that for the falling object example $x_{1,1}= t_i$ and $x_{i,1}=t_i^2$ with $t_i$ the time of the i-th observation. Also, keep in mind that vectors can be thought of as $N\times 1$ matrices.

For reasons that will soon become apparent, it is convenient to represent these in matrices:

\[ \mathbf{X} = [ \mathbf{X}_1 \mathbf{X}_2 ] = \begin{pmatrix} x_{1,1}&x_{1,2}\\ \vdots\\ x_{N,1}&x_{N,2} \end{pmatrix} \]

This matrix has dimension $N \times 2$. We can create this matrix in R this way:

n <- 25
tt <- seq(0,3.4,len=n) ##time in secs, t is a base function
X <- cbind(X1=tt,X2=tt^2)
head(X)

##             X1         X2
## [1,] 0.0000000 0.00000000
## [2,] 0.1416667 0.02006944
## [3,] 0.2833333 0.08027778
## [4,] 0.4250000 0.18062500
## [5,] 0.5666667 0.32111111
## [6,] 0.7083333 0.50173611

dim(X)

## [1] 25  2

We can also use this notation to denote an arbitrary number of covariates with the following $N\times p$ matrix:

\[ \mathbf{X} = \begin{pmatrix} x_{1,1}&\dots & x_{1,p} \\ x_{2,1}&\dots & x_{2,p} \\ & \vdots & \\ x_{N,1}&\dots & x_{N,p} \end{pmatrix} \]

Just as an example, we show you how to make one in R now using matrix instead of cbind:

N <- 100; p <- 5
X <- matrix(1:(N*p),N,p)
head(X)

##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1  101  201  301  401
## [2,]    2  102  202  302  402
## [3,]    3  103  203  303  403
## [4,]    4  104  204  304  404
## [5,]    5  105  205  305  405
## [6,]    6  106  206  306  406

dim(X)

## [1] 100   5

By default, the matrices are filled column by column. The byrow=TRUE argument lets us change that to row by row:

N <- 100; p <- 5
X <- matrix(1:(N*p),N,p,byrow=TRUE)
head(X)

##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    2    3    4    5
## [2,]    6    7    8    9   10
## [3,]   11   12   13   14   15
## [4,]   16   17   18   19   20
## [5,]   21   22   23   24   25
## [6,]   26   27   28   29   30

Finally, we define a scalar. A scalar is just a number, which we call a scalar because we want to distinguish it from vectors and matrices. We usually use lower case and don’t bold. In the next section, we will understand why we make this distinction.

Matrix Notation Exercises

Matrix Notation Exercises #1

In R we have vectors and matrices. You can create your own vectors with the function c.

c(1,5,3,4)

They are also the output of many functions such as

rnorm(10)

You can turn vectors into matrices using functions such as rbind, cbind or matrix.

Create the matrix from the vector 1:1000 like this:

X = matrix(1:1000,100,10)

What is the entry in row 25, column 3 ?

X = matrix(1:1000,100,10)
X[25,3]

## [1] 225

A:225

Matrix Notation Exercises #2

Using the function cbind, create a 10 x 5 matrix with first column x=1:10. Then columns 2x, 3x, 4x and 5x in columns 2 through 5.

What is the sum of the elements of the 7th row?

x=1:10
M<-cbind(x,2*x,3*x,4*x,5*x)
sum(M[7,])

## [1] 105

A:105

Matrix Notation Exercises #3

Which of the following creates a matrix with multiples of 3 in the third column?

matrix(1:60,20,3)

##       [,1] [,2] [,3]
##  [1,]    1   21   41
##  [2,]    2   22   42
##  [3,]    3   23   43
##  [4,]    4   24   44
##  [5,]    5   25   45
##  [6,]    6   26   46
##  [7,]    7   27   47
##  [8,]    8   28   48
##  [9,]    9   29   49
## [10,]   10   30   50
## [11,]   11   31   51
## [12,]   12   32   52
## [13,]   13   33   53
## [14,]   14   34   54
## [15,]   15   35   55
## [16,]   16   36   56
## [17,]   17   37   57
## [18,]   18   38   58
## [19,]   19   39   59
## [20,]   20   40   60

matrix(1:60,20,3,byrow=TRUE)

##       [,1] [,2] [,3]
##  [1,]    1    2    3
##  [2,]    4    5    6
##  [3,]    7    8    9
##  [4,]   10   11   12
##  [5,]   13   14   15
##  [6,]   16   17   18
##  [7,]   19   20   21
##  [8,]   22   23   24
##  [9,]   25   26   27
## [10,]   28   29   30
## [11,]   31   32   33
## [12,]   34   35   36
## [13,]   37   38   39
## [14,]   40   41   42
## [15,]   43   44   45
## [16,]   46   47   48
## [17,]   49   50   51
## [18,]   52   53   54
## [19,]   55   56   57
## [20,]   58   59   60

x=11:20;rbind(x,2*x,3*x)

##   [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## x   11   12   13   14   15   16   17   18   19    20
##     22   24   26   28   30   32   34   36   38    40
##     33   36   39   42   45   48   51   54   57    60

x=1:40;matrix(3*x,20,2)

##       [,1] [,2]
##  [1,]    3   63
##  [2,]    6   66
##  [3,]    9   69
##  [4,]   12   72
##  [5,]   15   75
##  [6,]   18   78
##  [7,]   21   81
##  [8,]   24   84
##  [9,]   27   87
## [10,]   30   90
## [11,]   33   93
## [12,]   36   96
## [13,]   39   99
## [14,]   42  102
## [15,]   45  105
## [16,]   48  108
## [17,]   51  111
## [18,]   54  114
## [19,]   57  117
## [20,]   60  120

A:matrix(1:60,20,3,byrow=TRUE)
EXPLANATION:You can make each of the matrices in R and examine them visually. Or you can check whether the third column has all multiples of 3 with all(X[,3]%%3==0). Note that the fourth choice does not even have a 3rd column.

Matrix Operations

In a previous section, we motivated the use of matrix algebra with this system of equations:

\[ \begin{align*} a + b + c &= 6\\ 3a - 2b + c &= 2\\ 2a + b - c &= 1 \end{align*} \]

We described how this system can be rewritten and solved using matrix algebra:

\[ \, \begin{pmatrix} 1&1&1\\ 3&-2&1\\ 2&1&-1 \end{pmatrix} \begin{pmatrix} a\\ b\\ c \end{pmatrix} = \begin{pmatrix} 6\\ 2\\ 1 \end{pmatrix} \implies \begin{pmatrix} a\\ b\\ c \end{pmatrix}= \begin{pmatrix} 1&1&1\\ 3&-2&1\\ 2&1&-1 \end{pmatrix}^{-1} \begin{pmatrix} 6\\ 2\\ 1 \end{pmatrix} \]

Having described matrix notation, we will explain the operation we perform with them. For example, above we have matrix multiplication and we also have a symbol representing the inverse of a matrix. The importance of these operations and others will become clear once we present specific examples related to data analysis.

Multiplying by a scalar

We start with one of the simplest operations: scalar multiplication. If $a$ is scalar and $\mathbf{X}$ is a matrix, then:

\[ \mathbf{X} = \begin{pmatrix} x_{1,1}&\dots & x_{1,p} \\ x_{2,1}&\dots & x_{2,p} \\ & \vdots & \\ x_{N,1}&\dots & x_{N,p} \end{pmatrix} \implies a \mathbf{X} = \begin{pmatrix} a x_{1,1} & \dots & a x_{1,p}\\ a x_{2,1}&\dots & a x_{2,p} \\ & \vdots & \\ a x_{N,1} & \dots & a x_{N,p} \end{pmatrix} \]

R automatically follows this rule when we multiply a number by a matrix using *:

X <- matrix(1:12,4,3)
print(X)

##      [,1] [,2] [,3]
## [1,]    1    5    9
## [2,]    2    6   10
## [3,]    3    7   11
## [4,]    4    8   12

a <- 2
print(a*X)

##      [,1] [,2] [,3]
## [1,]    2   10   18
## [2,]    4   12   20
## [3,]    6   14   22
## [4,]    8   16   24

The transpose

The transpose is an operation that simply changes columns to rows. We use a $\top$ to denote a transpose. The technical definition is as follows: if X is as we defined it above, here is the transpose which will be $p\times N$:

\[ \mathbf{X} = \begin{pmatrix} x_{1,1}&\dots & x_{1,p} \\ x_{2,1}&\dots & x_{2,p} \\ & \vdots & \\ x_{N,1}&\dots & x_{N,p} \end{pmatrix} \implies \mathbf{X}^\top = \begin{pmatrix} x_{1,1}&\dots & x_{p,1} \\ x_{1,2}&\dots & x_{p,2} \\ & \vdots & \\ x_{1,N}&\dots & x_{p,N} \end{pmatrix} \]

In R we simply use t:

X <- matrix(1:12,4,3)
X

##      [,1] [,2] [,3]
## [1,]    1    5    9
## [2,]    2    6   10
## [3,]    3    7   11
## [4,]    4    8   12

t(X)

##      [,1] [,2] [,3] [,4]
## [1,]    1    2    3    4
## [2,]    5    6    7    8
## [3,]    9   10   11   12

Matrix multiplication

We start by describing the matrix multiplication shown in the original system of equations example:

\[ \begin{align*} a + b + c &=6\\ 3a - 2b + c &= 2\\ 2a + b - c &= 1 \end{align*} \]

What we are doing is multiplying the rows of the first matrix by the columns of the second. Since the second matrix only has one column, we perform this multiplication by doing the following:

\[ \, \begin{pmatrix} 1&1&1\\ 3&-2&1\\ 2&1&-1 \end{pmatrix} \begin{pmatrix} a\\ b\\ c \end{pmatrix}= \begin{pmatrix} a + b + c \\ 3a - 2b + c \\ 2a + b - c \end{pmatrix} \]

Here is a simple example. We can check to see if abc=c(3,2,1) is a solution:

X  <- matrix(c(1,3,2,1,-2,1,1,1,-1),3,3)
abc <- c(3,2,1) #use as an example
rbind( sum(X[1,]*abc), sum(X[2,]*abc), sum(X[3,]%*%abc))

##      [,1]
## [1,]    6
## [2,]    6
## [3,]    7

We can use the %*% to perform the matrix multiplication and make this much more compact:

X%*%abc

##      [,1]
## [1,]    6
## [2,]    6
## [3,]    7

We can see that c(3,2,1) is not a solution as the answer here is not the required c(6,2,1).

To get the solution, we will need to invert the matrix on the left, a concept we learn about below.

Here is the general definition of matrix multiplication of matrices $A$ and $X$:

\[ \mathbf{AX} = \begin{pmatrix} a_{1,1} & a_{1,2} & \dots & a_{1,N}\\ a_{2,1} & a_{2,2} & \dots & a_{2,N}\\ & & \vdots & \\ a_{M,1} & a_{M,2} & \dots & a_{M,N} \end{pmatrix} \begin{pmatrix} x_{1,1}&\dots & x_{1,p} \\ x_{2,1}&\dots & x_{2,p} \\ & \vdots & \\ x_{N,1}&\dots & x_{N,p} \end{pmatrix} \]

\[ = \begin{pmatrix} \sum_{i=1}^N a_{1,i} x_{i,1} & \dots & \sum_{i=1}^N a_{1,i} x_{i,p}\\ & \vdots & \\ \sum_{i=1}^N a_{M,i} x_{i,1} & \dots & \sum_{i=1}^N a_{M,i} x_{i,p} \end{pmatrix} \]

You can only take the product if the number of columns of the first matrix $A$ equals the number of rows of the second one $X$. Also, the final matrix has the same row numbers as the first $A$ and the same column numbers as the second $X$. After you study the example below, you may want to come back and re-read the sections above.

The identity matrix

The identity matrix is analogous to the number 1: if you multiply the identity matrix by another matrix, you get the same matrix. For this to happen, we need it to be like this:

\[ \mathbf{I} = \begin{pmatrix} 1&0&0&\dots&0&0\\ 0&1&0&\dots&0&0\\ 0&0&1&\dots&0&0\\ \vdots &\vdots & \vdots&\ddots&\vdots&\vdots\\ 0&0&0&\dots&1&0\\ 0&0&0&\dots&0&1 \end{pmatrix} \]

By this definition, the identity always has to have the same number of rows as columns or be what we call a square matrix.

If you follow the matrix multiplication rule above, you notice this works out:

\[ \mathbf{XI} = \begin{pmatrix} x_{1,1} & \dots & x_{1,p}\\ & \vdots & \\ x_{N,1} & \dots & x_{N,p} \end{pmatrix} \begin{pmatrix} 1&0&0&\dots&0&0\\ 0&1&0&\dots&0&0\\ 0&0&1&\dots&0&0\\ & & &\vdots& &\\ 0&0&0&\dots&1&0\\ 0&0&0&\dots&0&1 \end{pmatrix} = \begin{pmatrix} x_{1,1} & \dots & x_{1,p}\\ & \vdots & \\ x_{N,1} & \dots & x_{N,p} \end{pmatrix} \]

In R you can form an identity matrix this way:

n <- 5 #pick dimensions
diag(n)

##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    0    0    0    0
## [2,]    0    1    0    0    0
## [3,]    0    0    1    0    0
## [4,]    0    0    0    1    0
## [5,]    0    0    0    0    1

The inverse

The inverse of matrix $X$, denoted with $X^{-1}$, has the property that, when multiplied, gives you the identity $X^{-1}X=I$. Of course, not all matrices have inverses. For example, a $2\times 2$ matrix with 1s in all its entries does not have an inverse.

As we will see when we get to the section on applications to linear models, being able to compute the inverse of a matrix is quite useful. A very convenient aspect of R is that it includes a predefined function solve to do this. Here is how we would use it to solve the linear of equations:

X <- matrix(c(1,3,2,1,-2,1,1,1,-1),3,3)
y <- matrix(c(6,2,1),3,1)
solve(X)%*%y #equivalent to solve(X,y)

##      [,1]
## [1,]    1
## [2,]    2
## [3,]    3

Please note that solve is a function that should be used with caution as it is not generally numerically stable. We explain this in much more detail in the QR factorization section.

Matrix Operation Exercises

Matrix Operation Exercises #1

Suppose X is a matrix in R. Which of the following is NOT equivalent to X?

X <- matrix(1:12,4,3)
X

##      [,1] [,2] [,3]
## [1,]    1    5    9
## [2,]    2    6   10
## [3,]    3    7   11
## [4,]    4    8   12

t( t(X) )

##      [,1] [,2] [,3]
## [1,]    1    5    9
## [2,]    2    6   10
## [3,]    3    7   11
## [4,]    4    8   12

X %*% matrix(1,ncol(X) )

##      [,1]
## [1,]   15
## [2,]   18
## [3,]   21
## [4,]   24

X*1

##      [,1] [,2] [,3]
## [1,]    1    5    9
## [2,]    2    6   10
## [3,]    3    7   11
## [4,]    4    8   12

X%*%diag(ncol(X))

##      [,1] [,2] [,3]
## [1,]    1    5    9
## [2,]    2    6   10
## [3,]    3    7   11
## [4,]    4    8   12

A:X %*% matrix(1,ncol(X) )
EXPLANATION: The first transposes the transpose, so we end up with our original X. The third is multiplying each element by 1, and the fourth is multiplying X by the identity. The second is not even guaranteed to have the same dimensions as X.

Matrix Operation Exercises #2

Solve the following system of equations using R:

3a + 4b - 5c + d = 10
2a + 2b + 2c - d = 5
a -b + 5c - 5d = 7
5a + d = 4

What is the solution for c:

X <- matrix(c(3,2,1,5,4,2,-1,0,-5,2,5,0,1,-1,-5,1),4,4)
y <- matrix(c(10,5,7,4),4,1)
solve(X)%*%y #equivalent to solve(X,y)

##            [,1]
## [1,]  1.2477876
## [2,]  1.0176991
## [3,] -0.8849558
## [4,] -2.2389381

#or

X = matrix(c(3,2,1,5,4,2,-1,0,-5,2,5,0,1,-1,-5,1),4,4)
y = c(10,5,7,4)
sol = solve(X,y)
#and c is the third entry:
sol[ 3 ]

## [1] -0.8849558

A:-0.8849558

Load the following two matrices into R:
a <- matrix(1:12, nrow=4) b <- matrix(1:15, nrow=3)

Note the dimension of ‘a’ and the dimension of ‘b’.

In the question below, we will use the matrix multiplication operator in R, %*%, to multiply these two matrices

Matrix Operation Exercises #3

What is the value in the 3rd row and the 2nd column of the matrix product of ‘a’ and ‘b’

a <- matrix(1:12, nrow=4)
b <- matrix(1:15, nrow=3)
c<-a %*% b
c[3,2]

## [1] 113

A:113

Matrix Operation Exercises #4

Multiply the 3rd row of ‘a’ with the 2nd column of ‘b’, using the element-wise vector multiplication with *.

What is the sum of the elements in the resulting vector?

sum(a[3,] * b[,2])

## [1] 113

A:113 #which is equivalent to the 3rd row, 2nd column element of the product of the two matrices.

Week 2 : Data Analysis with Matrix Algebra

Examples

Now we are ready to see how matrix algebra can be useful when analyzing data. We start with some simple examples and eventually arrive at the main one: how to write linear models with matrix algebra notation and solve the least squares problem.

The average

To compute the sample average and variance of our data, we use these formulas $\bar{Y}=\frac{1}{N} Y_i$ and $\mbox{var}(Y)=\frac{1}{N} \sum_{i=1}^N (Y_i - \bar{Y})^2$. We can represent these with matrix multiplication. First, define this $N \times 1$ matrix made just of 1s:

\[ A=\begin{pmatrix} 1\\ 1\\ \vdots\\ 1 \end{pmatrix} \]

This implies that:

\[ \frac{1}{N} \mathbf{A}^\top Y = \frac{1}{N} \begin{pmatrix}1&1&,\dots&1\end{pmatrix} \begin{pmatrix} Y_1\\ Y_2\\ \vdots\\ Y_N \end{pmatrix}= \frac{1}{N} \sum_{i=1}^N Y_i = \bar{Y} \]

Note that we are multiplying by the scalar $1/N$. In R, we multiply matrix using %*%:

data(father.son,package="UsingR")
y <- father.son$sheight
print(mean(y))

## [1] 68.68407

N <- length(y)
Y<- matrix(y,N,1)
A <- matrix(1,N,1)
barY=t(A)%*%Y / N

print(barY)

##          [,1]
## [1,] 68.68407

The variance

As we will see later, multiplying the transpose of a matrix with another is very common in statistics. In fact, it is so common that there is a function in R:

barY=crossprod(A,Y) / N
print(barY)

##          [,1]
## [1,] 68.68407

For the variance, we note that if:

\[ \mathbf{r}\equiv \begin{pmatrix} Y_1 - \bar{Y}\\ \vdots\\ Y_N - \bar{Y} \end{pmatrix}, \,\, \frac{1}{N} \mathbf{r}^\top\mathbf{r} = \frac{1}{N}\sum_{i=1}^N (Y_i - \bar{Y})^2 \]

In R, if you only send one matrix into crossprod, it computes: $r^\top r$ so we can simply type:

r <- y - barY
crossprod(r)/N

##          [,1]
## [1,] 7.915196

Which is almost equivalent to:

library(rafalib)
popvar(y)

## [1] 7.915196

Linear models

Now we are ready to put all this to use. Let’s start with Galton’s example. If we define these matrices:

\[ \mathbf{Y} = \begin{pmatrix} Y_1\\ Y_2\\ \vdots\\ Y_N \end{pmatrix} , \mathbf{X} = \begin{pmatrix} 1&x_1\\ 1&x_2\\ \vdots\\ 1&x_N \end{pmatrix} , \mathbf{\beta} = \begin{pmatrix} \beta_0\\ \beta_1 \end{pmatrix} \mbox{ and } \mathbf{\varepsilon} = \begin{pmatrix} \varepsilon_1\\ \varepsilon_2\\ \vdots\\ \varepsilon_N \end{pmatrix} \]

Then we can write the model:

\[ Y_i = \beta_0 + \beta_1 x_i + \varepsilon, i=1,\dots,N \]

as:

\[ \, \begin{pmatrix} Y_1\\ Y_2\\ \vdots\\ Y_N \end{pmatrix} = \begin{pmatrix} 1&x_1\\ 1&x_2\\ \vdots\\ 1&x_N \end{pmatrix} \begin{pmatrix} \beta_0\\ \beta_1 \end{pmatrix} + \begin{pmatrix} \varepsilon_1\\ \varepsilon_2\\ \vdots\\ \varepsilon_N \end{pmatrix} \]

or simply:

\[ \mathbf{Y}=\mathbf{X}\boldsymbol{\beta}+\boldsymbol{\varepsilon} \]

which is a much simpler way to write it.

The least squares equation becomes simpler as well since it is the following cross-product:

\[ (\mathbf{Y}-\mathbf{X}\boldsymbol{\beta})^\top (\mathbf{Y}-\mathbf{X}\boldsymbol{\beta}) \]

So now we are ready to determine which values of $\beta$ minimize the above, which we can do using calculus to find the minimum.

Advanced: Finding the minimum using calculus

There are a series of rules that permit us to compute partial derivative equations in matrix notation. By equating the derivative to 0 and solving for the $\beta$, we will have our solution. The only one we need here tells us that the derivative of the above equation is:

\[ 2 \mathbf{X}^\top (\mathbf{Y} - \mathbf{X} \boldsymbol{\hat{\beta}})=0 \]

\[ \mathbf{X}^\top \mathbf{X} \boldsymbol{\hat{\beta}} = \mathbf{X}^\top \mathbf{Y} \]

\[ \boldsymbol{\hat{\beta}} = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{Y} \]

and we have our solution. We usually put a hat on the $\beta$ that solves this, $\hat{\beta}$ , as it is an estimate of the “real” $\beta$ that generated the data.

Remember that the least squares are like a square (multiply something by itself) and that this formula is similar to the derivative of $f(x)^2$ being $2f(x)f\prime (x)$.

Finding LSE in R

Let’s see how it works in R:

data(father.son,package="UsingR")
x=father.son$fheight
y=father.son$sheight
X <- cbind(1,x)
betahat <- solve( t(X) %*% X ) %*% t(X) %*% y
###or
betahat <- solve( crossprod(X) ) %*% crossprod( X, y )

Now we can see the results of this by computing the estimated $\hat{\beta}_0+\hat{\beta}_1 x$ for any value of $x$:

newx <- seq(min(x),max(x),len=100)
X <- cbind(1,newx)
fitted <- X%*%betahat
plot(x,y,xlab="Father's height",ylab="Son's height")
lines(newx,fitted,col=2)

Galton’s data with fitted regression line.

This $\hat{\boldsymbol{\beta}}=(\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{Y}$ is one of the most widely used results in data analysis. One of the advantages of this approach is that we can use it in many different situations. For example, in our falling object problem:

set.seed(1)
g <- 9.8 #meters per second
n <- 25
tt <- seq(0,3.4,len=n) #time in secs, t is a base function
d <- 56.67  - 0.5*g*tt^2 + rnorm(n,sd=1)

Notice that we are using almost the same exact code:

X <- cbind(1,tt,tt^2)
y <- d
betahat <- solve(crossprod(X))%*%crossprod(X,y)
newtt <- seq(min(tt),max(tt),len=100)
X <- cbind(1,newtt,newtt^2)
fitted <- X%*%betahat
plot(tt,y,xlab="Time",ylab="Height")
lines(newtt,fitted,col=2)

Fitted parabola to simulated data for distance travelled versus time of falling object measured with error.

And the resulting estimates are what we expect:

betahat

##          [,1]
##    56.5317368
## tt  0.5013565
##    -5.0386455

The Tower of Pisa is about 56 meters high. Since we are just dropping the object there is no initial velocity, and half the constant of gravity is 9.8/2=4.9 meters per second squared.

The `lm` Function

R has a very convenient function that fits these models. We will learn more about this function later, but here is a preview:

X <- cbind(tt,tt^2)
fit=lm(y~X)
summary(fit)

## 
## Call:
## lm(formula = y ~ X)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.5295 -0.4882  0.2537  0.6560  1.5455 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  56.5317     0.5451 103.701   <2e-16 ***
## Xtt           0.5014     0.7426   0.675    0.507    
## X            -5.0386     0.2110 -23.884   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9822 on 22 degrees of freedom
## Multiple R-squared:  0.9973, Adjusted R-squared:  0.997 
## F-statistic:  4025 on 2 and 22 DF,  p-value: < 2.2e-16

Note that we obtain the same values as above.

Summary

We have shown how to write linear models using linear algebra. We are going to do this for several examples, many of which are related to designed experiments. We also demonstrated how to obtain least squares estimates. Nevertheless, it is important to remember that because $y$ is a random variable, these estimates are random as well. In a later section, we will learn how to compute standard error for these estimates and use this to perform inference.

Matrix Algebra in Practice I

Motivating Examples

Falling objects

set.seed(1)
g <- 9.8 ##meters per second
n <- 25
tt <- seq(0,3.4,len=n) ##time in secs, note: we use tt because t is a base function
d <- 56.67  - 0.5*g*tt^2 + rnorm(n,sd=1) ##meters

The assistants hand the data to Galileo and this is what he sees:

mypar()
plot(tt,d,ylab="Distance in meters",xlab="Time in seconds")

Simulated data for distance travelled versus time of falling object measured with error.

He does not know the exact equation, but by looking at the plot above he deduces that the position should follow a parabola. So he models the data with:

\[ Y_i = \beta_0 + \beta_1 x_i + \beta_2 x_i^2 + \varepsilon, i=1,\dots,n \]

Father & son heights

Now imagine you are Francis Galton in the 19th century and you collect paired height data from fathers and sons. You suspect that height is inherited. Your data:

data(father.son,package="UsingR")
x=father.son$fheight
y=father.son$sheight

looks like this:

plot(x,y,xlab="Father's height",ylab="Son's height")

Galton’s data. Son heights versus father heights.

The sons’ heights do seem to increase linearly with the fathers’ heights. In this case, a model that describes the data is as follows:

\[ Y_i = \beta_0 + \beta_1 x_i + \varepsilon, i=1,\dots,N \]

Random samples from multiple populations

dat <- read.csv("femaleMiceWeights.csv")
mypar(1,1)
stripchart(Bodyweight~Diet,data=dat,vertical=TRUE,method="jitter",pch=1,main="Mice weights")

Mouse weights under two diets.

\[ Y_i = \beta_0 + \beta_1 x_{i} + \varepsilon_i\]

Linear models in general

We have seen three very different examples in which linear models can be used. A general model that encompasses all of the above examples is the following:

\[ Y_i = \beta_0 + \beta_1 x_{i,1} + \beta_2 x_{i,2} + \dots + \beta_2 x_{i,p} + \varepsilon_i, i=1,\dots,n \]

\[ Y_i = \beta_0 + \sum_{j=1}^p \beta_j x_{i,j} + \varepsilon_i, i=1,\dots,n \]

Estimating parameters

\[ \sum_{i=1}^n \left\{ Y_i - \left(\beta_0 + \sum_{j=1}^p \beta_j x_{i,j}\right)\right\}^2 \]

Falling object example revisited

Thanks to my high school physics teacher, I know that the equation for the trajectory of a falling object is:

\[d = h_0 + v_0 t - 0.5 \times 9.8 t^2\]

g <- 9.8 ##meters per second
n <- 25
tt <- seq(0,3.4,len=n) ##time in secs, t is a base function
f <- 56.67  - 0.5*g*tt^2
y <-  f + rnorm(n,sd=1)

Here is what the data looks like with the solid line representing the true trajectory:

plot(tt,y,ylab="Distance in meters",xlab="Time in seconds")
lines(tt,f,col=2)

Fitted model for simulated data for distance travelled versus time of falling object measured with error.

But we were pretending to be Galileo and so we don’t know the parameters in the model. The data does suggest it is a parabola, so we model it as such:

\[ Y_i = \beta_0 + \beta_1 x_i + \beta_2 x_i^2 + \varepsilon, i=1,\dots,n \]

How do we find the LSE?

The `lm` function

In R we can fit this model by simply using the lm function. We will describe this function in detail later, but here is a preview:

tt2 <-tt^2
fit <- lm(y~tt+tt2)
summary(fit)$coef

##               Estimate Std. Error    t value     Pr(>|t|)
## (Intercept) 57.1047803  0.4996845 114.281666 5.119823e-32
## tt          -0.4460393  0.6806757  -0.655289 5.190757e-01
## tt2         -4.7471698  0.1933701 -24.549662 1.767229e-17

It gives us the LSE, as well as standard errors and p-values.

Part of what we do in this section is to explain the mathematics behind this function.

The least squares estimate (LSE)

Let’s write a function that computes the RSS for any vector $\beta$:

rss <- function(Beta0,Beta1,Beta2){
  r <- y - (Beta0+Beta1*tt+Beta2*tt^2)
  return(sum(r^2))
}

So for any three dimensional vector we get an RSS. Here is a plot of the RSS as a function of $\beta_2$ when we keep the other two fixed:

Beta2s<- seq(-10,0,len=100)
plot(Beta2s,sapply(Beta2s,rss,Beta0=55,Beta1=0),
     ylab="RSS",xlab="Beta2",type="l")
##Let's add another curve fixing another pair:
Beta2s<- seq(-10,0,len=100)
lines(Beta2s,sapply(Beta2s,rss,Beta0=65,Beta1=0),col=2)

Residual sum of squares obtained for several values of the parameters.

More on Galton (Advanced)

When studying the father-son data, Galton made a fascinating discovery using exploratory analysis.

Galton’s plot

\[ Pr(X<a,Y<b) = \]

Matrix Algebra in Practice II

In our falling object problem:

set.seed(1)
g <- 9.8 #meters per second
n <- 25
tt <- seq(0,3.4,len=n) #time in secs, t is a base function
d <- 56.67  - 0.5*g*tt^2 + rnorm(n,sd=1)

Let’s write a function that computes the RSS for any vector $\beta$:

rss <- function(Beta0,Beta1,Beta2){
  r <- d- (Beta0+Beta1*tt+Beta2*tt^2)
  return(sum(r^2))
}

Beta2s<- seq(-10,0,len=100)
RSS<-sapply(Beta2s,rss,Beta0=65,Beta1=0)
plot(Beta2s,sapply(Beta2s,rss,Beta0=55,Beta1=0),
     ylab="RSS",xlab="Beta2",type="l")
lines(Beta2s,RSS,type="l",col=3)

X <- cbind(1,tt,tt^2) # or 
#same as above X <- cbind(rep(1,length(tt),tt,tt^2)
head(X)

##               tt           
## [1,] 1 0.0000000 0.00000000
## [2,] 1 0.1416667 0.02006944
## [3,] 1 0.2833333 0.08027778
## [4,] 1 0.4250000 0.18062500
## [5,] 1 0.5666667 0.32111111
## [6,] 1 0.7083333 0.50173611

#lets now compute RSS for any given beta
y <- d
Beta<-matrix(c(55,0,5),3,1)
r<-y-X %*% Beta #residuals
RSS<-t(r) %*% r

#lets check if we get the same ans
rss(55,0,5)

## [1] 66131.18

RSS

##          [,1]
## [1,] 66131.18

#faster function in R

RSS<-crossprod(r)
RSS  # we get the same ans

##          [,1]
## [1,] 66131.18

Least Squares Estimates (LSE)

#Using Matrix algebra

betahat<-solve(t(X) %*% X) %*% t(X) %*% y
betahat

##          [,1]
##    56.5317368
## tt  0.5013565
##    -5.0386455

#or
betahat <- solve(crossprod(X))%*%crossprod(X,y)
betahat

##          [,1]
##    56.5317368
## tt  0.5013565
##    -5.0386455

#Using R lm function

tt2 <-tt^2
fit <- lm(y~tt+tt2)
summary(fit)$coef

##               Estimate Std. Error     t value     Pr(>|t|)
## (Intercept) 56.5317368  0.5451423 103.7008720 4.323897e-31
## tt           0.5013565  0.7425988   0.6751378 5.066226e-01
## tt2         -5.0386455  0.2109615 -23.8841916 3.167820e-17

#Note solve() is unstable function....we have another stable function backsolve()

QR<-qr(X)
Q<-qr.Q(QR)
R<-qr.R(QR)
backsolve(R,crossprod(Q,y))

##            [,1]
## [1,] 56.5317368
## [2,]  0.5013565
## [3,] -5.0386455

Matrix Algebra Examples Exercises

Suppose we are analyzing a set of 4 samples. The first two samples are from a treatment group A and the second two samples are from a treatment group B. This design can be represented with a model matrix like so:

X <- matrix(c(1,1,1,1,0,0,1,1),nrow=4)
rownames(X) <- c("a","a","b","b")
X

##   [,1] [,2]
## a    1    0
## a    1    0
## b    1    1
## b    1    1

Suppose that the fitted parameters for a linear model give us:

beta <- c(5, 2)

Use the matrix multiplication operator, %*%, in R to answer the following questions:

Matrix Algebra Examples Exercises #1

What is the fitted value for the A samples? (The fitted Y values.)

The formula for finding beta using least squares extimation is:

\[ \hat{\boldsymbol{\beta}} = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{Y} \]

We can calculate this in R using our matrix multiplication operator %*%, the inverse function solve, and the transpose function t.

To compute y i.e. fitted value which is:

\[ \mathbf{Y}=\mathbf{X}\boldsymbol{\beta}+\boldsymbol{\varepsilon} \]

# so ultimately we have to find y give betas and the X matrix

fitted = X %*% beta

fitted[ 1:2, ]

## a a 
## 5 5

A:5

Matrix Algebra Examples Exercises #2

What is the fitted value for the B samples? (The fitted Y values.)

fitted = X %*% beta
fitted[ 3:4, ]

## b b 
## 7 7

A:7

Suppose now we are comparing two treatments B and C to a control group A, each with two samples. This design can be represented with a model matrix like so:

X <- matrix(c(1,1,1,1,1,1,0,0,1,1,0,0,0,0,0,0,1,1),nrow=6)
rownames(X) <- c("a","a","b","b","c","c")
#which results in a matrix that looks like
X

##   [,1] [,2] [,3]
## a    1    0    0
## a    1    0    0
## b    1    1    0
## b    1    1    0
## c    1    0    1
## c    1    0    1

Suppose that the fitted values for the linear model are given by:

beta <- c(10,3,-3)

Matrix Algebra Examples Exercises #3

What is the fitted value for the B samples?

fitted = X %*% beta

fitted[ 3:4, ]

##  b  b 
## 13 13

A:13

Matrix Algebra Examples Exercises #4

What is the fitted value for the C samples?

fitted = X %*% beta

fitted[ 5:6, ]

## c c 
## 7 7

A:7

Week 2 : Inference

Inference Review Exercises

The standard error of an estimate is the standard deviation of the sampling distribution of an estimate. In PH525.1x, we saw that our estimate of the mean of a population changed depending on the sample that we took from the population. If we repeatedly sampled from the population, and each time estimated the mean, the collection of mean estimates formed the sampling distribution of the estimate. When we took the standard deviation of those estimates, that was the standard error of our mean estimate.

Here, we aren’t sampling individuals from a population, but we do have random noise in our observations Y. The estimate for the linear model terms (beta-hat) will not be the same if we were to re-run the experiment, because the random noise would be different. If we were to re-run the experiment many times, and estimate linear model terms (beta-hat) each time, this is called the sampling distribution of the estimates. If we take the standard deviation of all of these estimates from repetitions of the experiment, this is called the standard error of the estimate. While we are not sampling individuals, you can think about the repetition of the experiment that we are “sampling” new errors in our observation of Y.

Inference Review Exercises #1

We have shown how to find the least squares estimates with matrix algebra. These estimates are random variables as they are linear combinations of the data. For these estimates to be useful we also need to compute the standard errors.

Here we review standard errors in the context of linear models.

It is useful to think about where randomness comes from. In our falling object example, randomness was introduced through measurement errors. Every time we rerun the experiment a new set of measurement errors will be made which implies our data will be random. This implies that our estimate of, for example, the gravitational constant will change. The constant is fixed, but our estimates are not. To see this we can run a Monte Carlo simulation. Specifically we will generate the data repeatedly and compute the estimate for the quadratic term each time.

g = 9.8 ## meters per second

h0 = 56.67

v0 = 0

n = 25

tt = seq(0,3.4,len=n) ##time in secs, t is a base function

y = h0 + v0 *tt - 0.5* g*tt^2 + rnorm(n,sd=1)

Now we act as if we didn’t know h0, v0 and -0.5*g and use regression to estimate these. We can rewrite the model as y = b0 + b1 t + b2 t^2 + e and obtain the LSE we have used in this class. Note that g = -2 b2.

To obtain the LSE in R we could write:

X = cbind(1,tt,tt^2)
head(X)

##               tt           
## [1,] 1 0.0000000 0.00000000
## [2,] 1 0.1416667 0.02006944
## [3,] 1 0.2833333 0.08027778
## [4,] 1 0.4250000 0.18062500
## [5,] 1 0.5666667 0.32111111
## [6,] 1 0.7083333 0.50173611

A = solve(crossprod(X))%*%t(X)
head(A)

##           [,1]        [,2]        [,3]        [,4]        [,5]
##     0.30803419  0.25948718  0.21435897  0.17264957  0.13435897
## tt -0.35475113 -0.27692308 -0.20539052 -0.14015345 -0.08121188
##     0.08517434  0.06388076  0.04443879  0.02684843  0.01110970
##            [,6]        [,7]        [,8]        [,9]        [,10]
##     0.099487179  0.06803419  0.04000000  0.01538462 -0.005811966
## tt -0.028565808  0.01778477  0.05783986  0.09159945  0.119063545
##    -0.002777424 -0.01481293 -0.02499682 -0.03332909 -0.039809746
##          [,11]       [,12]       [,13]       [,14]       [,15]       [,16]
##    -0.02358974 -0.03794872 -0.04888889 -0.05641026 -0.06051282 -0.06119658
## tt  0.14023215  0.15510525  0.16368286  0.16596498  0.16195160  0.15164273
##    -0.04443879 -0.04721621 -0.04814202 -0.04721621 -0.04443879 -0.03980975
##          [,17]       [,18]       [,19]        [,20]        [,21]
##    -0.05846154 -0.05230769 -0.04273504 -0.029743590 -0.013333333
## tt  0.13503836  0.11213850  0.08294314  0.047452292  0.005665945
##    -0.03332909 -0.02499682 -0.01481293 -0.002777424  0.011109697
##           [,22]       [,23]       [,24]       [,25]
##     0.006495726  0.02974359  0.05641026  0.08649573
## tt -0.042415896 -0.09679323 -0.15746606 -0.22443439
##     0.026848434  0.04443879  0.06388076  0.08517434

Given how we have defined A, which of the following is the LSE of g, the acceleration due to gravity (suggestion: try the code in R)?

 A %*% y

##          [,1]
##    56.4305502
## tt  0.1467666
##    -4.8943619

 -2 * (A %*% y) [3]

## [1] 9.788724

 A[3,3]

##            
## 0.04443879

A:-2 * (A %*% y) [3]

EXPLANATION:9.8 is not the answer because the LSE is a random variable. The A%%y gives us the LSE for all three coefficients. The third entry gives us the coefficient for the quantradic term which is -0.5 g. We multiply by -2 to get the estimate of g.

Inference Review Exercises #2

In the lines of code above, there was a call to a random function rnorm(). This means that each time the lines of code above are repeated, the estimate of g will be different.

Use the code above in conjunction with the function replicate() to generate 100,000 Monte Carlo simulated datasets. For each dataset compute an estimate of g (remember to multiply by -2) and set the seed to 1.

What is the standard error of this estimate?:

B = 100000

g = 9.8 ## meters per second
n = 25

tt = seq(0,3.4,len=n) ##time in secs, t is a base function
X = cbind(1,tt,tt^2)
A = solve(crossprod(X))%*%t(X)

set.seed(1)

betahat = replicate(B,{
y = 56.67 - 0.5*g*tt^2 + rnorm(n,sd=1)
betahats = -2*A%*%y
return(betahats[3])
})

sqrt(mean( (betahat-mean(betahat) )^2))

## [1] 0.4297449

A:0.4297449

Inference Review Exercises #3

In the father and son height examples we have randomness because we have a random sample of father and son pairs. For the sake of illustration let’s assume that this is the entire population:

library(UsingR)

x = father.son$fheight
y = father.son$sheight
n = length(y)

#Now let's run a Monte Carlo simulation in which we take a sample of size 50 over and over again. Here is how we obtain one sample:

N =  50
index = sample(n,N)
sampledat = father.son[index,]

x = sampledat$fheight
y = sampledat$sheight
betahat =  lm(y~x)$coef

Use the function replicate to take 10,000 samples.

What is the standard error of the slope estimate? That is, calculate the standard deviation of the estimate from many random samples. Again, set the seed to 1.

N = 50
B = 10000

set.seed(1)

betahat = replicate(B,{
index = sample(n,N)
sampledat = father.son[index,]
x = sampledat$fheight
y = sampledat$sheight
lm(y~x)$coef[2]
})

sqrt ( mean( (betahat - mean(betahat) )^2 ))

## [1] 0.1243209

A:0.1243209

Inference Review Exercises #4

We are defining a new concept: covariance. The covariance of two lists of numbers X=X1,…,Xn and Y=Y1,…,Yn is mean( (Y - mean(Y))*(X-mean(X) ) ).

Which of the following is closest to the covariance between father heights and son heights

Y=father.son$fheight
X=father.son$sheight
mean( (Y - mean(Y))*(X-mean(X) ) )

## [1] 3.869739

#or

x = father.son$fheight
y = father.son$sheight
cor(x,y)

## [1] 0.5013383

cov<-cor(x,y)*sd(x)*sd(y)
cov

## [1] 3.873333

#or using R in built function
cov(x,y)

## [1] 3.873333

A: 3.873333

Standard Errors

We have shown how to find the least squares estimates with matrix algebra. These estimates are random variables since they are linear combinations of the data. For these estimates to be useful, we also need to compute their standard errors. Linear algebra provides a powerful approach for this task. We provide several examples.

Falling object

It is useful to think about where randomness comes from. In our falling object example, randomness was introduced through measurement errors. Each time we rerun the experiment, a new set of measurement errors will be made. This implies that our data will change randomly, which in turn suggests that our estimates will change randomly. For instance, our estimate of the gravitational constant will change every time we perform the experiment. The constant is fixed, but our estimates are not. To see this we can run a Monte Carlo simulation. Specifically, we will generate the data repeatedly and each time compute the estimate for the quadratic term.

set.seed(1)
B <- 10000
h0 <- 56.67
v0 <- 0
g <- 9.8 ##meters per second

n <- 25
tt <- seq(0,3.4,len=n) ##time in secs, t is a base function
X <-cbind(1,tt,tt^2)
##create X'X^-1 X'
A <- solve(crossprod(X)) %*% t(X)
betahat<-replicate(B,{
  y <- h0 + v0*tt  - 0.5*g*tt^2 + rnorm(n,sd=1)
  betahats <- A%*%y
  return(betahats[3])
})
head(betahat)

## [1] -5.038646 -4.894362 -5.143756 -5.220960 -5.063322 -4.777521

As expected, the estimate is different every time. This is because $\hat{\beta}$ is a random variable. It therefore has a distribution:

library(rafalib)
mypar(1,2)
hist(betahat)
qqnorm(betahat)
qqline(betahat)

Distribution of estimated regression coefficients obtained from Monte Carlo simulated falling object data. The left is a histogram and on the right we have a qq-plot against normal theoretical quantiles.

Since $\hat{\beta}$ is a linear combination of the data which we made normal in our simulation, it is also normal as seen in the qq-plot above. Also, the mean of the distribution is the true parameter $-0.5g$, as confirmed by the Monte Carlo simulation performed above.

round(mean(betahat),1)

## [1] -4.9

But we will not observe this exact value when we estimate because the standard error of our estimate is approximately:

sd(betahat)

## [1] 0.2129976

Here we will show how we can compute the standard error without a Monte Carlo simulation. Since in practice we do not know exactly how the errors are generated, we can’t use the Monte Carlo approach.

Father and son heights

In the father and son height examples, we have randomness because we have a random sample of father and son pairs. For the sake of illustration, let’s assume that this is the entire population:

data(father.son,package="UsingR")
x <- father.son$fheight
y <- father.son$sheight
n <- length(y)

Now let’s run a Monte Carlo simulation in which we take a sample size of 50 over and over again.

N <- 50
B <-1000
betahat <- replicate(B,{
  index <- sample(n,N)
  sampledat <- father.son[index,]
  x <- sampledat$fheight
  y <- sampledat$sheight
  lm(y~x)$coef
  })
betahat <- t(betahat) #have estimates in two columns

By making qq-plots, we see that our estimates are approximately normal random variables:

mypar(1,2)
qqnorm(betahat[,1])
qqline(betahat[,1])
qqnorm(betahat[,2])
qqline(betahat[,2])

Distribution of estimated regression coefficients obtained from Monte Carlo simulated father-son height data. The left is a histogram and on the right we have a qq-plot against normal theoretical quantiles.

We also see that the correlation of our estimates is negative:

cor(betahat[,1],betahat[,2])

## [1] -0.9992293

When we compute linear combinations of our estimates, we will need to know this information to correctly calculate the standard error of these linear combinations.

In the next section, we will describe the variance-covariance matrix. The covariance of two random variables is defined as follows:

mean( (betahat[,1]-mean(betahat[,1] ))* (betahat[,2]-mean(betahat[,2])))

## [1] -1.035291

The covariance is the correlation multiplied by the standard deviations of each random variable:

\[\mbox{Corr}(X,Y) = \frac{\mbox{Cov}(X,Y)}{\sigma_X \sigma_Y}\]

Other than that, this quantity does not have a useful interpretation in practice. However, as we will see, it is a very useful quantity for mathematical derivations. In the next sections, we show useful matrix algebra calculations that can be used to estimate standard errors of linear model estimates.

Variance-covariance matrix (Advanced)

As a first step we need to define the variance-covariance matrix, $\boldsymbol{\Sigma}$. For a vector of random variables, $\mathbf{Y}$, we define $\boldsymbol{\Sigma}$ as the matrix with the $i,j$ entry:

\[ \Sigma_{i,j} \equiv \mbox{Cov}(Y_i, Y_j) \]

The covariance is equal to the variance if $i = j$ and equal to 0 if the variables are independent. In the kinds of vectors considered up to now, for example, a vector $\mathbf{Y}$ of individual observations $Y_i$ sampled from a population, we have assumed independence of each observation and assumed the $Y_i$ all have the same variance $\sigma^2$, so the variance-covariance matrix has had only two kinds of elements:

\[ \mbox{Cov}(Y_i, Y_i) = \mbox{var}(Y_i) = \sigma^2\]

\[ \mbox{Cov}(Y_i, Y_j) = 0, \mbox{ for } i \neq j\]

which implies that $\boldsymbol{\Sigma} = \sigma^2 \mathbf{I}$ with $\mathbf{I}$, the identity matrix.

Later, we will see a case, specifically the estimate coefficients of a linear model, $\hat{\boldsymbol{\beta}}$, that has non-zero entries in the off diagonal elements of $\boldsymbol{\Sigma}$. Furthermore, the diagonal elements will not be equal to a single value $\sigma^2$.

Variance of a linear combination

A useful result provided by linear algebra is that the variance covariance-matrix of a linear combination $\mathbf{AY}$ of $\mathbf{Y}$ can be computed as follows:

\[ \mbox{var}(\mathbf{AY}) = \mathbf{A}\mbox{var}(\mathbf{Y}) \mathbf{A}^\top \]

For example, if $Y_1$ and $Y_2$ are independent both with variance $\sigma^2$ then:

\[\mbox{var}\{Y_1+Y_2\} = \mbox{var}\left\{ \begin{pmatrix}1&1\end{pmatrix}\begin{pmatrix} Y_1\\Y_2\\ \end{pmatrix}\right\}\]

\[ =\begin{pmatrix}1&1\end{pmatrix} \sigma^2 \mathbf{I}\begin{pmatrix} 1\\1\\ \end{pmatrix}=2\sigma^2\]

as we expect. We use this result to obtain the standard errors of the LSE (least squares estimate).

LSE standard errors (Advanced)

Note that $\boldsymbol{\hat{\beta}}$ is a linear combination of $\mathbf{Y}$: $\mathbf{AY}$ with $\mathbf{A}=\mathbf{(X^\top X)^{-1}X}^\top$, so we can use the equation above to derive the variance of our estimates:

\[\mbox{var}(\boldsymbol{\hat{\beta}}) = \mbox{var}( \mathbf{(X^\top X)^{-1}X^\top Y} ) = \]

\[\mathbf{(X^\top X)^{-1} X^\top} \mbox{var}(Y) (\mathbf{(X^\top X)^{-1} X^\top})^\top = \]

\[\mathbf{(X^\top X)^{-1} X^\top} \sigma^2 \mathbf{I} (\mathbf{(X^\top X)^{-1} X^\top})^\top = \]

\[\sigma^2 \mathbf{(X^\top X)^{-1} X^\top}\mathbf{X} \mathbf{(X^\top X)^{-1}} = \]

\[\sigma^2\mathbf{(X^\top X)^{-1}}\]

The diagonal of the square root of this matrix contains the standard error of our estimates.

Estimating $\sigma^2$

To obtain an actual estimate in practice from the formulas above, we need to estimate $\sigma^2$. Previously we estimated the standard errors from the sample. However, the sample standard deviation of $Y$ is not $\sigma$ because $Y$ also includes variability introduced by the deterministic part of the model: $\mathbf{X}\boldsymbol{\beta}$. The approach we take is to use the residuals.

We form the residuals like this:

\[ \mathbf{r}\equiv\boldsymbol{\hat{\varepsilon}} = \mathbf{Y}-\mathbf{X}\boldsymbol{\hat{\beta}}\]

Both $\mathbf{r}$ and $\boldsymbol{\hat{\varepsilon}}$ notations are used to denote residuals.

Then we use these to estimate, in a similar way, to what we do in the univariate case:

\[ s^2 \equiv \hat{\sigma}^2 = \frac{1}{N-p}\mathbf{r}^\top\mathbf{r} = \frac{1}{N-p}\sum_{i=1}^N r_i^2\]

Here $N$ is the sample size and $p$ is the number of columns in $\mathbf{X}$ or number of parameters (including the intercept term $\beta_0$). The reason we divide by $N-p$ is because mathematical theory tells us that this will give us a better (unbiased) estimate.

Let’s try this in R and see if we obtain the same values as we did with the Monte Carlo simulation above:

n <- nrow(father.son)
N <- 50
index <- sample(n,N)
sampledat <- father.son[index,]
x <- sampledat$fheight
y <- sampledat$sheight
X <- model.matrix(~x)

N <- nrow(X)
p <- ncol(X)

XtXinv <- solve(crossprod(X))

resid <- y - X %*% XtXinv %*% crossprod(X,y)

s <- sqrt( sum(resid^2)/(N-p))
ses <- sqrt(diag(XtXinv))*s

Let’s compare to what lm provides:

summary(lm(y~x))$coef[,2]

## (Intercept)           x 
##   8.3899781   0.1240767

ses

## (Intercept)           x 
##   8.3899781   0.1240767

They are identical because they are doing the same thing. Also, note that we approximate the Monte Carlo results:

apply(betahat,2,sd)

## (Intercept)           x 
##   8.3817556   0.1237362

Linear combination of estimates

Frequently, we want to compute the standard deviation of a linear combination of estimates such as $\hat{\beta}_2 - \hat{\beta}_1$. This is a linear combination of $\hat{\boldsymbol{\beta}}$:

\[\hat{\beta}_2 - \hat{\beta}_1 = \begin{pmatrix}0&-1&1&0&\dots&0\end{pmatrix} \begin{pmatrix} \hat{\beta}_0\\ \hat{\beta}_1 \\ \hat{\beta}_2 \\ \vdots\\ \hat{\beta}_p \end{pmatrix}\]

Using the above, we know how to compute the variance covariance matrix of $\hat{\boldsymbol{\beta}}$.

CLT and t-distribution

We have shown how we can obtain standard errors for our estimates. However, as we learned in the first chapter, to perform inference we need to know the distribution of these random variables. The reason we went through the effort to compute the standard errors is because the CLT applies in linear models. If $N$ is large enough, then the LSE will be normally distributed with mean $\boldsymbol{\beta}$ and standard errors as described. For small samples, if the $\varepsilon$ are normally distributed, then the $\hat{\beta}-\beta$ follow a t-distribution. We do not derive this result here, but the results are extremely useful since it is how we construct p-values and confidence intervals in the context of linear models.

Code versus math

The standard approach to writing linear models either assume the $\mathbf{X}$ are fixed or that we are conditioning on them. Thus $\mathbf{X} \boldsymbol{\beta}$ has no variance as the $\mathbf{X}$ is considered fixed. This is why we write $\mbox{var}(Y_i) = \mbox{var}(\varepsilon_i)=\sigma^2$. This can cause confusion in practice because if you, for example, compute the following:

x =  father.son$fheight
beta =  c(34,0.5)
var(beta[1]+beta[2]*x)

## [1] 1.883576

it is nowhere near 0. This is an example in which we have to be careful in distinguishing code from math. The function var is simply computing the variance of the list we feed it, while the mathematical definition of variance is considering only quantities that are random variables. In the R code above, x is not fixed at all: we are letting it vary, but when we write $\mbox{var}(Y_i) = \sigma^2$ we are imposing, mathematically, x to be fixed. Similarly, if we use R to compute the variance of $Y$ in our object dropping example, we obtain something very different than $\sigma^2=1$ (the known variance):

n <- length(tt)
y <- h0 + v0*tt  - 0.5*g*tt^2 + rnorm(n,sd=1)
var(y)

## [1] 329.5136

Again, this is because we are not fixing tt.

Technical note on variance

The standard approach to writing linear models either assume the X are fixed or that we are conditioning on them. Thus X*beta has no variance as the X is considered fixed. This is why we write var(Y_i) = var(e_i)=sigma^2. This can cause confusion in practice because if you, for example, compute the following:

x =  father.son$fheight
beta =  c(34,0.5)
var(beta[1]+beta[2]*x)

## [1] 1.883576

t is nowhere near 0. This is an example in which we have to be careful in distinguishing code from math. The function var is simply computing the variance of the list we feed it, while the mathematical use of var is considering only quantities that are random variables. In the R code above, x is not fixed at all: we are letting it vary but when we write var(Y_i) = sigma^2 we are imposing, mathematically, x to be fixed. Similarly if we use R to compute the variance of Y in our object dropping example we obtain something very different than sigma^2=1 (the known variance):

y = h0 + v0*tt  - 0.5*g*tt^2 + rnorm(n,sd=1)
var(y)

## [1] 317.9628

Again, this is because we are not fixing tt.

Standard Errors Exercises

In the previous assessment, we used a Monte Carlo technique to see that the linear model coefficients are random variables when the data is a random sample. Now we will use the matrix algebra from the previous video to try to estimate the standard error of the linear model coefficients. Again, take a random sample of the father.son heights data:

library(UsingR)

x = father.son$fheight
y = father.son$sheight

n = length(y)
N = 50

set.seed(1)
index = sample(n,N)

sampledat = father.son[index,]
x = sampledat$fheight
y = sampledat$sheight

betahat = lm(y~x)$coef

#The formula for the standard error in the previous video was (the following two lines are not R code):

SE(betahat) = sqrt(var(betahat)) var(betahat) = sigma^2 (X^T X)^-1

This is also listed in the standard error book page.

We will estimate or calculate each part of this equation and then combine them.

First, we want to estimate sigma^2, the variance of Y. As we have seen in the previous unit, the random part of Y is only coming from epsilon, because we assume X*beta is fixed. So we can try to estimate the variance of the epsilons from the residuals, the Yi minus the fitted values from the linear model.

Standard Errors Exercises #1

Note that the fitted values (Y-hat) from a linear model can be obtained with:

fit = lm(y ~ x)
fit$fitted.values

##        1        2        3        4        5        6        7        8 
## 70.62707 70.36129 70.86093 68.73019 65.59181 70.55285 70.21256 68.62521 
##        9       10       11       12       13       14       15       16 
## 67.06729 69.64913 69.09958 71.70621 68.31598 70.57027 70.39537 70.39613 
##       17       18       19       20       21       22       23       24 
## 68.73977 68.98874 71.47021 72.03615 69.55975 68.15895 66.63557 71.53651 
##       25       26       27       28       29       30       31       32 
## 69.57083 69.71050 67.14263 70.99719 67.11046 69.04901 66.65243 67.82895 
##       33       34       35       36       37       38       39       40 
## 68.24209 70.70156 65.50431 67.36000 69.30065 67.94424 66.35150 71.40489 
##       41       42       43       44       45       46       47       48 
## 71.64301 66.81654 69.22900 69.11769 69.21793 69.69519 67.00674 68.67869 
##       49       50 
## 67.40752 69.28800

What is the sum of the squared residuals (where residuals are given by r_i = Y_i - Y-hat_i)?

e<-y-fit$fitted.values
sum(e^2)

## [1] 256.2152

#or
fit = lm(y ~ x)
sum((y - fit$fitted.values)^2)

## [1] 256.2152

A: 256.2152

Our estimate of sigma^2 will be the sum of squared residuals divided by (N - p), the sample size minus the number of terms in the model. Since we have a sample of 50 and 2 terms in the model (an intercept and a slope), our estimate of sigma^2 will be the sum of squared residuals divided by 48. Save this to a variable ‘sigma2’:

sigma2 = SSR / 48 where SSR is the answer to the previous question.

Standard Errors Exercises #2

Form the design matrix X (note: use a capital X!). This can be done by combining a column of 1’s with a column of ‘x’ the father’s heights.

X = cbind(rep(1,N), x)

Now calculate (X^T X)^-1, the inverse of X transpose times X. Use the solve() function for the inverse and t() for the transpose. What is the element in the first row, first column?

solve(t(X) %*% X)[1,1]

## [1] 11.30275

A:11.30275

Standard Errors Exercises #3

Now we are one step away from the standard error of beta-hat. Take the diagonals from the (X^T X)^-1 matrix above, using the diag() function. Now multiply our estimate of sigma^2 and the diagonals of this matrix. This is the estimated variance of beta-hat, so take the square root of this. You should end up with two numbers, the standard error for the intercept and the standard error for the slope.

What is the standard error for the slope?

sqrt(diag(solve(t(X) %*% X)) * (256.2152/48))

##                   x 
## 7.7673678 0.1141966

# or
fit = lm(y ~ x)
sigma2 = sum((y - fit$fitted.values)^2) / (N - 2)
sqrt(sigma2 * diag(solve(t(X) %*% X)))

##                   x 
## 7.7673671 0.1141966

#Compare this value with the value you estimated using Monte Carlo in the previous assessment. It will not be the same, because we are only estimating the standard error given a particular sample of 50 (which we obtained with set.seed(1)).

#Note that the standard error estimate is also printed in the second column of:
summary(fit)

## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.2030 -1.8027  0.2918  1.4226  6.8493 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  29.2542     7.7674   3.766 0.000453 ***
## x             0.5857     0.1142   5.129 5.19e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.31 on 48 degrees of freedom
## Multiple R-squared:  0.354,  Adjusted R-squared:  0.3406 
## F-statistic: 26.31 on 1 and 48 DF,  p-value: 5.189e-06

A:0.1141966

Inference for LSE

We have shown how we can obtain standard errors for our estimates. But, as we learned in PH525.1x to perform inference we need to know the distribution of these random variables. The reason we went through the effort of computing the standard errors is because the CLT applies in linear models. If N is large enough, then the LSE will be normally distributed with mean beta and standard errors as described in our videos. For small samples, if the error term is normally distributed then the betahat-beta follow a t-distribution. Proving this mathematically is rather advanced, but the results are extremely useful as it is how we construct p-values and confidence intervals in the context of linear models.

Week 3 : Linear Models

Linear Models Introduction

Many of the models we use in data analysis can be presented using matrix algebra. We refer to these types of models as linear models. “Linear” here does not refer to lines, but rather to linear combinations. The representations we describe are convenient because we can write models more succinctly and we have the matrix algebra mathematical machinery to facilitate computation. In this chapter, we will describe in some detail how we use matrix algebra to represent and fit.

In this book, we focus on linear models that represent dichotomous groups: treatment versus control, for example. The effect of diet on mice weights is an example of this type of linear model. Here we describe slightly more complicated models, but continue to focus on dichotomous variables.

As we learn about linear models, we need to remember that we are still working with random variables. This means that the estimates we obtain using linear models are also random variables. Although the mathematics is more complex, the concepts we learned in previous chapters apply here. We begin with some exercises to review the concept of random variables in the context of linear models.

Linear Models as Matrix Multiplication I

Let’s talk a little bit about matrix notation and why we use it. It makes writing formulas easy, it also makes computation easy, it makes mathematics easy.

So here’s an example.

mat20 mat21

So suppose you have some data and you are, you have a linear model. So that’s four variables, it’s a little bit more complicated than the ones we saw before.And you have four parameters plus the base level parameter. And you want to estimate, you want to find the parameters that fit the data best, so we might consider using least squares. So that’s what this is.So we look for the betas that minimize this equation. So that equation, with all those supscripts and x’s and betas can be written in, if we know that your matrix notation can be written like this, a formula down here, which is much faster to write. And once you get used to this, you see the bottom formula and you know that what you’re doing is the top one. All right, so, not only is it easier to look at on a piece of paper, or to write down on a piece of paper, it’s actually faster at least in R. R is made to work one on matrices, on matrix algebra computations.So here, in this example, I’m showing it as just a very quick simulation example where we compute that sum of squares value. And if we do it using matrix multiplication, it takes 0.01 seconds.

mat22

And if you instead write out every single step and put it in a for loop or in something that goes one by one instead of using the matrix trick, it takes 3.4 seconds.

mat23

So that is a lot slower for over 300 times slower.

So a little bit of a refresher on matrix algebra. So we want to be able to multiply two matrices, so instead of writing down each single model for each element, for each individual, we instead write these matrices.

mat24

And these are the elements of the matrix. So when I write x times beta, that xBeta that you said, that we saw.X is going to be a matrix with entries for each individual on the rows. And then the columns represent the different covariants or variables that are used in the model. And then the parameters are put in another little matrix.So we’re going to show you how this gives us, when you multiply these two matrices, it gives us what you want in terms of the original linear model we wrote out.

So we start by multiplying, if we’re going multiply these two matrices, you take the first row and you multiply it by the first column, and get this. Right?

mat25

So you see, now you recognize that.It’s what we had at the very beginning in the non-matrix version of this.Now we start going down to the next one and we get the second line, et cetera.

mat26 mat27

And we do it for several of them.

So now we can rewrite the two groups in the– instead of the formula with all the indices, we can actually write these matrices.Here we would just write y equals xBeta plus Epsilon.Now, in this particular formula, I’m actually writing out every individual so you can see what the matrices contain.

mat28

All right, so this is the same formula that we had before in matrix form. And you can see that every individual here is coming from some like this. If you’re in group A, you’re going to get Beta0 plus an error.If you’re in group B, you’re going to get Beta0 plus Beta1,so you’re up here, plus an error.

Now if you have three groups, again we can write it out in matrix notation.

mat29

So if you’re in group 1, you multiply this by that.You get Beta0 plus an error, and you’re going to be one of these points here.If you’re in group 2, you’re one of these guys Group B, so you get Beta0 plus Beta1, so you’re up here, plus some error.And the same goes for, if you’re in group C.It would be Beta0 plus Beta2 plus an error.

Linear Models as Matrix Multiplication II

Today we’re going to talk about a very powerful mathematical technique called linear algebra.And we’re going to show it in the context of statistical analysis through linear models.

The t-test that we described in a previous module is actually something can be derived from the linear model machinery.In this slide I’m showing you a little bit of motivation.

mat30

You can think of the t-test as a corkscrew,while later models are much more applicable and much more general,so you can think of it as a Swiss Army knife. So just to give you an idea of the connection between t-tests and linear models, I’m going to do a bit of the notation here.

So in this picture, in this graph, I’m showing you two groups.

mat31

You can see them to the left and to the right, group A and group B. And you see all these points around two means. So what is this representing? So when we’re doing the t-test, we compare the average in one group and subtracting it from the average of the other group.And that was an estimate of the population averages.

Now here, we’re going to write it slightly different.And later you’re going to see why we’re doing this.What we’re going to do is we’re going to pick a baseline.Now, this average, instead of being like it was MuY, before in the previous module, we’re going to call it Beta0. and then we’re going to call the difference between the two groups Beta1.What is– and now, one of the nice things about this notation,is that the Beta1 is really what we care about most of the time.Is that difference, that’s what we want to estimate and perform statistical test on.

So reading out as a linear model, what we have is, we have a group A, has a mean of Beta0.And group B has a mean of Beta0 plus that difference.So we can write it like this.

mat32

Now, why I’m doing all of this because I want to eventually get to be able to write these as matrices. And the reason I want to do that is because if I write everything in matrices, everything is much easier and faster when we compute it on a computer, when we program it up. And it’s also more convenient to derive from mathematical properties as well.

mat33

So here’s how we would write the model that we’re talking about.We have the average for group A is Beta0 plus 0 times Beta1.And the mean for group B is Beta0 plus 1 times Beta1.OK, so now, we’re going to see that, we’re going to see more on that later.

So in general, the typical linear model that you will see, if you look it up in Wikipedia or something, or in a book.

mat34

You’re going to have the observed data represented with Y, and there’s going to be little subscript which is usually denoting an individual. In the case of genomics it could be a sample. So then you have the model parameters. These are the things that we want to estimate, the things that we want to find out. In the case of men and women heights, Beta1 was the difference we were interested in. So you have a base level, Beta0, then you have these parameters, Beta1, and you can have more. And then we also have these predictors or variables or covariates, as statisticians call them. That will define the model.And in the examples that we’re using up to now, they’re either 0 or 1.But they could be other things as well. They could be continuous data as well. And then we usually are almost, when we also include an error turn that describes variability. And this error term could be measurement error. Like in the case of if you’re measuring heights, and you’re doing it over and over for one person, it would be the measurement error of measuring height. But if you’re taking a random sample, then this Epsilon really represents the variability you get from sampling. So it’s like, tall people would have big errors and short people will have very large negative errors. It’s a little bit of a, it’s not the best name when we’re connecting it to biology. But it is what we call it in statistics.

So now, and now one way in which the linear model immediately become more useful than the t-test is in the case that we have three groups.

mat35

So say that instead of men and women, we wanted to know which ethnic group is tallest. So now we have three ethnic groups now instead of two genders. So in that case, we would have some baseline group, which have the Beta0. And then the other groups would be different. In what way? That we’re adding Beta1 in one case and Beta2 in the other. And the nice thing about linear models is that we can write this out and in one quick computation on the computer estimate Beta1, estimate Beta2, and also get standard errors for those estimates.

So here’s an illustration of what we’ve just shown.

mat36

You have mean for group A, mean for group B, mean for group C. But that’s not how the model is mathematically showing this.The model says, this here is Beta0, this difference is Beta1m and this difference is Beta2. And the nice thing again of that, is that’s what we’re interested in. It’s in those differences, so we estimate them directly.

Expressing Experimental Designs in R

The Design Matrix

Here we will show how to use the two R functions, formula and model.matrix, in order to produce design matrices (also known as model matrices) for a variety of linear models. For example, in the mouse diet examples we wrote the model as

\[ Y_i = \beta_0 + \beta_1 x_i + \varepsilon, i=1,\dots,N \]

with $Y_i$ the weights and $x_i$ equal to 1 only when mouse $i$ receives the high fat diet. We use the term experimental unit to $N$ different entities from which we obtain a measurement. In this case, the mice are the experimental units.

This is the type of variable we will focus on in this chapter. We call them indicator variables since they simply indicate if the experimental unit had a certain characteristic or not. As we described earlier, we can use linear algebra to represent this model:

as:

or simply:

\[ \mathbf{Y}=\mathbf{X}\boldsymbol{\beta}+\boldsymbol{\varepsilon} \]

The design matrix is the matrix $\mathbf{X}$.

Once we define a design matrix, we are ready to find the least squares estimates. We refer to this as fitting the model. For fitting linear models in R, we will directly provide a formula to the lm function. In this script, we will use the model.matrix function, which is used internally by the lm function. This will help us to connect the R formula with the matrix $\mathbf{X}$. It will therefore help us interpret the results from lm.

Choice of design

The choice of design matrix is a critical step in linear modeling since it encodes which coefficients will be fit in the model, as well as the inter-relationship between the samples. A common misunderstanding is that the choice of design follows straightforward from a description of which samples were included in the experiment. This is not the case. The basic information about each sample (whether control or treatment group, experimental batch, etc.) does not imply a single ‘correct’ design matrix. The design matrix additionally encodes various assumptions about how the variables in $\mathbf{X}$ explain the observed values in $\mathbf{Y}$, on which the investigator must decide.

For the examples we cover here, we use linear models to make comparisons between different groups. Hence, the design matrices that we ultimately work with will have at least two columns: an intercept column, which consists of a column of 1’s, and a second column, which specifies which samples are in a second group. In this case, two coefficients are fit in the linear model: the intercept, which represents the population average of the first group, and a second coefficient, which represents the difference between the population averages of the second group and the first group. The latter is typically the coefficient we are interested in when we are performing statistical tests: we want to know if their is a difference between the two groups.

We encode this experimental design in R with two pieces. We start with a formula with the tilde symbol ~. This means that we want to model the observations using the variables to the right of the tilde. Then we put the name of a variable, which tells us which samples are in which group.

Let’s try an example. Suppose we have two groups, control and high fat diet, with two samples each. For illustrative purposes, we will code these with 1 and 2 respectively. We should first tell R that these values should not be interpreted numerically, but as different levels of a factor. We can then use the paradigm ~ group to, say, model on the variable group.

group <- factor( c(1,1,2,2) )
model.matrix(~ group)

##   (Intercept) group2
## 1           1      0
## 2           1      0
## 3           1      1
## 4           1      1
## attr(,"assign")
## [1] 0 1
## attr(,"contrasts")
## attr(,"contrasts")$group
## [1] "contr.treatment"

(Don’t worry about the attr lines printed beneath the matrix. We won’t be using this information.)

What about the formula function? We don’t have to include this. By starting an expression with ~, it is equivalent to telling R that the expression is a formula:

model.matrix(formula(~ group))

##   (Intercept) group2
## 1           1      0
## 2           1      0
## 3           1      1
## 4           1      1
## attr(,"assign")
## [1] 0 1
## attr(,"contrasts")
## attr(,"contrasts")$group
## [1] "contr.treatment"

What happens if we don’t tell R that group should be interpreted as a factor?

group <- c(1,1,2,2)
model.matrix(~ group)

##   (Intercept) group
## 1           1     1
## 2           1     1
## 3           1     2
## 4           1     2
## attr(,"assign")
## [1] 0 1

This is not the design matrix we wanted, and the reason is that we provided a numeric variable as opposed to an indicator to the formula and model.matrix functions, without saying that these numbers actually referred to different groups. We want the second column to have only 0 and 1, indicating group membership.

A note about factors: the names of the levels are irrelevant to model.matrix and lm. All that matters is the order. For example:

group <- factor(c("control","control","highfat","highfat"))
model.matrix(~ group)

##   (Intercept) grouphighfat
## 1           1            0
## 2           1            0
## 3           1            1
## 4           1            1
## attr(,"assign")
## [1] 0 1
## attr(,"contrasts")
## attr(,"contrasts")$group
## [1] "contr.treatment"

produces the same design matrix as our first code chunk.

More groups

Using the same formula, we can accommodate modeling more groups. Suppose we have a third diet:

group <- factor(c(1,1,2,2,3,3))
model.matrix(~ group)

##   (Intercept) group2 group3
## 1           1      0      0
## 2           1      0      0
## 3           1      1      0
## 4           1      1      0
## 5           1      0      1
## 6           1      0      1
## attr(,"assign")
## [1] 0 1 1
## attr(,"contrasts")
## attr(,"contrasts")$group
## [1] "contr.treatment"

Noow we have a third column which specifies which samples belong to the third group.

An alternate formulation of design matrix is possible by specifying + 0 in the formula:

group <- factor(c(1,1,2,2,3,3))
model.matrix(~ group + 0)

##   group1 group2 group3
## 1      1      0      0
## 2      1      0      0
## 3      0      1      0
## 4      0      1      0
## 5      0      0      1
## 6      0      0      1
## attr(,"assign")
## [1] 1 1 1
## attr(,"contrasts")
## attr(,"contrasts")$group
## [1] "contr.treatment"

This group now fits a separate coefficient for each group. We will explore this design in more depth later on.

More variables

We have been using a simple case with just one variable (diet) as an example. In the life sciences, it is quite common to perform experiments with more than one variable. For example, we may be interested in the effect of diet and the difference in sexes. In this case, we have four possible groups:

diet <- factor(c(1,1,1,1,2,2,2,2))
sex <- factor(c("f","f","m","m","f","f","m","m"))
table(diet,sex)

##     sex
## diet f m
##    1 2 2
##    2 2 2

If we assume that the diet effect is the same for males and females (this is an assumption), then our linear model is:

\[ Y_{i}= \beta_0 + \beta_1 x_{i,1} + \beta_2 x_{i,2} + \varepsilon_i \]

To fit this model in R, we can simply add the additional variable with a + sign in order to build a design matrix which fits based on the information in additional variables:

diet <- factor(c(1,1,1,1,2,2,2,2))
sex <- factor(c("f","f","m","m","f","f","m","m"))
model.matrix(~ diet + sex)

##   (Intercept) diet2 sexm
## 1           1     0    0
## 2           1     0    0
## 3           1     0    1
## 4           1     0    1
## 5           1     1    0
## 6           1     1    0
## 7           1     1    1
## 8           1     1    1
## attr(,"assign")
## [1] 0 1 2
## attr(,"contrasts")
## attr(,"contrasts")$diet
## [1] "contr.treatment"
## 
## attr(,"contrasts")$sex
## [1] "contr.treatment"

The design matrix includes an intercept, a term for diet and a term for sex. We would say that this linear model accounts for differences in both the group and condition variables. However, as mentioned above, the model assumes that the diet effect is the same for both males and females. We say these are an additive effect. For each variable, we add an effect regardless of what the other is. Another model is possible here, which fits an additional term and which encodes the potential interaction of group and condition variables. We will cover interaction terms in depth in a later script.

The interaction model can be written in either of the following two formulas:

model.matrix(~ diet + sex + diet:sex)

model.matrix(~ diet*sex)

##   (Intercept) diet2 sexm diet2:sexm
## 1           1     0    0          0
## 2           1     0    0          0
## 3           1     0    1          0
## 4           1     0    1          0
## 5           1     1    0          0
## 6           1     1    0          0
## 7           1     1    1          1
## 8           1     1    1          1
## attr(,"assign")
## [1] 0 1 2 3
## attr(,"contrasts")
## attr(,"contrasts")$diet
## [1] "contr.treatment"
## 
## attr(,"contrasts")$sex
## [1] "contr.treatment"

Releveling

The level which is chosen for the reference level is the level which is contrasted against. By default, this is simply the first level alphabetically. We can specify that we want group 2 to be the reference level by either using the relevel function:

group <- factor(c(1,1,2,2))
group <- relevel(group, "2")
model.matrix(~ group)

##   (Intercept) group1
## 1           1      1
## 2           1      1
## 3           1      0
## 4           1      0
## attr(,"assign")
## [1] 0 1
## attr(,"contrasts")
## attr(,"contrasts")$group
## [1] "contr.treatment"

or by providing the levels explicitly in the factor call:

group <- factor(group, levels=c("1","2"))
model.matrix(~ group)

##   (Intercept) group2
## 1           1      0
## 2           1      0
## 3           1      1
## 4           1      1
## attr(,"assign")
## [1] 0 1
## attr(,"contrasts")
## attr(,"contrasts")$group
## [1] "contr.treatment"

Where does model.matrix look for the data?

The model.matrix function will grab the variable from the R global environment, unless the data is explicitly provided as a data frame to the data argument:

group <- 1:4
model.matrix(~ group, data=data.frame(group=5:8))

##   (Intercept) group
## 1           1     5
## 2           1     6
## 3           1     7
## 4           1     8
## attr(,"assign")
## [1] 0 1

Note how the R global environment variable group is ignored.

Continuous variables

In this chapter, we focus on models based on indicator values. In certain designs, however, we will be interested in using numeric variables in the design formula, as opposed to converting them to factors first. For example, in the falling object example, time was a continuous variable in the model and time squared was also included:

tt <- seq(0,3.4,len=4) 
model.matrix(~ tt + I(tt^2))

##   (Intercept)       tt   I(tt^2)
## 1           1 0.000000  0.000000
## 2           1 1.133333  1.284444
## 3           1 2.266667  5.137778
## 4           1 3.400000 11.560000
## attr(,"assign")
## [1] 0 1 2

The I function above is necessary to specify a mathematical transformation of a variable. See ?I for more information.

In the life sciences, we could be interested in testing various dosages of a treatment, where we expect a specific relationship between a measured quantity and the dosage, e.g. 0 mg, 10mg, 20mg.

The assumptions imposed by including continuous data as variables are typically hard to defend and motivate than the indicator function variables. Why the indicator variables simply assume a different mean between two groups, continuous variables assume a very specific relationship between the outcome and predictor variables.

In cases like the falling object, we have the theory of gravitation supporting the model. In the father-son height example, because the data is bi variate normal, it follows that there is a linear relationship if we condition. However, we find that continuous variables are included in linear models without justification to “adjust” for variables such as age. We highly discourage this practice unless the data support the model being used.

Expressing Design Formula Exercises

Suppose we have an experiment with the following design: on three different days, we perform an experiment with two treated and two control samples. We then measure some outcome Y_i, and we want to test the effect of condition, while controlling for whatever differences might have occured due to the the different day (maybe the temperature in the lab affects the measuring device). Assume that the true condition effect is the same for each day (no interaction between condition and day). We then define factors in R for ‘day’ and for ‘condition’.

mat1

Expressing Design Formula Exercises #1

Given the factors we have defined above, and not defining any new ones, which of the following R formula will produce a design matrix (model matrix) that let’s us analyze the effect of condition, controlling for the different days:

condition <- factor(c("treated","treated","treated","treated","treated","treated","control","control","control","control","control","control"))
day<- factor(c("A","A","B","B","C","C","A","A","B","B","C","C"))

table(condition,day)

##          day
## condition A B C
##   control 2 2 2
##   treated 2 2 2

model.matrix(~day + condition)

##    (Intercept) dayB dayC conditiontreated
## 1            1    0    0                1
## 2            1    0    0                1
## 3            1    1    0                1
## 4            1    1    0                1
## 5            1    0    1                1
## 6            1    0    1                1
## 7            1    0    0                0
## 8            1    0    0                0
## 9            1    1    0                0
## 10           1    1    0                0
## 11           1    0    1                0
## 12           1    0    1                0
## attr(,"assign")
## [1] 0 1 1 2
## attr(,"contrasts")
## attr(,"contrasts")$day
## [1] "contr.treatment"
## 
## attr(,"contrasts")$condition
## [1] "contr.treatment"

A: ~day + condition

EXPLANATION:Using the ~ and the names for the two variables we want in the model will produce a design matrix controlling for all levels of day and all levels of condition, so “~ day + condition”. We do not use the levels A,B,C etc in the design formula.

Linear Models in Practice I

The mouse diet example

We will demonstrate how to analyze the high fat diet data using linear models instead of directly applying a t-test. We will demonstrate how ultimately these two approaches are equivalent.

We start by reading in the data and creating a quick stripchart:

dat <- read.csv("femaleMiceWeights.csv") ##previously downloaded
head(dat)

##   Diet Bodyweight
## 1 chow      21.51
## 2 chow      28.14
## 3 chow      24.04
## 4 chow      23.45
## 5 chow      23.68
## 6 chow      19.79

stripchart(dat$Bodyweight ~ dat$Diet, vertical=TRUE, method="jitter",
           main="Bodyweight over Diet")