Let us talk a little bit about matrix notation and why we use it
Suppose you have some data and you have a linear model \[\sum_{i=1}^{N}(y_{i}-\beta_{0}-\beta_{1}x_{i1}-\beta_{2}x_{i2}-\beta_{3}x_{i3}-\beta_{4}x_{i4})^{2}\] It has four variables (\(x_{ij}\)) and it’s a little bit more complicated than the ones we saw before. And you have four parameters (\(\beta_{j}\) for every \(j=1,...,4\)) plus the base level parameter (\(\beta_{0}\)). And you want to find the parameters that fit your data best, so we might consider least squares.
So we look for the betas that minimize this equation, but before we proceed to the optimization process we can write the above equation in matrix notation as follows:
first, let us define the following matrices, \[\begin{align}X=\begin{pmatrix}1\\x_{i1}\\x_{i2}\\x_{i3}\\x_{i4}\end{pmatrix} & & \text{and} & & \beta=\begin{pmatrix}\beta_{0}\\ \beta_{1}\\ \beta_{2}\\ \beta_{3}\\ \beta_{4}\end{pmatrix}\\ \end{align}\] then we have, \[(Y- X\beta)^{T}(Y-X\beta)\] where \(Y\) is the vector that contains each observation \(y_{i}\) for each \(i=1,...,N\).
Note that this formula is equivalent to the first one, but easier to write down and work with. It is not only easier to write down on a piece of paper, but also faster at least on R.
Here is a simulation example that illustrates how quick R can work with matrix algebra instead of using a more complex formula to compute the sum of squares:
y<-rnorm(1e6)
x<-cbind(rep(1,1e6), rep(0:1, each=5e5))
beta<-c(1,1)
system.time({sum(y-x%*%beta)^2})
user system elapsed
0.124 0.003 0.129
system.time({sum(sapply(seq_along(y),function(i) y[i]-x[i,1]*beta[1]-x[i,2]*beta[2])^2)})
user system elapsed
0.955 0.017 0.981
So instead of writing down each single model for each element, for each individual we instead write these matrices. \[\begin{align}\underbrace{\begin{pmatrix}x_{10} & x_{11}\\ x_{20} & x_{21}\\ \vdots & \vdots \\ x_{N0} & x_{N1}\end{pmatrix}}_{X} \underbrace{\begin{pmatrix}\beta_{0}\\ \beta_{1}\end{pmatrix}}_{\beta}=\begin{pmatrix}\beta_{0}x_{10}+\beta_{1}x_{11}\\ \beta_{0}x_{20}+\beta_{1}x_{21}\\ \vdots \\ \beta_{0}x_{N0}+\beta_{1}x_{N1} \end{pmatrix}\end{align}\] So when we see \(X\beta\) in our model we assume that \(X\) is a matrix with entries for each individual on the rows, and then the columns represent the different covariates or variables using in the model, on the other hand \(\beta\) is another little matrix that contains the parameters of the model.
Instead of the formula with all the indices, we can actually write these matrices
In this module we’ll show how to use the two R base functions:
‘formula’
‘model.matrix’
in order to produce design matrices for a variety of linear models
We’re going to be building matrices filled by zeros and ones that get multiplied by beta in a linear model.
x<-c(1,1,2,2)
f<-formula(~x) # e.g lm(y~beta1x1+epsilon)
f
~x
The models fit by, e.g., the lm and glm functions are specified in a compact symbolic form. The ~ operator is basic in the formation of such models. An expression of the form y ~ model is interpreted as a specification that the response y is modeled by a linear predictor specified symbolically by model. Such a model consists of a series of terms separated by + operators.
X<-model.matrix(f) # it gives us an intercept filled by a vector of ones and we'll have a column x which gives the variable x as a column
X
(Intercept) x
1 1 1
2 1 1
3 1 2
4 1 2
attr(,"assign")
[1] 0 1
class(x)
[1] "numeric"
class(X)
[1] "matrix" "array"
Model.matrix will create a design matrix using that formula if you don’t provide any data, it will use the environment of the object f into the formula
So if we use a factor and build a model matrix using the factor x, it would look different,
x<-factor(c(1,1,2,2))
x
[1] 1 1 2 2
Levels: 1 2
X<-model.matrix(~x)
X
(Intercept) x2
1 1 0
2 1 0
3 1 1
4 1 1
attr(,"assign")
[1] 0 1
attr(,"contrasts")
attr(,"contrasts")$x
[1] "contr.treatment"
instead of having a column with \((1,1,2,2)\) vector we’ll have a column with \((0,0,1,1)\) vector, that takes the value of 0 when \(x_{ij}=1\) and 1 when \(x_{ij}=2\) (for each \(i\in\{1,...,N\}\) and \(j\in\{1,...,p\}\))
We again have an intercept filled by a vector of ones and a column x2 which gives the variable \(x\) represented as a indicator variable.
Now let’s construct a factor \(x\) which has three different levels:
x<-factor(c(1,1,2,2,3,3))
model.matrix(~x)
(Intercept) x2 x3
1 1 0 0
2 1 0 0
3 1 1 0
4 1 1 0
5 1 0 1
6 1 0 1
attr(,"assign")
[1] 0 1 1
attr(,"contrasts")
attr(,"contrasts")$x
[1] "contr.treatment"
One can actually define matrices with indicator variables as vectors defining non-numerical entries:
x<-factor(c(1,1,1,1,2,2,2,2))
y<-factor(c("a","a","b","a","b","a","a","b"))
model.matrix(~x+y)
(Intercept) x2 yb
1 1 0 0
2 1 0 0
3 1 0 1
4 1 0 0
5 1 1 1
6 1 1 0
7 1 1 0
8 1 1 1
attr(,"assign")
[1] 0 1 2
attr(,"contrasts")
attr(,"contrasts")$x
[1] "contr.treatment"
attr(,"contrasts")$y
[1] "contr.treatment"
As before we have an “intercept” and two vectors expressed as indicator variables. In vector “x2” variable takes value of 0 when \(x_{ij}=1\) and 1 otherwise. Likewise, in vector “yb” variable takes the value of 0 when \(y_{ij}=\)“a” and 1 otherwise.
Also you can specify an interaction between the variables using “:”, so
model.matrix(~x+y+x:y) #this model describes the interaction described below
(Intercept) x2 yb x2:yb
1 1 0 0 0
2 1 0 0 0
3 1 0 1 0
4 1 0 0 0
5 1 1 1 1
6 1 1 0 0
7 1 1 0 0
8 1 1 1 1
attr(,"assign")
[1] 0 1 2 3
attr(,"contrasts")
attr(,"contrasts")$x
[1] "contr.treatment"
attr(,"contrasts")$y
[1] "contr.treatment"
model.matrix(~x*y) #multiplication.
(Intercept) x2 yb x2:yb
1 1 0 0 0
2 1 0 0 0
3 1 0 1 0
4 1 0 0 0
5 1 1 1 1
6 1 1 0 0
7 1 1 0 0
8 1 1 1 1
attr(,"assign")
[1] 0 1 2 3
attr(,"contrasts")
attr(,"contrasts")$x
[1] "contr.treatment"
attr(,"contrasts")$y
[1] "contr.treatment"
So “x2:yb” vector will take value of 1 iff both \(x_{ij}=2\) and \(y_{ij}=\)“b” holds, and 0 otherwise.
We also can create a matrix with the opposite condition for the indicator variable \(x\):
x<- factor(c(1,1,2,2))
model.matrix(~x)
(Intercept) x2
1 1 0
2 1 0
3 1 1
4 1 1
attr(,"assign")
[1] 0 1
attr(,"contrasts")
attr(,"contrasts")$x
[1] "contr.treatment"
x<-relevel(x, "2")
model.matrix(~x)
(Intercept) x1
1 1 1
2 1 1
3 1 0
4 1 0
attr(,"assign")
[1] 0 1
attr(,"contrasts")
attr(,"contrasts")$x
[1] "contr.treatment"
x<-factor(x,levels=c("1","2"))
x
[1] 1 1 2 2
Levels: 1 2
Some other operations:
z<-1:4
model.matrix(~z) #z=(1,2,3,4)
(Intercept) z
1 1 1
2 1 2
3 1 3
4 1 4
attr(,"assign")
[1] 0 1
model.matrix(~0+z) # no intercept vector
z
1 1
2 2
3 3
4 4
attr(,"assign")
[1] 1
model.matrix(~z+I(z^2)) #adds another column that is the square of z
(Intercept) z I(z^2)
1 1 1 1
2 1 2 4
3 1 3 9
4 1 4 16
attr(,"assign")
[1] 0 1 2
Now we’re going to show how to perform linear models in R, and we are going to do a simple two group comparison. We need first do load in some data of mice weights, mice which were are given two diets. One was a high-fat diet and one was a control diet.
#install.packeges("downloader","ggplot2")
#library(downloader,ggplot2)
url<-"https://raw.githubusercontent.com/genomicsclass/dagdata/master/inst/extdata/femaleMiceWeights.csv"
filename<-"femaleMiceWeights.csv"
if (!file.exists(filename))download(url,filename)
dat<-read.csv(filename)
p<-ggplot(dat)
p<-p+aes(x=dat$Diet,y=dat$Bodyweight)
p<-p+scale_size_manual(values=c(0.8,0.8,0.8,0.5))
p<-p+geom_point()+geom_jitter(width=0.1)
p<-p+xlab("Diet")+ylab("Body Wheight")+ggtitle("Body Wheight over Diet")
p
Warning: Use of `dat$Diet` is discouraged.
ℹ Use `Diet` instead.
Warning: Use of `dat$Bodyweight` is discouraged.
ℹ Use `Bodyweight` instead.
Warning: Use of `dat$Diet` is discouraged.
ℹ Use `Diet` instead.
Warning: Use of `dat$Bodyweight` is discouraged.
ℹ Use `Bodyweight` instead.
“chow” is the control sample and “hf” is the high-fat diet. In data:
dat[11:14,]
first column indicates whether a mouse was exposed to a high-fat diet or a controlled diet.
Looking only to the chart we can see that mice population which were given the control diet have slightly lower body weights, on average, than the mice that were given the high-fat diet, although there’s overlap between these two groups.
We want to analyze the difference between these two groups using a linear model.
First, let us construct the matrix \(X\) that we have seen in the linear model formula, so we can see what is happening internally and how R is taking the data:
diet<-factor(dat$Diet)
levels(diet)
[1] "chow" "hf"
X<-model.matrix(~diet,data=dat)
X
(Intercept) diethf
1 1 0
2 1 0
3 1 0
4 1 0
5 1 0
6 1 0
7 1 0
8 1 0
9 1 0
10 1 0
11 1 0
12 1 0
13 1 1
14 1 1
15 1 1
16 1 1
17 1 1
18 1 1
19 1 1
20 1 1
21 1 1
22 1 1
23 1 1
24 1 1
attr(,"assign")
[1] 0 1
attr(,"contrasts")
attr(,"contrasts")$diet
[1] "contr.treatment"
Now, we’re going to just run the model.
You don’t actually have to specify the \(X\) matrix. We use the lm(y~x) function for linear model.
fit<-lm(Bodyweight~Diet, data=dat)
summary(fit)
Call:
lm(formula = Bodyweight ~ Diet, data = dat)
Residuals:
Min 1Q Median 3Q Max
-6.1042 -2.4358 -0.4138 2.8335 7.1858
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 23.813 1.039 22.912 <2e-16 ***
Diethf 3.021 1.470 2.055 0.0519 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.6 on 22 degrees of freedom
Multiple R-squared: 0.1611, Adjusted R-squared: 0.1229
F-statistic: 4.224 on 1 and 22 DF, p-value: 0.05192
(coefs<-coef(fit))
(Intercept) Diethf
23.813333 3.020833
So we have the intercept \(\hat{\beta}_{0}=23.813333\) and the difference in average weight between high-fat and control diet is \(\hat{\beta}_1=3.020833\).
To show that our mathematical model is consistent with these results, recall \[\hat{\beta}=\begin{pmatrix}\hat{\beta}_{0}\\ \hat{\beta_{1}}\end{pmatrix}=(X^{T}X)^{-1}X^{T}Y\] thus,
Y<-matrix(dat$Bodyweight,ncol=1)
beta.hat<-solve(t(X)%*%X)%*%(t(X)%*%Y)
beta.hat
[,1]
(Intercept) 23.813333
diethf 3.020833
We have another way to look at these results from this linear model:
s<-split(dat$Bodyweight,dat$Diet)
mean(s[["chow"]]) # split divides the data in the vector x into the groups defined by f. The replacement forms replace values corresponding to such a division. unsplit reverses the effect of split.
[1] 23.81333
mean(s[["hf"]])-mean(s[["chow"]])
[1] 3.020833
summary(fit)$coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) 23.813333 1.039353 22.911684 7.642256e-17
Diethf 3.020833 1.469867 2.055174 5.192480e-02
(ttest<-t.test(s[["chow"]], s[["hf"]],var.equal = T))
Two Sample t-test
data: s[["chow"]] and s[["hf"]]
t = -2.0552, df = 22, p-value = 0.05192
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-6.06915183 0.02748516
sample estimates:
mean of x mean of y
23.81333 26.83417
summary(fit)$coefficients[2,3]
[1] 2.055174
ttest$statistic
t
-2.055174
t.test( s[["hf"]],s[["chow"]],var.equal = T)$statistic
t
2.055174
Linear Models in Practice Exercises #1
You can make a design matrix \(X\) for a two group comparison either using model.matrix() or simply with:
X = cbind(rep(1,nx + ny),
rep(c(0,1),
c(nx, ny)))
For a comparison of two groups, where the first group has \(nx=5\) samples, and the second group has \(ny=7\) samples, what is the element in the 1st row and 1st column of \(X^{T}X\)?
nx<-5
ny<-7
X <- cbind(rep(1,nx + ny),
rep(c(0,1),
c(nx, ny)))
XtX<-(t(X)%*%X)
XtX[1,1]
[1] 12
Linear Models in Practice Exercises #2
What are all the other elements of \(X^{T}X\)?
XtX
[,1] [,2]
[1,] 12 7
[2,] 7 7
Recall the LSE estimates: \[\hat{\beta}=(X^{T}X)^{-1}X^{T}Y\] Once you write down the matrices as we did before you can find the solution to estimate the parameters.
When we’re testing if a particular estimated parameter is statically significant we’re going to have to know how much it varies, and we’re going to have to compute its standard error. This is basically the t-test.
We have an estimate and its standard error such that: \[\frac{\text{signal}}{\text{noice}}=\frac{\hat{\beta}_i}{SE(\hat{\beta}_i)}\] \[SE(\hat{\beta}_{i})=\sqrt{s^{2}(X^{T}X)^{-1}_{ii}}\] where \(s^{2}\) is the sample variance.