QUESTION 1.1.1: R REFRESHER (1/1 point) If you haven’t done so already, install the library UsingR

install.packages(“UsingR”)

Then once you load it you have access to Galton’s father and son heights:

library(UsingR)

## Loading required package: MASS
## Loading required package: HistData
## Loading required package: Hmisc
## Loading required package: grid
## Loading required package: lattice
## Loading required package: survival
## Loading required package: splines
## Loading required package: Formula
## Loading required package: ggplot2
## 
## Attaching package: 'Hmisc'
## 
## The following objects are masked from 'package:base':
## 
##     format.pval, round.POSIXt, trunc.POSIXt, units
## 
## 
## Attaching package: 'UsingR'
## 
## The following object is masked from 'package:ggplot2':
## 
##     movies
## 
## The following object is masked from 'package:survival':
## 
##     cancer

?father.son

data1 <- father.son
head(data1)

##    fheight  sheight
## 1 65.04851 59.77827
## 2 63.25094 63.21404
## 3 64.95532 63.34242
## 4 65.75250 62.79238
## 5 61.13723 64.28113
## 6 63.02254 64.24221

mean(data1$sheight) # ANS: 68.68407

## [1] 68.68407

What is the average height of the sons (don’t round off)? ANS: 68.68407

plot(cars)

QUESTION 1.1.2 R REFRESHER (1 point possible) One of the defining features of regression is that we stratify one variable based on others. In Statistics we use the verb “condition”. For example, the linear model for son and father heights answers the question how tall do I expect a son to be if I condition on his father being x inches. The regression line answers this question for any x. Using the father.son dataset described above, we want to know the expected height of sons if we condition on the father being 71 inches. Create a list of son height’s for sons that have fathers with height of 71 inches (round to the nearest inch).
QUESTION 1.3.2 (1 point possible) Solve the following system of equations using R:
3a + 4b - 5c + d = 10
2a + 2b + 2c - d = 5
a -b + 5c - 5d = 7
5a + d = 4
Note: covert this system of equations into a matrix of the form Xa=b, and solve with ‘solve(X,b)’
`r xx <- matrix(c(3,2,1,5,4,2,-1,0,-5,2,5,0,1,-1,-5,1),4,4) bb <- c(10,5,7,4) ans.vec <- solve(xx,bb) paste("c is",ans.vec[3]) # ANS -0.884955752212389`
`## [1] "c is -0.884955752212389"` What is the solution for c: ANS -0.884955752212389

Load the following two matrices into R:

a <- matrix(1:12, nrow=4)
b <- matrix(1:15, nrow=3)
dim(a);dim(b)

## [1] 4 3

## [1] 3 5

Note the dimension of ‘a’ and the dimension of ‘b’.

In the question below, we will use the matrix multiplication operator in R, %*%, to multiply these two matrices.

QUESTION 1.4.1 (1 point possible) What is the value in the 3rd row and the 2nd column of the matrix product of ‘a’ and ‘b’
Suppose we have an experiment with two species A and B, and two conditions: control and treated.
species <- factor(c(“A”,“A”,“B”,“B”)) condition <- factor(c(“control”,“treated”,“control”,“treated”))
And we will use a formula of ‘~ species + condition’.
The model matrix is then:
model.matrix(~ species + condition)

# Note: To follow along in lecture notes, did the following:
# library() # checked for 'downloader' package. Wasn't there, so installed it
# install.packages("downloader")
# library(downloader) # to ensure it was there
url <- "https://raw.githubusercontent.com/genomicsclass/dagdata/master/inst/extdata/spider_wolff_gorb_2013.csv"
filename <- "spider_wolff_gorb_2013.csv"
if (!file.exists(filename)) download(url, filename)
spider <- read.csv(filename, skip=1)
# View(spider)

boxplot(spider$friction ~ spider$type * spider$leg, 
        col=c("grey90","grey40"), las=2, 
        main="Comparison of friction coefficients of different leg pairs ")

# Just to remind ourselves about the simple two-group linear model, let’s subset to the L1 leg pair, and run lm:

spider.sub <- spider[spider$leg == "L1",]
fit <- lm(friction ~ type, data=spider.sub)
summary(fit)

## 
## Call:
## lm(formula = friction ~ type, data = spider.sub)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.33147 -0.10735 -0.04941 -0.00147  0.76853 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.92147    0.03827  24.078  < 2e-16 ***
## typepush    -0.51412    0.05412  -9.499  5.7e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2232 on 66 degrees of freedom
## Multiple R-squared:  0.5776, Adjusted R-squared:  0.5711 
## F-statistic: 90.23 on 1 and 66 DF,  p-value: 5.698e-14

(coefs <- coef(fit))

## (Intercept)    typepush 
##   0.9214706  -0.5141176

#We remember that the coefficients are just the mean of the pull observations, and the difference between the means of the two groups:

s <- split(spider.sub$friction, spider.sub$type)
mean(s[["pull"]])

## [1] 0.9214706

## [1] 0.9214706
mean(s[["push"]]) - mean(s[["pull"]])

## [1] -0.5141176

## [1] -0.5141176

# We can form the design matrix which was used internally to lm:

X <- model.matrix(~ type, data=spider.sub)
colnames(X)

## [1] "(Intercept)" "typepush"

## [1] "(Intercept)" "typepush"

so sqrt(Var(beta-hat_L4) + Var(beta-hat_L2) - 2 Cov(beta-hat_L4, beta-hat_L2)) is sqrt(0.03438^2 + 0.04569^2 - 20.0006389179) = 0.04462874 same as sqrt(cc %% Sigma %*% t(cc))

# sqrt(Var(beta-hat_L4 - beta-hat_L2)) = sqrt(Var(beta-hat_L4) + Var(beta-hat_L2) - 2 Cov(beta-hat_L4, beta-hat_L2)) is the same as
#sqrt(var(beta-hat_L4 -beta-hat_L2)) =sqrt(var(beta-hat_L4) + var(beta-hat_L2) - 2*Cov(beta-hat_L4,beta-hat_L2))
sqrt(cc %*% Sigma %*% t(cc)) # equal to 0.04462392

##            [,1]
## [1,] 0.04462392

sqrt(0.03438^2 + 0.04569^2) - 2*0.0006389179

## [1] 0.05590224

Suppose that we notice that the within-group variances for the groups with smaller frictional coefficients are generally smaller, and so we try to apply a transformation to the frictional coefficients to make the within-group variances more constant.

Add a new variable log2friction to the spider dataframe:

spider$log2friction <- log2(spider$friction)

The ‘Y’ values now look like:

boxplot(log2friction ~ type*leg, data=spider)

Run a linear model of log2friction with type, leg and interactions between type and leg.

spider$log2friction <- log2(spider$friction)
boxplot(log2friction ~ type*leg, data=spider)

fitTLlog2 <- lm(formula = log2friction ~ type + leg + type:leg, data = spider)
summary(fitTLlog2)

## 
## Call:
## lm(formula = log2friction ~ type + leg + type:leg, data = spider)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.35902 -0.19193  0.00596  0.16315  1.33090 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -0.16828    0.06613  -2.545 0.011487 *  
## typepush       -1.20656    0.09352 -12.901  < 2e-16 ***
## legL2           0.34681    0.11952   2.902 0.004014 ** 
## legL3           0.48999    0.08505   5.762 2.24e-08 ***
## legL4           0.64668    0.08995   7.189 6.20e-12 ***
## typepush:legL2  0.09967    0.16903   0.590 0.555906    
## typepush:legL3 -0.54075    0.12027  -4.496 1.02e-05 ***
## typepush:legL4 -0.46920    0.12721  -3.689 0.000272 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3856 on 274 degrees of freedom
## Multiple R-squared:  0.8125, Adjusted R-squared:  0.8077 
## F-statistic: 169.6 on 7 and 274 DF,  p-value: < 2.2e-16

Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.16828 0.06613 -2.545 0.011487 *
typepush -1.20656 0.09352 -12.901 < 2e-16 legL2 0.34681 0.11952 2.902 0.004014 legL3 0.48999 0.08505 5.762 2.24e-08 legL4 0.64668 0.08995 7.189 6.20e-12 typepush:legL2 0.09967 0.16903 0.590 0.555906
typepush:legL3 -0.54075 0.12027 -4.496 1.02e-05 typepush:legL4 -0.46920 0.12721 -3.689 0.000272 * — Signif. codes: 0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘’ 1

Residual standard error: 0.3856 on 274 degrees of freedom Multiple R-squared: 0.8125, Adjusted R-squared: 0.8077 F-statistic: 169.6 on 7 and 274 DF, p-value: < 2.2e-16 — QUESTION 2.6.1 (1/1 point) What is the t-value for the interaction of type push and leg L4? If this t-value is sufficiently large, we would reject the null hypothesis that the push vs pull effect on log2(friction) is the same in L4 as in L1. ANS from summary(fitTLlog2), t-value is -3.689

QUESTION 2.6.2 (1 point possible) What is the F-value for all of the type:leg interaction terms, in an analysis of variance? ANS: 10.701

For an ANOVA, need to load up

anova(fitTLlog2)

## Analysis of Variance Table
## 
## Response: log2friction
##            Df  Sum Sq Mean Sq  F value    Pr(>F)    
## type        1 164.709 164.709 1107.714 < 2.2e-16 ***
## leg         3   7.065   2.355   15.838 1.589e-09 ***
## type:leg    3   4.774   1.591   10.701 1.130e-06 ***
## Residuals 274  40.742   0.149                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Analysis of Variance Table

Response: log2friction Df Sum Sq Mean Sq F value Pr(>F)
type 1 164.709 164.709 1107.714 < 2.2e-16 leg 3 7.065 2.355 15.838 1.589e-09 type:leg 3 4.774 1.591 10.701 1.130e-06 *** Residuals 274 40.742 0.149
— Signif. codes: 0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘’ 1 — QUESTION 2.6.3 (1/1 point) What is the L2 vs L1 estimate in log2friction for the pull samples? ANS: 0.34681

summary(fitTLlog2) # From this summary, legL2 estimate is  0.34681

## 
## Call:
## lm(formula = log2friction ~ type + leg + type:leg, data = spider)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.35902 -0.19193  0.00596  0.16315  1.33090 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -0.16828    0.06613  -2.545 0.011487 *  
## typepush       -1.20656    0.09352 -12.901  < 2e-16 ***
## legL2           0.34681    0.11952   2.902 0.004014 ** 
## legL3           0.48999    0.08505   5.762 2.24e-08 ***
## legL4           0.64668    0.08995   7.189 6.20e-12 ***
## typepush:legL2  0.09967    0.16903   0.590 0.555906    
## typepush:legL3 -0.54075    0.12027  -4.496 1.02e-05 ***
## typepush:legL4 -0.46920    0.12721  -3.689 0.000272 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3856 on 274 degrees of freedom
## Multiple R-squared:  0.8125, Adjusted R-squared:  0.8077 
## F-statistic: 169.6 on 7 and 274 DF,  p-value: < 2.2e-16

# also can answer using 
# contrast(fitTLlog2, list(type="pull",leg="L2"), list(type="pull",leg="L1"))
# which is simply the legL2 coefficient: coef(fitTLlog2)["legL2"]

summary(fitTLlog2) Call: lm(formula = log2friction ~ type + leg + type:leg, data = spider)

Residuals: Min 1Q Median 3Q Max -1.35902 -0.19193 0.00596 0.16315 1.33090

Residual standard error: 0.3856 on 274 degrees of freedom Multiple R-squared: 0.8125, Adjusted R-squared: 0.8077 F-statistic: 169.6 on 7 and 274 DF, p-value: < 2.2e-16

Let’s use the example from the lecture to visualize how there is not a single best beta-hat, when the design matrix has collinearity of columns.

An example can be made with:

sex <- factor(rep(c(“female”,“male”),each=4)) trt <- factor(c(“A”,“A”,“B”,“B”,“C”,“C”,“D”,“D”))

The model matrix can then be formed with:

X <- model.matrix( ~ sex + trt)

And we can see that the number of independent columns is less than the number of columns of X:

qr(X)$rank

Suppose we observe some outcome, Y. For simplicity we will use synthetic data:

Y <- 1:8

Now, we will fix the value for two beta’s and optimize the remaining betas. We will fix beta_male and beta_D. And then we will find the optimal value for the remaining betas, in terms of minimizing sum((Y - X beta)^2).

The optimal value for the other betas is the one that minimizes:

sum( ( (Y - X* beta_male - X** beta_D) - X*** beta*** )^2 )

Where X* is the male column of the design matrix, X** is the D column, and X*** has the remaining columns.

So all we need to do is redefine Y as Y* = Y - X* beta_male - X** beta_D and fit a linear model. The following line of code creates this variable Y*, after fixing beta_male to a value ‘a’, and beta_D to a value, ‘b’:

makeYstar <- function(a,b) Y - X[,2] * a - X[,5] * b

Now we’ll construct a function which, for a given value a and b, gives us back the the sum of squared residuals after fitting the other terms.

fitTheRest <- function(a,b) { Ystar <- makeYstar(a,b) Xrest <- X[,-c(2,5)] betarest <- solve(t(Xrest) %% Xrest) %% t(Xrest) %% Ystar residuals <- Ystar - Xrest %% betarest sum(residuals^2) }

QUESTION 2.7.2 (1 point possible) What is the sum of squared residuals when the male coefficient is 1 and the D coefficient is 2, and the other coefficients are fit using the linear model solution? ANS 11

sex <- factor(rep(c("female","male"),each=4))
trt <- factor(c("A","A","B","B","C","C","D","D"))

X <- model.matrix( ~ sex + trt)

qr(X)$rank

## [1] 4

Y <- 1:8



makeYstar <- function(a,b) {Y - X[,2]*a - X[,5]*b}


fitTheRest <- function(a=1,b=2) {
  Ystar <- makeYstar(a,b)
  Xrest <- X[,-c(2,5)]
  betarest <- solve(t(Xrest)%*% Xrest)%*% t(Xrest)%*% Ystar
  residuals <- Ystar - Xrest%*% betarest
  sum(residuals^2)
}
fitTheRest() # ANS 11

## [1] 11

We can apply our function fitTheRest to a grid of values for beta_male and beta_D, using the outer() function in R. outer() takes three arguments, a grid of values for the first argument, a grid of values for the second argument, and finally a function which takes two arguments.

Try it out:

outer(1:3,1:3,*)

We can run fitTheRest on a grid of values, using the following code (the Vectorize() is necessary as outer() requires only vectorized functions):

outer(-2:8,-2:8,Vectorize(fitTheRest))

QUESTION 2.7.3 (1 point possible) In the grid of values, what is the smallest sum of squared residuals? ANS 2

outer(-2:8,-2:8,Vectorize(fitTheRest)) # ANS 2

##       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11]
##  [1,]  102   83   66   51   38   27   18   11    6     3     2
##  [2,]   83   66   51   38   27   18   11    6    3     2     3
##  [3,]   66   51   38   27   18   11    6    3    2     3     6
##  [4,]   51   38   27   18   11    6    3    2    3     6    11
##  [5,]   38   27   18   11    6    3    2    3    6    11    18
##  [6,]   27   18   11    6    3    2    3    6   11    18    27
##  [7,]   18   11    6    3    2    3    6   11   18    27    38
##  [8,]   11    6    3    2    3    6   11   18   27    38    51
##  [9,]    6    3    2    3    6   11   18   27   38    51    66
## [10,]    3    2    3    6   11   18   27   38   51    66    83
## [11,]    2    3    6   11   18   27   38   51   66    83   102

It’s fairly clear from just looking at the numbers, but we can also visualize the sum of squared residuals over our grid with the imagemat() function from rafalib:

install_github(“ririzarr/rafalib”)

library(rafalib) imagemat(outer(-2:8,-2:8,Vectorize(fitTheRest))) — There is clearly not a single beta which optimizes the least squares equation, due to collinearity, but an infinite line of solutions which produce an identical sum of squares values.

We will use the spider dataset to try out the QR decomposition as a solution to linear models. Load the full spider dataset, by using the code in the Interactions and Contrasts book page (did a modified version). Run a linear model of the friction coefficient with two variable (no interactions):

filename <- "spider_wolff_gorb_2013.csv"
if (file.exists(filename)) spider <- read.csv(filename, skip=1)
fit <- lm(friction ~ type + leg, data=spider)

#The solution we are interested in solving is:
betahat <- coef(fit)

# So for our matrix work, 
Y <- matrix(spider$friction, ncol=1)
X <- model.matrix(~ type + leg, data=spider)

QR <- qr(X)
Q <- qr.Q( QR )
paste("The first row, first column element in the Q matrix for this linear model is",format(Q[1,1], digits=3))

## [1] "The first row, first column element in the Q matrix for this linear model is -0.0595"

R <- qr.R( QR )
paste("The first row, first column element in the R matrix for this linear model is",format(R[1,1], digits=3))

## [1] "The first row, first column element in the R matrix for this linear model is -16.8"

#I n the material on QR decomposition, we saw that the solution for beta is:
# R beta = Q^T Y
Rbetahat <- crossprod(Q,Y)
paste("The first row, first column element of Q^T Y is",format(Rbetahat[1,1], digits=6))

## [1] "The first row, first column element of Q^T Y is -13.7987"

QUESTION 2.8.1 (1 point possible) What is the first row, first column element in the Q matrix for this linear model? ANS -0.0595

QUESTION 2.8.2 (1 point possible) What is the first row, first column element in the R matrix for this linear model? ANS -16.8

QUESTION 2.8.3 (1 point possible) What is the first row, first column element of Q^T Y (use crossprod() as in the book page) ANS -13.7987"

Finally convince yourself that the QR gives the least squares solution by putting all the pieces together:

R^-1 (Q^T Y) compared to betahat

Matrix Algebra and Linear models

Todd Curtis

February 20, 2015

install_github(“ririzarr/rafalib”)