Matrix Algebra and Linear Models

Week 2 - Testing Linear Models

Question 2.1.1

set.seed(12345)
g <- 9.8 ## meters per second
h0 <- 56.67
v0 <- 0
n <- 25
tt <- seq(0,3.4,len=n) ##time in secs, t is a base function
y <- h0 + v0*tt - 0.5*g*tt^2 + rnorm(n,sd=1)

Now we act as if we didn’t know h0, v0 and -0.5*g and use regression to estimate these. We can rewrite the model as y = b0 + b1 t + b2 t^2 + e and obtain the LSE we have used in this class. Note that g = -2 b2.

To obtain the LSE in R we could write:

X <- cbind(1, tt, tt^2)
A <- solve(crossprod(X), t(X))

Given how we have defined A, which of the following is the LSE of g, the acceleration due to gravity (suggestion: try the code in R)?

-2 * (A %*% y)

##           [,1]
##    -113.028558
## tt   -1.035275
##      10.168842

Question 2.1.2

In the lines of code above, there was a call to a random function rnorm(). This means that each time the lines of code above are repeated, the estimate of g will be different.

Use the code above in conjunction with the function replicate() to generate 100,000 Monte Carlo simulated datasets. For each dataset compute an estimate of g (remember to multiply by -2)

What is the standard error of this estimate?:

betahat <- replicate(100000, {
  y <- h0 + v0*tt - 0.5*g*tt^2 + rnorm(n,sd=1)
  betahats <- -2 * (A %*% y)
  return(betahats[3])
})
sd(betahat)

## [1] 0.427602

Question 2.1.3

In the father and son height examples we have randomness because we have a random sample of father and son pairs. For the sake of illustration let’s assume that this is the entire population:

library(UsingR)

## Warning: package 'UsingR' was built under R version 3.1.3

## Loading required package: MASS
## Loading required package: HistData

## Warning: package 'HistData' was built under R version 3.1.3

## Loading required package: Hmisc
## Loading required package: grid
## Loading required package: lattice
## Loading required package: survival
## Loading required package: splines
## Loading required package: Formula
## 
## Attaching package: 'Hmisc'
## 
## The following objects are masked from 'package:base':
## 
##     format.pval, round.POSIXt, trunc.POSIXt, units
## 
## 
## Attaching package: 'UsingR'
## 
## The following object is masked from 'package:survival':
## 
##     cancer

x <- father.son$fheight
y <- father.son$sheight
n <- length(y)

Now let’s run a Monte Carlo simulation in which we take a sample of size 50 over and over again.

Use the function replicate to take 10,000 samples.

What is the standard error of the slope estimate? That is, calculate the standard deviation of the estimate from many random samples.

betahat <- replicate(10000, {
  index <- sample(n, 50)
  sampledat <- father.son[index,]
  x <- sampledat$fheight
  y <- sampledat$sheight
  return(lm(y ~ x)$coef[2])
})
sd(betahat)

## [1] 0.1250985

Question 2.1.4

We are defining a new concept: covariance. The covariance of two lists of numbers X=X1,…,Xn and Y=Y1,…,Yn is mean( (Y - mean(Y))*(X-mean(X) ) ).

Which of the following is closest to the covariance between father heights and son heights

cov(father.son$fheight, father.son$sheight)

## [1] 3.873333