Homework Assigned 10/13/16 and 10/18/16

Created by Robin Cunningham, UNC Chapel Hill output: html_document

Intro to Multiple Linear Regression

GPA Data Set

Note that the first question of this homework is the same as the previous homework. This is so that you have the matrices and job runs we need to continue our analysis of this model. This assignment begins with Question 2.

1. Get the data set “College GPA.csv” and put it in your working directory along with this document and the console of R-Studio. Note that all of the code boxes below are set to eval=FALSE so that your file will knit but you will need to change this to put out your answers.

a. Read in the GPA file and assign it to the handy name provided.

```r
gpa <- read.csv("College GPA.csv", header = TRUE)
gpa
```

```
##    First.Yr.GPA Math.SAT Verbal.SAT HS.Math.GPA HS.English.GPA
## 1          1.97      321        247        2.30           2.63
## 2          2.74      718        436        3.80           3.57
## 3          2.19      358        578        2.98           2.57
## 4          2.60      403        447        3.58           2.21
## 5          2.98      640        563        3.38           3.48
## 6          1.65      237        342        1.48           2.14
## 7          1.89      270        472        1.67           2.64
## 8          2.38      418        356        3.73           2.52
## 9          2.66      443        327        3.09           3.20
## 10         1.96      359        385        1.54           3.46
## 11         3.14      669        664        3.21           3.37
## 12         1.96      409        518        2.77           2.60
## 13         2.20      582        364        1.47           2.90
## 14         3.90      750        632        3.14           3.49
## 15         2.02      451        435        1.54           3.20
## 16         3.61      645        704        3.50           3.74
## 17         3.07      791        341        3.20           2.93
## 18         2.63      521        483        3.59           3.32
## 19         3.11      594        665        3.42           2.70
## 20         3.20      653        606        3.69           3.52
```

Create each of the Matrices, X, Y, and Beta_hat that you will need to solve the normal equations. \[ X\hat{B} =Y \] Begin by defining variables \[ Y = First year GPA \] \[ X1 = Math SAT \] \[ X2 = Verbal SAT \] \[ X3 = HS Math GPA \] \[ X4 = HS English GPA. \] Then initialize a Beta_hat vector of all zeros and the appropriate length.
```
#Initialize vectors
k <- ncol(gpa)-1
n <- nrow(gpa)
Beta_hat = numeric(length = n)
Y <- numeric(n)
X1 <- numeric(n)
X2 <- numeric(n)
X3 <- numeric(n)
X4 <- numeric(n)

#Assign values to vectors
Y <- gpa$First.Yr.GPA
X1 <- gpa$Math.SAT
X2 <- gpa$Verbal.SAT
X3 <- gpa$HS.Math.GPA
X4 <- gpa$HS.English.GPA
```

We have our variables now, but they are all stored as vectors and we must convert them to matrices. Use cbind to create the matrix X. You will need to create a vector of 1’s of the appropriate length first. Also, use the matrix command to turn Y and Beta_hat into matrices with the correct dimensions. After you have made them, go ahead and print (with labels) all three Matrices for the normal equations with labels.

  #Change Beta_hat to a matrix
  Beta_hat = matrix(Beta_hat, nrow = n, ncol=1)
  print("Beta_hat = ")

## [1] "Beta_hat = "

  Beta_hat

##       [,1]
##  [1,]    0
##  [2,]    0
##  [3,]    0
##  [4,]    0
##  [5,]    0
##  [6,]    0
##  [7,]    0
##  [8,]    0
##  [9,]    0
## [10,]    0
## [11,]    0
## [12,]    0
## [13,]    0
## [14,]    0
## [15,]    0
## [16,]    0
## [17,]    0
## [18,]    0
## [19,]    0
## [20,]    0

  #Change Y to a matrix
  mY = matrix(Y, nrow = n, ncol=1)
  print("Y = ")

## [1] "Y = "

mY

##       [,1]
##  [1,] 1.97
##  [2,] 2.74
##  [3,] 2.19
##  [4,] 2.60
##  [5,] 2.98
##  [6,] 1.65
##  [7,] 1.89
##  [8,] 2.38
##  [9,] 2.66
## [10,] 1.96
## [11,] 3.14
## [12,] 1.96
## [13,] 2.20
## [14,] 3.90
## [15,] 2.02
## [16,] 3.61
## [17,] 3.07
## [18,] 2.63
## [19,] 3.11
## [20,] 3.20

  #create X 
  # First create column of ones and convert Xi to columns
  mones <- matrix(1, nrow=n)
  mX1 = matrix(X1, nrow = n, ncol=1)
  mX2 = matrix(X2, nrow=n, ncol=1)
  mX3 = matrix(X3, nrow=n, ncol=1)
  mX4 = matrix(X4, nrow=n, ncol=1)
  # make matrix X
  X <- cbind(mones, mX1, mX2, mX3, mX4)
  X

##       [,1] [,2] [,3] [,4] [,5]
##  [1,]    1  321  247 2.30 2.63
##  [2,]    1  718  436 3.80 3.57
##  [3,]    1  358  578 2.98 2.57
##  [4,]    1  403  447 3.58 2.21
##  [5,]    1  640  563 3.38 3.48
##  [6,]    1  237  342 1.48 2.14
##  [7,]    1  270  472 1.67 2.64
##  [8,]    1  418  356 3.73 2.52
##  [9,]    1  443  327 3.09 3.20
## [10,]    1  359  385 1.54 3.46
## [11,]    1  669  664 3.21 3.37
## [12,]    1  409  518 2.77 2.60
## [13,]    1  582  364 1.47 2.90
## [14,]    1  750  632 3.14 3.49
## [15,]    1  451  435 1.54 3.20
## [16,]    1  645  704 3.50 3.74
## [17,]    1  791  341 3.20 2.93
## [18,]    1  521  483 3.59 3.32
## [19,]    1  594  665 3.42 2.70
## [20,]    1  653  606 3.69 3.52

c. Calculate the least squares values of Beta_hat. In the calculation of the least squares values, I recommend calculating X^tX first, then the inverse, then Beta_hat.
I went ahead and printed out the Beta_i values for you.

  #Calculate Beta_hat
  XTX = t(X)%*%X
  Inverse = solve(XTX)
  Beta_hat = Inverse%*%t(X)%*%Y
  Beta_hat

##             [,1]
## [1,] 0.161549575
## [2,] 0.002010168
## [3,] 0.001252197
## [4,] 0.189440168
## [5,] 0.087563741

  #Print individual values of Beta_i
  labels = matrix(c("Beta_0", "Beta_1", "Beta_2", "Beta_3", "Beta_4"), nrow = 5)
  beta_summary <- cbind(labels, Beta_hat)
  beta_summary

##      [,1]     [,2]                 
## [1,] "Beta_0" "0.161549575051402"  
## [2,] "Beta_1" "0.00201016759533022"
## [3,] "Beta_2" "0.00125219730826483"
## [4,] "Beta_3" "0.189440168465175"  
## [5,] "Beta_4" "0.0875637413857484"

Now that you know how lm calculates these coefficients, it is ok to use lm directly to calculate the least-squares statistics. Do that here and assign the model to the name gpa.mlr. Then get the summary output of the model.

gpa.mlr <- lm(Y ~ X1 + X2 +X3 + X4)
summary(gpa.mlr)

## 
## Call:
## lm(formula = Y ~ X1 + X2 + X3 + X4)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.44328 -0.12837  0.00257  0.13400  0.53900 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept) 0.1615496  0.4375321   0.369  0.71712   
## X1          0.0020102  0.0005844   3.439  0.00365 **
## X2          0.0012522  0.0005515   2.270  0.03835 * 
## X3          0.1894402  0.0918680   2.062  0.05697 . 
## X4          0.0875637  0.1764963   0.496  0.62700   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2685 on 15 degrees of freedom
## Multiple R-squared:  0.8528, Adjusted R-squared:  0.8135 
## F-statistic: 21.72 on 4 and 15 DF,  p-value: 4.255e-06

How do your estimates of the parameters compare to R’s?(Comment box below)

answer here

Interpret each of the regression paramaters in words. That is, explain what each value means in terms of the scenario.
```
answer here
```
Find a 95% confidence interval for $\beta_4$.(Use either a code-box, a comment box, or both to hold your answer.)
For a new individual with
MSAT = 640
VSAT = 540
HSM = 3.8
HSE = 3.2
What is your best estimate for their 1st-year gpa?
Please use R to give your answer.

OMIT i. For the person described in Part (h), find a 95% confidence interval for their 1st-year gpa. (Note that s^2 is given in the R output of the linear model.)

We should have done this to start the regression analysis, but make plots of Y versus each of the predictors (4 plots) and discuss what you see. (Try to get everything into this RMarkdown document. No paper this time.)
Makes histograms of each of the variables to see if there are any obvious outliers or other odd behavior. Print the histograms and any comments.
Plot residuals versus fitted values (Y-hat_i) and the QQ plot of residuals. It is ok to use lm for this. Include any comments on what these diagnostic plots indicate.
Find a 95% prediction interval for the individual described in Part (h) above. It is ok to use R’s predict function for this. Interpret the result in terms of the scenario.

2. Parts (a) through (e) above hopefully built our confidence regarding R, regarding the matrix algebra involved in finding the regression coefficients and in our ability to produce the regression coefficients by hand.

In this question, we will see how to reproduce the Standard Errors of each coefficient using matrix algebra. It turns out there is not much work for this, just a couple of tricks.
Please render all answers using RMarkdown (no paper).
a. We learned in class that the matrix \[(X^{t}X)^{-1}\] basically gives us the Standard Errors for the coefficients. This matrix is so important that it is often called $C$. Assign the appropriate matrix to the variable name $C$ and print the matrix.

# Assign the right matrix to the variable C.
# C <-

As we learned in class, the squares of the Standard Errors of the coefficients are given by the diagonal elements of the matrix \[s^{2}C.\] So to reproduce those, we need to produce $s^2$. We could just square the residual standard error (0.2685) in the output of lm, but we can also calculate it with matrices by going through SSE and the degrees of freedom of SSE. In the code block below, use matrix equations from class to calculate the matrix of residuals (“e-hat”) and SSE. Print out SSE.

# Calculate SE(Beta_3) by hand
# ehat = 
#SSE <-

c. Now calculate $s = \sqrt{MSE}$ using the fact that the degrees of freedom of SSE is $n-k-1$. Print $s$ and verify that it equals the residual standard error.

# Calculate s by hand.
#n and k were found above.

Well that is reassuring. It looks like we can trust R’s output for $s$. Now print the matrix $s^{2}C$ and view the squares of the SE(Beta_i)’s on the diagonal. Unfortunately, R still considers $s$ to be a 1X1 matrix, so we have to convert it to a scalar first, before multiplying by $C$. I have done this step for you.

# view the diagonal after converting s to a number and calculating s^2*C.
#s <- as.numeric(s)

You can view the actual Standard Errors of the Coefficients by looking at the square-root of $s^2 C$, but it would be more satisfying to produce a matrix of my_SEs. Write a do-loop to do this

# Make a vector of my_SEs
# I did it for you, but inspect the code please!!
my_SEs <- numeric(k+1) # one term for each coefficient
for (i in 1:k+1){
        my_SEs[i] = sqrt(s^2*C[i,i])
}
my_SEs

3. The last thing we have to look at before model selection is prediction intervals for $\hat{Y}$ and confidence intervals for $\mu_Y$. The idea is that a new student comes along with a set of SAT scores and High School grades and we want to make a confidence interval for their 1st-year GPA at university.
a. We will switch up the order this time and first use R to calculate the confidence interval for $\mu_Y$ and the prediction interval for $\hat{Y}$. I will even give you the code:

#Use gpa.mlr for predition interval
new_student <- data.frame(X1=600.0, X2=650.0, X3=2.5, X4=3.8)
predict(gpa.mlr,new_student, interval="predict")

##        fit      lwr      upr
## 1 2.987921 2.334746 3.641096

#Use gpa.mlr for confidence interval for mu_Y
predict(gpa.mlr,new_student, interval="confidence")

##        fit      lwr      upr
## 1 2.987921 2.673148 3.302695

b. Explain in plain english what each interval represents.We will verify below that they are 95%-confidence intervals. Needless to say, your two explanations may not be identical.

c. Calculate the standard errors $SE(\mu_Y)$ and $SE(\hat{Y})$ using the matrix formulas from class. Print out each standard error with a label. (Note: I defined the new matrix x as a column matrix. If you define it as a row-matrix, it will affect some of your formulas.)

# first define x and calculate x^T*C*x and convert it to a scalar

#new_x <- matrix(c(1, 600,650,2.5,3.8),ncol=1)


# Standard error of mu_Y


# Standard error of yhat for new obs.

d. Note that both the types of interval for the new predictor values (600, 650, 2.4, 3.8) use the same point-estimate $\hat{Y} = x^T\hat{\beta}$. Calculate this point-estimate and verify that it equals the value we found in Part (a). Also, turn it into a scalar so we can use it for the confidence intervals in Part (e).

e. Now use $t_{0.975, n-k-1}$ to reproduce the 95% - confidence intervals we found in Part (a). Print each with a label (the label can be on the previous line) along with lower.bound and upper.bound.

# confidence interval for average of all new observations
# with predictors = (600, 650, 2.4, 3.8)

#CI_muy.lower <- 
#CI_muy.upper <- 
# print("The upper and lower bounds of the confidence interval for mu_Y are")
# print(c(CI_muy.lower, CI_muy.upper))



# Now do prediction interval for gpa of a single new observation
# with predictors = (600, 650, 2.4, 3.8)

f. Feel some satisfaction at being able to reproduce the output of lm for a multi-linear regression. You now know how to calculate a linear regression and how to properly use it for prediction. Next in the course, we will learn to decide between different models to choose the “best” one.