Created by Robin Cunningham, UNC Chapel Hill output: html_document
GPA Data Set
Note that the first question of this homework is the same as the previous homework. This is so that you have the matrices and job runs we need to continue our analysis of this model. This assignment begins with Question 2.
1. Get the data set “College GPA.csv” and put it in your working directory along with this document and the console of R-Studio. Note that all of the code boxes below are set to eval=FALSE
so that your file will knit
but you will need to change this to put out your answers.
a. Read in the GPA file and assign it to the handy name provided.
```r
gpa <- read.csv("College GPA.csv", header = TRUE)
gpa
```
```
## First.Yr.GPA Math.SAT Verbal.SAT HS.Math.GPA HS.English.GPA
## 1 1.97 321 247 2.30 2.63
## 2 2.74 718 436 3.80 3.57
## 3 2.19 358 578 2.98 2.57
## 4 2.60 403 447 3.58 2.21
## 5 2.98 640 563 3.38 3.48
## 6 1.65 237 342 1.48 2.14
## 7 1.89 270 472 1.67 2.64
## 8 2.38 418 356 3.73 2.52
## 9 2.66 443 327 3.09 3.20
## 10 1.96 359 385 1.54 3.46
## 11 3.14 669 664 3.21 3.37
## 12 1.96 409 518 2.77 2.60
## 13 2.20 582 364 1.47 2.90
## 14 3.90 750 632 3.14 3.49
## 15 2.02 451 435 1.54 3.20
## 16 3.61 645 704 3.50 3.74
## 17 3.07 791 341 3.20 2.93
## 18 2.63 521 483 3.59 3.32
## 19 3.11 594 665 3.42 2.70
## 20 3.20 653 606 3.69 3.52
```
Create each of the Matrices, X, Y, and Beta_hat that you will need to solve the normal equations. \[ X\hat{B} =Y \] Begin by defining variables \[ Y = First year GPA \] \[ X1 = Math SAT \] \[ X2 = Verbal SAT \] \[ X3 = HS Math GPA \] \[ X4 = HS English GPA. \] Then initialize a Beta_hat vector of all zeros and the appropriate length.
#Initialize vectors
k <- ncol(gpa)-1
n <- nrow(gpa)
Beta_hat = numeric(length = n)
Y <- numeric(n)
X1 <- numeric(n)
X2 <- numeric(n)
X3 <- numeric(n)
X4 <- numeric(n)
#Assign values to vectors
Y <- gpa$First.Yr.GPA
X1 <- gpa$Math.SAT
X2 <- gpa$Verbal.SAT
X3 <- gpa$HS.Math.GPA
X4 <- gpa$HS.English.GPA
We have our variables now, but they are all stored as vectors and we must convert them to matrices. Use cbind
to create the matrix X. You will need to create a vector of 1’s of the appropriate length first. Also, use the matrix
command to turn Y and Beta_hat into matrices with the correct dimensions. After you have made them, go ahead and print (with labels) all three Matrices for the normal equations with labels.
#Change Beta_hat to a matrix
Beta_hat = matrix(Beta_hat, nrow = n, ncol=1)
print("Beta_hat = ")
## [1] "Beta_hat = "
Beta_hat
## [,1]
## [1,] 0
## [2,] 0
## [3,] 0
## [4,] 0
## [5,] 0
## [6,] 0
## [7,] 0
## [8,] 0
## [9,] 0
## [10,] 0
## [11,] 0
## [12,] 0
## [13,] 0
## [14,] 0
## [15,] 0
## [16,] 0
## [17,] 0
## [18,] 0
## [19,] 0
## [20,] 0
#Change Y to a matrix
mY = matrix(Y, nrow = n, ncol=1)
print("Y = ")
## [1] "Y = "
mY
## [,1]
## [1,] 1.97
## [2,] 2.74
## [3,] 2.19
## [4,] 2.60
## [5,] 2.98
## [6,] 1.65
## [7,] 1.89
## [8,] 2.38
## [9,] 2.66
## [10,] 1.96
## [11,] 3.14
## [12,] 1.96
## [13,] 2.20
## [14,] 3.90
## [15,] 2.02
## [16,] 3.61
## [17,] 3.07
## [18,] 2.63
## [19,] 3.11
## [20,] 3.20
#create X
# First create column of ones and convert Xi to columns
mones <- matrix(1, nrow=n)
mX1 = matrix(X1, nrow = n, ncol=1)
mX2 = matrix(X2, nrow=n, ncol=1)
mX3 = matrix(X3, nrow=n, ncol=1)
mX4 = matrix(X4, nrow=n, ncol=1)
# make matrix X
X <- cbind(mones, mX1, mX2, mX3, mX4)
X
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 321 247 2.30 2.63
## [2,] 1 718 436 3.80 3.57
## [3,] 1 358 578 2.98 2.57
## [4,] 1 403 447 3.58 2.21
## [5,] 1 640 563 3.38 3.48
## [6,] 1 237 342 1.48 2.14
## [7,] 1 270 472 1.67 2.64
## [8,] 1 418 356 3.73 2.52
## [9,] 1 443 327 3.09 3.20
## [10,] 1 359 385 1.54 3.46
## [11,] 1 669 664 3.21 3.37
## [12,] 1 409 518 2.77 2.60
## [13,] 1 582 364 1.47 2.90
## [14,] 1 750 632 3.14 3.49
## [15,] 1 451 435 1.54 3.20
## [16,] 1 645 704 3.50 3.74
## [17,] 1 791 341 3.20 2.93
## [18,] 1 521 483 3.59 3.32
## [19,] 1 594 665 3.42 2.70
## [20,] 1 653 606 3.69 3.52
c. Calculate the least squares values of Beta_hat. In the calculation of the least squares values, I recommend calculating X^tX first, then the inverse, then Beta_hat.
I went ahead and printed out the Beta_i values for you.
#Calculate Beta_hat
XTX = t(X)%*%X
Inverse = solve(XTX)
Beta_hat = Inverse%*%t(X)%*%Y
Beta_hat
## [,1]
## [1,] 0.161549575
## [2,] 0.002010168
## [3,] 0.001252197
## [4,] 0.189440168
## [5,] 0.087563741
#Print individual values of Beta_i
labels = matrix(c("Beta_0", "Beta_1", "Beta_2", "Beta_3", "Beta_4"), nrow = 5)
beta_summary <- cbind(labels, Beta_hat)
beta_summary
## [,1] [,2]
## [1,] "Beta_0" "0.161549575051402"
## [2,] "Beta_1" "0.00201016759533022"
## [3,] "Beta_2" "0.00125219730826483"
## [4,] "Beta_3" "0.189440168465175"
## [5,] "Beta_4" "0.0875637413857484"
Now that you know how lm
calculates these coefficients, it is ok to use lm
directly to calculate the least-squares statistics. Do that here and assign the model to the name gpa.mlr. Then get the summary output of the model.
gpa.mlr <- lm(Y ~ X1 + X2 +X3 + X4)
summary(gpa.mlr)
##
## Call:
## lm(formula = Y ~ X1 + X2 + X3 + X4)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.44328 -0.12837 0.00257 0.13400 0.53900
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.1615496 0.4375321 0.369 0.71712
## X1 0.0020102 0.0005844 3.439 0.00365 **
## X2 0.0012522 0.0005515 2.270 0.03835 *
## X3 0.1894402 0.0918680 2.062 0.05697 .
## X4 0.0875637 0.1764963 0.496 0.62700
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2685 on 15 degrees of freedom
## Multiple R-squared: 0.8528, Adjusted R-squared: 0.8135
## F-statistic: 21.72 on 4 and 15 DF, p-value: 4.255e-06
How do your estimates of the parameters compare to R’s?(Comment box below)
answer here
Interpret each of the regression paramaters in words. That is, explain what each value means in terms of the scenario.
answer here
Find a 95% confidence interval for \(\beta_4\).(Use either a code-box, a comment box, or both to hold your answer.)
For a new individual with
MSAT = 640
VSAT = 540
HSM = 3.8
HSE = 3.2
What is your best estimate for their 1st-year gpa?
Please use R to give your answer.
OMIT i. For the person described in Part (h), find a 95% confidence interval for their 1st-year gpa. (Note that s^2 is given in the R output of the linear model.)
We should have done this to start the regression analysis, but make plots of Y versus each of the predictors (4 plots) and discuss what you see. (Try to get everything into this RMarkdown document. No paper this time.)
Makes histograms of each of the variables to see if there are any obvious outliers or other odd behavior. Print the histograms and any comments.
Plot residuals versus fitted values (Y-hat_i) and the QQ plot of residuals. It is ok to use lm
for this. Include any comments on what these diagnostic plots indicate.
Find a 95% prediction interval for the individual described in Part (h) above. It is ok to use R’s predict
function for this. Interpret the result in terms of the scenario.
2. Parts (a) through (e) above hopefully built our confidence regarding R, regarding the matrix algebra involved in finding the regression coefficients and in our ability to produce the regression coefficients by hand.
In this question, we will see how to reproduce the Standard Errors of each coefficient using matrix algebra. It turns out there is not much work for this, just a couple of tricks.
Please render all answers using RMarkdown (no paper).
a. We learned in class that the matrix \[(X^{t}X)^{-1}\] basically gives us the Standard Errors for the coefficients. This matrix is so important that it is often called \(C\). Assign the appropriate matrix to the variable name \(C\) and print the matrix.
# Assign the right matrix to the variable C.
# C <-
# Calculate SE(Beta_3) by hand
# ehat =
#SSE <-
c. Now calculate \(s = \sqrt{MSE}\) using the fact that the degrees of freedom of SSE is \(n-k-1\). Print \(s\) and verify that it equals the residual standard error.
# Calculate s by hand.
#n and k were found above.
# view the diagonal after converting s to a number and calculating s^2*C.
#s <- as.numeric(s)
my_SEs
. Write a do-loop to do this# Make a vector of my_SEs
# I did it for you, but inspect the code please!!
my_SEs <- numeric(k+1) # one term for each coefficient
for (i in 1:k+1){
my_SEs[i] = sqrt(s^2*C[i,i])
}
my_SEs
3. The last thing we have to look at before model selection is prediction intervals for \(\hat{Y}\) and confidence intervals for \(\mu_Y\). The idea is that a new student comes along with a set of SAT scores and High School grades and we want to make a confidence interval for their 1st-year GPA at university.
a. We will switch up the order this time and first use R to calculate the confidence interval for \(\mu_Y\) and the prediction interval for \(\hat{Y}\). I will even give you the code:
#Use gpa.mlr for predition interval
new_student <- data.frame(X1=600.0, X2=650.0, X3=2.5, X4=3.8)
predict(gpa.mlr,new_student, interval="predict")
## fit lwr upr
## 1 2.987921 2.334746 3.641096
#Use gpa.mlr for confidence interval for mu_Y
predict(gpa.mlr,new_student, interval="confidence")
## fit lwr upr
## 1 2.987921 2.673148 3.302695
b. Explain in plain english what each interval represents.We will verify below that they are 95%-confidence intervals. Needless to say, your two explanations may not be identical.
c. Calculate the standard errors \(SE(\mu_Y)\) and \(SE(\hat{Y})\) using the matrix formulas from class. Print out each standard error with a label. (Note: I defined the new matrix x as a column matrix. If you define it as a row-matrix, it will affect some of your formulas.)
# first define x and calculate x^T*C*x and convert it to a scalar
#new_x <- matrix(c(1, 600,650,2.5,3.8),ncol=1)
# Standard error of mu_Y
# Standard error of yhat for new obs.
d. Note that both the types of interval for the new predictor values (600, 650, 2.4, 3.8) use the same point-estimate \(\hat{Y} = x^T\hat{\beta}\). Calculate this point-estimate and verify that it equals the value we found in Part (a). Also, turn it into a scalar so we can use it for the confidence intervals in Part (e).
e. Now use \(t_{0.975, n-k-1}\) to reproduce the 95% - confidence intervals we found in Part (a). Print each with a label (the label can be on the previous line) along with lower.bound and upper.bound.
# confidence interval for average of all new observations
# with predictors = (600, 650, 2.4, 3.8)
#CI_muy.lower <-
#CI_muy.upper <-
# print("The upper and lower bounds of the confidence interval for mu_Y are")
# print(c(CI_muy.lower, CI_muy.upper))
# Now do prediction interval for gpa of a single new observation
# with predictors = (600, 650, 2.4, 3.8)
f. Feel some satisfaction at being able to reproduce the output of lm
for a multi-linear regression. You now know how to calculate a linear regression and how to properly use it for prediction. Next in the course, we will learn to decide between different models to choose the “best” one.