Problem Set 6

Problem 1

(a) k-Fold Cross Validation is implimated by randomly dividng the set of n observation into k groups/folds, usually k=5 or k=10, that don’t overlap and are about equal in size. The first group/fold, i=1 serves as the validation set that we train on, and then the model is fit on the other k-1 groups/folds. The mean sqared error is then taken from the observations. This is iterated k times, with the validation set changing each iteration. This process results in k estimates of the test error that are then averaged.

(b.i.) In comparison to the k-Fold Cross-Validation method, the validation set approach can have high variability in its’ estimate of the test error because of how dependent the outcome is on which observations are included in the validation set and training set. The validation set approach also only uses a subset of the observations to fit the model, which increases the likelyhood that we might overstimate the test error for the fit of the model on our who data set.

(b.ii.) In comparison to the k-Fold Cross-Validation approach, the LOOCV approach produces an outcome that has much higher variance due to the the outputs being highly correlated because the models are trained on nearly identical sets of observations, and we’ve learned that the means of highly correlated values have much greater variance. The LOOCV approach also requires fitting the model n times which can be “computationally expensive” meaning this is a potentially more intensive approach. However, There is a Bias-Variance trade-off between the LOOCV and the k-Fold approaches. The LOOCV approach tends to potentially have less bias.

Problem 2

(a)

set.seed(1)
x=rnorm(100)
y=x-2*x^2+rnorm(100)

n=100, p=2 explanatory variable. B0 = 0, B1X1 = X, B2X2=2X^2, epsilon = rnorm(100). So our MLR model equation would be written as Yi hat = X - 2X^2.

(b)

plot(x,y)

We see an inverse u-shaped curviture in the distribution of points. This suggests a non-linear relationship. The points are more densely plotted in the middle of the distribution/the top of the curve, suggesting a greater frequency of observations of x between -1 and 1.

(c)

set.seed(2)
data=data.frame(x=x,y=y)
library(tidyverse)
## ── Attaching packages ───────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.2.1     ✔ purrr   0.3.2
## ✔ tibble  2.1.3     ✔ dplyr   0.8.3
## ✔ tidyr   0.8.3     ✔ stringr 1.4.0
## ✔ readr   1.3.1     ✔ forcats 0.4.0
## ── Conflicts ──────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(boot)
glm.fit<-glm(y~x, data=data)
cv.err<-cv.glm(data, glm.fit)

cv.error<-rep(0, 4)
for(i in 1:4){
  glm.fit<-glm(y~poly(x, i), data=data)
  cv.error[i]<-cv.glm(data, glm.fit)$delta[1]
}

cvDF<-data.frame(degree=1:4, cv.error)
cv.error
## [1] 7.2881616 0.9374236 0.9566218 0.9539049

(c.i.) 7.2881616

(c.ii.) 0.9374236

(c.iii.) 0.9566218

(c.iv.) 0.9539049

(d)

set.seed(52)
data=data.frame(x=x,y=y)
library(tidyverse)
library(boot)
glm.fit<-glm(y~x, data=data)
cv.err<-cv.glm(data, glm.fit)

cv.error<-rep(0, 4)
for(i in 1:4){
  glm.fit<-glm(y~poly(x, i), data=data)
  cv.error[i]<-cv.glm(data, glm.fit)$delta[1]
}

cvDF<-data.frame(degree=1:4, cv.error)
cv.error
## [1] 7.2881616 0.9374236 0.9566218 0.9539049

The results are the same as the ones we got in c. This is because there is no randomness in the training/validation set splits, each iteration is with the same underlying data and model.

(e) The second model (ii) had the smallest LOOCV error. This is what we expected because this is the model we used to generate the data.

(f)

glm.fit<-glm(y~poly(x, i), data=data)
summary(glm.fit)
## 
## Call:
## glm(formula = y ~ poly(x, i), data = data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0550  -0.6212  -0.1567   0.5952   2.2267  
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1.55002    0.09591 -16.162  < 2e-16 ***
## poly(x, i)1   6.18883    0.95905   6.453 4.59e-09 ***
## poly(x, i)2 -23.94830    0.95905 -24.971  < 2e-16 ***
## poly(x, i)3   0.26411    0.95905   0.275    0.784    
## poly(x, i)4   1.25710    0.95905   1.311    0.193    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.9197797)
## 
##     Null deviance: 700.852  on 99  degrees of freedom
## Residual deviance:  87.379  on 95  degrees of freedom
## AIC: 282.3
## 
## Number of Fisher Scoring iterations: 2

We find that the coefficient estimates of the first and second models are significant. This agrees with econlusions drawn based on cross-validation results that tell us the second model had the smallest LOOCV error.