Problem 1

(a) Explain how k-fold cross validation is implemented

k-fold cross validation splits the data into k number of “folds” of approximately equal size. It trains on all folds but one and then tests the model on the fold that was left out. It then reiterates this process with each fold. For each test, the mean squared error for the model is determined. After testing all models for all of the folds, the average mean squared error is computed.

(b) What are the advantages and disadvantages of k-fold cross-validation relative to:

(i) The (single) validation set approach?

The (single) validation approach splits the data into a training and testing group. The model is based on the data in the training group. The testing group is treated as “new” data and plugged into the model. The benefit over k-fold is that the (single) validation method iseasy to implement and you only have to do it once. However, the (single) validation approach has higher variability than k-fold, which depends on which observations are split into each group. Additionally, it suffers from a trade off where smaller training groups lead to greater variability in estimates of the test data, yet smaller testing groups lead to greater variability in actual performance. Splitting the data makes the data set smaller which increases the test error rate.

(ii) LOOCV?

Leave one out cross validation (LOOCV) creates a model based on n-1 observations and tests that model on the single remaining observation. This process is repeated until every observation has been tested. The benefit of LOOCV over k-fold is it uses n-1 observations so the model that is being fit uses more observations which leads to less error for the training model. However, LOOCV has higher variance than k-fold because all the models are highly correlated with eachother. Since only 1 observation is left out, there is a lot of overlap in the observations for each model. k-fold still has overlap (the more folds, the more overlap), but less than LOOCV.

Problem 2

(a) Generate a simulated data set. Write out the MLR model and indicate n and p.

set.seed(1)
x = rnorm(100)
y = x-2*x^2+rnorm(100)

data <- data.frame(x,y)
str(data)
## 'data.frame':    100 obs. of  2 variables:
##  $ x: num  -0.626 0.184 -0.836 1.595 0.33 ...
##  $ y: num  -2.032 0.158 -3.143 -3.337 -0.542 ...
mlr_mod <- lm(y~x)
summary(mlr_mod)
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5161 -0.6800  0.6812  1.5491  3.8183 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1.6254     0.2619  -6.205 1.31e-08 ***
## x             0.6925     0.2909   2.380   0.0192 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.6 on 98 degrees of freedom
## Multiple R-squared:  0.05465,    Adjusted R-squared:  0.045 
## F-statistic: 5.665 on 1 and 98 DF,  p-value: 0.01924

n = 100, p = 1

(b) Create a scatterplot of x against y.

library(ggplot2)
data <- data.frame(x,y)
ggplot(data, aes(x = x, y = y))+
  geom_point()+
  geom_smooth(method = lm,se=FALSE)+
  theme_bw()

The scatterplot indicates a non-linear relationship between x and y. Although x is a significant predictor of y, it would be a much stronger relationship if the curvature was considered.

(c) Set a random seed, and then compute the LOOCV errors that result from fitting the following four models using least squares

library(boot)
set.seed(87)
glm.fit<-glm(y~x, data=data)
coef(glm.fit)
## (Intercept)           x 
##   -1.625427    0.692497
cv.err<-cv.glm(data, glm.fit)
cv.err$delta
## [1] 7.288162 7.284744
cv.error<-rep(0, 4)
for(i in 1:4){
  glm.fit<-glm(y~poly(x, i), data=data)
  cv.error[i]<-cv.glm(data, glm.fit)$delta[1]
}

cvDF<-data.frame(degree=1:4, cv.error)
cvDF
##   degree  cv.error
## 1      1 7.2881616
## 2      2 0.9374236
## 3      3 0.9566218
## 4      4 0.9539049
ggplot(cvDF, aes(x = degree, y = cv.error))+
  geom_point()+
  geom_line()

(d) Repeat using another random seed , and report your results. Are your results the same as what you got in (c)?

set.seed(43)
glm.fit<-glm(y~x, data=data)
coef(glm.fit)
## (Intercept)           x 
##   -1.625427    0.692497
cv.err<-cv.glm(data, glm.fit)
cv.err$delta
## [1] 7.288162 7.284744
cv.error<-rep(0, 4)
for(i in 1:4){
  glm.fit<-glm(y~poly(x, i), data=data)
  cv.error[i]<-cv.glm(data, glm.fit)$delta[1]
}

cvDF<-data.frame(degree=1:4, cv.error)
cvDF
##   degree  cv.error
## 1      1 7.2881616
## 2      2 0.9374236
## 3      3 0.9566218
## 4      4 0.9539049
ggplot(cvDF, aes(x = degree, y = cv.error))+
  geom_point()+
  geom_line()

The results are the same for both seeds. This is because LOOCV tests every iteration of splitting the data into groups containing n-1 observations and the remaining observation. This is true regardless of the order that the data is split in.

(e) Which of the models in (c) had the smallest LOOCV error? Is this what you expected? Explain your answer.

The second degree had the smallest error. This is what I what expected because the model was simulated to have a second degree polynomial relationship between x and y. Therefore, squaring the x value would result in a regression line that most resembled the data.

(f) Comment on the statistical significance of the coefficient estimates that results from fitting each of the models in (c) using least squares. Do these results agree with the conclusions drawn based on cross-validation results?

lm4 <- lm(y ~ poly(x, 4))
summary(lm4)
## 
## Call:
## lm(formula = y ~ poly(x, 4))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.0550 -0.6212 -0.1567  0.5952  2.2267 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1.55002    0.09591 -16.162  < 2e-16 ***
## poly(x, 4)1   6.18883    0.95905   6.453 4.59e-09 ***
## poly(x, 4)2 -23.94830    0.95905 -24.971  < 2e-16 ***
## poly(x, 4)3   0.26411    0.95905   0.275    0.784    
## poly(x, 4)4   1.25710    0.95905   1.311    0.193    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9591 on 95 degrees of freedom
## Multiple R-squared:  0.8753, Adjusted R-squared:  0.8701 
## F-statistic: 166.7 on 4 and 95 DF,  p-value: < 2.2e-16

The results suggest that a predictors of the first and second degree are sufficient to create the best model. Predictors of the third and fourth degree are not significant. This is in line with the conclusions drawn from the cross-validation – that a second degree polynomial regression is the most effective model in minimizing test error.