Homework 6

Problem 1

Review the big ideas for k-fold cross validation.

#1A) Explain how k-fold cross-validation is implemented.

#K-fold validation is a cross validation method which chooses a number of times (k) and creates that many “folds” in the data. This splits the data into k groups of approximately equal sizes. The model is then trained on the remaining data and tested on the fold. This process is completed for each of the groups. Each time the MSE (mean sqaured error) is calculated. These MSEs are averaged together to create an overall MSE for the model.

#2A) What are the advantages and disadvantages of k-fold cross validation relative to the single validation set approach? LOOCV?

#(i) The single validation set approach splits a data set (typically either 50/50, 70,30, or 80/20) and trains the model on one part of the data, and then tests the model on the other part. The advantage of this approach is that by training and testing your data, you can easily assess how good of a job your model does because you already have actual values to compare to your predicted values and can then immediately assess error. Therefore, you can both create a model and test the model using the same data set. However, the disadvantage is that there is a trade off between bias and variance in the way the data is split. The less data used in training, the higher the variance in estimates of the test data. The more data used in training, the higher the variance in the actual performance of the test data. There is no consesus on how the data should be split to find a balance. Therefore, the test error rate can change depending on how the data is split and tends to overestimate the test error. Therefore, this method is much easier, but also more variable than k-fold.

#(ii) Leave one out cross validation trains the model on n-1 observations and then tests on the one left out observation. This process is repeated for every observation in the data set. The advantage of this method is that it is basically unbiased (because the training group is so large), but highly variable because the model is only predicting one value. Compared to this method, k-fold has less variance and is less compuationally expensive.

Problem 2

#2A)Generate a simulated data set as follows:

set.seed(1)
x= rnorm(100)
y= x-2*x^2+rnorm(100)
str(x)
##  num [1:100] -0.626 0.184 -0.836 1.595 0.33 ...

#The n is 100, as the random normal distibution is set to contain 100 observations

#There is one explanatory variable, p = 1

#MLR model

mlr_mod2 <- lm(y~x)
summary(mlr_mod2)
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5161 -0.6800  0.6812  1.5491  3.8183 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1.6254     0.2619  -6.205 1.31e-08 ***
## x             0.6925     0.2909   2.380   0.0192 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.6 on 98 degrees of freedom
## Multiple R-squared:  0.05465,    Adjusted R-squared:  0.045 
## F-statistic: 5.665 on 1 and 98 DF,  p-value: 0.01924

#2B) Create a scatter plot of X against Y. Comment on what you find.

library(tidyverse)
## -- Attaching packages ----------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 3.2.1     v purrr   0.3.2
## v tibble  2.1.3     v dplyr   0.8.3
## v tidyr   0.8.3     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.4.0
## -- Conflicts -------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
data <- data.frame(x,y)
ggplot(data,aes(x= x, y = y))+
  geom_point()+
  geom_smooth(method = lm, se = F)+
  theme_bw()

#This scatterplot indicates a non linear relationship between X and Y. As X increases, Y also increases, but then as X increases more, Y begins to decrease again.

#2C) Set a random seed, and then compute the LOOCV errors that result from fitting the following four models using least squares.

glm.fit <- glm(y~x, data = data)
coef(glm.fit)
## (Intercept)           x 
##   -1.625427    0.692497
library(boot)
glm.fit <- glm(y~x, data = data)
cv.error <- cv.glm(data, glm.fit)
cv.error$delta
## [1] 7.288162 7.284744
set.seed(2)
cv.error<-rep(0, 4)
for(i in 1:4){
  glm.fit<-glm(y~poly(x, i), data=data)
  cv.error[i]<-cv.glm(data, glm.fit)$delta[1]
}

cvDF <- data.frame(degree=1:4, cv.error)
cvDF
##   degree  cv.error
## 1      1 7.2881616
## 2      2 0.9374236
## 3      3 0.9566218
## 4      4 0.9539049
ggplot(cvDF, aes(x=degree, y=cv.error))+
  geom_point()+
  geom_line()

#2D) Repeat (C) using another random seed, and report your results. Are your results the same you got in (C)? Why?

set.seed(345)
cv.error<-rep(0, 4)
for(i in 1:4){
  glm.fit<-glm(y~poly(x, i), data=data)
  cv.error[i]<-cv.glm(data, glm.fit)$delta[1]
}

cvDF <- data.frame(degree=1:4, cv.error)
cvDF
##   degree  cv.error
## 1      1 7.2881616
## 2      2 0.9374236
## 3      3 0.9566218
## 4      4 0.9539049
ggplot(cvDF, aes(x=degree, y=cv.error))+
  geom_point()+
  geom_line()

#The results using a different seed are identical such that cv.error sharply decreases between 1 and 2 degrees of the model, and the cv.error then stays low as the number of degrees increases. The results are the same because although setting different seeds means the models will be trained in a different order of observations, the models will still be trained on every (n) combinations of n-1 observations.

#2E) Which of the models in (C) had the smallest LOOCV error? Is this what you expected?

#Model (ii), the model that went to the 2nd degree polynomial had the smallest cv.error. This is what I expected because the original relationship between X and Y uses a 2nd degree polynomial (and no higher polynomial). Thus, the model that uses X^2 would ideally capture the shape of the original relationship and therefore make better predictions than a model that only uses the first degree. A model that goes to a higher degree polynomial is unnecessary because the model using a second degree should already fit the data well.

#2F) Comment on the statistical significance of the coefficient estimates that results from the fitting each of the models in (C) using least squares. Do these results agree with the conclusions drawn based on cross-validation results?

summary(glm.fit)
## 
## Call:
## glm(formula = y ~ poly(x, i), data = data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0550  -0.6212  -0.1567   0.5952   2.2267  
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1.55002    0.09591 -16.162  < 2e-16 ***
## poly(x, i)1   6.18883    0.95905   6.453 4.59e-09 ***
## poly(x, i)2 -23.94830    0.95905 -24.971  < 2e-16 ***
## poly(x, i)3   0.26411    0.95905   0.275    0.784    
## poly(x, i)4   1.25710    0.95905   1.311    0.193    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.9197797)
## 
##     Null deviance: 700.852  on 99  degrees of freedom
## Residual deviance:  87.379  on 95  degrees of freedom
## AIC: 282.3
## 
## Number of Fisher Scoring iterations: 2

#The coefficient estimates of both the 1st and 2nd degree polynomials are statistically significant which shows that those coefficients significantly help predict Y, but coefficient estimates beyond the 2nd degree do not. The statistical significance agrees with the conclusions based on cross validation. The cross validation shows that MS error is lowest using a model to the 2nd degree, meaning our model fits the data best at this point. Thus, both lead to the conclusion that the model using up to the 2nd degree polynomial is the ideal model for the data.