Review the big ideas for k-fold cross validation

  1. Explain how k-fold cross-validation is implemented.

Essentially, this method splits the data into K number of folds. K is normally 5 or 10. All of these follds are of approximately equal size. You fit the model onto k-1 folds and train on 1, and iterate k number of times. This means you are getting similar iterations of models, but not so much similarity that there is enough correlation to bring up the mean squared error.

  1. What are the advantages and disadvantages of k-fold cross-validation relative to:

  2. The (single) validation set approach? The single validation set approach has a highly variable test-error rate, becasue the test-error rate is dependent upon which observations are selected to be in that specific model iteration. Moreover, this single validation method often overestimates the test error becasue if you have a 50-50 split you only use half of the data available to you. Whereas the K-fold has less error in the MSE because it uses an average of many models but does not have much correlation between each iteration. Also, it should be noted that the k-fold method might be prone to leaving out influential outliers, causing a problem of prediction: how would this model predict that outlier if it was left out.

  1. LOOCV? The K-fold method has less correlation in the models which means less overall error. However, this model leaves one value out each time, which can be potentially problematic as the value left out might be an influental outlier, meaning that it could give a vert different output of the MSE. As stated above, this causes a problem with the prediction of certain outliers which might be left out.

Problem 2: We will now perform cross-validation on a simulated data set: (a) Generate a simulated data set as follows: n=100 observation. p= 2 explanatory variables. The equation in MLR form is Yi hat = X - 2X^2.

set.seed(1)
x=rnorm(100)
y=x-2*x^2+rnorm(100)
plot(x,y)

This is an inverted u-shaped curve which tells us that this relationship is likely non-linear. Moreover, we can see a cluster of obeservations towards the top of this curve which is located between -1 and 1, pointing to the curve being centered around 0.

set.seed(5)
data=data.frame(x=x,y=y)


library(boot)
genfit<-glm(y~x, data=data)
cv.new<-cv.glm(data, genfit)

error.cv<-rep(0, 4)
for(i in 1:4){
  glm.fit<-glm(y~poly(x, i), data=data)
  error.cv[i]<-cv.glm(data, glm.fit)$delta[1]
}

dframe.cv<-data.frame(degree=1:4, error.cv)
error.cv
## [1] 7.2881616 0.9374236 0.9566218 0.9539049
set.seed(505)
data=data.frame(x=x,y=y)

library(boot)
genfit<-glm(y~x, data=data)
cv.new<-cv.glm(data, genfit)

error.cv<-rep(0, 4)
for(i in 1:4){
  glm.fit<-glm(y~poly(x, i), data=data)
  error.cv[i]<-cv.glm(data, glm.fit)$delta[1]
}

dframe.cv<-data.frame(degree=1:4, error.cv)
error.cv
## [1] 7.2881616 0.9374236 0.9566218 0.9539049

The outputs are the same between step c and d becasue the underlying model is exactly the same between the two, meaning that there is no randomness or error caused by the selection of the training and testing splits.

  1. the second model in part c had the smallest amount of error. This is what I would have expected becasue this was the original model we utilized to create our data when resampling.
glm.fit<-glm(y~poly(x, i), data=data)
summary(glm.fit)
## 
## Call:
## glm(formula = y ~ poly(x, i), data = data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0550  -0.6212  -0.1567   0.5952   2.2267  
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1.55002    0.09591 -16.162  < 2e-16 ***
## poly(x, i)1   6.18883    0.95905   6.453 4.59e-09 ***
## poly(x, i)2 -23.94830    0.95905 -24.971  < 2e-16 ***
## poly(x, i)3   0.26411    0.95905   0.275    0.784    
## poly(x, i)4   1.25710    0.95905   1.311    0.193    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.9197797)
## 
##     Null deviance: 700.852  on 99  degrees of freedom
## Residual deviance:  87.379  on 95  degrees of freedom
## AIC: 282.3
## 
## Number of Fisher Scoring iterations: 2

the first and second models have a p-value below the threshold of 0.05, meaning that we can reject the null and assume that they are signifigant. This means that these models have accurate coefficient estimates. Importantly, the lowest p-value is for the second model, which is supported through our cross-validation results in the previous steps.