Review the big ideas for k-fold cross validation
Essentially, this method splits the data into K number of folds. K is normally 5 or 10. All of these follds are of approximately equal size. You fit the model onto k-1 folds and train on 1, and iterate k number of times. This means you are getting similar iterations of models, but not so much similarity that there is enough correlation to bring up the mean squared error.
What are the advantages and disadvantages of k-fold cross-validation relative to:
The (single) validation set approach? The single validation set approach has a highly variable test-error rate, becasue the test-error rate is dependent upon which observations are selected to be in that specific model iteration. Moreover, this single validation method often overestimates the test error becasue if you have a 50-50 split you only use half of the data available to you. Whereas the K-fold has less error in the MSE because it uses an average of many models but does not have much correlation between each iteration. Also, it should be noted that the k-fold method might be prone to leaving out influential outliers, causing a problem of prediction: how would this model predict that outlier if it was left out.
Problem 2: We will now perform cross-validation on a simulated data set: (a) Generate a simulated data set as follows: n=100 observation. p= 2 explanatory variables. The equation in MLR form is Yi hat = X - 2X^2.
set.seed(1)
x=rnorm(100)
y=x-2*x^2+rnorm(100)
plot(x,y)
This is an inverted u-shaped curve which tells us that this relationship is likely non-linear. Moreover, we can see a cluster of obeservations towards the top of this curve which is located between -1 and 1, pointing to the curve being centered around 0.
set.seed(5)
data=data.frame(x=x,y=y)
library(boot)
genfit<-glm(y~x, data=data)
cv.new<-cv.glm(data, genfit)
error.cv<-rep(0, 4)
for(i in 1:4){
glm.fit<-glm(y~poly(x, i), data=data)
error.cv[i]<-cv.glm(data, glm.fit)$delta[1]
}
dframe.cv<-data.frame(degree=1:4, error.cv)
error.cv
## [1] 7.2881616 0.9374236 0.9566218 0.9539049
set.seed(505)
data=data.frame(x=x,y=y)
library(boot)
genfit<-glm(y~x, data=data)
cv.new<-cv.glm(data, genfit)
error.cv<-rep(0, 4)
for(i in 1:4){
glm.fit<-glm(y~poly(x, i), data=data)
error.cv[i]<-cv.glm(data, glm.fit)$delta[1]
}
dframe.cv<-data.frame(degree=1:4, error.cv)
error.cv
## [1] 7.2881616 0.9374236 0.9566218 0.9539049
The outputs are the same between step c and d becasue the underlying model is exactly the same between the two, meaning that there is no randomness or error caused by the selection of the training and testing splits.
glm.fit<-glm(y~poly(x, i), data=data)
summary(glm.fit)
##
## Call:
## glm(formula = y ~ poly(x, i), data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0550 -0.6212 -0.1567 0.5952 2.2267
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.55002 0.09591 -16.162 < 2e-16 ***
## poly(x, i)1 6.18883 0.95905 6.453 4.59e-09 ***
## poly(x, i)2 -23.94830 0.95905 -24.971 < 2e-16 ***
## poly(x, i)3 0.26411 0.95905 0.275 0.784
## poly(x, i)4 1.25710 0.95905 1.311 0.193
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.9197797)
##
## Null deviance: 700.852 on 99 degrees of freedom
## Residual deviance: 87.379 on 95 degrees of freedom
## AIC: 282.3
##
## Number of Fisher Scoring iterations: 2
the first and second models have a p-value below the threshold of 0.05, meaning that we can reject the null and assume that they are signifigant. This means that these models have accurate coefficient estimates. Importantly, the lowest p-value is for the second model, which is supported through our cross-validation results in the previous steps.