k-fold cross validation splits the data into k number of “folds” of approximately equal size. It trains on all folds but one and then tests the model on the fold that was left out. It then reiterates this process with each fold. For each test, the mean squared error for the model is determined. After testing all models for all of the folds, the average mean squared error is computed.
The (single) validation approach splits the data into a training and testing group. The model is based on the data in the training group. The testing group is treated as “new” data and plugged into the model. The benefit over k-fold is that the (single) validation method iseasy to implement and you only have to do it once. However, the (single) validation approach has higher variability than k-fold, which depends on which observations are split into each group. Additionally, it suffers from a trade off where smaller training groups lead to greater variability in estimates of the test data, yet smaller testing groups lead to greater variability in actual performance. Splitting the data makes the data set smaller which increases the test error rate.
Leave one out cross validation (LOOCV) creates a model based on n-1 observations and tests that model on the single remaining observation. This process is repeated until every observation has been tested. The benefit of LOOCV over k-fold is it uses n-1 observations so the model that is being fit uses more observations which leads to less error for the training model. However, LOOCV has higher variance than k-fold because all the models are highly correlated with eachother. Since only 1 observation is left out, there is a lot of overlap in the observations for each model. k-fold still has overlap (the more folds, the more overlap), but less than LOOCV.
set.seed(1)
x = rnorm(100)
y = x-2*x^2+rnorm(100)
data <- data.frame(x,y)
str(data)
## 'data.frame': 100 obs. of 2 variables:
## $ x: num -0.626 0.184 -0.836 1.595 0.33 ...
## $ y: num -2.032 0.158 -3.143 -3.337 -0.542 ...
mlr_mod <- lm(y~x)
summary(mlr_mod)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5161 -0.6800 0.6812 1.5491 3.8183
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.6254 0.2619 -6.205 1.31e-08 ***
## x 0.6925 0.2909 2.380 0.0192 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.6 on 98 degrees of freedom
## Multiple R-squared: 0.05465, Adjusted R-squared: 0.045
## F-statistic: 5.665 on 1 and 98 DF, p-value: 0.01924
n = 100, p = 1
library(ggplot2)
data <- data.frame(x,y)
ggplot(data, aes(x = x, y = y))+
geom_point()+
geom_smooth(method = lm,se=FALSE)+
theme_bw()
The scatterplot indicates a non-linear relationship between x and y. Although x is a significant predictor of y, it would be a much stronger relationship if the curvature was considered.
library(boot)
set.seed(87)
glm.fit<-glm(y~x, data=data)
coef(glm.fit)
## (Intercept) x
## -1.625427 0.692497
cv.err<-cv.glm(data, glm.fit)
cv.err$delta
## [1] 7.288162 7.284744
cv.error<-rep(0, 4)
for(i in 1:4){
glm.fit<-glm(y~poly(x, i), data=data)
cv.error[i]<-cv.glm(data, glm.fit)$delta[1]
}
cvDF<-data.frame(degree=1:4, cv.error)
cvDF
## degree cv.error
## 1 1 7.2881616
## 2 2 0.9374236
## 3 3 0.9566218
## 4 4 0.9539049
ggplot(cvDF, aes(x = degree, y = cv.error))+
geom_point()+
geom_line()
set.seed(43)
glm.fit<-glm(y~x, data=data)
coef(glm.fit)
## (Intercept) x
## -1.625427 0.692497
cv.err<-cv.glm(data, glm.fit)
cv.err$delta
## [1] 7.288162 7.284744
cv.error<-rep(0, 4)
for(i in 1:4){
glm.fit<-glm(y~poly(x, i), data=data)
cv.error[i]<-cv.glm(data, glm.fit)$delta[1]
}
cvDF<-data.frame(degree=1:4, cv.error)
cvDF
## degree cv.error
## 1 1 7.2881616
## 2 2 0.9374236
## 3 3 0.9566218
## 4 4 0.9539049
ggplot(cvDF, aes(x = degree, y = cv.error))+
geom_point()+
geom_line()
The results are the same for both seeds. This is because LOOCV tests every iteration of splitting the data into groups containing n-1 observations and the remaining observation. This is true regardless of the order that the data is split in.
The second degree had the smallest error. This is what I what expected because the model was simulated to have a second degree polynomial relationship between x and y. Therefore, squaring the x value would result in a regression line that most resembled the data.
lm4 <- lm(y ~ poly(x, 4))
summary(lm4)
##
## Call:
## lm(formula = y ~ poly(x, 4))
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.0550 -0.6212 -0.1567 0.5952 2.2267
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.55002 0.09591 -16.162 < 2e-16 ***
## poly(x, 4)1 6.18883 0.95905 6.453 4.59e-09 ***
## poly(x, 4)2 -23.94830 0.95905 -24.971 < 2e-16 ***
## poly(x, 4)3 0.26411 0.95905 0.275 0.784
## poly(x, 4)4 1.25710 0.95905 1.311 0.193
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9591 on 95 degrees of freedom
## Multiple R-squared: 0.8753, Adjusted R-squared: 0.8701
## F-statistic: 166.7 on 4 and 95 DF, p-value: < 2.2e-16
The results suggest that a predictors of the first and second degree are sufficient to create the best model. Predictors of the third and fourth degree are not significant. This is in line with the conclusions drawn from the cross-validation – that a second degree polynomial regression is the most effective model in minimizing test error.