Explain how k-fold cross-validation is implemented. k-fold cross-validation is an intermediate between the single validation and leave-one-out approach, splitting the the data into approximately equal size “k-folds” This method fits the data to k-1 folds and trains on 1. The process is repeated k number of times with the rule of thumb being k = 5 or 10.
single validation approach - In comparison, the k-fold cross-validation approach as a smaller training set size thus a smaller variance in the results. However, the larger validation set size creates a larger bias.
set.seed(1)
x=rnorm(100)
y=x-2*x^2+rnorm(100)
n (number of obervations) = 100 p (number of explanatory variables) =
plot(x,y)
Based on the scatterplot, it is clear that there is curvature in the data and that a least squares regression line (1st polynomial) would not be the best fit for the data.
set.seed(44)
DATA<-data.frame(x, y)
library(boot)
glm.fit<-glm(y~x, data=DATA)
cv.err<-cv.glm(DATA, glm.fit)
cv.err$delta
## [1] 7.288162 7.284744
glm.fit<-glm(y~poly(x, 2), data=DATA)
cv.err<-cv.glm(DATA, glm.fit)
cv.err$delta
## [1] 0.9374236 0.9371789
glm.fit<-glm(y~poly(x, 3), data=DATA)
cv.err<-cv.glm(DATA, glm.fit)
cv.err$delta
## [1] 0.9566218 0.9562538
glm.fit<-glm(y~poly(x, 4), data=DATA)
cv.err<-cv.glm(DATA, glm.fit)
cv.err$delta
## [1] 0.9539049 0.9534453
set.seed(13)
DATA<-data.frame(x, y)
glm.fit<-glm(y~x, data=DATA)
cv.err<-cv.glm(DATA, glm.fit)
cv.err$delta
## [1] 7.288162 7.284744
glm.fit<-glm(y~poly(x, 2), data=DATA)
cv.err<-cv.glm(DATA, glm.fit)
cv.err$delta
## [1] 0.9374236 0.9371789
glm.fit<-glm(y~poly(x, 3), data=DATA)
cv.err<-cv.glm(DATA, glm.fit)
cv.err$delta
## [1] 0.9566218 0.9562538
glm.fit<-glm(y~poly(x, 4), data=DATA)
cv.err<-cv.glm(DATA, glm.fit)
cv.err$delta
## [1] 0.9539049 0.9534453
The second degree polynomial had the smallest LOOCV error. Based on the shapr of the oringial scatterplot it seemed probabale that a second degree polynomial would be the correct fit however, I also expetected the third and fourth degree polynomials to have low errors as well. The next question to consider is is “worth it” to increase the power (maybe overfitting) for such a minimal improvement in lowering the error.
mod4<-lm(y~poly(x, 4), data=DATA)
summary(mod4)
##
## Call:
## lm(formula = y ~ poly(x, 4), data = DATA)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.0550 -0.6212 -0.1567 0.5952 2.2267
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.55002 0.09591 -16.162 < 2e-16 ***
## poly(x, 4)1 6.18883 0.95905 6.453 4.59e-09 ***
## poly(x, 4)2 -23.94830 0.95905 -24.971 < 2e-16 ***
## poly(x, 4)3 0.26411 0.95905 0.275 0.784
## poly(x, 4)4 1.25710 0.95905 1.311 0.193
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9591 on 95 degrees of freedom
## Multiple R-squared: 0.8753, Adjusted R-squared: 0.8701
## F-statistic: 166.7 on 4 and 95 DF, p-value: < 2.2e-16
Both the least squares models (first and second degree) have statistically significant coefficient estimates (p-val(second) < p-val(first)) while the (thrid and fourth degree) do not. These results support the conclusion that it is not worth it to fit a third or fourth degree polynomial as they would both be examples of overfitting. The correct model to fit the data is a second degree polynomial.
mod2<-lm(y~poly(x, 2), data=DATA)
summary(mod2)
##
## Call:
## lm(formula = y ~ poly(x, 2), data = DATA)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.9650 -0.6254 -0.1288 0.5803 2.2700
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.5500 0.0958 -16.18 < 2e-16 ***
## poly(x, 2)1 6.1888 0.9580 6.46 4.18e-09 ***
## poly(x, 2)2 -23.9483 0.9580 -25.00 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.958 on 97 degrees of freedom
## Multiple R-squared: 0.873, Adjusted R-squared: 0.8704
## F-statistic: 333.3 on 2 and 97 DF, p-value: < 2.2e-16
plot(x,y)
plot(mod2)