In this method, the set of observations are randomly divided into k equally sized groups, or folds. The first fold is the validation set and the remaining k – 1 folds are the training set. The model is fit on the remaining folds and the MSE is computed on the validation set. After being repeated k times there are k estimates of the MSE. Averaging the K different MSEs results in the k-fold CV estimate.
Although it is simpler and easier to implement, the variability in test error estimates from the validation set approach tends to be much higher than the variability in k-fold CV estimates. The validation set method can also lead to overestimates of the test error rate.
The LOOCV approach is potentially computationally expensive, whereas the k-fold CV is typically performed with k = 5 or k = 10, making it more feasible. The k-fold approach also gives more accurate estimates of test error rates. However if our main goal was to decrease bias, then LOOCV would be preferred over k-fold.
library(ISLR)
## Warning: package 'ISLR' was built under R version 4.1.2
set.seed(1)
summary(Default)
## default student balance income
## No :9667 No :7056 Min. : 0.0 Min. : 772
## Yes: 333 Yes:2944 1st Qu.: 481.7 1st Qu.:21340
## Median : 823.6 Median :34553
## Mean : 835.4 Mean :33517
## 3rd Qu.:1166.3 3rd Qu.:43808
## Max. :2654.3 Max. :73554
glm.fit=glm(default~income+balance, data=Default, family=binomial)
summary(glm.fit)
##
## Call:
## glm(formula = default ~ income + balance, family = binomial,
## data = Default)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4725 -0.1444 -0.0574 -0.0211 3.7245
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.154e+01 4.348e-01 -26.545 < 2e-16 ***
## income 2.081e-05 4.985e-06 4.174 2.99e-05 ***
## balance 5.647e-03 2.274e-04 24.836 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1579.0 on 9997 degrees of freedom
## AIC: 1585
##
## Number of Fisher Scoring iterations: 8
train=sample(dim(Default)[1], dim(Default)[1]/2)
glm.fit2=glm(default~income+balance, data=Default, family="binomial", subset=train)
summary(glm.fit2)
##
## Call:
## glm(formula = default ~ income + balance, family = "binomial",
## data = Default, subset = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5830 -0.1428 -0.0573 -0.0213 3.3395
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.194e+01 6.178e-01 -19.333 < 2e-16 ***
## income 3.262e-05 7.024e-06 4.644 3.41e-06 ***
## balance 5.689e-03 3.158e-04 18.014 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1523.8 on 4999 degrees of freedom
## Residual deviance: 803.3 on 4997 degrees of freedom
## AIC: 809.3
##
## Number of Fisher Scoring iterations: 8
glm.prob=predict(glm.fit2, Default[-train, ], type="response")
glm.pred=rep("No", dim(Default)[1])
glm.pred[glm.prob > 0.5]="Yes"
mean(glm.pred != Default[-train, "default"])
## [1] 0.0254
train=sample(dim(Default)[1], dim(Default)[1]/2)
glm.fit3=glm(default~income+balance, data=Default, family="binomial", subset=train)
glm.prob=predict(glm.fit3, Default[-train, ], type="response")
glm.pred=rep("No", dim(Default)[1])
glm.pred[glm.prob > 0.5]="Yes"
mean(glm.pred != Default[-train, "default"])
## [1] 0.0274
train=sample(dim(Default)[1], dim(Default)[1]/2)
glm.fit4=glm(default~income+balance, data=Default, family="binomial", subset=train)
glm.prob=predict(glm.fit4, Default[-train, ], type="response")
glm.pred=rep("No", dim(Default)[1])
glm.pred[glm.prob > 0.5]="Yes"
mean(glm.pred != Default[-train, "default"])
## [1] 0.0244
train=sample(dim(Default)[1], dim(Default)[1]/2)
glm.fit5=glm(default~income+balance, data=Default, family="binomial", subset=train)
glm.prob=predict(glm.fit5, Default[-train, ], type="response")
glm.pred=rep("No", dim(Default)[1])
glm.pred[glm.prob > 0.5]="Yes"
mean(glm.pred != Default[-train, "default"])
## [1] 0.0244
The results of the validation set errors are variable, but fairly close in range.
train=sample(dim(Default)[1], dim(Default)[1]/2)
glm.fit6=glm(default~income+balance+student, data=Default, family="binomial", subset=train)
glm.prob=predict(glm.fit6, Default[-train, ], type="response")
glm.pred=rep("No", dim(Default)[1])
glm.pred[glm.prob > 0.5]="Yes"
mean(glm.pred != Default[-train, "default"])
## [1] 0.0278
Adding the dummy variable does not seem to affect the test error rate, as it is still within the same range as the other rates.
set.seed(1)
glm.fit = glm(default~income+balance, data=Default, family="binomial")
summary(glm.fit)
##
## Call:
## glm(formula = default ~ income + balance, family = "binomial",
## data = Default)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4725 -0.1444 -0.0574 -0.0211 3.7245
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.154e+01 4.348e-01 -26.545 < 2e-16 ***
## income 2.081e-05 4.985e-06 4.174 2.99e-05 ***
## balance 5.647e-03 2.274e-04 24.836 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1579.0 on 9997 degrees of freedom
## AIC: 1585
##
## Number of Fisher Scoring iterations: 8
boot.fn=function (data ,index)
return(coef(glm(default~income+balance, data=data, subset=index, family="binomial")))
library(boot)
boot(Default, boot.fn, 1000)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Default, statistic = boot.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* -1.154047e+01 -3.945460e-02 4.344722e-01
## t2* 2.080898e-05 1.680317e-07 4.866284e-06
## t3* 5.647103e-03 1.855765e-05 2.298949e-04
The standard error estimates from the glm() and bootstrap functions are quite close to each other.
library(MASS)
attach(Boston)
mu.hat=mean(medv)
mu.hat
## [1] 22.53281
sd(medv)/sqrt(dim(Boston)[1])
## [1] 0.4088611
set.seed(1)
boot.fn=function(data, index)
return(mean(data[index]))
boot(medv, boot.fn, 1000)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = medv, statistic = boot.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 22.53281 0.007650791 0.4106622
The standard errors computed in (b) and (c) are very close in range.
t.test(Boston$medv)
##
## One Sample t-test
##
## data: Boston$medv
## t = 55.111, df = 505, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 21.72953 23.33608
## sample estimates:
## mean of x
## 22.53281
[22.53281 - 2(0.4106622), 22.53281 + 2(0.4106622)] = [21.71149, 23.35413]
The confidence interval results are very similar.
median(medv)
## [1] 21.2
boot.fn=function(data, index)
return(median(data[index]))
boot(medv, boot.fn, 1000)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = medv, statistic = boot.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 21.2 -0.0386 0.3770241
The median estimate is the same and the standard error is relatively small.
mu.percentile=quantile(medv, c(0.1))
mu.percentile
## 10%
## 12.75
boot.fn=function(data, index)
return(quantile(data[index], c(0.1)))
boot(medv, boot.fn, 1000)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = medv, statistic = boot.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 12.75 0.0186 0.4925766
The percentile estimate is the same and the standard error is relatively small.