K-fold validation is the approach taken to validate a model by separating the available data into k sets, where one of the folds is selected as the test set and the remaining are used for training. In doing so, we obtain the error of using this particular fold for testing. The process is then repeated once for each fold, and all the errors obtained are averaged.
The main disadvantage of using an approach such as k-fold is the computational cost involved. This may provide too time consuming or costly for some models. The validation set approach has a clear benefit in this case, since the model only needs be learnt from the data once, and tested once. However, a validation set may not be available (due to only a small number of observations being available) or may be too costly to obtain in practice. In these cases, the k-fold cross validation is a clear winner, for it allows us to fine tune our model. Furthermore, a validation set may tend to over estimate the error since less data is used for training.
The Leave-One-Out Cross Validation (LOOCV) approach has a worse (or the same, in the case k=n) computational cost to K-Fold Cross Validation since the model needs to be trained and tested n times, instead of k. Furthermore, LOOCV suffers from a higher variance in result, since they are typically highly correlated (most models trained are very similar). There are no significant advance to using LOOCV since 5-fold or 10-fold will significantly reduce the computational cost and will neither suffer from high bias or variance.
glm.fit=glm(default~income+balance,family='binomial',data=Default)
set.seed(3)
subset=sample(1:1000,500) # i. done by the subset parameter.
glm.fit=glm(default~income+balance,family='binomial', # ii. fitting the model.
data=Default,subset=subset)
glm.resp=predict(glm.fit,Default[-subset,],type='response') # iii. computing the posterior.
glm.pred=ifelse(glm.resp>0.5,'Yes','No')
mean(glm.pred!=Default[-subset,'default'])
## [1] 0.02631579
DefaultValidationFn = function(formula=default~income+balance,n=1000,s=500,seed){
set.seed(seed)
subset=sample(n,s)
glm.fit=glm(formula,family='binomial',
data=Default,subset=subset)
glm.resp=predict(glm.fit,Default[-subset,],type='response')
mean(glm.pred!=Default[-subset,'default'])
}
for(i in 1:3) print(DefaultValidationFn(seed=i))
## [1] 0.02715789
## [1] 0.02621053
## [1] 0.02631579
for(i in 1:3) print(DefaultValidationFn(formula=default~income+balance+student,seed=i))
## [1] 0.02715789
## [1] 0.02621053
## [1] 0.02631579
A reduction is not observed when including the variable student.
summary(glm(default~income+balance,data=Default,family='binomial'))
##
## Call:
## glm(formula = default ~ income + balance, family = "binomial",
## data = Default)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4725 -0.1444 -0.0574 -0.0211 3.7245
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.154e+01 4.348e-01 -26.545 < 2e-16 ***
## income 2.081e-05 4.985e-06 4.174 2.99e-05 ***
## balance 5.647e-03 2.274e-04 24.836 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1579.0 on 9997 degrees of freedom
## AIC: 1585
##
## Number of Fisher Scoring iterations: 8
boot.fn = function(data,index){
coef(glm(default~income+balance,data=data,subset=index,family='binomial'))
}
set.seed(1)
boot.fn(Default,sample(1000,500,replace = T))
## (Intercept) income balance
## -1.029272e+01 1.558656e-05 5.236697e-03
boot(Default,boot.fn,R=1000)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Default, statistic = boot.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* -1.154047e+01 -3.933672e-02 4.343265e-01
## t2* 2.080898e-05 1.671659e-07 4.867306e-06
## t3* 5.647103e-03 1.850346e-05 2.298227e-04
mu=mean(medv) # estimate of population mean
print(mu)
## [1] 22.53281
sterr = function(x,index){ sd(x[index])/sqrt(length(x[index])) } # estimate of standard error.
print(sterr(medv))
## [1] 0.4088611
set.seed(456)
boot(medv,function(x,index){mean(x[index])},R=1000)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = medv, statistic = function(x, index) {
## mean(x[index])
## }, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 22.53281 0.01222925 0.3939195
The results are different but approximate; 0.404878 for the boot estimate and 0.4088611 for the sample estimate.
mu-2*0.404878
## [1] 21.72305
mu+2*0.404878
## [1] 23.34256
print(median(medv))
## [1] 21.2
boot(medv,function(data,index){median(data[index])},1000)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = medv, statistic = function(data, index) {
## median(data[index])
## }, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 21.2 -0.02755 0.375179
There is a standard error or 0.3756641 for the estimating the median value.
quantile(medv,.1)
## 10%
## 12.75
boot(medv,function(data,index){quantile(medv[index],.1)},R=1000)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = medv, statistic = function(data, index) {
## quantile(medv[index], 0.1)
## }, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 12.75 0.01485 0.50522
There is a standard error of 0.5166154. One can interpret this value to mean the amount that the quarterly deviates from the estimate in each sample.