It divides the data set into k groups (folds) of approximately equal size, just like LOOCV, each fold (group) will be treated as a validation (test) set, and the model gets fit on the remaining k-1 folds (groups). LOOCV and K-Fold Cross Validation result in the same if k = n (number of observations), however we usually use k=5 or k=10.
Advantages: The validation set approach is a very simple and easy to implement approach, since our model gets only to be fitted once.
Disadvantages: However, since only one portion of the data set is used for validation and the other for training, the test error rate (MSE) can be highly variable, since it depends on which observations are used for training and which ones for validation. Following the same line, the validation MSE might even tend to overestimate the test error rate for the model fit on the entire data set.
Advantages: The first and most known advantage of k-fold CV over LOOCV is that this last one way more computationally intense, since we have to fit each model, and with big data sets that could almost be impossible. When it comes to bias reduction, LOOCV is preferred over k-fold CV, LOOCV will give approximately unbiased estimates of the test error, since each training set contains n − 1 observations, which is almost as many as the number of observations in the full data set. And performing k-fold CV for, k = 5 or k = 10 will lead to an intermediate level of bias, since each training set contains (k − 1)n/k observations—fewer than in the LOOCV approach, but substantially more than in the validation set approach.
Disadvantages: On the other hand, when it comes to variance, LOOCV has higher variance than K-fold CV when k>n, this is because by averaging the outputs of n fitted models, those will be highly correlated with each other. In result the test error estimate resulting from LOOCV tends to have higher variance than does the test error estimate resulting from k-fold CV.
library(ISLR)
## Warning: package 'ISLR' was built under R version 4.1.2
library(MASS)
library(caTools)
## Warning: package 'caTools' was built under R version 4.1.2
library(boot)
attach(Default)
summary(Default)
## default student balance income
## No :9667 No :7056 Min. : 0.0 Min. : 772
## Yes: 333 Yes:2944 1st Qu.: 481.7 1st Qu.:21340
## Median : 823.6 Median :34553
## Mean : 835.4 Mean :33517
## 3rd Qu.:1166.3 3rd Qu.:43808
## Max. :2654.3 Max. :73554
set.seed(1)
glm.fit = glm(default~income+balance, family="binomial", data=Default)
summary(glm.fit)
##
## Call:
## glm(formula = default ~ income + balance, family = "binomial",
## data = Default)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4725 -0.1444 -0.0574 -0.0211 3.7245
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.154e+01 4.348e-01 -26.545 < 2e-16 ***
## income 2.081e-05 4.985e-06 4.174 2.99e-05 ***
## balance 5.647e-03 2.274e-04 24.836 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1579.0 on 9997 degrees of freedom
## AIC: 1585
##
## Number of Fisher Scoring iterations: 8
sample_data = sample.split(Default, SplitRatio = 0.50)
training.set = subset(Default, sample_data==TRUE)
test.set=subset(Default,sample_data==FALSE)
test.default = test.set$default
glm.train = glm(default~income+balance, data=training.set, family = binomial)
summary(glm.train)
##
## Call:
## glm(formula = default ~ income + balance, family = binomial,
## data = training.set)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5287 -0.1436 -0.0570 -0.0204 3.3287
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.162e+01 6.066e-01 -19.164 < 2e-16 ***
## income 1.944e-05 6.931e-06 2.805 0.00504 **
## balance 5.772e-03 3.178e-04 18.161 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1569.84 on 4999 degrees of freedom
## Residual deviance: 821.43 on 4997 degrees of freedom
## AIC: 827.43
##
## Number of Fisher Scoring iterations: 8
glm.probs = predict(glm.train, test.set, type = "response")
glm.preds = rep("No",length(test.set$default))
glm.preds[glm.probs>0.5]="Yes"
table(glm.preds,test.default)
## test.default
## glm.preds No Yes
## No 4827 100
## Yes 23 50
Test error rate is: \((100+23)/(5000) = 2.5%\)
Another random split
set.seed(12)
sample_data = sample.split(Default, SplitRatio = 0.50)
training.set = subset(Default, sample_data==TRUE)
test.set=subset(Default,sample_data==FALSE)
test.default = test.set$default
Fitting the model
glm.train = glm(default~income+balance, data=training.set, family = binomial)
glm.probs = predict(glm.train, test.set, type = "response")
glm.preds = rep("No",length(test.set$default))
glm.preds[glm.probs>0.5]="Yes"
table(glm.preds,test.default)
## test.default
## glm.preds No Yes
## No 4798 122
## Yes 27 53
Test error rate is: \((122+27)/(5000) = 2.98%\)
Validation set size of 70% (original seed = 1)
set.seed(1)
index = sample(1:nrow(Default), 0.7*nrow(Default))
training.set = Default[index,]
test.set=Default[-index,]
test.default = test.set$default
Fitting the model
glm.train = glm(default~income+balance, data=training.set, family = binomial)
glm.probs = predict(glm.train, test.set, type = "response")
glm.preds = rep("No",length(test.set$default))
glm.preds[glm.probs>0.5]="Yes"
table(glm.preds,test.default)
## test.default
## glm.preds No Yes
## No 2896 78
## Yes 2 24
Test error rate is: \((2+78)/(3000) = 2.67%\)
Validation set size of 80% (original seed = 1)
set.seed(1)
index = sample(1:nrow(Default), 0.8*nrow(Default))
training.set = Default[index,]
test.set=Default[-index,]
test.default = test.set$default
Fitting the model
glm.train = glm(default~income+balance, data=training.set, family = binomial)
glm.probs = predict(glm.train, test.set, type = "response")
glm.preds = rep("No",length(test.set$default))
glm.preds[glm.probs>0.5]="Yes"
table(glm.preds,test.default)
## test.default
## glm.preds No Yes
## No 1928 50
## Yes 2 20
Test error rate is: \((2+50)/(2000) = 2.6%\)
Fitting the model (with the 80/20 split)
glm.train = glm(default~income+balance+student, data=training.set, family = binomial)
glm.probs = predict(glm.train, test.set, type = "response")
glm.preds = rep("No",length(test.set$default))
glm.preds[glm.probs>0.5]="Yes"
table(glm.preds,test.default)
## test.default
## glm.preds No Yes
## No 1928 53
## Yes 2 17
Test error rate is: \((2+53)/(2000) = 2.75%\)
set.seed(1)
glm.fit2 = glm(default~income+balance, data=Default, family=binomial)
summary(glm.fit2)$coefficients[2:3,2]
## income balance
## 4.985167e-06 2.273731e-04
boot.fn = function(data, index)
return(coef(glm(default~income+balance, data=data, subset=index, family = binomial)))
boot(Default,boot.fn,1000)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Default, statistic = boot.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* -1.154047e+01 -3.945460e-02 4.344722e-01
## t2* 2.080898e-05 1.680317e-07 4.866284e-06
## t3* 5.647103e-03 1.855765e-05 2.298949e-04
summary(glm.fit2)$coefficients
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.154047e+01 4.347564e-01 -26.544680 2.958355e-155
## income 2.080898e-05 4.985167e-06 4.174178 2.990638e-05
## balance 5.647103e-03 2.273731e-04 24.836280 3.638120e-136
Comments: We can see that the coefficients in the original glm model of the Std. Errors for Intercept, Income and balance are:4.347564e-01, 4.985167e-06, 2.273731e-04 respectively, and the bootstrap Std. Error coefficients are not very different at all: 4.344722e-01, 4.866284e-06, 2.298949e-04 (same order: Intercept, Income and Balance)
detach(Default)
attach(Boston)
summary(medv)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.00 17.02 21.20 22.53 25.00 50.00
mu = mean(medv)
length=length(medv)
sd=sqrt(sum((medv-mu)^2)/(length-1))
se=sd/sqrt(length)
se
## [1] 0.4088611
Comments:
se=sd/sqrt(length)
se
## [1] 0.4088611
set.seed(1)
boot.fn2 = function(data,index)
return(mean(data[index]))
boot(medv,boot.fn2,1000)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = medv, statistic = boot.fn2, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 22.53281 0.007650791 0.4106622
Comments: The standard error using bootstrapping with R=1000 is 0,41, almost the same as the one from the original mean function 0.408.
mu-2*se #lower bound
## [1] 21.71508
mu+2*se #higher bound
## [1] 23.35053
T-Test
t.test(Boston$medv)
##
## One Sample t-test
##
## data: Boston$medv
## t = 55.111, df = 505, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 21.72953 23.33608
## sample estimates:
## mean of x
## 22.53281
Comments: The 95 percent confidence interval from the T-test and the interval we calculated is almost the same.
median.medv = median(Boston$medv)
median.medv
## [1] 21.2
set.seed(1)
boot.fn3 = function(data,index)
return(median(data[index]))
boot(medv,boot.fn3,1000)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = medv, statistic = boot.fn3, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 21.2 0.02295 0.3778075
Comments: The median of our bootstrapping method is 21.2, which is the same as the calculated one, the std. error is 0.37, which is very small compared to the 21.2 median.
tenth.perc = quantile(medv, 0.1)
tenth.perc
## 10%
## 12.75
set.seed(1)
boot.fn4 = function(data,index){
X=medv[index]
Y=quantile(X,0.1)
return(Y)
}
boot(medv,boot.fn4,1000)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = medv, statistic = boot.fn4, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 12.75 0.0339 0.4767526
Comments: Our calculated 10th quantile was 12.75 and the bootstrapping one was 12.75, exactly the same, our Std. Error was 0.4767526.