With k-fold cross-validation the data is seperated into k folds and a model is trained on k-1 of those folds. It is then tested on the fold that was not used in the training. This occurs k number of times and then the results are averaged.
K-fold has lower variabilty than validation set approach, especially for smaller datasets. Also, with the validation set approach the test error can be over estimated.
LOOCV is less bias than k-fold but LOOCV has higher variance. Also k-fold is much better with larger datasets because it is much less demanding computationally.
library(ISLR)
## Warning: package 'ISLR' was built under R version 4.0.5
attach(Default)
glm.fit=glm(default~income+balance, family="binomial", data=Default)
summary(glm.fit)
##
## Call:
## glm(formula = default ~ income + balance, family = "binomial",
## data = Default)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4725 -0.1444 -0.0574 -0.0211 3.7245
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.154e+01 4.348e-01 -26.545 < 2e-16 ***
## income 2.081e-05 4.985e-06 4.174 2.99e-05 ***
## balance 5.647e-03 2.274e-04 24.836 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1579.0 on 9997 degrees of freedom
## AIC: 1585
##
## Number of Fisher Scoring iterations: 8
set.seed(3)
train = sample(1:1000,500)
length(train)
## [1] 500
glm.fit = glm(default ~ income+balance, family = "binomial", subset = train)
glm.pred = factor(ifelse(predict(glm.fit, data = Default[-train,], type = "response") > 0.5, "Yes", "No"))
summary(glm.pred)
## No Yes
## 496 4
summary(Default)
## default student balance income
## No :9667 No :7056 Min. : 0.0 Min. : 772
## Yes: 333 Yes:2944 1st Qu.: 481.7 1st Qu.:21340
## Median : 823.6 Median :34553
## Mean : 835.4 Mean :33517
## 3rd Qu.:1166.3 3rd Qu.:43808
## Max. :2654.3 Max. :73554
mean(Default[-train,'default'] != glm.pred)
## [1] 0.04084211
The validation set error is 0.0516
set.seed(3)
train = sample(1:2000,1000)
glm.fit = glm(default ~ income+balance, family = "binomial", subset = train)
glm.pred = factor(ifelse(predict(glm.fit, data = Default[-train,], type = "response") > 0.5, "Yes", "No"))
summary(glm.pred)
## No Yes
## 985 15
mean(Default[-train,'default'] != glm.pred)
## [1] 0.04522222
set.seed(3)
train = sample(1:4000,2000)
glm.fit = glm(default ~ income+balance, family = "binomial", subset = train)
glm.pred = factor(ifelse(predict(glm.fit, data = Default[-train,], type = "response") > 0.5, "Yes", "No"))
summary(glm.pred)
## No Yes
## 1962 38
mean(Default[-train,'default'] != glm.pred)
## [1] 0.05125
For all three of the splits the seemed to stay relatively the same set error with only the second new split being the lowest.
ValidationLoop = function(totalSize, subtotal, seed) {
set.seed(seed)
train = sample(1:totalSize, subtotal)
glm.fit = glm(default ~ income+balance+student, family = "binomial", subset = train)
glm.pred = factor(ifelse(predict(glm.fit, data = Default[-train,], type = "response") > 0.5, "Yes", "No"))
summary(glm.pred)
mean(Default[-train,'default'] != glm.pred)
}
for(i in list(1,2,5)) print(ValidationLoop(i*2000, i*1000, i))
## [1] 0.05333333
## [1] 0.05
## [1] 0.0486
It does not look like adding the dummy variable student caused much reduction in error.
library(boot)
summary(glm(default~income+balance,data=Default,family="binomial"))
##
## Call:
## glm(formula = default ~ income + balance, family = "binomial",
## data = Default)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4725 -0.1444 -0.0574 -0.0211 3.7245
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.154e+01 4.348e-01 -26.545 < 2e-16 ***
## income 2.081e-05 4.985e-06 4.174 2.99e-05 ***
## balance 5.647e-03 2.274e-04 24.836 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1579.0 on 9997 degrees of freedom
## AIC: 1585
##
## Number of Fisher Scoring iterations: 8
boot.fn = function(dataSet, obvIndex){
coef(glm(default~income+balance, data=dataSet, subset=obvIndex, family = "binomial"))
}
set.seed(1)
boot.fn(Default,sample(2000,1000,replace=T))
## (Intercept) income balance
## -1.370125e+01 3.073263e-05 6.996657e-03
boot(Default, boot.fn, R=1000)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Default, statistic = boot.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* -1.154047e+01 -3.920364e-02 0.4344275948
## t2* 2.080898e-05 1.646916e-07 0.0000048674
## t3* 5.647103e-03 1.845423e-05 0.0002298672
The estimated standard error using glm() function is almost exactly the same as when using my bootstrap function
library(MASS)
attach(Boston)
mu=mean(medv)
print(mu)
## [1] 22.53281
sd(Boston$medv)/sqrt(length(Boston$medv))
## [1] 0.4088611
set.seed(42)
boot.fn = function(x,index){
mean(x[index])
}
boot(medv,statistic = boot.fn, R = 1000)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = medv, statistic = boot.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 22.53281 0.02671186 0.4009216
Although they are not exactly the same they are extremely close being 0.4088 and 0.4009
print(mu-2*0.4088611)
## [1] 21.71508
print(mu-2*0.4009216)
## [1] 21.73096
t.test(Boston$medv)
##
## One Sample t-test
##
## data: Boston$medv
## t = 55.111, df = 505, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 21.72953 23.33608
## sample estimates:
## mean of x
## 22.53281
The estimated confidence intervals are quite close.
median(Boston$medv)
## [1] 21.2
set.seed(42)
boot2.fn = function(x,index){
median(x[index])
}
boot(medv,statistic = boot2.fn, R = 1000)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = medv, statistic = boot2.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 21.2 0.0106 0.3661785
Just like part c with mean, the std error of median is small compared to the estimate.
quantile(Boston$medv, 0.1)
## 10%
## 12.75
set.seed(42)
boot3.fn = function(x,index){
quantile(x[index], 0.1)
}
boot(medv,statistic = boot3.fn, R = 1000)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = medv, statistic = boot3.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 12.75 0.0215 0.4948966
The standard error is bigger but still not large.