K-fold cross validation is implemented by breaking the dataset into k parts and then running cross validation on those parts rather than each individual entry in the dataset. We average k MSEs (as opposed to n-1) to get an estimated test error rate.
Advantages
K-fold CV is more stable than The Validation Approach.
K-fold CV generally has a much lower variability in the MSEs than The Validation Approach.
Disadvantages
K-fold CV is computationally more intense than The validation approach.
Advantages
K-fold CV is less computationally intensive than LOOCV, while still having similar stability.
K-fold CV has less variance than LOOCV.
Disadvantages
When k<n K-fold CV has more bias than LOOCV.
logist.Defualt <- glm(default ~ income + balance, data = Default, family = binomial)
summary(logist.Defualt)
##
## Call:
## glm(formula = default ~ income + balance, family = binomial,
## data = Default)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4725 -0.1444 -0.0574 -0.0211 3.7245
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.154e+01 4.348e-01 -26.545 < 2e-16 ***
## income 2.081e-05 4.985e-06 4.174 2.99e-05 ***
## balance 5.647e-03 2.274e-04 24.836 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1579.0 on 9997 degrees of freedom
## AIC: 1585
##
## Number of Fisher Scoring iterations: 8
set.seed(1)
set.training <- sample(nrow(Default), nrow(Default)/2)
training.default <- Default[set.training, ]
#head(training.default)
test.default <- Default[-set.training, ]
va.Defualt <- glm(default ~ income + balance, data = training.default, family = binomial)
summary(va.Defualt)
##
## Call:
## glm(formula = default ~ income + balance, family = binomial,
## data = training.default)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5830 -0.1428 -0.0573 -0.0213 3.3395
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.194e+01 6.178e-01 -19.333 < 2e-16 ***
## income 3.262e-05 7.024e-06 4.644 3.41e-06 ***
## balance 5.689e-03 3.158e-04 18.014 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1523.8 on 4999 degrees of freedom
## Residual deviance: 803.3 on 4997 degrees of freedom
## AIC: 809.3
##
## Number of Fisher Scoring iterations: 8
prob.Default.glm <- predict(va.Defualt, newdata = test.default, type = 'response')
pred.Default.glm <- rep('No', length(prob.Default.glm))
pred.Default.glm[prob.Default.glm > 0.5] <- 'Yes'
table(pred.Default.glm, test.default$default, dnn = c('Predicted Default', "Actual Default"))
## Actual Default
## Predicted Default No Yes
## No 4824 108
## Yes 19 49
The majority of predictions that default status is “No” and they are relatively accurate.
mean(pred.Default.glm != Default[-set.training,]$default)
## [1] 0.0254
The validation error is 0.00254, which is very small. This means that this model is fairly accurate at predicting the default status.
logi.error <- rep(0, 3)
for (i in 1:3) {
set.seed(i+1)
set.training2 <- sample(nrow(Default), nrow(Default)/2)
training.default2 <- Default[set.training2, ]
test.default2 <- Default[-set.training2, ]
va.Defualt2 <- glm(default ~ income + balance, data = training.default2, family = binomial)
prob.Default.glm2 <- predict(va.Defualt2, newdata = test.default2, type = 'response')
pred.Default.glm2 <- rep('No', length(prob.Default.glm2))
pred.Default.glm2[prob.Default.glm2 > 0.5] <- 'Yes'
logi.error[i] <- mean(pred.Default.glm2 != Default[-set.training2, ]$default)
}
print(logi.error)
## [1] 0.0238 0.0264 0.0256
All 3 MSEs produced by the different splits are different, but very similar to the original MSE we found of 0.0254.
Default$sdummy <- ifelse(Default$student == 'No', 0, 1)
dummy.error <- rep(0, 4)
for (i in 1:4) {
set.seed(i)
set.training3 <- sample(nrow(Default), nrow(Default)/2)
training.default3 <- Default[set.training3, ]
test.default3 <- Default[-set.training3, ]
va.Defualt3 <- glm(default ~ income + balance + sdummy, data = training.default3, family = binomial)
prob.Default.glm3 <- predict(va.Defualt3, newdata = test.default3, type = 'response')
pred.Default.glm3 <- rep('No', length(prob.Default.glm3))
pred.Default.glm3[prob.Default.glm3 > 0.5] <- 'Yes'
dummy.error[i] <- mean(pred.Default.glm3 != Default[-set.training3, ]$default)
}
print(dummy.error)
## [1] 0.0260 0.0246 0.0272 0.0262
The MSEs do not change noticably with the inclusion of the dummy variable for student. If anything they are negligibly higher than the logistic model without the dummy variable for student.
logist.Defualt.6 <- glm(default ~ income + balance, data = Default, family = binomial)
summary(logist.Defualt.6)
##
## Call:
## glm(formula = default ~ income + balance, family = binomial,
## data = Default)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4725 -0.1444 -0.0574 -0.0211 3.7245
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.154e+01 4.348e-01 -26.545 < 2e-16 ***
## income 2.081e-05 4.985e-06 4.174 2.99e-05 ***
## balance 5.647e-03 2.274e-04 24.836 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1579.0 on 9997 degrees of freedom
## AIC: 1585
##
## Number of Fisher Scoring iterations: 8
boot.fn <- function(data,index){
return(coef(glm(default ~ income + balance, data=data, subset=index, family = 'binomial')))
}
set.seed(1)
boot(Default, boot.fn, 1000)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Default, statistic = boot.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* -1.154047e+01 -3.945460e-02 4.344722e-01
## t2* 2.080898e-05 1.680317e-07 4.866284e-06
## t3* 5.647103e-03 1.855765e-05 2.298949e-04
The standard errors are almost indistinguishable between the functions.
detach(Default)
mu.hat <- mean(medv)
mu.hat
## [1] 22.53281
mu.hat.se <- sd(medv)/sqrt(nrow(Boston))
mu.hat.se
## [1] 0.4088611
The estimated standard error is 0.4088611
boot.fn2 <- function(data,index){
return(mean(data[index]))
}
set.seed(1)
btstrp <- boot(medv,boot.fn2, 1000)
Estimated Standard Error: 0.4088611
Bootstrap Estimated SE: 0.4106622
The standard errors are similar. The bootstrap error is slightly higher than the estimated error.
t.test(medv)
##
## One Sample t-test
##
## data: medv
## t = 55.111, df = 505, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 21.72953 23.33608
## sample estimates:
## mean of x
## 22.53281
95% Confidence Interval: [21.72953, 23.33608]
Based on the t-test we could assume 95% confidence that the actual mean for medv is between 21.72953 and 23.33608.
ci.mu <- c(btstrp$t0 - 2 * 0.4106622, btstrp$t0 + 2 * 0.4106622)
ci.mu
## [1] 21.71148 23.35413
95% Confidence Interval: [21.71148, 23.35413]
Based on the bootstrap confidence interval we could assume 95% confidence that the actual mean for medv is between 21.71148 and 23.35413.
The two confidence intervals are very similar.
median(medv)
## [1] 21.2
The median estimate for medv is 21.2.
boot.fn3 <- function(data,index){
return(median(data[index]))
}
set.seed(1)
btstrp2 <- boot(medv,boot.fn3, 1000)
btstrp2
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = medv, statistic = boot.fn3, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 21.2 0.02295 0.3778075
Median Estimate: 21.2
Bootstrap Median Estimate: 21.2
The two estimates are exactly the same.
perc.10 <- quantile(medv, .10)
perc.10
## 10%
## 12.75
The estimate for the 10th percentile is 12.75.
boot.fn4 <- function(data,index){
return(quantile(data[index], 0.1))
}
boot(medv, boot.fn4, 1000)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = medv, statistic = boot.fn4, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 12.75 0.01455 0.4823468
The 10th percentile using the bootstrap estimate is 12.75. This is the exact same value we got using the original estimate.
detach(Boston)