K-fold cross-validation randomly splits the data into k equal-sized folds. Each fold takes a turn as the validation set, while the other k-1 folds form the training set. The test error is computed for each fold and then averaged to estimate overall test error.
Defaultset.seed(1)
glm.fit <- glm(default ~ income + balance, data = Default, family = "binomial")
summary(glm.fit)
##
## Call:
## glm(formula = default ~ income + balance, family = "binomial",
## data = Default)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.154e+01 4.348e-01 -26.545 < 2e-16 ***
## income 2.081e-05 4.985e-06 4.174 2.99e-05 ***
## balance 5.647e-03 2.274e-04 24.836 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1579.0 on 9997 degrees of freedom
## AIC: 1585
##
## Number of Fisher Scoring iterations: 8
val_error <- function(seed) {
set.seed(seed)
train_idx <- sample(1:nrow(Default), nrow(Default)/2)
train <- Default[train_idx, ]
test <- Default[-train_idx, ]
glm.fit <- glm(default ~ income + balance, data = train, family = "binomial")
probs <- predict(glm.fit, test, type = "response")
preds <- ifelse(probs > 0.5, "Yes", "No")
mean(preds != test$default)
}
val_error(1)
## [1] 0.0254
val_error(1)
## [1] 0.0254
val_error(2)
## [1] 0.0238
val_error(3)
## [1] 0.0264
The validation error changes slightly with each seed, showing that the validation set approach is sensitive to the data split.
val_error_student <- function(seed) {
set.seed(seed)
train_idx <- sample(1:nrow(Default), nrow(Default)/2)
train <- Default[train_idx, ]
test <- Default[-train_idx, ]
glm.fit <- glm(default ~ income + balance + student, data = train, family = "binomial")
probs <- predict(glm.fit, test, type = "response")
preds <- ifelse(probs > 0.5, "Yes", "No")
mean(preds != test$default)
}
val_error_student(1)
## [1] 0.026
Including student does not significantly reduce error,
suggesting it may not be a strong predictor.
glm()glm.fit <- glm(default ~ income + balance, data = Default, family = "binomial")
summary(glm.fit)$coefficients
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.154047e+01 4.347564e-01 -26.544680 2.958355e-155
## income 2.080898e-05 4.985167e-06 4.174178 2.990638e-05
## balance 5.647103e-03 2.273731e-04 24.836280 3.638120e-136
boot.fnboot.fn <- function(data, index) {
fit <- glm(default ~ income + balance, data = data[index, ], family = "binomial")
return(coef(fit)[2:3])
}
set.seed(1)
boot(Default, boot.fn, R = 1000)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Default, statistic = boot.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 2.080898e-05 1.680317e-07 4.866284e-06
## t2* 5.647103e-03 1.855765e-05 2.298949e-04
Bootstrap estimates are close to glm() SEs, confirming
the reliability of standard formulas.
medvmu_hat <- mean(Boston$medv)
mu_hat
## [1] 22.53281
se_mu_hat <- sd(Boston$medv) / sqrt(nrow(Boston))
se_mu_hat
## [1] 0.4088611
boot.fn.mean <- function(data, index) {
mean(data[index])
}
set.seed(1)
boot_mean <- boot(Boston$medv, boot.fn.mean, R = 1000)
boot_mean
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Boston$medv, statistic = boot.fn.mean, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 22.53281 0.007650791 0.4106622
ci_boot <- c(boot_mean$t0 - 2*sd(boot_mean$t), boot_mean$t0 + 2*sd(boot_mean$t))
ci_boot
## [1] 21.71148 23.35413
t.test(Boston$medv)$conf.int
## [1] 21.72953 23.33608
## attr(,"conf.level")
## [1] 0.95
mu_med <- median(Boston$medv)
mu_med
## [1] 21.2
boot.fn.med <- function(data, index) {
median(data[index])
}
set.seed(1)
boot_med <- boot(Boston$medv, boot.fn.med, R = 1000)
boot_med
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Boston$medv, statistic = boot.fn.med, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 21.2 0.02295 0.3778075
Median SE can’t be derived with a formula, so bootstrap is useful here.
mu_10 <- quantile(Boston$medv, 0.1)
mu_10
## 10%
## 12.75
boot.fn.q10 <- function(data, index) {
quantile(data[index], 0.1)
}
set.seed(1)
boot_q10 <- boot(Boston$medv, boot.fn.q10, R = 1000)
boot_q10
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Boston$medv, statistic = boot.fn.q10, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 12.75 0.0339 0.4767526
The bootstrap helps estimate variability for statistics like the median and 10th percentile, where no closed-form SE formula exists.