library(ISLR2)
library(boot)
k-fold cross-validation works as follows:
\[CV_{(k)} = \frac{1}{k}\sum_{i=1}^{k} MSE_i\]
Advantages of k-fold CV:
Disadvantages of k-fold CV:
Advantages of k-fold CV:
Disadvantages of k-fold CV:
Conclusion: k = 5 or k = 10 is generally preferred due to the bias-variance tradeoff.
set.seed(1)
glm.fit.full <- glm(default ~ income + balance,
data = Default,
family = binomial)
summary(glm.fit.full)
##
## Call:
## glm(formula = default ~ income + balance, family = binomial,
## data = Default)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.154e+01 4.348e-01 -26.545 < 2e-16 ***
## income 2.081e-05 4.985e-06 4.174 2.99e-05 ***
## balance 5.647e-03 2.274e-04 24.836 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1579.0 on 9997 degrees of freedom
## AIC: 1585
##
## Number of Fisher Scoring iterations: 8
set.seed(1)
n <- nrow(Default)
# i. Split into training (50%) and validation (50%) sets
train <- sample(n, n / 2)
# ii. Fit logistic regression on training data only
glm.fit <- glm(default ~ income + balance,
data = Default,
family = binomial,
subset = train)
# iii. Predict on validation set
glm.probs <- predict(glm.fit, Default[-train, ], type = "response")
glm.pred <- ifelse(glm.probs > 0.5, "Yes", "No")
# iv. Compute validation set error
val_error_b <- mean(glm.pred != Default[-train, "default"])
cat("Validation set error (seed 1):", round(val_error_b * 100, 2), "%\n")
## Validation set error (seed 1): 2.54 %
compute_val_error <- function(seed) {
set.seed(seed)
train <- sample(n, n / 2)
fit <- glm(default ~ income + balance,
data = Default, family = binomial, subset = train)
probs <- predict(fit, Default[-train, ], type = "response")
pred <- ifelse(probs > 0.5, "Yes", "No")
mean(pred != Default[-train, "default"])
}
errors <- sapply(c(2, 3, 4), compute_val_error)
for (i in seq_along(errors)) {
cat(sprintf("Validation error (seed %d): %.2f%%\n", i + 1, errors[i] * 100))
}
## Validation error (seed 2): 2.38%
## Validation error (seed 3): 2.64%
## Validation error (seed 4): 2.56%
Comment: The validation error varies across seeds (roughly 2.4%–2.6%), illustrating the variability inherent in the validation set approach — different splits yield different error estimates.
set.seed(1)
train <- sample(n, n / 2)
glm.fit.student <- glm(default ~ income + balance + student,
data = Default,
family = binomial,
subset = train)
probs.student <- predict(glm.fit.student, Default[-train, ], type = "response")
pred.student <- ifelse(probs.student > 0.5, "Yes", "No")
val_error_d <- mean(pred.student != Default[-train, "default"])
cat("Validation error with student variable:", round(val_error_d * 100, 2), "%\n")
## Validation error with student variable: 2.6 %
Comment: Including student does
not meaningfully reduce the test error. The dummy
variable for student status does not appear to improve predictive
accuracy in this logistic regression model.
set.seed(1)
glm.fit.default <- glm(default ~ income + balance,
data = Default,
family = binomial)
summary(glm.fit.default)$coefficients[c("income", "balance"), ]
## Estimate Std. Error z value Pr(>|z|)
## income 2.080898e-05 4.985167e-06 4.174178 2.990638e-05
## balance 5.647103e-03 2.273731e-04 24.836280 3.638120e-136
boot.fn <- function(data, index) {
fit <- glm(default ~ income + balance,
data = data,
family = binomial,
subset = index)
return(coef(fit)[c("income", "balance")])
}
# Quick test on full data
boot.fn(Default, 1:nrow(Default))
## income balance
## 2.080898e-05 5.647103e-03
set.seed(1)
boot.out <- boot(Default, boot.fn, R = 1000)
boot.out
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Default, statistic = boot.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 2.080898e-05 1.680317e-07 4.866284e-06
## t2* 5.647103e-03 1.855765e-05 2.298949e-04
# Compare glm SE vs bootstrap SE
glm_se <- summary(glm.fit.default)$coefficients[c("income", "balance"), "Std. Error"]
boot_se <- c(sd(boot.out$t[, 1]), sd(boot.out$t[, 2]))
comparison <- data.frame(
Coefficient = c("income", "balance"),
GLM_SE = glm_se,
Bootstrap_SE = boot_se
)
print(comparison)
## Coefficient GLM_SE Bootstrap_SE
## income income 4.985167e-06 4.866284e-06
## balance balance 2.273731e-04 2.298949e-04
Comment: The bootstrap standard errors are very
close to the formula-based glm() standard errors. This
agreement is reassuring — it suggests the standard logistic regression
assumptions (e.g., about the shape of the likelihood) hold reasonably
well here. The bootstrap makes no distributional assumptions, so this
similarity validates both approaches.
mu_hat <- mean(Boston$medv)
cat("Estimated population mean of medv:", mu_hat, "\n")
## Estimated population mean of medv: 22.53281
n_boston <- nrow(Boston)
se_hat <- sd(Boston$medv) / sqrt(n_boston)
cat("Standard error of mu_hat:", round(se_hat, 4), "\n")
## Standard error of mu_hat: 0.4089
Interpretation: The sample mean of medv
is estimated to be $22,532, with a standard error of
about $409. This means repeated samples would typically
produce means within roughly $409 of the true population mean.
set.seed(1)
boot.mean <- function(data, index) mean(data[index])
boot.mu <- boot(Boston$medv, boot.mean, R = 1000)
boot.mu
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Boston$medv, statistic = boot.mean, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 22.53281 0.007650791 0.4106622
boot_se_mu <- sd(boot.mu$t)
cat("Bootstrap SE of mu_hat:", round(boot_se_mu, 4), "\n")
## Bootstrap SE of mu_hat: 0.4107
cat("Formula-based SE: ", round(se_hat, 4), "\n")
## Formula-based SE: 0.4089
Comment: The bootstrap SE (≈ 0.412) is very close to the formula-based SE (≈ 0.409). Both methods agree well.
# Bootstrap-based CI
ci_bootstrap <- c(mu_hat - 2 * boot_se_mu, mu_hat + 2 * boot_se_mu)
cat("Bootstrap 95% CI: (", round(ci_bootstrap[1], 3), ",", round(ci_bootstrap[2], 3), ")\n")
## Bootstrap 95% CI: ( 21.711 , 23.354 )
# t-test CI
t_result <- t.test(Boston$medv)
cat("t-test 95% CI: (", round(t_result$conf.int[1], 3), ",",
round(t_result$conf.int[2], 3), ")\n")
## t-test 95% CI: ( 21.73 , 23.336 )
Comment: The bootstrap-based CI is nearly identical to the t-test CI. Both methods agree closely, further validating the bootstrap approach.
mu_med <- median(Boston$medv)
cat("Estimated population median of medv:", mu_med, "\n")
## Estimated population median of medv: 21.2
set.seed(1)
boot.median <- function(data, index) median(data[index])
boot.med.out <- boot(Boston$medv, boot.median, R = 1000)
boot.med.out
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Boston$medv, statistic = boot.median, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 21.2 0.02295 0.3778075
boot_se_med <- sd(boot.med.out$t)
cat("Bootstrap SE of median:", round(boot_se_med, 4), "\n")
## Bootstrap SE of median: 0.3778
Comment: The median is estimated at $21,200 with a bootstrap SE of about $380. There is no simple closed-form formula for the SE of the median, making the bootstrap especially valuable here. The SE is slightly smaller than that of the mean, suggesting the median is a relatively stable estimate.
mu_01 <- quantile(Boston$medv, 0.1)
cat("Estimated 10th percentile of medv:", mu_01, "\n")
## Estimated 10th percentile of medv: 12.75
set.seed(1)
boot.q01 <- function(data, index) quantile(data[index], 0.1)
boot.q01.out <- boot(Boston$medv, boot.q01, R = 1000)
boot.q01.out
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Boston$medv, statistic = boot.q01, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 12.75 0.0339 0.4767526
boot_se_q01 <- sd(boot.q01.out$t)
cat("Bootstrap SE of 10th percentile:", round(boot_se_q01, 4), "\n")
## Bootstrap SE of 10th percentile: 0.4768
Comment: The 10th percentile is estimated at $12,750 with a bootstrap SE of about $477. The SE is larger relative to the estimate compared to the mean and median, which is expected since tail quantiles are estimated with greater uncertainty. Again, the bootstrap is the natural tool here since no closed-form formula exists for the SE of a quantile.