Problem 3

(a) Explain how k-fold cross-validation is implemented

K-fold cross-validation randomly splits the data into k equal-sized folds. Each fold takes a turn as the validation set, while the other k-1 folds form the training set. The test error is computed for each fold and then averaged to estimate overall test error.

(b)

(i) Compared to Validation Set Approach

  • Advantage: Lower variance due to averaging across folds. More data is used for training.
  • Disadvantage: Higher computational cost.

(ii) Compared to LOOCV

  • Advantage: Less computationally intensive.
  • Disadvantage: LOOCV may have lower bias due to training on more data.

Problem 5 - Validation Set Approach on Default

(a) Fit logistic regression

set.seed(1)
glm.fit <- glm(default ~ income + balance, data = Default, family = "binomial")
summary(glm.fit)
## 
## Call:
## glm(formula = default ~ income + balance, family = "binomial", 
##     data = Default)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.154e+01  4.348e-01 -26.545  < 2e-16 ***
## income       2.081e-05  4.985e-06   4.174 2.99e-05 ***
## balance      5.647e-03  2.274e-04  24.836  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 1579.0  on 9997  degrees of freedom
## AIC: 1585
## 
## Number of Fisher Scoring iterations: 8

(b) Validation Set Error

val_error <- function(seed) {
  set.seed(seed)
  train_idx <- sample(1:nrow(Default), nrow(Default)/2)
  train <- Default[train_idx, ]
  test <- Default[-train_idx, ]
  
  glm.fit <- glm(default ~ income + balance, data = train, family = "binomial")
  probs <- predict(glm.fit, test, type = "response")
  preds <- ifelse(probs > 0.5, "Yes", "No")
  mean(preds != test$default)
}

val_error(1)
## [1] 0.0254

(c) Repeat 3 times

val_error(1)
## [1] 0.0254
val_error(2)
## [1] 0.0238
val_error(3)
## [1] 0.0264

The validation error changes slightly with each seed, showing that the validation set approach is sensitive to the data split.

(d) Add Student Dummy Variable

val_error_student <- function(seed) {
  set.seed(seed)
  train_idx <- sample(1:nrow(Default), nrow(Default)/2)
  train <- Default[train_idx, ]
  test <- Default[-train_idx, ]
  
  glm.fit <- glm(default ~ income + balance + student, data = train, family = "binomial")
  probs <- predict(glm.fit, test, type = "response")
  preds <- ifelse(probs > 0.5, "Yes", "No")
  mean(preds != test$default)
}

val_error_student(1)
## [1] 0.026

Including student does not significantly reduce error, suggesting it may not be a strong predictor.


Problem 6 - Bootstrap vs GLM SE

(a) SE from glm()

glm.fit <- glm(default ~ income + balance, data = Default, family = "binomial")
summary(glm.fit)$coefficients
##                  Estimate   Std. Error    z value      Pr(>|z|)
## (Intercept) -1.154047e+01 4.347564e-01 -26.544680 2.958355e-155
## income       2.080898e-05 4.985167e-06   4.174178  2.990638e-05
## balance      5.647103e-03 2.273731e-04  24.836280 3.638120e-136

(b) Define boot.fn

boot.fn <- function(data, index) {
  fit <- glm(default ~ income + balance, data = data[index, ], family = "binomial")
  return(coef(fit)[2:3])
}

(c) Bootstrap Estimate

set.seed(1)
boot(Default, boot.fn, R = 1000)
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Default, statistic = boot.fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##         original       bias     std. error
## t1* 2.080898e-05 1.680317e-07 4.866284e-06
## t2* 5.647103e-03 1.855765e-05 2.298949e-04

(d) Comment

Bootstrap estimates are close to glm() SEs, confirming the reliability of standard formulas.


Problem 9 - Bootstrap on Boston Data

(a) Estimate of mean medv

mu_hat <- mean(Boston$medv)
mu_hat
## [1] 22.53281

(b) SE estimate using formula

se_mu_hat <- sd(Boston$medv) / sqrt(nrow(Boston))
se_mu_hat
## [1] 0.4088611

(c) Bootstrap SE estimate

boot.fn.mean <- function(data, index) {
  mean(data[index])
}
set.seed(1)
boot_mean <- boot(Boston$medv, boot.fn.mean, R = 1000)
boot_mean
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Boston$medv, statistic = boot.fn.mean, R = 1000)
## 
## 
## Bootstrap Statistics :
##     original      bias    std. error
## t1* 22.53281 0.007650791   0.4106622

(d) 95% CI from bootstrap and t-test

ci_boot <- c(boot_mean$t0 - 2*sd(boot_mean$t), boot_mean$t0 + 2*sd(boot_mean$t))
ci_boot
## [1] 21.71148 23.35413
t.test(Boston$medv)$conf.int
## [1] 21.72953 23.33608
## attr(,"conf.level")
## [1] 0.95

(e) Estimate of median

mu_med <- median(Boston$medv)
mu_med
## [1] 21.2

(f) Bootstrap SE of median

boot.fn.med <- function(data, index) {
  median(data[index])
}
set.seed(1)
boot_med <- boot(Boston$medv, boot.fn.med, R = 1000)
boot_med
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Boston$medv, statistic = boot.fn.med, R = 1000)
## 
## 
## Bootstrap Statistics :
##     original  bias    std. error
## t1*     21.2 0.02295   0.3778075

Median SE can’t be derived with a formula, so bootstrap is useful here.

(g) 10th percentile

mu_10 <- quantile(Boston$medv, 0.1)
mu_10
##   10% 
## 12.75

(h) Bootstrap SE of 10th percentile

boot.fn.q10 <- function(data, index) {
  quantile(data[index], 0.1)
}
set.seed(1)
boot_q10 <- boot(Boston$medv, boot.fn.q10, R = 1000)
boot_q10
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Boston$medv, statistic = boot.fn.q10, R = 1000)
## 
## 
## Bootstrap Statistics :
##     original  bias    std. error
## t1*    12.75  0.0339   0.4767526

The bootstrap helps estimate variability for statistics like the median and 10th percentile, where no closed-form SE formula exists.