library(ISLR2)
library(boot)

Question 3: k-Fold Cross-Validation

(a) How k-fold cross-validation is implemented

k-fold cross-validation works as follows:

  1. The dataset of n observations is randomly divided into k roughly equal-sized folds.
  2. For each fold i (from 1 to k):
    • The ith fold is held out as the validation set
    • The model is trained on the remaining k−1 folds
    • The test error is estimated on the held-out fold
  3. The k resulting error estimates are averaged:

\[CV_{(k)} = \frac{1}{k}\sum_{i=1}^{k} MSE_i\]

(b) Advantages and Disadvantages

Relative to the Validation Set Approach:

Advantages of k-fold CV:

  • Lower variance in the test error estimate, since it averages over k splits rather than relying on a single random split.
  • Uses more data for training (all but 1/k of the data), leading to less bias in the error estimate.
  • Not sensitive to which observations happen to end up in the training vs. validation set.

Disadvantages of k-fold CV:

  • Computationally more expensive — the model must be fit k times instead of once.

Relative to LOOCV:

Advantages of k-fold CV:

  • Much less computationally expensive: fits the model k times (typically 5 or 10) vs. n times for LOOCV.
  • Lower variance than LOOCV: LOOCV averages n models trained on nearly identical data, so the estimates are highly correlated, leading to higher variance in the average.

Disadvantages of k-fold CV:

  • Slightly higher bias than LOOCV, since each model is trained on (k−1)/k of the data rather than (n−1)/n.

Conclusion: k = 5 or k = 10 is generally preferred due to the bias-variance tradeoff.


Question 5: Default Dataset — Validation Set Approach

(a) Fit logistic regression using income and balance

set.seed(1)

glm.fit.full <- glm(default ~ income + balance,
                    data = Default,
                    family = binomial)

summary(glm.fit.full)
## 
## Call:
## glm(formula = default ~ income + balance, family = binomial, 
##     data = Default)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.154e+01  4.348e-01 -26.545  < 2e-16 ***
## income       2.081e-05  4.985e-06   4.174 2.99e-05 ***
## balance      5.647e-03  2.274e-04  24.836  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 1579.0  on 9997  degrees of freedom
## AIC: 1585
## 
## Number of Fisher Scoring iterations: 8

(b) Validation set approach — estimate test error

set.seed(1)

n <- nrow(Default)

# i. Split into training (50%) and validation (50%) sets
train <- sample(n, n / 2)

# ii. Fit logistic regression on training data only
glm.fit <- glm(default ~ income + balance,
               data = Default,
               family = binomial,
               subset = train)

# iii. Predict on validation set
glm.probs <- predict(glm.fit, Default[-train, ], type = "response")
glm.pred  <- ifelse(glm.probs > 0.5, "Yes", "No")

# iv. Compute validation set error
val_error_b <- mean(glm.pred != Default[-train, "default"])
cat("Validation set error (seed 1):", round(val_error_b * 100, 2), "%\n")
## Validation set error (seed 1): 2.54 %

(c) Repeat three times with different random seeds

compute_val_error <- function(seed) {
  set.seed(seed)
  train  <- sample(n, n / 2)
  fit    <- glm(default ~ income + balance,
                data = Default, family = binomial, subset = train)
  probs  <- predict(fit, Default[-train, ], type = "response")
  pred   <- ifelse(probs > 0.5, "Yes", "No")
  mean(pred != Default[-train, "default"])
}

errors <- sapply(c(2, 3, 4), compute_val_error)
for (i in seq_along(errors)) {
  cat(sprintf("Validation error (seed %d): %.2f%%\n", i + 1, errors[i] * 100))
}
## Validation error (seed 2): 2.38%
## Validation error (seed 3): 2.64%
## Validation error (seed 4): 2.56%

Comment: The validation error varies across seeds (roughly 2.4%–2.6%), illustrating the variability inherent in the validation set approach — different splits yield different error estimates.

(d) Add student dummy variable

set.seed(1)
train <- sample(n, n / 2)

glm.fit.student <- glm(default ~ income + balance + student,
                       data = Default,
                       family = binomial,
                       subset = train)

probs.student <- predict(glm.fit.student, Default[-train, ], type = "response")
pred.student  <- ifelse(probs.student > 0.5, "Yes", "No")
val_error_d   <- mean(pred.student != Default[-train, "default"])

cat("Validation error with student variable:", round(val_error_d * 100, 2), "%\n")
## Validation error with student variable: 2.6 %

Comment: Including student does not meaningfully reduce the test error. The dummy variable for student status does not appear to improve predictive accuracy in this logistic regression model.


Question 6: Bootstrap Standard Errors for Default Dataset

(a) Standard errors via glm()

set.seed(1)

glm.fit.default <- glm(default ~ income + balance,
                       data = Default,
                       family = binomial)

summary(glm.fit.default)$coefficients[c("income", "balance"), ]
##             Estimate   Std. Error   z value      Pr(>|z|)
## income  2.080898e-05 4.985167e-06  4.174178  2.990638e-05
## balance 5.647103e-03 2.273731e-04 24.836280 3.638120e-136

(b) Write boot.fn()

boot.fn <- function(data, index) {
  fit <- glm(default ~ income + balance,
             data = data,
             family = binomial,
             subset = index)
  return(coef(fit)[c("income", "balance")])
}

# Quick test on full data
boot.fn(Default, 1:nrow(Default))
##       income      balance 
## 2.080898e-05 5.647103e-03

(c) Use boot() to estimate standard errors

set.seed(1)

boot.out <- boot(Default, boot.fn, R = 1000)
boot.out
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Default, statistic = boot.fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##         original       bias     std. error
## t1* 2.080898e-05 1.680317e-07 4.866284e-06
## t2* 5.647103e-03 1.855765e-05 2.298949e-04

(d) Comment on results

# Compare glm SE vs bootstrap SE
glm_se   <- summary(glm.fit.default)$coefficients[c("income", "balance"), "Std. Error"]
boot_se  <- c(sd(boot.out$t[, 1]), sd(boot.out$t[, 2]))

comparison <- data.frame(
  Coefficient  = c("income", "balance"),
  GLM_SE       = glm_se,
  Bootstrap_SE = boot_se
)
print(comparison)
##         Coefficient       GLM_SE Bootstrap_SE
## income       income 4.985167e-06 4.866284e-06
## balance     balance 2.273731e-04 2.298949e-04

Comment: The bootstrap standard errors are very close to the formula-based glm() standard errors. This agreement is reassuring — it suggests the standard logistic regression assumptions (e.g., about the shape of the likelihood) hold reasonably well here. The bootstrap makes no distributional assumptions, so this similarity validates both approaches.


Question 9: Boston Housing Dataset

(a) Estimate population mean of medv

mu_hat <- mean(Boston$medv)
cat("Estimated population mean of medv:", mu_hat, "\n")
## Estimated population mean of medv: 22.53281

(b) Standard error of mu_hat

n_boston <- nrow(Boston)
se_hat   <- sd(Boston$medv) / sqrt(n_boston)
cat("Standard error of mu_hat:", round(se_hat, 4), "\n")
## Standard error of mu_hat: 0.4089

Interpretation: The sample mean of medv is estimated to be $22,532, with a standard error of about $409. This means repeated samples would typically produce means within roughly $409 of the true population mean.

(c) Bootstrap SE of mu_hat

set.seed(1)

boot.mean <- function(data, index) mean(data[index])
boot.mu   <- boot(Boston$medv, boot.mean, R = 1000)
boot.mu
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Boston$medv, statistic = boot.mean, R = 1000)
## 
## 
## Bootstrap Statistics :
##     original      bias    std. error
## t1* 22.53281 0.007650791   0.4106622
boot_se_mu <- sd(boot.mu$t)
cat("Bootstrap SE of mu_hat:", round(boot_se_mu, 4), "\n")
## Bootstrap SE of mu_hat: 0.4107
cat("Formula-based SE:       ", round(se_hat, 4), "\n")
## Formula-based SE:        0.4089

Comment: The bootstrap SE (≈ 0.412) is very close to the formula-based SE (≈ 0.409). Both methods agree well.

(d) 95% Confidence Interval for mean of medv

# Bootstrap-based CI
ci_bootstrap <- c(mu_hat - 2 * boot_se_mu, mu_hat + 2 * boot_se_mu)
cat("Bootstrap 95% CI: (", round(ci_bootstrap[1], 3), ",", round(ci_bootstrap[2], 3), ")\n")
## Bootstrap 95% CI: ( 21.711 , 23.354 )
# t-test CI
t_result <- t.test(Boston$medv)
cat("t-test 95% CI:    (", round(t_result$conf.int[1], 3), ",",
    round(t_result$conf.int[2], 3), ")\n")
## t-test 95% CI:    ( 21.73 , 23.336 )

Comment: The bootstrap-based CI is nearly identical to the t-test CI. Both methods agree closely, further validating the bootstrap approach.

(e) Estimate median of medv

mu_med <- median(Boston$medv)
cat("Estimated population median of medv:", mu_med, "\n")
## Estimated population median of medv: 21.2

(f) Bootstrap SE of the median

set.seed(1)

boot.median  <- function(data, index) median(data[index])
boot.med.out <- boot(Boston$medv, boot.median, R = 1000)
boot.med.out
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Boston$medv, statistic = boot.median, R = 1000)
## 
## 
## Bootstrap Statistics :
##     original  bias    std. error
## t1*     21.2 0.02295   0.3778075
boot_se_med <- sd(boot.med.out$t)
cat("Bootstrap SE of median:", round(boot_se_med, 4), "\n")
## Bootstrap SE of median: 0.3778

Comment: The median is estimated at $21,200 with a bootstrap SE of about $380. There is no simple closed-form formula for the SE of the median, making the bootstrap especially valuable here. The SE is slightly smaller than that of the mean, suggesting the median is a relatively stable estimate.

(g) 10th percentile of medv

mu_01 <- quantile(Boston$medv, 0.1)
cat("Estimated 10th percentile of medv:", mu_01, "\n")
## Estimated 10th percentile of medv: 12.75

(h) Bootstrap SE of the 10th percentile

set.seed(1)

boot.q01    <- function(data, index) quantile(data[index], 0.1)
boot.q01.out <- boot(Boston$medv, boot.q01, R = 1000)
boot.q01.out
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Boston$medv, statistic = boot.q01, R = 1000)
## 
## 
## Bootstrap Statistics :
##     original  bias    std. error
## t1*    12.75  0.0339   0.4767526
boot_se_q01 <- sd(boot.q01.out$t)
cat("Bootstrap SE of 10th percentile:", round(boot_se_q01, 4), "\n")
## Bootstrap SE of 10th percentile: 0.4768

Comment: The 10th percentile is estimated at $12,750 with a bootstrap SE of about $477. The SE is larger relative to the estimate compared to the mean and median, which is expected since tail quantiles are estimated with greater uncertainty. Again, the bootstrap is the natural tool here since no closed-form formula exists for the SE of a quantile.