STA 6543 Predictive Modeling

Problem 3: k-Fold Cross-Validation

(a) How k-fold cross-validation is implemented

k-fold cross-validation works by randomly dividing the full training data set into \(k\) roughly equal-sized groups, or folds. Then, for each fold \(i = 1, \ldots, k\):

Hold out fold \(i\) as the validation set.
Fit the model on the remaining \(k - 1\) folds (the training portion).
Evaluate prediction error on the held-out fold \(i\).

After repeating this for every fold, you have \(k\) error estimates. The k-fold CV estimate is the average of these \(k\) numbers:

\[\text{CV}_{(k)} = \frac{1}{k} \sum_{i=1}^{k} \text{MSE}_i\]

The most common choices are \(k = 5\) or \(k = 10\). When \(k = n\) (one observation left out each time), this becomes LOOCV.

(b) Advantages and disadvantages of k-fold CV

i. Relative to the validation set approach:

The main advantage over the validation set approach is that you’re not stuck depending on one lucky (or unlucky) 50/50 split. Averaging over \(k\) different hold-outs lowers the variance of the error estimate, and since each training set is larger, the bias goes down too. The cost is computation: the model has to be fit \(k\) times rather than once. That’s usually fine, but for slow models on big data it adds up. The validation set approach wins on simplicity when a rough estimate is all you need.

ii. Relative to LOOCV:

LOOCV uses \(n - 1\) observations for training each time, so it barely underestimates the true error. The catch is that all \(n\) models are trained on almost the same data, making their individual errors highly correlated — the average ends up noisier than you’d expect. k-fold CV with \(k = 5\) or \(10\) trades a small amount of bias for a big drop in variance, and the estimate tends to be more stable overall. It’s also much faster since you only fit the model \(k\) times instead of \(n\).

Problem 5: Validation Set Approach on the Default Data Set

data(Default, package = "ISLR2")
cat("Dimensions:", nrow(Default), "rows x", ncol(Default), "columns\n")

## Dimensions: 10000 rows x 4 columns

head(Default)

##   default student   balance    income
## 1      No      No  729.5265 44361.625
## 2      No     Yes  817.1804 12106.135
## 3      No      No 1073.5492 31767.139
## 4      No      No  529.2506 35704.494
## 5      No      No  785.6559 38463.496
## 6      No     Yes  919.5885  7491.559

(a) Fit a logistic regression model using income and balance

set.seed(1)
fit_full <- glm(default ~ income + balance,
                data    = Default,
                family  = binomial)
summary(fit_full)

## 
## Call:
## glm(formula = default ~ income + balance, family = binomial, 
##     data = Default)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.154e+01  4.348e-01 -26.545  < 2e-16 ***
## income       2.081e-05  4.985e-06   4.174 2.99e-05 ***
## balance      5.647e-03  2.274e-04  24.836  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 1579.0  on 9997  degrees of freedom
## AIC: 1585
## 
## Number of Fisher Scoring iterations: 8

(b) Validation set estimate of the test error

set.seed(1)

n        <- nrow(Default)
train_idx <- sample(n, n / 2)

fit_val  <- glm(default ~ income + balance,
                data    = Default,
                family  = binomial,
                subset  = train_idx)

prob_val  <- predict(fit_val,
                     newdata = Default[-train_idx, ],
                     type    = "response")
pred_val  <- ifelse(prob_val > 0.5, "Yes", "No")
actual    <- Default[-train_idx, "default"]

val_err_b <- mean(pred_val != actual)
cat("Validation set error (seed 1):", round(val_err_b, 4), "\n")

## Validation set error (seed 1): 0.0254

The estimated test error is about 2.54%.

(c) Repeat with three different random splits

seeds <- c(42, 7, 123)
errs  <- numeric(3)

for (i in seq_along(seeds)) {
  set.seed(seeds[i])
  tr  <- sample(n, n / 2)
  fit <- glm(default ~ income + balance,
             data   = Default,
             family = binomial,
             subset = tr)
  prb <- predict(fit, newdata = Default[-tr, ], type = "response")
  prd <- ifelse(prb > 0.5, "Yes", "No")
  errs[i] <- mean(prd != Default[-tr, "default"])
  cat("Seed", seeds[i], "validation error:", round(errs[i], 4), "\n")
}

## Seed 42 validation error: 0.026 
## Seed 7 validation error: 0.0242 
## Seed 123 validation error: 0.0276

The error rate shifts around a bit across the three splits, which illustrates the main issue with this approach: the result depends on which observations you happen to assign to training vs. validation. There’s no way to tell which estimate is closest to the truth. Running a few different splits and looking at the range at least gives a sense of how much the estimate is bouncing around.

(d) Add the student dummy variable

set.seed(1)
tr_d  <- sample(n, n / 2)
fit_d <- glm(default ~ income + balance + student,
             data   = Default,
             family = binomial,
             subset = tr_d)
prb_d <- predict(fit_d, newdata = Default[-tr_d, ], type = "response")
prd_d <- ifelse(prb_d > 0.5, "Yes", "No")
err_d <- mean(prd_d != Default[-tr_d, "default"])
cat("Validation error with student:", round(err_d, 4), "\n")

## Validation error with student: 0.026

cat("Validation error without student (b):", round(val_err_b, 4), "\n")

## Validation error without student (b): 0.0254

Adding student barely moves the error rate. The two-predictor model does about the same thing, which lines up with what Chapter 4 showed — once balance is in the model it already accounts for most of what student status is correlated with.

Problem 6: Bootstrap Estimates of Standard Errors on the Default Data Set

(a) Standard errors from glm()

fit_6 <- glm(default ~ income + balance,
             data   = Default,
             family = binomial)
se_glm <- summary(fit_6)$coefficients[, "Std. Error"]
cat("GLM standard errors:\n")

## GLM standard errors:

print(round(se_glm, 8))

## (Intercept)      income     balance 
##  0.43475636  0.00000499  0.00022737

The standard error for the income coefficient is approximately 4.985e-06 and for balance it is approximately 2.274e-04.

(b) Write a boot_fn() function

boot_fn <- function(data, index) {
  fit <- glm(default ~ income + balance,
             data   = data,
             family = binomial,
             subset = index)
  coef(fit)[c("income", "balance")]
}

# Quick test on the full data set
boot_fn(Default, 1:nrow(Default))

##       income      balance 
## 2.080898e-05 5.647103e-03

(c) Bootstrap standard errors

set.seed(1)
boot_out <- boot(Default, boot_fn, R = 1000)
boot_out

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Default, statistic = boot_fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##         original       bias     std. error
## t1* 2.080898e-05 1.680317e-07 4.866284e-06
## t2* 5.647103e-03 1.855765e-05 2.298949e-04

cat("Bootstrap standard errors:\n")

## Bootstrap standard errors:

cat("  income :", round(sd(boot_out$t[, 1]), 8), "\n")

##   income : 4.87e-06

cat("  balance:", round(sd(boot_out$t[, 2]), 8), "\n")

##   balance: 0.00022989

(d) Comparison: sm.GLM() (glm) vs bootstrap

The bootstrap SEs and the glm() SEs come out very close for both coefficients, so either way you get roughly the same picture. The glm() formula relies on the model being correctly specified; the bootstrap just resamples the data directly without needing that. Since the two agree here, it suggests the model assumptions are holding reasonably well. If the model were off, the bootstrap would be the more trustworthy number.

Problem 9: Boston Housing Data Set

data(Boston, package = "ISLR2")
cat("Dimensions:", nrow(Boston), "rows x", ncol(Boston), "columns\n")

## Dimensions: 506 rows x 13 columns

(a) Estimate the population mean of medv

mu_hat <- mean(Boston$medv)
cat("Estimated population mean (mu_hat):", round(mu_hat, 4), "\n")

## Estimated population mean (mu_hat): 22.5328

The estimate for the population mean of medv is \(\hat{\mu} =\) 22.5328.

(b) Standard error of the sample mean

se_formula <- sd(Boston$medv) / sqrt(nrow(Boston))
cat("SE of mu_hat (formula):", round(se_formula, 4), "\n")

## SE of mu_hat (formula): 0.4089

Plugging in: \(\text{SE}(\hat{\mu}) = s / \sqrt{n}\) gives 0.4089. So the sample mean is off from the true population mean by about 0.41 thousand dollars on average.

(c) Bootstrap estimate of SE(μ̂)

mean_fn <- function(data, index) {
  mean(data$medv[index])
}

set.seed(1)
boot_mean <- boot(Boston, mean_fn, R = 1000)
se_boot   <- sd(boot_mean$t)
cat("Bootstrap SE of mu_hat:", round(se_boot, 4), "\n")

## Bootstrap SE of mu_hat: 0.4107

cat("Formula SE of mu_hat  :", round(se_formula, 4), "\n")

## Formula SE of mu_hat  : 0.4089

The bootstrap SE is 0.4107, essentially the same as the formula estimate of 0.4089. They match because the sample mean is well-behaved and the standard formula works fine for it. The bootstrap is more useful for statistics where no closed-form SE exists.

(d) 95% confidence interval for the mean of medv

ci_boot <- c(mu_hat - 2 * se_boot, mu_hat + 2 * se_boot)
ci_t    <- t.test(Boston$medv)$conf.int
cat("Bootstrap 95% CI (mu_hat ± 2 * SE_boot): [",
    round(ci_boot[1], 4), ",", round(ci_boot[2], 4), "]\n")

## Bootstrap 95% CI (mu_hat ± 2 * SE_boot): [ 21.7115 , 23.3541 ]

cat("t.test 95% CI                           : [",
    round(ci_t[1], 4), ",", round(ci_t[2], 4), "]\n")

## t.test 95% CI                           : [ 21.7295 , 23.3361 ]

The bootstrap-based 95% CI is [21.711, 23.354] and the t.test() interval is [21.73, 23.336]. The two are nearly the same, which makes sense since the SEs were nearly identical. Both put the true mean of medv somewhere around 21 to 23 thousand dollars.

(e) Estimate of the population median

mu_med <- median(Boston$medv)
cat("Estimated population median (mu_med):", mu_med, "\n")

## Estimated population median (mu_med): 21.2

The estimate for the population median is \(\hat{\mu}_{med} =\) 21.2.

(f) Bootstrap standard error of the median

median_fn <- function(data, index) {
  median(data$medv[index])
}

set.seed(1)
boot_med <- boot(Boston, median_fn, R = 1000)
se_med   <- sd(boot_med$t)
cat("Bootstrap SE of mu_med:", round(se_med, 4), "\n")

## Bootstrap SE of mu_med: 0.3778

The bootstrap SE for the median is 0.3778, a bit smaller than the SE for the mean. The median is less pulled around by extreme values, so that’s not surprising. There’s no closed-form formula for this, so the bootstrap is the practical way to go.

(g) Estimate of the tenth percentile of medv

mu_01 <- quantile(Boston$medv, 0.1)
cat("Estimated 10th percentile (mu_0.1):", mu_01, "\n")

## Estimated 10th percentile (mu_0.1): 12.75

The estimate for the tenth percentile of medv in the population is \(\hat{\mu}_{0.1} =\) 12.75.

(h) Bootstrap standard error of the tenth percentile

p10_fn <- function(data, index) {
  quantile(data$medv[index], 0.1)
}

set.seed(1)
boot_p10 <- boot(Boston, p10_fn, R = 1000)
se_p10   <- sd(boot_p10$t)
cat("Bootstrap SE of mu_0.1:", round(se_p10, 4), "\n")

## Bootstrap SE of mu_0.1: 0.4768

The SE for the 10th percentile is 0.4768, larger than what we got for the mean or median. A tail quantile is pinned down by fewer observations, so there’s more variability in the estimate. There’s no formula to use here either, which is why the bootstrap is handy.