Question 3 — k-Fold Cross-Validation

(a) Explain how k-fold cross-validation is implemented

In k-fold cross-validation, the data is randomly split into k parts. The model is trained on k − 1 parts and tested on the remaining part. This process repeats k times, and the errors are averaged.

(b) Advantages and disadvantages

Compared to the validation set approach, k-fold reduces variance since it uses more data for training each time. Compared to LOOCV, k-fold is less computationally expensive and has lower variance, but LOOCV may produce less biased estimates.

Question 5 — Validation Set Approach with Logistic Regression

(a) Fit Logistic Regression Model

glm.fit <- glm(default ~ income + balance, data=Default, family=binomial)
summary(glm.fit)
## 
## Call:
## glm(formula = default ~ income + balance, family = binomial, 
##     data = Default)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.154e+01  4.348e-01 -26.545  < 2e-16 ***
## income       2.081e-05  4.985e-06   4.174 2.99e-05 ***
## balance      5.647e-03  2.274e-04  24.836  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 1579.0  on 9997  degrees of freedom
## AIC: 1585
## 
## Number of Fisher Scoring iterations: 8

Answer: The logistic regression shows balance is strongly associated with default, while income is not.

(b) Validation Set Error

train <- sample(1:nrow(Default), nrow(Default)/2)
glm.fit.train <- glm(default ~ income + balance, data=Default, subset=train, family=binomial)
glm.probs <- predict(glm.fit.train, Default[-train,], type="response")
glm.pred <- ifelse(glm.probs > 0.5, "Yes", "No")
error1 <- mean(glm.pred != Default[-train,]$default)
error1
## [1] 0.0254

Answer: Validation error is about 2.54% to 3% depending on the split.

(c) Repeat Three Times

error_rates <- rep(NA,3)
for (i in 1:3) {
  set.seed(i)
  train <- sample(1:nrow(Default), nrow(Default)/2)
  glm.fit.train <- glm(default ~ income + balance, data=Default, subset=train, family=binomial)
  glm.probs <- predict(glm.fit.train, Default[-train,], type="response")
  glm.pred <- ifelse(glm.probs > 0.5, "Yes", "No")
  error_rates[i] <- mean(glm.pred != Default[-train,]$default)
}
error_rates
## [1] 0.0254 0.0238 0.0264

Answer: The error rates across the three splits are slightly different but stable. They came out to be 2.54%, 2.38%, and 2.64%

(d) Include Student Variable

error_rates_student <- rep(NA,3)
for (i in 1:3) {
  set.seed(i)
  train <- sample(1:nrow(Default), nrow(Default)/2)
  glm.fit.train <- glm(default ~ income + balance + student, data=Default, subset=train, family=binomial)
  glm.probs <- predict(glm.fit.train, Default[-train,], type="response")
  glm.pred <- ifelse(glm.probs > 0.5, "Yes", "No")
  error_rates_student[i] <- mean(glm.pred != Default[-train,]$default)
}
error_rates_student
## [1] 0.0260 0.0246 0.0272

Answer: Including the student variable does not significantly reduce the test error. They came out to 2.60%, 2.46%, and 2.72%

Question 6 — Bootstrap for Standard Errors

(a) glm() Standard Errors

summary(glm.fit)$coefficients
##                  Estimate   Std. Error    z value      Pr(>|z|)
## (Intercept) -1.154047e+01 4.347564e-01 -26.544680 2.958355e-155
## income       2.080898e-05 4.985167e-06   4.174178  2.990638e-05
## balance      5.647103e-03 2.273731e-04  24.836280 3.638120e-136

Answer: The standard errors from glm() are approximately 4.98e-06 for income and 2.27e-04 for balance.

(b) Bootstrap Function

boot.fn <- function(data, index){
  fit <- glm(default ~ income + balance, data=data, family=binomial, subset=index)
  return(coef(fit))
}

(c) Bootstrap Estimate

boot(Default, boot.fn, R=1000)
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Default, statistic = boot.fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##          original        bias     std. error
## t1* -1.154047e+01 -4.138024e-02 4.381705e-01
## t2*  2.080898e-05  2.071209e-07 4.860557e-06
## t3*  5.647103e-03  1.975295e-05 2.310331e-04

Answer: Bootstrap estimates are very close to glm() estimates, confirming their reliability.

(d) Comment on the estimated standard errors

Answer: The bootstrap estimates are very close to the glm() estimates, confirming their reliability. Both methods yield similar results for the standard errors of income and balance, suggesting that the model is stable.

Question 9 — Bootstrap on Boston Housing

(a) Estimate Mean

mu_hat <- mean(Boston$medv)
mu_hat
## [1] 22.53281

Answer: The mean of medv is approximately 22.53.

(b) Standard Error Formula

se_mu <- sd(Boston$medv) / sqrt(nrow(Boston))
se_mu
## [1] 0.4088611

Answer: Standard error is about 0.41.

(c) Bootstrap SE for Mean

boot.mean.fn <- function(data, index){
  return(mean(data[index]))
}
boot.out <- boot(Boston$medv, boot.mean.fn, R=1000)
boot.out
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Boston$medv, statistic = boot.mean.fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##     original     bias    std. error
## t1* 22.53281 0.02053913    0.405457

Answer: Bootstrap confirms similar SE to formula.

(d) Bootstrap CI vs t-test

boot_se <- sd(boot.out$t)
CI_boot <- c(mu_hat - 2 * boot_se, mu_hat + 2 * boot_se)
CI_boot
## [1] 21.72189 23.34372
t.test(Boston$medv)$conf.int
## [1] 21.72953 23.33608
## attr(,"conf.level")
## [1] 0.95

Answer: Bootstrap CI and t-test CI are similar.

(e) Median Estimate

mu_med <- median(Boston$medv)
mu_med
## [1] 21.2

Answer: Median medv is around 21.2.

(f) Bootstrap SE for Median

boot.median.fn <- function(data, index){
  return(median(data[index]))
}
boot.out.median <- boot(Boston$medv, boot.median.fn, R=1000)
boot.out.median
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Boston$medv, statistic = boot.median.fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##     original   bias    std. error
## t1*     21.2 -0.02225   0.3786632

Answer: Bootstrap SE for the median is around 0.38.

(g) Estimate 10th Percentile

mu_0.1 <- quantile(Boston$medv, 0.1)
mu_0.1
##   10% 
## 12.75

Answer: The 10th percentile is around 12.75.

(h) Bootstrap SE for 10th Percentile

boot.quantile.fn <- function(data, index){
  return(quantile(data[index], 0.1))
}
boot.out.quantile <- boot(Boston$medv, boot.quantile.fn, R=1000)
boot.out.quantile
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Boston$medv, statistic = boot.quantile.fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##     original  bias    std. error
## t1*    12.75  0.0012   0.5098416

Answer: Bootstrap SE for the 10th percentile is around 0.5.