Assignment 4

STA6543

Author

Stephen Garcia (wqr974)

Published

July 3, 2025

Chapter 5

Resampling Methods

Conceptual Exercises:

Problem 3

We now review k-fold cross-validation.

(a)
Explain how k-fold cross-validation is implemented.
Response:
K-fold cross-validation is a resampling technique used to evaluate a model’s generalization performance. It works as follows:

  1. Data Splitting: The dataset is randomly partitioned into k equally sized (or nearly equal) subsets called folds.

  2. Iterative Training and Validation: The cross-validation process runs k times. In each iteration:

    • One fold is held out as the validation set.
    • The remaining k − 1 folds are combined to form the training set.
    • The model is trained on the training set and evaluated on the validation set.
  3. Performance Aggregation: A performance metric (e.g., accuracy, F1 score, RMSE) is computed for each of the k validation folds.

  4. Final Estimate: The k scores are averaged to provide an overall estimate of the model’s performance.

This method reduces bias associated with a single train-test split and helps assess how well the model generalizes to independent data, thus mitigating overfitting.

Applied Exercises:

Problem 5

In Chapter 4, we used logistic regression to predict the probability of default using income and balance on the Default data set. We will now estimate the test error of this logistic regression model using the validation set approach. Do not forget to set a random seed before beginning your analysis.

(a)
Fit a logistic regression model that uses income and balance to predict default.

# Load required packages
library(ISLR)
library(caret)
Warning: package 'caret' was built under R version 4.4.2
Loading required package: ggplot2
Loading required package: lattice
# Set random seed for reproducibility
set.seed(123)

# View the structure of the data
# str(Default)

# ii. Fit a multiple logistic regression model using only the training observations
model_full <- train(
  default ~ income + balance,
  data = Default,
  method = "glm",
  family = "binomial"
)

# View the model summary
summary(model_full)

Call:
NULL

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -1.154e+01  4.348e-01 -26.545  < 2e-16 ***
income       2.081e-05  4.985e-06   4.174 2.99e-05 ***
balance      5.647e-03  2.274e-04  24.836  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2920.6  on 9999  degrees of freedom
Residual deviance: 1579.0  on 9997  degrees of freedom
AIC: 1585

Number of Fisher Scoring iterations: 8

(b)
Using the validation set approach, estimate the test error of this model. In order to do this, you must perform the following steps:

# i. Split the sample set into a training set and a validation set
train_index <- createDataPartition(Default$default, p = 0.7, list = FALSE)
train_data <- Default[train_index, ]
validation_data <- Default[-train_index, ]

# ii. Fit a multiple logistic regression model using only the training observations
model_fit <- train(
  default ~ income + balance,
  data = train_data,
  method = "glm",
  family = "binomial"
)

# iii. Obtain posterior probabilities and classify as 'Yes' if > 0.5
pred_probs <- predict(model_fit, newdata = validation_data, type = "prob")
pred_class <- ifelse(pred_probs[, "Yes"] > 0.5, "Yes", "No")

# iv. Compute the validation set error (fraction misclassified)
actual <- validation_data$default
misclassified <- mean(pred_class != actual)
misclassified
[1] 0.02200734

Extra
I wanted to see the output in a confusion matrix format.

# Generate confusion matrix using caret
conf_matrix <- confusionMatrix(
  factor(pred_class, levels = c("No", "Yes")),
  factor(actual, levels = c("No", "Yes"))
)

# Print the confusion matrix
conf_matrix
Confusion Matrix and Statistics

          Reference
Prediction   No  Yes
       No  2887   53
       Yes   13   46
                                          
               Accuracy : 0.978           
                 95% CI : (0.9721, 0.9829)
    No Information Rate : 0.967           
    P-Value [Acc > NIR] : 0.0002241       
                                          
                  Kappa : 0.5717          
                                          
 Mcnemar's Test P-Value : 1.582e-06       
                                          
            Sensitivity : 0.9955          
            Specificity : 0.4646          
         Pos Pred Value : 0.9820          
         Neg Pred Value : 0.7797          
             Prevalence : 0.9670          
         Detection Rate : 0.9627          
   Detection Prevalence : 0.9803          
      Balanced Accuracy : 0.7301          
                                          
       'Positive' Class : No              
                                          

(c)
Repeat the process in (b) three times, using three different splits of the observations into a training set and a validation set. Comment on the results obtained.

# Initialize vector to store test errors
set.seed(123)
errors <- numeric(3)

for (i in 1:3) {
  # Split the data (random seed changes each iteration)
  train_index <- createDataPartition(Default$default, p = 0.7, list = FALSE)
  train_data <- Default[train_index, ]
  validation_data <- Default[-train_index, ]

  # Fit logistic regression model
  model_fit <- train(
    default ~ income + balance,
    data = train_data,
    method = "glm",
    family = "binomial"
  )

  # Predict on validation set
  pred_probs <- predict(model_fit, newdata = validation_data, type = "prob")
  pred_class <- ifelse(pred_probs[, "Yes"] > 0.5, "Yes", "No")

  # Compute test error (misclassification rate)
  actual <- validation_data$default
  errors[i] <- mean(pred_class != actual)
}

# Show test errors for each run
errors
[1] 0.02534178 0.02700900 0.02467489
mean_error <- mean(errors)
mean_error
[1] 0.02567523

Response:
The average test error across the three trials is approximately 2.57%, indicating a relatively low misclassification rate. The consistency across the three estimates suggests that the model’s performance is stable and not highly sensitive to the specific training-validation split. This reinforces the predictive strength of the variables income and especially balance, which is known to be a strong predictor of default in this dataset.

(d)
Now consider a logistic regression model that predicts the probability of default using income, balance, and a dummy variable for student. Estimate the test error for this model using the validation set approach. Comment on whether or not including a dummy variable for student leads to a reduction in the test error rate.

# Set seed for reproducibility
set.seed(123)

# Repeat the validation set approach 3 times
errors_with_student <- numeric(3)

for (i in 1:3) {
  # Split the data into training and validation sets
  train_index <- createDataPartition(Default$default, p = 0.7, list = FALSE)
  train_data <- Default[train_index, ]
  validation_data <- Default[-train_index, ]

  # Fit logistic regression including 'student' as a predictor
  model_fit <- train(
    default ~ income + balance + student,
    data = train_data,
    method = "glm",
    family = "binomial"
  )

  # Predict on validation set
  pred_probs <- predict(model_fit, newdata = validation_data, type = "prob")
  pred_class <- ifelse(pred_probs[, "Yes"] > 0.5, "Yes", "No")

  # Compute test error (fraction misclassified)
  actual <- validation_data$default
  errors_with_student[i] <- mean(pred_class != actual)
}

# Print the results
errors_with_student
[1] 0.02634211 0.02734245 0.02600867
mean(errors_with_student)
[1] 0.02656441

Response:
The average validation set error for the logistic regression model including income, balance, and the student dummy variable is 2.66%, compared to 2.57% for the model without the student variable.

This result suggests that including the student variable did not reduce test error, and in fact resulted in a slightly higher misclassification rate. This implies that student may not provide meaningful additional predictive value beyond what is already captured by income and balance. Therefore, for this dataset and model, the inclusion of student does not improve generalization performance and may be unnecessary for prediction.

Problem 6

We continue to consider the use of a logistic regression model to predict the probability of default using income and balance on the Default data set. In particular, we will now compute estimates for the standard errors of the income and balance logistic regression coefficients in two different ways: (1) using the bootstrap, and (2) using the standard formula for computing the standard errors in the glm() function. Do not forget to set a random seed before beginning your analysis.

(a)
Using the summary() and glm() functions, determine the estimated standard errors for the coefficients associated with income and balance in a multiple logistic regression model that uses both predictors.

# Set seed for reproducibility
set.seed(123)

# Fit logistic regression model using income and balance
logit_model <- glm(default ~ income + balance, data = Default, family = "binomial")

# Display model summary to extract standard errors
summary(logit_model)$coefficients
                 Estimate   Std. Error    z value      Pr(>|z|)
(Intercept) -1.154047e+01 4.347564e-01 -26.544680 2.958355e-155
income       2.080898e-05 4.985167e-06   4.174178  2.990638e-05
balance      5.647103e-03 2.273731e-04  24.836280 3.638120e-136

(b)
Write a function, boot.fn(), that takes as input the Default data set as well as an index of the observations, and that outputs the coefficient estimates for income and balance in the multiple logistic regression model.

# Define boot.fn function
boot.fn <- function(data, index) {
  # Fit logistic regression model on selected observations
  fit <- glm(default ~ income + balance, data = data, subset = index, family = "binomial")
  
  # Return coefficient estimates for income and balance
  return(coef(fit)[c("income", "balance")])
}

# Example: Call boot.fn on a random sample
set.seed(123)
boot.fn(Default, sample(1:nrow(Default), nrow(Default), replace = TRUE))
      income      balance 
1.803436e-05 5.448714e-03 

(c)
Use the boot() function together with your boot.fn() function to estimate the standard errors of the logistic regression coefficients for income and balance.

# Load necessary library
library(boot)

Attaching package: 'boot'
The following object is masked from 'package:lattice':

    melanoma
# Set seed for reproducibility
set.seed(123)

# Perform bootstrap with 1000 replications
boot_results <- boot(data = Default, statistic = boot.fn, R = 1000)

# View bootstrap estimates and standard errors
boot_results

ORDINARY NONPARAMETRIC BOOTSTRAP


Call:
boot(data = Default, statistic = boot.fn, R = 1000)


Bootstrap Statistics :
        original       bias     std. error
t1* 2.080898e-05 1.582518e-07 4.729534e-06
t2* 5.647103e-03 1.296980e-05 2.217214e-04

(d)
Comment on the estimated standard errors obtained using the glm() function and using your bootstrap function.

Response:

Comparison of Standard Errors for Logistic Regression Coefficients

Coefficient Standard Error (glm) Standard Error (bootstrap)
income 4.985 × 10⁻⁵ 4.730 × 10⁻⁶
balance 2.274 × 10⁻⁴ 2.217 × 10⁻⁴

Comment

The bootstrap standard error for balance is very close to the standard error reported by glm(), suggesting that the built-in model assumptions provide a reliable estimate for that coefficient.

However, the bootstrap standard error for income is much smaller than the one reported by glm() (about 1/10th the size), indicating that the glm-based standard error may be overestimating the variability of the income coefficient. This discrepancy might arise from the very small effect and low significance of income in the model, which is also reflected in its high p-value in the glm output.

Bootstrap estimates offer a data-driven way to assess variability and can be more robust when model assumptions (like linearity or homoscedasticity) are questionable.

Problem 9

We will now consider the Boston housing data set, from the ISLR2 library. (a)
Based on this data set, provide an estimate for the population mean of medv. Call this estimate ˆμ.

# Load the ISLR2 package
library(ISLR2)
Warning: package 'ISLR2' was built under R version 4.4.3

Attaching package: 'ISLR2'
The following objects are masked from 'package:ISLR':

    Auto, Credit
# (a) Estimate for the population mean of medv
mu_hat <- mean(Boston$medv)
mu_hat
[1] 22.53281

(b)
Provide an estimate of the standard error of ˆμ. Interpret this result. Hint: We can compute the standard error of the sample mean by dividing the sample standard deviation by the square root of the number of observations.

# (b) Standard error using the formula
sample_sd <- sd(Boston$medv)
n <- nrow(Boston)
se_mu_hat <- sample_sd / sqrt(n)
se_mu_hat
[1] 0.4088611

(c)
Now estimate the standard error of ˆμ using the bootstrap. How does this compare to your answer from (b)?

# Define bootstrap function for the mean
boot.fn <- function(data, index) {
  mean(data[index])
}

# Perform bootstrap
set.seed(123)
boot_results <- boot(data = Boston$medv, statistic = boot.fn, R = 1000)
boot_results

ORDINARY NONPARAMETRIC BOOTSTRAP


Call:
boot(data = Boston$medv, statistic = boot.fn, R = 1000)


Bootstrap Statistics :
    original      bias    std. error
t1* 22.53281 -0.01607372   0.4045557

(d)
Based on your bootstrap estimate from (c), provide a 95 % confidence interval for the mean of medv. Compare it to the results obtained using t.test(Boston$medv). Hint: You can approximate a 95 % confidence interval using the formula [ˆμ − 2SE(ˆμ), ˆμ + 2SE(ˆμ)].

# Extract values from bootstrap
mu_hat <- mean(Boston$medv)
se_boot <- sd(boot_results$t)  # or use boot_results$t0 directly for mean, boot_results$t for samples

# Approximate 95% CI using bootstrap SE
ci_bootstrap <- c(mu_hat - 2 * se_boot, mu_hat + 2 * se_boot)
ci_bootstrap
[1] 21.72369 23.34192

(e)
Based on this data set, provide an estimate, ˆμmed, for the median value of medv in the population.

# Estimate the population median of medv
mu_med_hat <- median(Boston$medv)
mu_med_hat
[1] 21.2

(f)
We now would like to estimate the standard error of ˆμmed. Unfortunately, there is no simple formula for computing the standard error of the median. Instead, estimate the standard error of the median using the bootstrap. Comment on your findings.

# Define bootstrap function to return the sample median
boot.fn.median <- function(data, index) {
  median(data[index])
}

# Set seed for reproducibility
set.seed(123)

# Perform bootstrap with 1000 replications
boot_median <- boot(data = Boston$medv, statistic = boot.fn.median, R = 1000)

# Display bootstrap results
boot_median

ORDINARY NONPARAMETRIC BOOTSTRAP


Call:
boot(data = Boston$medv, statistic = boot.fn.median, R = 1000)


Bootstrap Statistics :
    original  bias    std. error
t1*     21.2 -0.0203   0.3676453

Response:
The bootstrap results for the sample median medv are:

  • **Sample Median ((_{}))**: 21.2
  • Bootstrap Bias: -0.020
  • Bootstrap Standard Error: 0.368

Interpretation

Using 1,000 bootstrap replications, we estimate the standard error of the median home value to be approximately 0.368. This tells us how much the median medv would vary across repeated samples from the population.

The small bias (−0.0203) suggests that the bootstrap median estimates are centered closely around the original sample median.

This standard error reflects the variability of a robust statistic (the median), and supports the conclusion that 21.2 is a stable and reasonable estimate of the population median for medv in the Boston dataset.

(g)
Based on this data set, provide an estimate for the tenth percentile of medv in Boston census tracts. Call this quantity ˆμ0.1. (You can use the quantile() function.)

# Estimate the 10th percentile of medv
mu_0.1_hat <- quantile(Boston$medv, probs = 0.1)
mu_0.1_hat
  10% 
12.75 

(h)
Use the bootstrap to estimate the standard error of ˆμ0.1. Comment on your findings.

# Define bootstrap function to return 10th percentile
boot.fn.p10 <- function(data, index) {
  quantile(data[index], probs = 0.1)
}

# Perform bootstrap with 1000 replications
set.seed(123)
boot_p10 <- boot(data = Boston$medv, statistic = boot.fn.p10, R = 1000)

# View bootstrap results
boot_p10

ORDINARY NONPARAMETRIC BOOTSTRAP


Call:
boot(data = Boston$medv, statistic = boot.fn.p10, R = 1000)


Bootstrap Statistics :
    original  bias    std. error
t1*    12.75  -0.012    0.527868

Response:
Using 1,000 bootstrap samples, we estimate the 10th percentile ((_{0.1})) of medv to be 12.75. The bootstrap output shows:

  • Bootstrap Bias: −0.012
  • Bootstrap Standard Error: 0.528

Interpretation

This means that the estimated standard error for the 10th percentile is approximately 0.528, which reflects the typical variability we might expect in (_{0.1}) across different random samples from the Boston housing population.

The very small bias (−0.012) suggests that the bootstrap distribution is well-centered around the original estimate, indicating that the sample-based 10th percentile is a stable and reliable estimate of the population 10th percentile.

The bootstrap once again proves useful for estimating standard errors when no closed-form solution exists, especially for quantiles and other non-linear statistics.