Problem 3.

a.

Implementation of K-Fold Cross-Validation:

-K-fold cross-validation divides the dataset into k equal-sized folds. Each fold serves as the validation set once, while the remaining k-1 folds are used for training. This process repeats k-times, and the performance results are averaged to assess the model’s overall performance.

b.

Advantages and Disadvantages:

Relative to the Validation Set Approach: Advantages: K-fold uses the entire dataset for both training and validation, leading to more reliable performance estimates.

Disadvantages: It is computationally expensive and more complex than the validation set approach.

Relative to Leave-One-Out Cross-Validation (LOOCV):

Advantages: K-fold is less computationally intensive and reduces the high variance seen in LOOCV’s performance estimates.

Disadvantages: LOOCV provides slightly better precision by using nearly the entire dataset for training in each iteration.

Problem 5

a.

# Load the necessary library
library(ISLR2)

## Warning: package 'ISLR2' was built under R version 4.4.3

# Set a random seed for reproducibility
set.seed(123)

# Fit the logistic regression model
default_model <- glm(default ~ income + balance, data = Default, family = binomial)

# Display the summary of the model
summary(default_model)

## 
## Call:
## glm(formula = default ~ income + balance, family = binomial, 
##     data = Default)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.154e+01  4.348e-01 -26.545  < 2e-16 ***
## income       2.081e-05  4.985e-06   4.174 2.99e-05 ***
## balance      5.647e-03  2.274e-04  24.836  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 1579.0  on 9997  degrees of freedom
## AIC: 1585
## 
## Number of Fisher Scoring iterations: 8

b.

i.

# Split the data into training and validation sets (e.g., 80% training, 20% validation)
train_indices <- sample(1:nrow(Default), size = 0.8 * nrow(Default))
train_set <- Default[train_indices, ]
validation_set <- Default[-train_indices, ]

# Check the split
nrow(train_set)  # Number of observations in the training set

## [1] 8000

nrow(validation_set)  # Number of observations in the validation set

## [1] 2000

ii.

# Fit the logistic regression model using the training observations
train_model <- glm(default ~ income + balance, data = train_set, family = binomial)

# Display the summary of the model
summary(train_model)

## 
## Call:
## glm(formula = default ~ income + balance, family = binomial, 
##     data = train_set)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.161e+01  4.884e-01  -23.78  < 2e-16 ***
## income       2.174e-05  5.602e-06    3.88 0.000104 ***
## balance      5.677e-03  2.548e-04   22.28  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2340.6  on 7999  degrees of freedom
## Residual deviance: 1258.1  on 7997  degrees of freedom
## AIC: 1264.1
## 
## Number of Fisher Scoring iterations: 8

iii.

# Obtain posterior probabilities of default for the validation set
validation_probs <- predict(train_model, newdata = validation_set, type = "response")

# Classify individuals to the "default" category if the probability is greater than 0.5
validation_preds <- ifelse(validation_probs > 0.5, "Yes", "No")

# Add predictions to the validation set for reference
validation_set$predicted_default <- validation_preds

# View the first few rows of the validation set with predictions
head(validation_set)

##    default student   balance    income predicted_default
## 6       No     Yes  919.5885  7491.559                No
## 19      No      No  485.9369 61566.106                No
## 23      No      No 1055.9566 51317.883                No
## 34      No      No  913.5872 46907.225                No
## 35      No     Yes 1423.9389 22634.488                No
## 38      No      No  351.4535 35087.489                No

iv.

# Compute the validation set error
validation_error <- mean(validation_set$predicted_default != validation_set$default)

# Display the validation set error
validation_error

## [1] 0.026

c.

# Define a function to calculate validation error for each split
calculate_validation_error <- function(seed_value) {
  # Set the random seed for reproducibility
  set.seed(seed_value)
  
  # Split the data into training (80%) and validation (20%) sets
  train_indices <- sample(1:nrow(Default), size = 0.8 * nrow(Default))
  train_set <- Default[train_indices, ]
  validation_set <- Default[-train_indices, ]
  
  # Fit the logistic regression model on the training set
  train_model <- glm(default ~ income + balance, data = train_set, family = binomial)
  
  # Obtain posterior probabilities for the validation set
  validation_probs <- predict(train_model, newdata = validation_set, type = "response")
  
  # Classify based on threshold (0.5)
  validation_preds <- ifelse(validation_probs > 0.5, "Yes", "No")
  
  # Compute validation error
  validation_error <- mean(validation_preds != validation_set$default)
  
  return(validation_error)
}

# Run the process three times with different random seeds
error1 <- calculate_validation_error(123)
error2 <- calculate_validation_error(456)
error3 <- calculate_validation_error(789)

# Display validation errors for each split
validation_errors <- c(error1, error2, error3)
validation_errors

## [1] 0.0260 0.0275 0.0280

Comments: The validation errors obtained from the three different splits are 0.0260, 0.0275, and 0.0280. These values are very close to each other, indicating that the model’s performance is stable across different training-validation splits.

This consistency suggests that the logistic regression model, using income and balance to predict default, generalizes well and is not overly sensitive to the specific data composition in the training or validation sets. The low error rates also indicate that the model is performing well in terms of classification accuracy.

d.

# Fit the logistic regression model with income, balance, and student
train_model_with_student <- glm(default ~ income + balance + student, data = train_set, family = binomial)

# Obtain posterior probabilities for the validation set
validation_probs_with_student <- predict(train_model_with_student, newdata = validation_set, type = "response")

# Classify individuals based on probabilities (threshold = 0.5)
validation_preds_with_student <- ifelse(validation_probs_with_student > 0.5, "Yes", "No")

# Compute the validation set error
validation_error_with_student <- mean(validation_preds_with_student != validation_set$default)
validation_error_with_student

## [1] 0.026

Comment: The validation error obtained for the model with the inclusion of the student variable is 0.026, which is identical to one of the previous error rates from the model without student. This suggests that adding the student variable does not lead to a significant reduction in the test error rate for predicting default.

Problem 6.

a.

# Set random seed for reproducibility
set.seed(123)

# Fit the logistic regression model with a new tag name
model_income_balance <- glm(default ~ income + balance, data = Default, family = binomial)

# Display the summary of the model
model_summary_new <- summary(model_income_balance)

# Extract the estimated standard errors for income and balance
std_errors_new <- model_summary_new$coefficients[, "Std. Error"]
std_errors_new

##  (Intercept)       income      balance 
## 4.347564e-01 4.985167e-06 2.273731e-04

b.

set.seed(123)
boot.fn <- function(data, index) {
  # Subset the data using the provided index
  subset_data <- data[index, ]
  
  # Fit the logistic regression model
  model <- glm(default ~ income + balance, data = subset_data, family = "binomial")
  
  # Extract coefficients for income and balance only
  coefs <- coef(model)[c("income", "balance")]
  
  # Return the coefficients
  return(coefs)
}
boot.fn(Default, 1:nrow(Default))

##       income      balance 
## 2.080898e-05 5.647103e-03

c.

library(boot)

## Warning: package 'boot' was built under R version 4.4.3

# Set random seed for reproducibility
set.seed(123)

# Define the bootstrap function (from previous step)
boot.fn <- function(data, index) {
  subset_data <- data[index, ]
  model <- glm(default ~ income + balance, data = subset_data, family = "binomial")
  coefs <- coef(model)[c("income", "balance")]
  return(coefs)
}

# Perform bootstrap estimation
boot_result <- boot(Default, boot.fn, R = 1000)

# Display the bootstrap results
print(boot_result)

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Default, statistic = boot.fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##         original       bias     std. error
## t1* 2.080898e-05 1.582518e-07 4.729534e-06
## t2* 5.647103e-03 1.296980e-05 2.217214e-04

# Extract standard errors (standard deviation of bootstrap estimates)
income_se <- sd(boot_result$t[, 1])
balance_se <- sd(boot_result$t[, 2])

# Print the estimated standard errors
cat("Bootstrap Standard Error for income:", income_se, "\n")

## Bootstrap Standard Error for income: 4.729534e-06

cat("Bootstrap Standard Error for balance:", balance_se, "\n")

## Bootstrap Standard Error for balance: 0.0002217214

d.

Comments: The standard errors from glm() and the bootstrap for income and balance are very similar, indicating both methods reliably estimate the variability of the logistic regression coefficients. This close agreement suggests the model is well-specified, with the slight bootstrap underestimation likely due to sampling variability from 1,000 replicates.

Problem 9.

a.

# Calculate the sample mean of medv
mu_hat <- mean(Boston$medv)

# Display the result
mu_hat

## [1] 22.53281

The estimate for the population mean, denoted ˆµ, is 22.53281.

b.

s <- sd(Boston$medv)
n <- nrow(Boston)

# Compute the standard error of mu-hat
se_mu_hat <- s / sqrt(n)

# Display the result
se_mu_hat

## [1] 0.4088611

c.

# Set seed for reproducibility
set.seed(123)

# Define the function to compute the mean of medv
mean.fn <- function(data, index) {
  return(mean(data$medv[index]))
}

# Run the bootstrap with 1000 replicates
boot_result <- boot(Boston, mean.fn, R = 1000)

# Extract the standard error (standard deviation of bootstrap means)
se_mu_hat_boot <- sd(boot_result$t)

# Original sample mean for reference
mu_hat <- mean(Boston$medv)

# Display the result
mu_hat

## [1] 22.53281

se_mu_hat_boot

## [1] 0.4045557

The bootstrap SE (0.4045557) closely matches the analytical SE (0.4088611), validating both methods. The slight underestimate in the bootstrap could be random chance or a function of the number of replicates, but it’s trivial given their proximity.

d.

# Sample mean
mu_hat <- mean(Boston$medv)

# Bootstrap SE from (c)
se_mu_hat_boot <- 0.4045557

# Bootstrap 95% CI using hint's formula
boot_ci_lower <- mu_hat - 2 * se_mu_hat_boot
boot_ci_upper <- mu_hat + 2 * se_mu_hat_boot

# t-test 95% CI
t_test_result <- t.test(Boston$medv)
t_ci_lower <- t_test_result$conf.int[1]
t_ci_upper <- t_test_result$conf.int[2]

# Print results
cat("Bootstrap 95% CI for mean of medv: [", boot_ci_lower, ", ", boot_ci_upper, "]\n")

## Bootstrap 95% CI for mean of medv: [ 21.72369 ,  23.34192 ]

cat("t.test 95% CI for mean of medv: [", t_ci_lower, ", ", t_ci_upper, "]\n")

## t.test 95% CI for mean of medv: [ 21.72953 ,  23.33608 ]

e.

# Calculate the sample median of medv
mu_hat_med <- median(Boston$medv)

# Print the result
cat("The estimate for the population median of medv, ˆµmed, is ", mu_hat_med, "\n")

## The estimate for the population median of medv, ˆµmed, is  21.2

f.

# Set seed for reproducibility
set.seed(123)

# Define the function to compute the median of medv
median.fn <- function(data, index) {
  return(median(data$medv[index]))  # Correctly reference 'medv'
}

# Run the bootstrap with 1000 replicates
boot_result <- boot(Boston, median.fn, R = 1000)

# Extract the standard error (standard deviation of bootstrap medians)
se_mu_hat_med <- sd(boot_result$t)

# Print the result
cat("The bootstrap standard error of ˆµmed is ", se_mu_hat_med, "\n")

## The bootstrap standard error of ˆµmed is  0.3676453

Comment: A standard error of 0.3676453 implies that the median is fairly stable and does not vary widely across different samples. This indicates that the data distribution for medv might be relatively well-behaved, with no extreme instability in the central tendency.

g.

# Compute the tenth percentile of medv
mu_hat_0.1 <- quantile(Boston$medv, probs = 0.1)

# Print the result
cat("The tenth percentile of medv (ˆµ0.1) is:", mu_hat_0.1, "\n")

## The tenth percentile of medv (ˆµ0.1) is: 12.75

h.

# Define the function to compute the 10th percentile of medv
percentile.fn <- function(data, index) {
  return(quantile(data$medv[index], probs = 0.1))
}

# Set seed for reproducibility
set.seed(123)

# Perform bootstrap with 1000 replicates
bootstrap_results <- boot(data = Boston, statistic = percentile.fn, R = 1000)

# Extract the standard error (standard deviation of bootstrap estimates)
se_mu_hat_0.1 <- sd(bootstrap_results$t)

# Print the result
cat("The bootstrap standard error of ˆµ0.1 is:", se_mu_hat_0.1, "\n")

## The bootstrap standard error of ˆµ0.1 is: 0.527868

Comment: The bootstrap standard error of ˆµ0.1 calculated as 0.527868, provides a measure of the sampling variability in estimating the 10th percentile of medv. This value indicates a moderate degree of uncertainty, reflecting the inherent variability in the data and the stability of the estimator when applied to repeated resampling procedures.

Predicitve Modeling Homework 4

Minh Nguyen zpx082

2025-03-28

Problem 3.

a.

b.

Problem 5

a.

b.

i.

ii.

iii.

iv.

c.

d.

Problem 6.

a.

b.

c.

d.

Problem 9.

a.

b.

c.

d.

e.

f.

g.

h.