Lab 6

5. In Chapter 4, we used logistic regression to predict the probability of default using income and balance on the Default data set. We will now estimate the test error of this logistic regression model using the validation set approach. Do not forget to set a random seed before beginning your analysis.
(a) Fit a logistic regression model that uses income and balance to predict default.

library(ISLR2)

## Warning: package 'ISLR2' was built under R version 4.4.3

library(tidyverse)

## Warning: package 'tidyr' was built under R version 4.4.2

## Warning: package 'readr' was built under R version 4.4.2

## Warning: package 'dplyr' was built under R version 4.4.3

## Warning: package 'lubridate' was built under R version 4.4.2

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

set.seed(1)

# Fit logistic regression model
glm.fit <- glm(default ~ income + balance, data = Default, family = binomial)
summary(glm.fit)

## 
## Call:
## glm(formula = default ~ income + balance, family = binomial, 
##     data = Default)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.154e+01  4.348e-01 -26.545  < 2e-16 ***
## income       2.081e-05  4.985e-06   4.174 2.99e-05 ***
## balance      5.647e-03  2.274e-04  24.836  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 1579.0  on 9997  degrees of freedom
## AIC: 1585
## 
## Number of Fisher Scoring iterations: 8

(b) Using the validation set approach, estimate the test error of this model. In order to do this, you must perform the following steps:
i. Split the sample set into a training set and a validation set.

set.seed(1)
train <- sample(nrow(Default), nrow(Default) / 2) # Randomly split data
Default.train <- Default[train, ]
Default.validation <- Default[-train, ]

ii. Fit a multiple logistic regression model using only the training observations.

glm.train <- glm(default ~ income + balance, data = Default.train, family = binomial)

iii. Obtain a prediction of default status for each individual in the validation set by computing the posterior probability of default for that individual, and classifying the individual to the default category if the posterior probability is greater than 0.5.

glm.probs <- predict(glm.train, Default.validation, type = "response")
glm.pred <- ifelse(glm.probs > 0.5, "Yes", "No")

iv. Compute the validation set error, which is the fraction of the observations in the validation set that are misclassified.

validation.error <- mean(glm.pred != Default.validation$default)
validation.error

## [1] 0.0254

(c) Repeat the process in (b) three times, using three different splits of the observations into a training set and a validation set. Comment on the results obtained.

set.seed(2)
# Repeat splitting, fitting, predicting, and calculating error
repeat_validation <- function(seed) {
  set.seed(seed)
  train <- sample(nrow(Default), nrow(Default) / 2)
  Default.train <- Default[train, ]
  Default.validation <- Default[-train, ]
  
  glm.train <- glm(default ~ income + balance, data = Default.train, family = binomial)
  glm.probs <- predict(glm.train, Default.validation, type = "response")
  glm.pred <- ifelse(glm.probs > 0.5, "Yes", "No")
  
  mean(glm.pred != Default.validation$default)
}

errors <- sapply(1:3, repeat_validation)
errors

## [1] 0.0254 0.0238 0.0264

Explanation: The above code uses a function repeat_validation() to repeat the validation set approach three times with different random seeds. The results are stored in errors. This allows comparison of test error rates across different splits, demonstrating variability in model performance due to random sampling.

(d) Now consider a logistic regression model that predicts the prob ability of default using income, balance, and a dummy variable for student. Estimate the test error for this model using the validation set approach. Comment on whether or not including a dummy variable for student leads to a reduction in the test error rate.

glm.student <- glm(default ~ income + balance + student, data = Default.train, family = binomial)
glm.student.probs <- predict(glm.student, Default.validation, type = "response")
glm.student.pred <- ifelse(glm.student.probs > 0.5, "Yes", "No")
student.error <- mean(glm.student.pred != Default.validation$default)
student.error

## [1] 0.026

Explanation: The inclusion of the student dummy variable adds information about whether the individual is a student. The test error rate is computed for this extended model, allowing comparison with the model that excludes this variable. This helps assess whether the additional predictor improves classification accuracy.

6. We continue to consider the use of a logistic regression model to predict the probability of default using income and balance on the Default data set. In particular, we will now compute estimates for the standard errors of the income and balance logistic regression coefficients in two different ways: (1) using the bootstrap, and (2) using the standard formula for computing the standard errors in the glm() function. Do not forget to set a random seed before beginning your analysis.

(a) Using the summary() and glm() functions, determine the estimated standard errors for the coefficients associated with income and balance in a multiple logistic regression model that uses both predictors.

set.seed(1)
glm.fit <- glm(default ~ income + balance, data = Default, family = binomial)
summary(glm.fit)$coefficients[, "Std. Error"]

##  (Intercept)       income      balance 
## 4.347564e-01 4.985167e-06 2.273731e-04

(b) Write a function, boot.fn(), that takes as input the Default data set as well as an index of the observations, and that outputs the coefficient estimates for income and balance in the multiple logistic regression model.

boot.fn <- function(data, index) {
  coef(glm(default ~ income + balance, data = data[index, ], family = binomial))
}

(c) Use the boot() function together with your boot.fn() function to estimate the standard errors of the logistic regression coefficients for income and balance.

library(boot)

set.seed(1)
boot.results <- boot(Default, boot.fn, R = 1000)
boot.results$t0 # Original coefficients

##   (Intercept)        income       balance 
## -1.154047e+01  2.080898e-05  5.647103e-03

apply(boot.results$t, 2, sd) # Bootstrap standard errors

## [1] 4.344722e-01 4.866284e-06 2.298949e-04

(d) Comment on the estimated standard errors obtained using the glm() function and using your bootstrap function.

# Compare results
summary(glm.fit)$coefficients[, "Std. Error"]

##  (Intercept)       income      balance 
## 4.347564e-01 4.985167e-06 2.273731e-04

apply(boot.results$t, 2, sd)

## [1] 4.344722e-01 4.866284e-06 2.298949e-04

Compare standard errors from glm() and bootstrap

The comparison between standard errors obtained from glm() and bootstrap highlights differences in estimation methods. Bootstrap standard errors are derived empirically by resampling, while glm() uses a theoretical formula. Observing discrepancies can provide insights into model assumptions and data variability.

7. In Sections 5.3.2 and 5.3.3, we saw that the cv.glm() function can be used in order to compute the LOOCV test error estimate. Alternatively, one could compute those quantities using just the glm() and predict.glm() functions, and a for loop. You will now take this approach in order to compute the LOOCV error for a simple logistic regression model on the Weekly data set. Recall that in the context of classification problems, the LOOCV error is given in (5.4).

(a) Fit a logistic regression model that predicts Direction using Lag1 and Lag2.

data("Weekly")
set.seed(1)

glm.weekly <- glm(Direction ~ Lag1 + Lag2, data = Weekly, family = binomial)
summary(glm.weekly)

## 
## Call:
## glm(formula = Direction ~ Lag1 + Lag2, family = binomial, data = Weekly)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.22122    0.06147   3.599 0.000319 ***
## Lag1        -0.03872    0.02622  -1.477 0.139672    
## Lag2         0.06025    0.02655   2.270 0.023232 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1496.2  on 1088  degrees of freedom
## Residual deviance: 1488.2  on 1086  degrees of freedom
## AIC: 1494.2
## 
## Number of Fisher Scoring iterations: 4

(b) Fit a logistic regression model that predicts Direction using Lag1 and Lag2 using all but the first observation.

glm.exclude.first <- glm(Direction ~ Lag1 + Lag2, data = Weekly[-1, ], family = binomial)

# Predict direction for first observation
predict.first <- predict(glm.exclude.first, Weekly[1, ], type = "response")
predicted.direction.first <- ifelse(predict.first > 0.5, "Up", "Down")
predicted.direction.first == Weekly$Direction[1]

## [1] FALSE

(c) Use the model from (b) to predict the direction of the first observation. You can do this by predicting that the first observation will go up if P(Direction = “Up”|Lag1, Lag2) > 0.5. Was this observation correctly classified?

data("Weekly")

# Fit logistic regression excluding the first observation
glm.fit <- glm(Direction ~ Lag1 + Lag2, data = Weekly[-1, ], family = binomial)

# Predict the direction of the first observation
predicted.prob <- predict(glm.fit, Weekly[1, ], type = "response")
predicted.direction <- ifelse(predicted.prob > 0.5, "Up", "Down")

# Check if the prediction is correct
actual.direction <- Weekly$Direction[1]
correct <- predicted.direction == actual.direction

# Output results
list(
  Predicted_Probability = predicted.prob,
  Predicted_Direction = predicted.direction,
  Actual_Direction = actual.direction,
  Correct_Classification = correct
)

## $Predicted_Probability
##         1 
## 0.5713923 
## 
## $Predicted_Direction
##    1 
## "Up" 
## 
## $Actual_Direction
## [1] Down
## Levels: Down Up
## 
## $Correct_Classification
## [1] FALSE

Explanation:

The logistic regression model is trained on all but the first observation.
The posterior probability is computed for the first observation, and it is classified as “Up” if P(Direction=“Up”)>0.5P(Direction=“Up”)>0.5.
The result indicates whether the prediction matches the actual direction of the first observation.

(d) Write a for loop from i =1 to i = n, where n is the number of observations in the data set, that performs each of the following steps:

i. Fit a logistic regression model using all but the ith observation to predict Direction using Lag1 and Lag2.

# Initialize the number of observations
n <- nrow(Weekly)

# Example for the first observation (i = 1)
i <- 1

# Fit logistic regression model excluding the ith observation
model_loocv <- glm(Direction ~ Lag1 + Lag2, data = Weekly[-i, ], family = binomial)
summary(model_loocv)

## 
## Call:
## glm(formula = Direction ~ Lag1 + Lag2, family = binomial, data = Weekly[-i, 
##     ])
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.22324    0.06150   3.630 0.000283 ***
## Lag1        -0.03843    0.02622  -1.466 0.142683    
## Lag2         0.06085    0.02656   2.291 0.021971 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1494.6  on 1087  degrees of freedom
## Residual deviance: 1486.5  on 1085  degrees of freedom
## AIC: 1492.5
## 
## Number of Fisher Scoring iterations: 4

ii. Compute the posterior probability of the market moving up for the ith observation.

# Compute posterior probability for the ith observation
posterior_prob <- predict(model_loocv, newdata = Weekly[i, ], type = "response")
posterior_prob

##         1 
## 0.5713923

iii. Use the posterior probability for the ith observation in order to predict whether or not the market moves up.

# Predict direction based on posterior probability
predicted_direction <- ifelse(posterior_prob > 0.5, "Up", "Down")
predicted_direction

##    1 
## "Up"

iv. Determine whether or not an error was made in predicting the direction for the ith observation. If an error was made, then indicate this as a 1, and otherwise indicate it as a 0.

# Check if prediction is incorrect (error = 1 if incorrect)
actual_direction <- Weekly$Direction[i]
error <- ifelse(predicted_direction != actual_direction, 1, 0)
error

## [1] 1

# Initialize vector to store errors
errors <- numeric(n)

for (i in 1:n) {
  # Fit model excluding the ith observation
  model_loocv <- glm(Direction ~ Lag1 + Lag2, data = Weekly[-i, ], family = binomial)
  
  # Compute posterior probability for the ith observation
  posterior_prob <- predict(model_loocv, newdata = Weekly[i, ], type = "response")
  
  # Predict direction based on posterior probability
  predicted_direction <- ifelse(posterior_prob > 0.5, "Up", "Down")
  
  # Check if prediction is incorrect (error = 1 if incorrect)
  errors[i] <- ifelse(predicted_direction != Weekly$Direction[i], 1, 0)
}

# Compute LOOCV error rate
loocv_error_rate <- mean(errors)
loocv_error_rate

## [1] 0.4499541

(e) Take the average of the n numbers obtained in (d)iv in order to obtain the LOOCV estimate for the test error. Comment on the results.
The LOOCV error rate represents an unbiased estimate of the test error for the logistic regression model predicting Direction using Lag1 and Lag2. A lower LOOCV error rate indicates better predictive performance. This method leverages all observations except one at each iteration to train the model, ensuring robustness in estimating the test error.

9. We will now consider the Boston housing data set, from the ISLR2 library.

(a) Based on this data set, provide an estimate for the population mean of medv. Call this estimate ˆ µ.

data("Boston")
mu_hat <- mean(Boston$medv)
mu_hat

## [1] 22.53281

(b) Provide an estimate of the standard error of ˆ µ. Interpret this result.
Hint: We can compute the standard error of the sample mean by dividing the sample standard deviation by the square root of the number of observations.

std_error <- sd(Boston$medv) / sqrt(nrow(Boston))
std_error

## [1] 0.4088611

Explanation: The standard error quantifies the variability in our estimate of the mean μ^.

(c) Now estimate the standard error of ˆ µ using the bootstrap. How does this compare to your answer from (b)?

set.seed(123)
bootstrap_se <- function(data, B = 1000) {
  n <- length(data)
  boot_means <- numeric(B)
  
  for (b in 1:B) {
    sample_data <- sample(data, size = n, replace = TRUE)
    boot_means[b] <- mean(sample_data)
  }
  
  return(sd(boot_means))
}

bootstrap_se_medv <- bootstrap_se(Boston$medv)
bootstrap_se_medv

## [1] 0.4185474

The bootstrap estimate of standard error is similar to the analytical result but may differ slightly due to resampling variability.

(d) Based on your bootstrap estimate from (c), provide a 95% confidence interval for the mean of medv. Compare it to the results obtained using t.test(Boston$medv).

# Bootstrap-based confidence interval
ci_bootstrap <- c(mu_hat - 2 * bootstrap_se_medv, mu_hat + 2 * bootstrap_se_medv)

# Confidence interval using t.test()
ci_t_test <- t.test(Boston$medv)$conf.int

list(Bootstrap_CI = ci_bootstrap, T_Test_CI = ci_t_test)

## $Bootstrap_CI
## [1] 21.69571 23.36990
## 
## $T_Test_CI
## [1] 21.72953 23.33608
## attr(,"conf.level")
## [1] 0.95

Both methods provide similar confidence intervals, but bootstrap does not rely on normality assumptions.

(e) Based on this data set, provide an estimate, ˆ µmed, for the median value of medv in the population.

median_hat <- median(Boston$medv)
median_hat

## [1] 21.2

(f) We now would like to estimate the standard error of ˆ µmed. Unfortunately, there is no simple formula for computing the standard error of the median. Instead, estimate the standard error of the median using the bootstrap. Comment on your findings.

bootstrap_se_median <- function(data, B = 1000) {
  n <- length(data)
  boot_medians <- numeric(B)
  
  for (b in 1:B) {
    sample_data <- sample(data, size = n, replace = TRUE)
    boot_medians[b] <- median(sample_data)
  }
  
  return(sd(boot_medians))
}

bootstrap_se_median_medv <- bootstrap_se_median(Boston$medv)
bootstrap_se_median_medv

## [1] 0.3779944

The bootstrap provides an empirical estimate of the standard error for the median since no analytical formula exists.

(g) Based on this data set, provide an estimate for the tenth percentile of medv in Boston census tracts. Call this quantity ˆ µ0.1. (You can use the quantile() function.)

tenth_percentile_hat <- quantile(Boston$medv, probs = 0.1)
tenth_percentile_hat

##   10% 
## 12.75

(h) Use the bootstrap to estimate the standard error of ˆ µ0.1. Comment on your findings.

bootstrap_se_percentile <- function(data, percentile = 0.1, B = 1000) {
  n <- length(data)
  boot_percentiles <- numeric(B)
  
  for (b in 1:B) {
    sample_data <- sample(data, size = n, replace = TRUE)
    boot_percentiles[b] <- quantile(sample_data, probs = percentile)
  }
  
  return(sd(boot_percentiles))
}

bootstrap_se_tenth_percentile_medv <- bootstrap_se_percentile(Boston$medv, percentile = 0.1)
bootstrap_se_tenth_percentile_medv

## [1] 0.5069193

The bootstrap standard error provides insight into the variability of the estimated tenth percentile.

Lab 6

Akshay Kumar

03/24/2025