In Chapter 4, we used logistic regression to predict the probability of default using income and balance on the Default data set. We will now estimate the test error of this logistic regression model using the validation set approach. Do not forget to set a random seed before beginning your analysis.
library(ISLR)
## Warning: package 'ISLR' was built under R version 4.4.3
set.seed(1)
# Fit logistic regression
model1 <- glm(default ~ income + balance, data = Default, family = "binomial")
summary(model1)
##
## Call:
## glm(formula = default ~ income + balance, family = "binomial",
## data = Default)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.154e+01 4.348e-01 -26.545 < 2e-16 ***
## income 2.081e-05 4.985e-06 4.174 2.99e-05 ***
## balance 5.647e-03 2.274e-04 24.836 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1579.0 on 9997 degrees of freedom
## AIC: 1585
##
## Number of Fisher Scoring iterations: 8
Split the sample set into a training set and a validation set.
Fit a multiple logistic regression model using only the training observations
Obtain a prediction of default status for each individual in the validation set by computing the posterior probability of default for that individual, and classifying the individual to the default category if the posterior probability is greater than 0.5.
Compute the validation set error, which is the fraction of the observations in the validation set that are misclassifed.
# Split the data: 50% training, 50% validation
set.seed(1)
trainIndex <- sample(1:nrow(Default), nrow(Default) / 2)
trainSet <- Default[trainIndex, ]
validSet <- Default[-trainIndex, ]
# Fit model on training data
model2 <- glm(default ~ income + balance, data = trainSet, family = "binomial")
# Predict on validation set
probs <- predict(model2, newdata = validSet, type = "response")
predictions <- ifelse(probs > 0.5, "Yes", "No")
# Calculate validation error
actual <- validSet$default
mean(predictions != actual)
## [1] 0.0254
set.seed(2)
# First split
train1 <- sample(1:nrow(Default), nrow(Default) / 2)
model3 <- glm(default ~ income + balance, data = Default[train1, ], family = "binomial")
pred1 <- predict(model3, newdata = Default[-train1, ], type = "response")
error1 <- mean(ifelse(pred1 > 0.5, "Yes", "No") != Default[-train1, ]$default)
set.seed(3)
# Second split
train2 <- sample(1:nrow(Default), nrow(Default) / 2)
model4 <- glm(default ~ income + balance, data = Default[train2, ], family = "binomial")
pred2 <- predict(model4, newdata = Default[-train2, ], type = "response")
error2 <- mean(ifelse(pred2 > 0.5, "Yes", "No") != Default[-train2, ]$default)
set.seed(4)
# Third split
train3 <- sample(1:nrow(Default), nrow(Default) / 2)
model5 <- glm(default ~ income + balance, data = Default[train3, ], family = "binomial")
pred3 <- predict(model5, newdata = Default[-train3, ], type = "response")
error3 <- mean(ifelse(pred3 > 0.5, "Yes", "No") != Default[-train3, ]$default)
# Print all errors
error1; error2; error3
## [1] 0.0238
## [1] 0.0264
## [1] 0.0256
set.seed(5)
train4 <- sample(1:nrow(Default), nrow(Default) / 2)
model6 <- glm(default ~ income + balance + student, data = Default[train4, ], family = "binomial")
pred4 <- predict(model6, newdata = Default[-train4, ], type = "response")
error4 <- mean(ifelse(pred4 > 0.5, "Yes", "No") != Default[-train4, ]$default)
# Compare with previous errors
error4
## [1] 0.029
We continue to consider the use of a logistic regression model to predict the probability of default using income and balance on the Default data set. In particular, we will now compute estimates for the standard errors of the income and balance logistic regression coeffcients in two diferent ways: (1) using the bootstrap, and (2) using the standard formula for computing the standard errors in the glm() function. Do not forget to set a random seed before beginning your analysis.
set.seed(1)
# Fit logistic regression model
model <- glm(default ~ income + balance, data = Default, family = "binomial")
summary(model)
##
## Call:
## glm(formula = default ~ income + balance, family = "binomial",
## data = Default)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.154e+01 4.348e-01 -26.545 < 2e-16 ***
## income 2.081e-05 4.985e-06 4.174 2.99e-05 ***
## balance 5.647e-03 2.274e-04 24.836 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1579.0 on 9997 degrees of freedom
## AIC: 1585
##
## Number of Fisher Scoring iterations: 8
# Define the bootstrap function
boot.fn <- function(data, index) {
model <- glm(default ~ income + balance, data = data[index, ], family = "binomial")
return(coef(model)[2:3]) # return only income and balance coefficients
}
library(boot)
set.seed(1)
# Run bootstrap with 1000 replications
bootResults <- boot(data = Default, statistic = boot.fn, R = 1000)
# Display bootstrap standard errors
bootResults
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Default, statistic = boot.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 2.080898e-05 1.680317e-07 4.866284e-06
## t2* 5.647103e-03 1.855765e-05 2.298949e-04
apply(bootResults$t, 2, sd)
## [1] 4.866284e-06 2.298949e-04
The bootstrap estimates of the standard errors are very similar to those obtained from the glm() function. Specifically, for both the income and balance coefficients, the differences in standard errors are very small, indicating that the parametric assumptions made by glm() are reasonable in this case. This agreement gives us more confidence in the reliability of the model’s inference.
In Sections 5.3.2 and 5.3.3, we saw that the cv.glm() function can be used in order to compute the LOOCV test error estimate. Alternatively, one could compute those quantities using just the glm() and predict.glm() functions, and a for loop. You will now take this approach in order to compute the LOOCV error for a simple logistic regression model on the Weekly data set. Recall that in the context of classifcation problems, the LOOCV error is given in (5.4).
data("Weekly")
# Fit logistic regression model
modelA <- glm(Direction ~ Lag1 + Lag2, data = Weekly, family = "binomial")
summary(modelA)
##
## Call:
## glm(formula = Direction ~ Lag1 + Lag2, family = "binomial", data = Weekly)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.22122 0.06147 3.599 0.000319 ***
## Lag1 -0.03872 0.02622 -1.477 0.139672
## Lag2 0.06025 0.02655 2.270 0.023232 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1496.2 on 1088 degrees of freedom
## Residual deviance: 1488.2 on 1086 degrees of freedom
## AIC: 1494.2
##
## Number of Fisher Scoring iterations: 4
modelB <- glm(Direction ~ Lag1 + Lag2, data = Weekly[-1, ], family = "binomial")
# Predict using the model from (B) on the first observation
probC <- predict(modelB, newdata = Weekly[1, ], type = "response")
predC <- ifelse(probC > 0.5, "Up", "Down")
# Check if it was correctly classified
actualC <- Weekly$Direction[1]
predC == actualC # TRUE if correct, FALSE if not
## [1] FALSE
Fit a logistic regression model using all but the ith observation to predict Direction using Lag1 and Lag2.
Compute the posterior probability of the market moving up for the ith observation.
Use the posterior probability for the ith observation in order to predict whether or not the market moves up.
Determine whether or not an error was made in predicting the direction for the ith observation. If an error was made, then indicate this as a 1, and otherwise indicate it as a 0.
# Initialize vector to store errors
n <- nrow(Weekly)
errors <- rep(NA, n) # or errors <- numeric(n)
# LOOCV loop
for (i in 1:n) {
# Fit model on all but the i-th observation
modelLOO <- glm(Direction ~ Lag1 + Lag2, data = Weekly[-i, ], family = "binomial")
# Predict the i-th observation
prob <- predict(modelLOO, newdata = Weekly[i, , drop = FALSE], type = "response")
pred <- ifelse(prob > 0.5, "Up", "Down")
# Store 1 if misclassified, 0 if correct
actual <- Weekly$Direction[i]
errors[i] <- ifelse(pred != actual, 1, 0)
}
# View first few errors
head(errors)
## [1] 1 1 0 1 0 1
# LOOCV error estimate
loocv_error <- mean(errors)
loocv_error
## [1] 0.4499541
The LOOCV estimate for the test error rate is approximately 45%, which indicates that the logistic regression model using only Lag1 and Lag2 correctly classifies the market direction about 55% of the time. Since this is only slightly better than random guessing (which would be 50% for a binary classification problem), it suggests that Lag1 and Lag2 have limited predictive power for forecasting market direction in the Weekly dataset. To improve predictive accuracy, it may be helpful to explore additional lag variables or alternative modeling approaches.
We will now consider the Boston housing data set, from the ISLR2 library.
library(ISLR2)
## Warning: package 'ISLR2' was built under R version 4.4.3
##
## Attaching package: 'ISLR2'
## The following objects are masked from 'package:ISLR':
##
## Auto, Credit
data("Boston")
# Estimate mean
mu_hat <- mean(Boston$medv)
mu_hat
## [1] 22.53281
se_mean <- sd(Boston$medv) / sqrt(nrow(Boston))
se_mean
## [1] 0.4088611
This value tells us that, on average, the sample mean of medv (median home value in Boston census tracts) would vary by about $0.41K from one random sample to another, assuming repeated sampling from the same population. This SE is essential for constructing confidence intervals and assessing the precision of the estimated mean.
# Define bootstrap function
boot.fn.mean <- function(data, index) {
mean(data[index])
}
# Bootstrap with 1000 replications
set.seed(1)
boot.mean <- boot(Boston$medv, statistic = boot.fn.mean, R = 1000)
# View bootstrap SE
boot.mean
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Boston$medv, statistic = boot.fn.mean, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 22.53281 0.007650791 0.4106622
# Extract bootstrap standard error
boot_se_mean <- sd(boot.mean$t)
boot_se_mean
## [1] 0.4106622
The standard error estimates from both methods are very close — differing by less than 0.002. This small difference suggests that:
The normal-theory (formula-based) approach is reliable in this case.
The bootstrap provides a nearly identical estimate without assuming normality of the sampling distribution.
This agreement reinforces confidence in the accuracy and stability of the sample mean as an estimator for the population mean.
# 95% CI using bootstrap percentiles
quantile(boot.mean$t, c(0.025, 0.975))
## 2.5% 97.5%
## 21.76333 23.37662
# Compare
t.test(Boston$medv)
##
## One Sample t-test
##
## data: Boston$medv
## t = 55.111, df = 505, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 21.72953 23.33608
## sample estimates:
## mean of x
## 22.53281
mu_hat_med <- median(Boston$medv)
mu_hat_med
## [1] 21.2
# Define bootstrap function for median
boot.fn.median <- function(data, index) {
median(data[index])
}
# Bootstrap
set.seed(1)
boot.median <- boot(Boston$medv, statistic = boot.fn.median, R = 1000)
# Bootstrap SE of the median
boot_se_median <- sd(boot.median$t)
boot_se_median
## [1] 0.3778075
Your result indicates that the sample median home value (medv) varies by about $0.38K across different resamples from the population. This level of uncertainty is slightly less than the standard error of the mean you obtained earlier (~0.41), which is expected because the median is less sensitive to outliers, though also slightly less efficient (i.e., higher variance) under normality. So overall, the bootstrap has successfully provided a reasonable estimate of variability for the median, which couldn’t have been obtained via traditional methods.
mu_hat_0.1 <- quantile(Boston$medv, 0.1)
mu_hat_0.1
## 10%
## 12.75
# Define bootstrap function for 10th percentile
boot.fn.p10 <- function(data, index) {
quantile(data[index], 0.1)
}
# Bootstrap
set.seed(1)
boot.p10 <- boot(Boston$medv, statistic = boot.fn.p10, R = 1000)
# Bootstrap SE
boot_se_p10 <- sd(boot.p10$t)
boot_se_p10
## [1] 0.4767526
Since the 10th percentile represents the lower end of the distribution, it’s often more sensitive to data variability and outliers compared to the mean or median. The standard error of ~0.48 tells us how much the 10th percentile of median home values would fluctuate from sample to sample.
This result makes sense in the context of real estate data — lower-end home values can vary more across neighborhoods, leading to slightly higher variability.
Once again, the bootstrap method proves useful here, as there’s no closed-form expression for the standard error of a percentile.