Question 5

In Chapter 4, we used logistic regression to predict the probability of default using income and balance on the Default data set. We will now estimate the test error of this logistic regression model using the validation set approach. Do not forget to set a random seed before beginning your analysis.

A. Fit a logistic regression model that uses income and balance to predict default.

library(ISLR)
## Warning: package 'ISLR' was built under R version 4.4.3
set.seed(1)

# Fit logistic regression
model1 <- glm(default ~ income + balance, data = Default, family = "binomial")
summary(model1)
## 
## Call:
## glm(formula = default ~ income + balance, family = "binomial", 
##     data = Default)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.154e+01  4.348e-01 -26.545  < 2e-16 ***
## income       2.081e-05  4.985e-06   4.174 2.99e-05 ***
## balance      5.647e-03  2.274e-04  24.836  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 1579.0  on 9997  degrees of freedom
## AIC: 1585
## 
## Number of Fisher Scoring iterations: 8

B. Using the validation set approach, estimate the test error of this model. In order to do this, you must perform the following steps:

  1. Split the sample set into a training set and a validation set.

  2. Fit a multiple logistic regression model using only the training observations

  3. Obtain a prediction of default status for each individual in the validation set by computing the posterior probability of default for that individual, and classifying the individual to the default category if the posterior probability is greater than 0.5.

  4. Compute the validation set error, which is the fraction of the observations in the validation set that are misclassifed.

# Split the data: 50% training, 50% validation
set.seed(1)
trainIndex <- sample(1:nrow(Default), nrow(Default) / 2)
trainSet <- Default[trainIndex, ]
validSet <- Default[-trainIndex, ]

# Fit model on training data
model2 <- glm(default ~ income + balance, data = trainSet, family = "binomial")

# Predict on validation set
probs <- predict(model2, newdata = validSet, type = "response")
predictions <- ifelse(probs > 0.5, "Yes", "No")

# Calculate validation error
actual <- validSet$default
mean(predictions != actual)
## [1] 0.0254

C. Repeat the process in (b) three times, using three diferent splits of the observations into a training set and a validation set. Comment on the results obtained.

set.seed(2)
# First split
train1 <- sample(1:nrow(Default), nrow(Default) / 2)
model3 <- glm(default ~ income + balance, data = Default[train1, ], family = "binomial")
pred1 <- predict(model3, newdata = Default[-train1, ], type = "response")
error1 <- mean(ifelse(pred1 > 0.5, "Yes", "No") != Default[-train1, ]$default)

set.seed(3)
# Second split
train2 <- sample(1:nrow(Default), nrow(Default) / 2)
model4 <- glm(default ~ income + balance, data = Default[train2, ], family = "binomial")
pred2 <- predict(model4, newdata = Default[-train2, ], type = "response")
error2 <- mean(ifelse(pred2 > 0.5, "Yes", "No") != Default[-train2, ]$default)

set.seed(4)
# Third split
train3 <- sample(1:nrow(Default), nrow(Default) / 2)
model5 <- glm(default ~ income + balance, data = Default[train3, ], family = "binomial")
pred3 <- predict(model5, newdata = Default[-train3, ], type = "response")
error3 <- mean(ifelse(pred3 > 0.5, "Yes", "No") != Default[-train3, ]$default)

# Print all errors
error1; error2; error3
## [1] 0.0238
## [1] 0.0264
## [1] 0.0256

D. Now consider a logistic regression model that predicts the probability of default using income, balance, and a dummy variable for student. Estimate the test error for this model using the validation set approach. Comment on whether or not including a dummy variable for student leads to a reduction in the test error rate.

set.seed(5)
train4 <- sample(1:nrow(Default), nrow(Default) / 2)
model6 <- glm(default ~ income + balance + student, data = Default[train4, ], family = "binomial")
pred4 <- predict(model6, newdata = Default[-train4, ], type = "response")
error4 <- mean(ifelse(pred4 > 0.5, "Yes", "No") != Default[-train4, ]$default)

# Compare with previous errors
error4
## [1] 0.029

Question 6

We continue to consider the use of a logistic regression model to predict the probability of default using income and balance on the Default data set. In particular, we will now compute estimates for the standard errors of the income and balance logistic regression coeffcients in two diferent ways: (1) using the bootstrap, and (2) using the standard formula for computing the standard errors in the glm() function. Do not forget to set a random seed before beginning your analysis.

A. Using the summary() and glm() functions, determine the estimated standard errors for the coeffcients associated with income and balance in a multiple logistic regression model that uses both predictors.

set.seed(1)

# Fit logistic regression model
model <- glm(default ~ income + balance, data = Default, family = "binomial")
summary(model)
## 
## Call:
## glm(formula = default ~ income + balance, family = "binomial", 
##     data = Default)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.154e+01  4.348e-01 -26.545  < 2e-16 ***
## income       2.081e-05  4.985e-06   4.174 2.99e-05 ***
## balance      5.647e-03  2.274e-04  24.836  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 1579.0  on 9997  degrees of freedom
## AIC: 1585
## 
## Number of Fisher Scoring iterations: 8

B. Write a function, boot.fn(), that takes as input the Default data set as well as an index of the observations, and that outputs the coeffcient estimates for income and balance in the multiple logistic regression model.

# Define the bootstrap function
boot.fn <- function(data, index) {
  model <- glm(default ~ income + balance, data = data[index, ], family = "binomial")
  return(coef(model)[2:3])  # return only income and balance coefficients
}

C. Use the boot() function together with your boot.fn() function to estimate the standard errors of the logistic regression coeffcients for income and balance.

library(boot)
set.seed(1)

# Run bootstrap with 1000 replications
bootResults <- boot(data = Default, statistic = boot.fn, R = 1000)

# Display bootstrap standard errors
bootResults
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Default, statistic = boot.fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##         original       bias     std. error
## t1* 2.080898e-05 1.680317e-07 4.866284e-06
## t2* 5.647103e-03 1.855765e-05 2.298949e-04
apply(bootResults$t, 2, sd)
## [1] 4.866284e-06 2.298949e-04

D. Comment on the estimated standard errors obtained using the glm() function and using your bootstrap function.

The bootstrap estimates of the standard errors are very similar to those obtained from the glm() function. Specifically, for both the income and balance coefficients, the differences in standard errors are very small, indicating that the parametric assumptions made by glm() are reasonable in this case. This agreement gives us more confidence in the reliability of the model’s inference.

Question 7

In Sections 5.3.2 and 5.3.3, we saw that the cv.glm() function can be used in order to compute the LOOCV test error estimate. Alternatively, one could compute those quantities using just the glm() and predict.glm() functions, and a for loop. You will now take this approach in order to compute the LOOCV error for a simple logistic regression model on the Weekly data set. Recall that in the context of classifcation problems, the LOOCV error is given in (5.4).

A. Fit a logistic regression model that predicts Direction using Lag1 and Lag2.

data("Weekly")

# Fit logistic regression model
modelA <- glm(Direction ~ Lag1 + Lag2, data = Weekly, family = "binomial")
summary(modelA)
## 
## Call:
## glm(formula = Direction ~ Lag1 + Lag2, family = "binomial", data = Weekly)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.22122    0.06147   3.599 0.000319 ***
## Lag1        -0.03872    0.02622  -1.477 0.139672    
## Lag2         0.06025    0.02655   2.270 0.023232 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1496.2  on 1088  degrees of freedom
## Residual deviance: 1488.2  on 1086  degrees of freedom
## AIC: 1494.2
## 
## Number of Fisher Scoring iterations: 4

B. Fit a logistic regression model that predicts Direction using Lag1 and Lag2 using all but the frst observation.

modelB <- glm(Direction ~ Lag1 + Lag2, data = Weekly[-1, ], family = "binomial")

C. se the model from (b) to predict the direction of the frst observation. You can do this by predicting that the frst observation will go up if P(Direction = “Up”|Lag1, Lag2) > 0.5. Was this observation correctly classifed?

# Predict using the model from (B) on the first observation
probC <- predict(modelB, newdata = Weekly[1, ], type = "response")
predC <- ifelse(probC > 0.5, "Up", "Down")

# Check if it was correctly classified
actualC <- Weekly$Direction[1]
predC == actualC  # TRUE if correct, FALSE if not
## [1] FALSE

D. Write a for loop from i = 1 to i = n, where n is the number of observations in the data set, that performs each of the following steps:

  1. Fit a logistic regression model using all but the ith observation to predict Direction using Lag1 and Lag2.

  2. Compute the posterior probability of the market moving up for the ith observation.

  3. Use the posterior probability for the ith observation in order to predict whether or not the market moves up.

  4. Determine whether or not an error was made in predicting the direction for the ith observation. If an error was made, then indicate this as a 1, and otherwise indicate it as a 0.

# Initialize vector to store errors
n <- nrow(Weekly)
errors <- rep(NA, n)  # or errors <- numeric(n)

# LOOCV loop
for (i in 1:n) {
  # Fit model on all but the i-th observation
  modelLOO <- glm(Direction ~ Lag1 + Lag2, data = Weekly[-i, ], family = "binomial")
  
  # Predict the i-th observation
  prob <- predict(modelLOO, newdata = Weekly[i, , drop = FALSE], type = "response")
  pred <- ifelse(prob > 0.5, "Up", "Down")
  
  # Store 1 if misclassified, 0 if correct
  actual <- Weekly$Direction[i]
  errors[i] <- ifelse(pred != actual, 1, 0)
}

# View first few errors
head(errors)
## [1] 1 1 0 1 0 1

E. Take the average of the n numbers obtained in (d)iv in order to obtain the LOOCV estimate for the test error. Comment on the results.

# LOOCV error estimate
loocv_error <- mean(errors)
loocv_error
## [1] 0.4499541

The LOOCV estimate for the test error rate is approximately 45%, which indicates that the logistic regression model using only Lag1 and Lag2 correctly classifies the market direction about 55% of the time. Since this is only slightly better than random guessing (which would be 50% for a binary classification problem), it suggests that Lag1 and Lag2 have limited predictive power for forecasting market direction in the Weekly dataset. To improve predictive accuracy, it may be helpful to explore additional lag variables or alternative modeling approaches.

Question 9

We will now consider the Boston housing data set, from the ISLR2 library.

A. Based on this data set, provide an estimate for the population mean of medv. Call this estimate µˆ.

library(ISLR2)
## Warning: package 'ISLR2' was built under R version 4.4.3
## 
## Attaching package: 'ISLR2'
## The following objects are masked from 'package:ISLR':
## 
##     Auto, Credit
data("Boston")

# Estimate mean
mu_hat <- mean(Boston$medv)
mu_hat
## [1] 22.53281

B. Provide an estimate of the standard error of µˆ. Interpret this result.

se_mean <- sd(Boston$medv) / sqrt(nrow(Boston))
se_mean
## [1] 0.4088611

This value tells us that, on average, the sample mean of medv (median home value in Boston census tracts) would vary by about $0.41K from one random sample to another, assuming repeated sampling from the same population. This SE is essential for constructing confidence intervals and assessing the precision of the estimated mean.

C. Now estimate the standard error of µˆ using the bootstrap. How does this compare to your answer from (b)?

# Define bootstrap function
boot.fn.mean <- function(data, index) {
  mean(data[index])
}

# Bootstrap with 1000 replications
set.seed(1)
boot.mean <- boot(Boston$medv, statistic = boot.fn.mean, R = 1000)

# View bootstrap SE
boot.mean
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Boston$medv, statistic = boot.fn.mean, R = 1000)
## 
## 
## Bootstrap Statistics :
##     original      bias    std. error
## t1* 22.53281 0.007650791   0.4106622
# Extract bootstrap standard error
boot_se_mean <- sd(boot.mean$t)
boot_se_mean
## [1] 0.4106622

The standard error estimates from both methods are very close — differing by less than 0.002. This small difference suggests that:

  • The normal-theory (formula-based) approach is reliable in this case.

  • The bootstrap provides a nearly identical estimate without assuming normality of the sampling distribution.

This agreement reinforces confidence in the accuracy and stability of the sample mean as an estimator for the population mean.

D. Based on your bootstrap estimate from (c), provide a 95 % confdence interval for the mean of medv. Compare it to the results obtained using t.test(Boston$medv).

# 95% CI using bootstrap percentiles
quantile(boot.mean$t, c(0.025, 0.975))
##     2.5%    97.5% 
## 21.76333 23.37662
# Compare
t.test(Boston$medv)
## 
##  One Sample t-test
## 
## data:  Boston$medv
## t = 55.111, df = 505, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  21.72953 23.33608
## sample estimates:
## mean of x 
##  22.53281

E. Based on this data set, provide an estimate, µˆmed, for the median value of medv in the population.

mu_hat_med <- median(Boston$medv)
mu_hat_med
## [1] 21.2

F. We now would like to estimate the standard error of µˆmed. Unfortunately, there is no simple formula for computing the standard error of the median. Instead, estimate the standard error of the median using the bootstrap. Comment on your fndings

# Define bootstrap function for median
boot.fn.median <- function(data, index) {
  median(data[index])
}

# Bootstrap
set.seed(1)
boot.median <- boot(Boston$medv, statistic = boot.fn.median, R = 1000)

# Bootstrap SE of the median
boot_se_median <- sd(boot.median$t)
boot_se_median
## [1] 0.3778075

Your result indicates that the sample median home value (medv) varies by about $0.38K across different resamples from the population. This level of uncertainty is slightly less than the standard error of the mean you obtained earlier (~0.41), which is expected because the median is less sensitive to outliers, though also slightly less efficient (i.e., higher variance) under normality. So overall, the bootstrap has successfully provided a reasonable estimate of variability for the median, which couldn’t have been obtained via traditional methods.

G. Based on this data set, provide an estimate for the tenth percentile of medv in Boston census tracts. Call this quantity µˆ0.1. (You can use the quantile() function.)

mu_hat_0.1 <- quantile(Boston$medv, 0.1)
mu_hat_0.1
##   10% 
## 12.75

H. Use the bootstrap to estimate the standard error of µˆ0.1. Comment on your fndings.

# Define bootstrap function for 10th percentile
boot.fn.p10 <- function(data, index) {
  quantile(data[index], 0.1)
}

# Bootstrap
set.seed(1)
boot.p10 <- boot(Boston$medv, statistic = boot.fn.p10, R = 1000)

# Bootstrap SE
boot_se_p10 <- sd(boot.p10$t)
boot_se_p10
## [1] 0.4767526

Since the 10th percentile represents the lower end of the distribution, it’s often more sensitive to data variability and outliers compared to the mean or median. The standard error of ~0.48 tells us how much the 10th percentile of median home values would fluctuate from sample to sample.

This result makes sense in the context of real estate data — lower-end home values can vary more across neighborhoods, leading to slightly higher variability.

Once again, the bootstrap method proves useful here, as there’s no closed-form expression for the standard error of a percentile.