# Load necessary library and data
library(ISLR2)
set.seed(42)
# Fit logistic regression model
fit <- glm(default ~ income + balance, data = Default, family = "binomial")
summary(fit)
##
## Call:
## glm(formula = default ~ income + balance, family = "binomial",
## data = Default)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.154e+01 4.348e-01 -26.545 < 2e-16 ***
## income 2.081e-05 4.985e-06 4.174 2.99e-05 ***
## balance 5.647e-03 2.274e-04 24.836 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1579.0 on 9997 degrees of freedom
## AIC: 1585
##
## Number of Fisher Scoring iterations: 8
# Split data into training and validation sets
train <- sample(nrow(Default), nrow(Default) / 2)
# Fit logistic regression model on training data
fit <- glm(default ~ income + balance, data = Default, family = "binomial", subset = train)
# Predict on validation set
pred <- ifelse(predict(fit, newdata = Default[-train, ], type = "response") > 0.5, "Yes", "No")
# Compute validation set error
table(pred, Default$default[-train])
##
## pred No Yes
## No 4817 110
## Yes 20 53
validation_error <- mean(pred != Default$default[-train])
validation_error
## [1] 0.026
# Repeat the process 3 times using replicate
set.seed(42)
errors <- replicate(3, {
train <- sample(nrow(Default), nrow(Default) / 2)
fit <- glm(default ~ income + balance, data = Default, family = "binomial", subset = train)
pred <- ifelse(predict(fit, newdata = Default[-train, ], type = "response") > 0.5, "Yes", "No")
mean(pred != Default$default[-train])
})
errors
## [1] 0.0260 0.0260 0.0294
# Logistic regression with student variable
errors_with_student <- replicate(3, {
train <- sample(nrow(Default), nrow(Default) / 2)
fit <- glm(default ~ income + balance + student, data = Default, family = "binomial", subset = train)
pred <- ifelse(predict(fit, newdata = Default[-train, ], type = "response") > 0.5, "Yes", "No")
mean(pred != Default$default[-train])
})
errors_with_student
## [1] 0.0268 0.0278 0.0256
Including the student
variable does not significantly
reduce the test error compared to the previous model.
6. We continue to consider the use of a logistic regression model to predict the probability of default using income and balance on the Default data set. In particular, we will now compute estimates for the standard errors of the income and balance logistic regression coefficients in two different ways: (1) using the bootstrap, and (2) using the standard formula for computing the standard errors in the glm() function. Do not forget to set a random seed before beginning your analysis.
# Fit logistic regression model
fit <- glm(default ~ income + balance, data = Default, family = "binomial")
# Summary to get standard errors
summary(fit)$coefficients
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.154047e+01 4.347564e-01 -26.544680 2.958355e-155
## income 2.080898e-05 4.985167e-06 4.174178 2.990638e-05
## balance 5.647103e-03 2.273731e-04 24.836280 3.638120e-136
boot.fn <- function(data, index) {
fit <- glm(default ~ income + balance, data = data[index, ], family = "binomial")
return(coef(fit))
}
library(boot)
# Set seed for reproducibility
set.seed(42)
# Bootstrap to estimate standard errors
boot_results <- boot(Default, boot.fn, R = 1000)
boot_results
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Default, statistic = boot.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* -1.154047e+01 -2.292405e-02 4.435269e-01
## t2* 2.080898e-05 2.737444e-08 5.073444e-06
## t3* 5.647103e-03 1.176249e-05 2.299133e-04
The standard errors obtained via bootstrapping are comparable to
those provided by glm()
, validating the results.
7. In Sections 5.3.2 and 5.3.3, we saw that the cv.glm() function can be used in order to compute the LOOCV test error estimate. Alternatively, one could compute those quantities using just the glm() and predict.glm() functions, and a for loop. You will now take this approach in order to compute the LOOCV error for a simple logistic regression model on the Weekly data set. Recall that in the context of classification problems, the LOOCV error is given in (5.4).
# Load Weekly data and fit logistic regression model
fit <- glm(Direction ~ Lag1 + Lag2, data = Weekly, family = "binomial")
summary(fit)
##
## Call:
## glm(formula = Direction ~ Lag1 + Lag2, family = "binomial", data = Weekly)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.22122 0.06147 3.599 0.000319 ***
## Lag1 -0.03872 0.02622 -1.477 0.139672
## Lag2 0.06025 0.02655 2.270 0.023232 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1496.2 on 1088 degrees of freedom
## Residual deviance: 1488.2 on 1086 degrees of freedom
## AIC: 1494.2
##
## Number of Fisher Scoring iterations: 4
# Fit model using all but the first observation
fit <- glm(Direction ~ Lag1 + Lag2, data = Weekly[-1, ], family = "binomial")
# Predict direction for the first observation
prediction <- predict(fit, newdata = Weekly[1, ], type = "response") > 0.5
prediction
## 1
## TRUE
# Initialize error vector
n <- nrow(Weekly)
errors <- rep(NA, n)
# Perform LOOCV using a for loop
for (i in 1:n) {
fit <- glm(Direction ~ Lag1 + Lag2, data = Weekly[-i, ], family = "binomial")
prob <- predict(fit, newdata = Weekly[i, ], type = "response")
pred <- ifelse(prob > 0.5, "Up", "Down")
errors[i] <- (pred != Weekly$Direction[i])
}
# Compute LOOCV test error
loocv_error <- mean(errors)
loocv_error
## [1] 0.4499541
The LOOCV test error is approximately 45%, indicating that the model’s predictions are only marginally better than random guessing.
9. We will now consider the Boston housing data set, from the ISLR2 library.
# Load Boston dataset and estimate population mean
mu_hat <- mean(Boston$medv)
mu_hat
## [1] 22.53281
# Standard error of the mean
se_mu <- sd(Boston$medv) / sqrt(length(Boston$medv))
se_mu
## [1] 0.4088611
# Bootstrap to estimate standard error
set.seed(42)
boot_mean <- boot(Boston$medv, function(data, index) mean(data[index]), R = 1000)
boot_mean
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Boston$medv, statistic = function(data, index) mean(data[index]),
## R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 22.53281 0.02671186 0.4009216
# Standard error from bootstrap
boot_se <- sd(boot_mean$t)
boot_se
## [1] 0.4009216
The standard error obtained using the bootstrap is close to the one calculated manually, confirming the validity of both methods.
Using the bootstrap standard error obtained in part (c), we can
compute a 95% confidence interval for the population mean of
medv
as:
# Bootstrap standard error for the mean
se_boot <- sd(boot_mean$t)
# 95% confidence interval
ci_bootstrap <- c(mu_hat - 2 * se_boot, mu_hat + 2 * se_boot)
ci_bootstrap
## [1] 21.73096 23.33465
To compare with the results obtained using t.test()
:
# Using t.test to get 95% confidence interval
ci_ttest <- t.test(Boston$medv)$conf.int
ci_ttest
## [1] 21.72953 23.33608
## attr(,"conf.level")
## [1] 0.95
The confidence intervals obtained using both methods should be close, but the bootstrap method might slightly differ due to random sampling. Both methods should provide a reasonable range for the population mean.
# Estimate for the population median of medv
mu_med <- median(Boston$medv)
mu_med
## [1] 21.2
# Bootstrap to estimate the standard error of the median
set.seed(42)
boot_median <- boot(Boston$medv, function(v, i) median(v[i]), 10000)
# Standard error of the median
se_median <- sd(boot_median$t)
se_median
## [1] 0.3744634
The standard error of the median is typically lower than the standard error of the mean. In this case, the standard error of the median is approximately 0.374, which indicates a smaller variability for the median than for the mean.
# Estimate for the 10th percentile of medv
mu_0_1 <- quantile(Boston$medv, 0.1)
mu_0_1
## 10%
## 12.75
# Bootstrap to estimate the standard error of the 10th percentile
set.seed(42)
boot_percentile <- boot(Boston$medv, function(v, i) quantile(v[i], 0.1), 10000)
# Standard error of the 10th percentile
se_percentile <- sd(boot_percentile$t)
se_percentile
## [1] 0.497298
The standard error of the 10th percentile is higher than that of the
median. In this case, it is approximately 0.5, which suggests that there
is more variability in the lower tail of the medv
distribution.