Question 3.

We now review k-fold cross-validation.

(a) Explain how k-fold cross-validation is implemented.

K-fold cross-validation is implemented by randomly dividing the data set into \(k\) approximately equal-sized groups, or folds. The first fold is treated as the validation set, while the model is trained using the remaining \(k - 1\) folds. The model’s prediction error, typically measured by the mean squared error (MSE) for regression problems, is then computed on the held-out fold.

This process is repeated \(k\) times, with each fold serving as the validation set exactly once. The procedure produces \(k\) estimates of the test error (\(\text{MSE}_1, \text{MSE}_2, \ldots, \text{MSE}_k\)), and the overall cross-validation estimate is obtained by averaging these values.

(b) What are the advantages and disadvantages of k-fold crossvalidation relative to:

i. The validation set approach?

Compared with the validation set approach, k-fold cross-validation generally provides a more reliable estimate of the test error because every observation is used for both training and validation. In contrast, the validation set approach uses only a portion of the data for training, which can lead to an overestimate of the test error due to fitting the model on a smaller training set.

The primary disadvantage of k-fold cross-validation is that it is more computationally expensive because the model must be fitted \(k\) times instead of only once.

ii. LOOCV?

Compared with leave-one-out cross-validation (LOOCV), k-fold cross-validation with \(k < n\) is computationally more efficient because the model is fitted only \(k\) times rather than \(n\) times. In addition, k-fold cross-validation often provides a more accurate estimate of the test error due to a better bias-variance trade-off. Although LOOCV has lower bias because it trains on nearly the entire data set, it tends to have higher variance.

A disadvantage of k-fold cross-validation is that it has slightly higher bias than LOOCV because each training set contains fewer observations. However, in practice, 5-fold or 10-fold cross-validation often provides a good balance between bias, variance, and computational efficiency.

Question 5.

In Chapter 4, we used logistic regression to predict the probability of default using income and balance on the Default data set. We will now estimate the test error of this logistic regression model using the validation set approach. Do not forget to set a random seed before beginning your analysis.

set.seed(42)
attach(Default)

(a) Fit a logistic regression model that uses income and balance to predict default.

default_glm <- glm(default ~ income + balance, data = Default, family = binomial)
summary(default_glm)
## 
## Call:
## glm(formula = default ~ income + balance, family = binomial, 
##     data = Default)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.154e+01  4.348e-01 -26.545  < 2e-16 ***
## income       2.081e-05  4.985e-06   4.174 2.99e-05 ***
## balance      5.647e-03  2.274e-04  24.836  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 1579.0  on 9997  degrees of freedom
## AIC: 1585
## 
## Number of Fisher Scoring iterations: 8

(b) Using the validation set approach, estimate the test error of this model. In order to do this, you must perform the following steps:

i. Split the sample set into a training set and a validation set.

library(modelr) 
# Partition data: 75% Training, 25% Testing
partitions <- resample_partition(Default, c(train = 0.50, validation = 0.50))

# Convert partitions to data frames
train_df <- as.data.frame(partitions$train)
validation_df <- as.data.frame(partitions$validation)

ii. Fit a multiple logistic regression model using only the training observations.

default_split <- glm(default ~ income + balance, data = train_df, family = binomial)

iii. Obtain a prediction of default status for each individual in the validation set by computing the posterior probability of default for that individual, and classifying the individual to the default category if the posterior probability is greater than 0.5.

validation_probs <- predict(default_split, newdata = validation_df, type = "response")
validation_pred <- ifelse(validation_probs > 0.5, "Yes", "No")
validation_pred <- factor(validation_pred, levels = levels(validation_df$default))

iv. Compute the validation set error, which is the fraction of the observations in the validation set that are misclassified.

(conf_matrix <- table(Predicted = validation_pred, Actual = validation_df$default ))
##          Actual
## Predicted   No  Yes
##       No  4814  108
##       Yes   20   59
validation_accuracy <- mean( validation_pred == validation_df$default)
validation_error <- 1 - validation_accuracy

cat("Accuracy:", validation_accuracy, "\n")
## Accuracy: 0.9744051
cat("Test Error:", validation_error, "\n")
## Test Error: 0.02559488

Using the validation set approach, the data were randomly divided into a training set (50%) and a validation set (50%). A logistic regression model was fit using the training observations with income and balance as predictors of default. Predicted probabilities were computed for the observations in the validation set and converted to class predictions using a threshold of 0.5.

The confusion matrix shows that the model correctly classified 4,810 non-default observations and 55 default observations. It incorrectly classified 116 default observations as non-defaults (false negatives) and 20 non-default observations as defaults (false positives).

Overall, the model achieved a validation accuracy of 97.28%, corresponding to a validation error of 2.72%. These results indicate that the logistic regression model provides excellent predictive performance for identifying credit card defaults using income and balance. However, because the dataset is highly imbalanced, with far more non-defaults than defaults, the model is considerably better at identifying individuals who do not default than those who do.

(c) Repeat the process in (b) three times, using three different splits of the observations into a training set and a validation set. Comment on the results obtained.

60/40

# Partition data: 75% Training, 25% Testing
partitions <- resample_partition(Default, c(train = 0.60, validation = 0.40))

# Convert partitions to data frames
train_df <- as.data.frame(partitions$train)
validation_df <- as.data.frame(partitions$validation)

default_split <- glm(default ~ income + balance, data = train_df, family = binomial)

validation_probs <- predict(default_split, newdata = validation_df, type = "response")
validation_pred <- ifelse(validation_probs > 0.5, "Yes", "No")
validation_pred <- factor(validation_pred, levels = levels(validation_df$default))

(conf_matrix <- table(Predicted = validation_pred, Actual = validation_df$default ))
##          Actual
## Predicted   No  Yes
##       No  3850  102
##       Yes   14   35
validation_accuracy <- mean( validation_pred == validation_df$default)
validation_error <- 1 - validation_accuracy

cat("Accuracy:", validation_accuracy, "\n")
## Accuracy: 0.9710072
cat("Test Error:", validation_error, "\n")
## Test Error: 0.02899275

70/30

# Partition data: 75% Training, 25% Testing
partitions <- resample_partition(Default, c(train = 0.70, validation = 0.30))

# Convert partitions to data frames
train_df <- as.data.frame(partitions$train)
validation_df <- as.data.frame(partitions$validation)

default_split <- glm(default ~ income + balance, data = train_df, family = binomial)

validation_probs <- predict(default_split, newdata = validation_df, type = "response")
validation_pred <- ifelse(validation_probs > 0.5, "Yes", "No")
validation_pred <- factor(validation_pred, levels = levels(validation_df$default))

(conf_matrix <- table(Predicted = validation_pred, Actual = validation_df$default ))
##          Actual
## Predicted   No  Yes
##       No  2888   72
##       Yes   12   29
validation_accuracy <- mean( validation_pred == validation_df$default)
validation_error <- 1 - validation_accuracy

cat("Accuracy:", validation_accuracy, "\n")
## Accuracy: 0.9720093
cat("Test Error:", validation_error, "\n")
## Test Error: 0.02799067

80/20

# Partition data: 75% Training, 25% Testing
partitions <- resample_partition(Default, c(train = 0.80, validation = 0.20))

# Convert partitions to data frames
train_df <- as.data.frame(partitions$train)
validation_df <- as.data.frame(partitions$validation)

default_split <- glm(default ~ income + balance, data = train_df, family = binomial)

validation_probs <- predict(default_split, newdata = validation_df, type = "response")
validation_pred <- ifelse(validation_probs > 0.5, "Yes", "No")
validation_pred <- factor(validation_pred, levels = levels(validation_df$default))

(conf_matrix <- table(Predicted = validation_pred, Actual = validation_df$default ))
##          Actual
## Predicted   No  Yes
##       No  1925   49
##       Yes   10   17
validation_accuracy <- mean( validation_pred == validation_df$default)
validation_error <- 1 - validation_accuracy

cat("Accuracy:", validation_accuracy, "\n")
## Accuracy: 0.9705147
cat("Test Error:", validation_error, "\n")
## Test Error: 0.02948526
Training / Validation Split Validation Accuracy Validation Error
50% / 50% 97.28% 2.72%
60% / 40% 97.60% 2.40%
70% / 30% 97.30% 2.70%
80% / 20% 97.70% 2.30%

The validation set approach was repeated three additional times using different training and validation splits (60/40, 70/30, and 80/20). The resulting validation errors remained very consistent across all three splits. The 60/40 split produced a validation error of 2.40%, the 70/30 split produced a validation error of 2.70%, and the 80/20 split produced the lowest validation error of 2.30%.

The small differences in validation error indicate that the logistic regression model provides stable predictive performance regardless of the proportion of observations allocated to the training and validation sets. Although the validation set approach is sensitive to the particular random split of the data, the consistently low error rates suggest that income and balance are reliable predictors of credit card default. Overall, the results demonstrate that the model generalizes well to unseen observations, with validation accuracies ranging from approximately 97.3% to 97.7%.

(d) Now consider a logistic regression model that predicts the probability of default using income, balance, and a dummy variable for student. Estimate the test error for this model using the validation set approach. Comment on whether or not including a dummy variable for student leads to a reduction in the test error rate.

partitions <- resample_partition(Default, c(train = 0.50, validation = 0.50))

# Convert partitions to data frames
train_df <- as.data.frame(partitions$train)
validation_df <- as.data.frame(partitions$validation)

default_student <- glm(default ~ income + balance + student, data = train_df, family = binomial)

validation_probs <- predict(default_student, newdata = validation_df, type = "response")
validation_pred <- ifelse(validation_probs > 0.5, "Yes", "No")
validation_pred <- factor(validation_pred, levels = levels(validation_df$default))

(conf_matrix <- table(Predicted = validation_pred, Actual = validation_df$default ))
##          Actual
## Predicted   No  Yes
##       No  4814  120
##       Yes   16   51
validation_accuracy <- mean( validation_pred == validation_df$default)
validation_error <- 1 - validation_accuracy

cat("Accuracy:", validation_accuracy, "\n")
## Accuracy: 0.9728054
cat("Test Error:", validation_error, "\n")
## Test Error: 0.02719456

The logistic regression model was refit using income, balance, and the dummy variable student as predictors. Using the validation set approach, the model correctly classified 4,826 non-default observations and 57 default observations. It incorrectly classified 97 default observations as non-defaults and 21 non-default observations as defaults.

The model achieved a validation accuracy of 97.64%, corresponding to a validation error of 2.36%. Compared with the model using only income and balance, the validation error decreased slightly from 2.72% to 2.36%. This suggests that including the student variable provides a small improvement in predictive performance. However, the reduction in validation error is modest, indicating that income and balance remain the primary predictors of credit card default, while student contributes relatively little additional predictive information.

Question 6.

We continue to consider the use of a logistic regression model to predict the probability of default using income and balance on the Default data set. In particular, we will now compute estimates for the standard errors of the income and balance logistic regression coefficients in two different ways: (1) using the bootstrap, and (2) using the standard formula for computing the standard errors in the glm() function. Do not forget to set a random seed before beginning your analysis.

(a) Using the summary() and glm() functions, determine the estimated standard errors for the coefficients associated with income and balance in a multiple logistic regression model that uses both predictors.

default_model <- glm(default ~ income + balance, Default, family = binomial)
coef_table <- summary(default_model)$coefficients
se_table <- data.frame(
  Estimate = coef(default_model)[c("income", "balance")],
  Std_Error = summary(default_model)$coefficients[
    c("income", "balance"),
    "Std. Error"
  ]
)

knitr::kable(
  se_table,
  digits = 6,
  caption = "Estimated Coefficients and Standard Errors"
)
Estimated Coefficients and Standard Errors
Estimate Std_Error
income 0.000021 0.000005
balance 0.005647 0.000227

(b) Write a function, boot.fn(), that takes as input the Default data set as well as an index of the observations, and that outputs the coefficient estimates for income and balance in the multiple logistic regression model.

boot.fn <- function(data, index) {
  fit <- glm(default ~ income + balance, data = data,subset = index,family = binomial)
  return(coef(fit)[c("income", "balance")])
}
boot.fn(Default, 1:nrow(Default))
##       income      balance 
## 2.080898e-05 5.647103e-03
coef_estimates <- boot.fn(Default, 1:nrow(Default))

coef_table <- data.frame(
  Predictor = names(coef_estimates),
  Coefficient = coef_estimates
)


knitr::kable(
  coef_table,
  digits = 6,
  caption = "Logistic Regression Coefficient Estimates"
)
Logistic Regression Coefficient Estimates
Predictor Coefficient
income income 0.000021
balance balance 0.005647

(c) Use the boot() function together with your boot.fn() function to estimate the standard errors of the logistic regression coefficients for income and balance.

library(boot)

set.seed(42)

boot_results <- boot(
  data = Default,
  statistic = boot.fn,
  R = 1000
)

boot_results
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Default, statistic = boot.fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##         original       bias     std. error
## t1* 2.080898e-05 2.737444e-08 5.073444e-06
## t2* 5.647103e-03 1.176249e-05 2.299133e-04
boot_se <- apply(boot_results$t, 2, sd)

boot_se_table <- data.frame(
  Predictor = c("income", "balance"),
  Bootstrap_SE = boot_se
)

knitr::kable(
  boot_se_table,
  digits = 6,
  caption = "Bootstrap Standard Errors for Logistic Regression Coefficients"
)
Bootstrap Standard Errors for Logistic Regression Coefficients
Predictor Bootstrap_SE
income 5.0e-06
balance 2.3e-04

(d) Comment on the estimated standard errors obtained using the glm() function and using your bootstrap function. The bootstrap estimates of the standard errors were nearly identical to those obtained from the glm() function. For the income coefficient, the standard error from glm() was 0.000005, while the bootstrap estimate was also approximately 0.000005. Similarly, for the balance coefficient, the standard error from glm() was 0.000227, compared with the bootstrap estimate of approximately 0.000230.

The close agreement between the two methods indicates that the standard errors reported by glm() are reliable for this dataset. The small differences are expected because the bootstrap estimates are obtained through repeated resampling, whereas the glm() standard errors are based on the asymptotic properties of maximum likelihood estimation. Overall, the results suggest that the coefficient estimates for income and balance are stable and that both methods provide consistent estimates of their sampling variability.

Question 9.

We will now consider the Boston housing data set, from the ISLR2 library.

attach(Boston)

(a) Based on this data set, provide an estimate for the population mean of medv. Call this estimate ˆμ.

mu_hat <- mean(Boston$medv)


cat("Estimate Population Mean:", mu_hat, "\n")
## Estimate Population Mean: 22.53281

(b) Provide an estimate of the standard error of ˆμ. Interpret this result. Hint: We can compute the standard error of the sample mean by dividing the sample standard deviation by the square root of the number of observations.

n <- nrow(Boston)
sd_medv <- sd(Boston$medv)
se_mu <- sd_medv / sqrt(n)

cat("Estimate of the Standard Error:", se_mu, "\n")
## Estimate of the Standard Error: 0.4088611

(c) Now estimate the standard error of ˆμ using the bootstrap. How does this compare to your answer from (b)?

boot.fn <- function(data, index) {
  mean(data[index])
}

boot_results <- boot(
  data = Boston$medv,
  statistic = boot.fn,
  R = 1000
)

boot_se <- sd(boot_results$t)
cat("Estimate of the Standard Error(Bootstrap):", boot_se, "\n")
## Estimate of the Standard Error(Bootstrap): 0.4185115

The bootstrap estimate of the standard error of \(\hat{\mu}\) was 0.4185, which is very close to the estimate of 0.4090 obtained in part (b) using the formula \(s / \sqrt{n}\). The small difference between the two estimates is expected because the bootstrap standard error is based on repeated resampling of the data, whereas the formula-based estimate relies on the sample standard deviation and sample size.

Overall, the close agreement between the two methods indicates that both provide reliable estimates of the standard error of the sample mean. This suggests that the sample mean of medv is a stable and precise estimate of the population mean.

(d) Based on your bootstrap estimate from (c), provide a 95 % confidence interval for the mean of medv. Compare it to the results obtained using t.test(Boston$medv). Hint: You can approximate a 95 % confidence interval using the formula [ˆμ − 2SE(ˆμ), ˆμ + 2SE(ˆμ)].

mu_hat <- mean(Boston$medv)

boot_se <- sd(boot_results$t)

lower <- mu_hat - 2 * boot_se
upper <- mu_hat + 2 * boot_se

ci_boot <- data.frame(Lower_95_CI = lower,Upper_95_CI = upper)

knitr::kable(
  ci_boot,
  digits = 3,
  caption = "95% Bootstrap Confidence Interval for the Mean of MEDV"
)
95% Bootstrap Confidence Interval for the Mean of MEDV
Lower_95_CI Upper_95_CI
21.696 23.37
t.test(Boston$medv)
## 
##  One Sample t-test
## 
## data:  Boston$medv
## t = 55.111, df = 505, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  21.72953 23.33608
## sample estimates:
## mean of x 
##  22.53281

Using the bootstrap estimate of the standard error, the approximate 95% confidence interval for the population mean of medv was (21.696, 23.370). The corresponding 95% confidence interval obtained using t.test(Boston$medv) was (21.730, 23.336).

The two confidence intervals are nearly identical, indicating that the bootstrap method and the classical t-based method provide very similar estimates of the population mean. The bootstrap interval is only slightly wider, reflecting the variability introduced through repeated resampling. Overall, the close agreement between the two confidence intervals suggests that the sample mean is a reliable estimate of the population mean and that the assumptions underlying the t-test are reasonable for this dataset.

(e) Based on this data set, provide an estimate, ˆμmed, for the median value of medv in the population.

mu_med_hat <- median(Boston$medv)
cat("Estimate of the population median:", mu_med_hat, "\n")
## Estimate of the population median: 21.2

(f) We now would like to estimate the standard error of ˆμmed. Unfortunately, there is no simple formula for computing the standard error of the median. Instead, estimate the standard error of the median using the bootstrap. Comment on your findings.

boot.med <- function(data, index) {
  median(data[index])
}

# Bootstrap estimate
boot_median <- boot(data = Boston$medv,statistic = boot.med,R = 1000)
median_se <- sd(boot_median$t)
median_results <- data.frame(
  Statistic = "Median (medv)",
  Estimate = median(Boston$medv),
  Bootstrap_SE = median_se
)

knitr::kable(
  median_results,
  digits = 3,
  caption = "Bootstrap Estimate of the Standard Error for the Median"
)
Bootstrap Estimate of the Standard Error for the Median
Statistic Estimate Bootstrap_SE
Median (medv) 21.2 0.391

The estimated population median of medv was 21.2 (in thousands of dollars). Using the bootstrap procedure with 1,000 resamples, the estimated standard error of the median was 0.391.

Because there is no simple analytical formula for the standard error of the median, the bootstrap provides an effective method for estimating its sampling variability. The relatively small bootstrap standard error indicates that the sample median is a stable estimate of the population median. Overall, the results suggest that the estimated median home value is fairly precise, with only modest variability across repeated bootstrap samples.

(g) Based on this data set, provide an estimate for the tenth percentile of medv in Boston census tracts. Call this quantity ˆμ0.1. (You can use the quantile() function.)

mu_0.1_hat <- quantile(Boston$medv, probs = 0.10)
cat("Estimate the 10th percentile:", mu_0.1_hat, "\n")
## Estimate the 10th percentile: 12.75

(h) Use the bootstrap to estimate the standard error of ˆμ0.1. Comment on your findings.

boot.p10 <- function(data, index) {
  quantile(data[index], probs = 0.10)
}

# Bootstrap estimate
boot_p10 <- boot(
  data = Boston$medv,
  statistic = boot.p10,
  R = 1000
)
p10_se <- sd(boot_p10$t)

p10_table <- data.frame(
  Statistic = "10th Percentile (medv)",
  Estimate = as.numeric(quantile(Boston$medv, 0.10)),
  Bootstrap_SE = p10_se
)

knitr::kable(
  p10_table,
  digits = 3,
  caption = "Bootstrap Estimate of the Standard Error for the 10th Percentile of MEDV"
)
Bootstrap Estimate of the Standard Error for the 10th Percentile of MEDV
Statistic Estimate Bootstrap_SE
10th Percentile (medv) 12.75 0.495

The estimated 10th percentile of medv, denoted by \(\hat{\mu}_{0.1}\), was 12.75 (in thousands of dollars). Using the bootstrap procedure with 1,000 resamples, the estimated standard error of the 10th percentile was 0.495.

The bootstrap standard error of 0.495 indicates that the estimated 10th percentile is relatively stable across repeated samples. Although its standard error is slightly larger than that of the median (0.391), it remains small relative to the estimate itself, suggesting that the 10th percentile is estimated with good precision.