(a)
Explain how k-fold cross-validation is implemented. Response:
K-fold cross-validation is a resampling technique used to evaluate a model’s generalization performance. It works as follows:
Data Splitting: The dataset is randomly partitioned into k equally sized (or nearly equal) subsets called folds.
Iterative Training and Validation: The cross-validation process runs k times. In each iteration:
One fold is held out as the validation set.
The remaining k − 1 folds are combined to form the training set.
The model is trained on the training set and evaluated on the validation set.
Performance Aggregation: A performance metric (e.g., accuracy, F1 score, RMSE) is computed for each of the k validation folds.
Final Estimate: The k scores are averaged to provide an overall estimate of the model’s performance.
This method reduces bias associated with a single train-test split and helps assess how well the model generalizes to independent data, thus mitigating overfitting.
Applied Exercises:
Problem 5
In Chapter 4, we used logistic regression to predict the probability of default using income and balance on the Default data set. We will now estimate the test error of this logistic regression model using the validation set approach. Do not forget to set a random seed before beginning your analysis.
(a)
Fit a logistic regression model that uses income and balance to predict default.
Warning: package 'caret' was built under R version 4.4.2
Loading required package: ggplot2
Loading required package: lattice
# Set random seed for reproducibilityset.seed(123)# View the structure of the data# str(Default)# ii. Fit a multiple logistic regression model using only the training observationsmodel_full <-train( default ~ income + balance,data = Default,method ="glm",family ="binomial")# View the model summarysummary(model_full)
Call:
NULL
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.154e+01 4.348e-01 -26.545 < 2e-16 ***
income 2.081e-05 4.985e-06 4.174 2.99e-05 ***
balance 5.647e-03 2.274e-04 24.836 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 2920.6 on 9999 degrees of freedom
Residual deviance: 1579.0 on 9997 degrees of freedom
AIC: 1585
Number of Fisher Scoring iterations: 8
(b)
Using the validation set approach, estimate the test error of this model. In order to do this, you must perform the following steps:
# i. Split the sample set into a training set and a validation settrain_index <-createDataPartition(Default$default, p =0.7, list =FALSE)train_data <- Default[train_index, ]validation_data <- Default[-train_index, ]# ii. Fit a multiple logistic regression model using only the training observationsmodel_fit <-train( default ~ income + balance,data = train_data,method ="glm",family ="binomial")# iii. Obtain posterior probabilities and classify as 'Yes' if > 0.5pred_probs <-predict(model_fit, newdata = validation_data, type ="prob")pred_class <-ifelse(pred_probs[, "Yes"] >0.5, "Yes", "No")# iv. Compute the validation set error (fraction misclassified)actual <- validation_data$defaultmisclassified <-mean(pred_class != actual)misclassified
[1] 0.02200734
Extra
I wanted to see the output in a confusion matrix format.
# Generate confusion matrix using caretconf_matrix <-confusionMatrix(factor(pred_class, levels =c("No", "Yes")),factor(actual, levels =c("No", "Yes")))# Print the confusion matrixconf_matrix
Confusion Matrix and Statistics
Reference
Prediction No Yes
No 2887 53
Yes 13 46
Accuracy : 0.978
95% CI : (0.9721, 0.9829)
No Information Rate : 0.967
P-Value [Acc > NIR] : 0.0002241
Kappa : 0.5717
Mcnemar's Test P-Value : 1.582e-06
Sensitivity : 0.9955
Specificity : 0.4646
Pos Pred Value : 0.9820
Neg Pred Value : 0.7797
Prevalence : 0.9670
Detection Rate : 0.9627
Detection Prevalence : 0.9803
Balanced Accuracy : 0.7301
'Positive' Class : No
(c)
Repeat the process in (b) three times, using three different splits of the observations into a training set and a validation set. Comment on the results obtained.
# Initialize vector to store test errorsset.seed(123)errors <-numeric(3)for (i in1:3) {# Split the data (random seed changes each iteration) train_index <-createDataPartition(Default$default, p =0.7, list =FALSE) train_data <- Default[train_index, ] validation_data <- Default[-train_index, ]# Fit logistic regression model model_fit <-train( default ~ income + balance,data = train_data,method ="glm",family ="binomial" )# Predict on validation set pred_probs <-predict(model_fit, newdata = validation_data, type ="prob") pred_class <-ifelse(pred_probs[, "Yes"] >0.5, "Yes", "No")# Compute test error (misclassification rate) actual <- validation_data$default errors[i] <-mean(pred_class != actual)}# Show test errors for each runerrors
[1] 0.02534178 0.02700900 0.02467489
mean_error <-mean(errors)mean_error
[1] 0.02567523
Response:
The average test error across the three trials is approximately 2.57%, indicating a relatively low misclassification rate. The consistency across the three estimates suggests that the model’s performance is stable and not highly sensitive to the specific training-validation split. This reinforces the predictive strength of the variables income and especially balance, which is known to be a strong predictor of default in this dataset.
(d)
Now consider a logistic regression model that predicts the probability of default using income, balance, and a dummy variable for student. Estimate the test error for this model using the validation set approach. Comment on whether or not including a dummy variable for student leads to a reduction in the test error rate.
# Set seed for reproducibilityset.seed(123)# Repeat the validation set approach 3 timeserrors_with_student <-numeric(3)for (i in1:3) {# Split the data into training and validation sets train_index <-createDataPartition(Default$default, p =0.7, list =FALSE) train_data <- Default[train_index, ] validation_data <- Default[-train_index, ]# Fit logistic regression including 'student' as a predictor model_fit <-train( default ~ income + balance + student,data = train_data,method ="glm",family ="binomial" )# Predict on validation set pred_probs <-predict(model_fit, newdata = validation_data, type ="prob") pred_class <-ifelse(pred_probs[, "Yes"] >0.5, "Yes", "No")# Compute test error (fraction misclassified) actual <- validation_data$default errors_with_student[i] <-mean(pred_class != actual)}# Print the resultserrors_with_student
[1] 0.02634211 0.02734245 0.02600867
mean(errors_with_student)
[1] 0.02656441
Response:
The average validation set error for the logistic regression model including income, balance, and the student dummy variable is 2.66%, compared to 2.57% for the model without the student variable.
This result suggests that including the student variable did not reduce test error, and in fact resulted in a slightly higher misclassification rate. This implies that student may not provide meaningful additional predictive value beyond what is already captured by income and balance. Therefore, for this dataset and model, the inclusion of student does not improve generalization performance and may be unnecessary for prediction.
Problem 6
We continue to consider the use of a logistic regression model to predict the probability of default using income and balance on the Default data set. In particular, we will now compute estimates for the standard errors of the income and balance logistic regression coefficients in two different ways: (1) using the bootstrap, and (2) using the standard formula for computing the standard errors in the glm() function. Do not forget to set a random seed before beginning your analysis.
(a)
Using the summary() and glm() functions, determine the estimated standard errors for the coefficients associated with income and balance in a multiple logistic regression model that uses both predictors.
# Set seed for reproducibilityset.seed(123)# Fit logistic regression model using income and balancelogit_model <-glm(default ~ income + balance, data = Default, family ="binomial")# Display model summary to extract standard errorssummary(logit_model)$coefficients
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.154047e+01 4.347564e-01 -26.544680 2.958355e-155
income 2.080898e-05 4.985167e-06 4.174178 2.990638e-05
balance 5.647103e-03 2.273731e-04 24.836280 3.638120e-136
(b)
Write a function, boot.fn(), that takes as input the Default data set as well as an index of the observations, and that outputs the coefficient estimates for income and balance in the multiple logistic regression model.
# Define boot.fn functionboot.fn <-function(data, index) {# Fit logistic regression model on selected observations fit <-glm(default ~ income + balance, data = data, subset = index, family ="binomial")# Return coefficient estimates for income and balancereturn(coef(fit)[c("income", "balance")])}# Example: Call boot.fn on a random sampleset.seed(123)boot.fn(Default, sample(1:nrow(Default), nrow(Default), replace =TRUE))
income balance
1.803436e-05 5.448714e-03
(c)
Use the boot() function together with your boot.fn() function to estimate the standard errors of the logistic regression coefficients for income and balance.
# Load necessary librarylibrary(boot)
Attaching package: 'boot'
The following object is masked from 'package:lattice':
melanoma
# Set seed for reproducibilityset.seed(123)# Perform bootstrap with 1000 replicationsboot_results <-boot(data = Default, statistic = boot.fn, R =1000)# View bootstrap estimates and standard errorsboot_results
(d)
Comment on the estimated standard errors obtained using the glm() function and using your bootstrap function.
Response:
Comparison of Standard Errors for Logistic Regression Coefficients
Coefficient
Standard Error (glm)
Standard Error (bootstrap)
income
4.985 × 10⁻⁵
4.730 × 10⁻⁶
balance
2.274 × 10⁻⁴
2.217 × 10⁻⁴
Comment
The bootstrap standard error for balance is very close to the standard error reported by glm(), suggesting that the built-in model assumptions provide a reliable estimate for that coefficient.
However, the bootstrap standard error for income is much smaller than the one reported by glm() (about 1/10th the size), indicating that the glm-based standard error may be overestimating the variability of the income coefficient. This discrepancy might arise from the very small effect and low significance of income in the model, which is also reflected in its high p-value in the glm output.
Bootstrap estimates offer a data-driven way to assess variability and can be more robust when model assumptions (like linearity or homoscedasticity) are questionable.
Problem 9
We will now consider the Boston housing data set, from the ISLR2 library.(a)
Based on this data set, provide an estimate for the population mean of medv. Call this estimate ˆμ.
# Load the ISLR2 packagelibrary(ISLR2)
Warning: package 'ISLR2' was built under R version 4.4.3
Attaching package: 'ISLR2'
The following objects are masked from 'package:ISLR':
Auto, Credit
# (a) Estimate for the population mean of medvmu_hat <-mean(Boston$medv)mu_hat
[1] 22.53281
(b)
Provide an estimate of the standard error of ˆμ. Interpret this result. Hint: We can compute the standard error of the sample mean by dividing the sample standard deviation by the square root of the number of observations.
# (b) Standard error using the formulasample_sd <-sd(Boston$medv)n <-nrow(Boston)se_mu_hat <- sample_sd /sqrt(n)se_mu_hat
[1] 0.4088611
(c)
Now estimate the standard error of ˆμ using the bootstrap. How does this compare to your answer from (b)?
# Define bootstrap function for the meanboot.fn <-function(data, index) {mean(data[index])}# Perform bootstrapset.seed(123)boot_results <-boot(data = Boston$medv, statistic = boot.fn, R =1000)boot_results
(d)
Based on your bootstrap estimate from (c), provide a 95 % confidence interval for the mean of medv. Compare it to the results obtained using t.test(Boston$medv). Hint: You can approximate a 95 % confidence interval using the formula [ˆμ − 2SE(ˆμ), ˆμ + 2SE(ˆμ)].
# Extract values from bootstrapmu_hat <-mean(Boston$medv)se_boot <-sd(boot_results$t) # or use boot_results$t0 directly for mean, boot_results$t for samples# Approximate 95% CI using bootstrap SEci_bootstrap <-c(mu_hat -2* se_boot, mu_hat +2* se_boot)ci_bootstrap
[1] 21.72369 23.34192
(e)
Based on this data set, provide an estimate, ˆμmed, for the median value of medv in the population.
# Estimate the population median of medvmu_med_hat <-median(Boston$medv)mu_med_hat
[1] 21.2
(f)
We now would like to estimate the standard error of ˆμmed. Unfortunately, there is no simple formula for computing the standard error of the median. Instead, estimate the standard error of the median using the bootstrap. Comment on your findings.
# Define bootstrap function to return the sample medianboot.fn.median <-function(data, index) {median(data[index])}# Set seed for reproducibilityset.seed(123)# Perform bootstrap with 1000 replicationsboot_median <-boot(data = Boston$medv, statistic = boot.fn.median, R =1000)# Display bootstrap resultsboot_median
Response:
The bootstrap results for the sample median medv are:
**Sample Median ((_{}))**: 21.2
Bootstrap Bias: -0.020
Bootstrap Standard Error: 0.368
Interpretation
Using 1,000 bootstrap replications, we estimate the standard error of the median home value to be approximately 0.368. This tells us how much the median medv would vary across repeated samples from the population.
The small bias (−0.0203) suggests that the bootstrap median estimates are centered closely around the original sample median.
This standard error reflects the variability of a robust statistic (the median), and supports the conclusion that 21.2 is a stable and reasonable estimate of the population median for medv in the Boston dataset.
(g)
Based on this data set, provide an estimate for the tenth percentile of medv in Boston census tracts. Call this quantity ˆμ0.1. (You can use the quantile() function.)
# Estimate the 10th percentile of medvmu_0.1_hat <-quantile(Boston$medv, probs =0.1)mu_0.1_hat
10%
12.75
(h)
Use the bootstrap to estimate the standard error of ˆμ0.1. Comment on your findings.
# Define bootstrap function to return 10th percentileboot.fn.p10 <-function(data, index) {quantile(data[index], probs =0.1)}# Perform bootstrap with 1000 replicationsset.seed(123)boot_p10 <-boot(data = Boston$medv, statistic = boot.fn.p10, R =1000)# View bootstrap resultsboot_p10
Response:
Using 1,000 bootstrap samples, we estimate the 10th percentile ((_{0.1})) of medv to be 12.75. The bootstrap output shows:
Bootstrap Bias: −0.012
Bootstrap Standard Error: 0.528
Interpretation
This means that the estimated standard error for the 10th percentile is approximately 0.528, which reflects the typical variability we might expect in (_{0.1}) across different random samples from the Boston housing population.
The very small bias (−0.012) suggests that the bootstrap distribution is well-centered around the original estimate, indicating that the sample-based 10th percentile is a stable and reliable estimate of the population 10th percentile.
The bootstrap once again proves useful for estimating standard errors when no closed-form solution exists, especially for quantiles and other non-linear statistics.