(3)

(a) Explain how is k-fold cross-validation implemented?

k-Fold Cross-Validation involves dividing the dataset into k equal-sized subsets or folds. A model is trained on k-1 folds and tested on the remaining fold. This process is repeated k times, ensuring each fold acts as the test set once. The final performance metric is obtained by averaging the results across all folds, providing a robust estimate of the model’s out-of-sample performance.

The choice of performance metric is flexible and not limited to Mean Squared Error (MSE) or Accuracy. Other metrics such as Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), Area Under the Receiver Operating Characteristic Curve (AUROC), and Area Under the Precision-Recall Curve (AUPRC) can also be used if they are applicable to the model and dataset.

(b) What are the advantages and disadvantages of k-fold cross-validation compared to:

(i) The Validation Set Approach

Advantages: - k-Fold Cross-Validation reduces variability in performance estimation, especially for smaller datasets, making results more reliable. - It ensures that all data points are used for both training and testing, maximizing the utilization of the dataset. - The validation set approach may overestimate test error since a significant portion of the data is excluded from training, whereas k-Fold Cross-Validation minimizes this issue.

Disadvantages: - The validation set approach is easier to understand and communicate, making it useful in industry settings where stakeholders require a straightforward explanation of model evaluation. - Computationally, k-Fold Cross-Validation is more expensive as it requires training k models compared to the validation set approach, which only requires training and testing a model once.

(ii) Leave-One-Out Cross-Validation (LOOCV)

Advantages: - k-Fold Cross-Validation is computationally more efficient than LOOCV, particularly when dealing with large datasets. For example, with k = 10 and n = 10,000, k-Fold CV requires training 10 models, whereas LOOCV necessitates training 10,000 models. - From a bias-variance perspective, k-Fold Cross-Validation provides a better tradeoff. LOOCV has low bias but high variance, whereas k-Fold CV balances the two, potentially yielding a more accurate test error estimate.

Disadvantages: - k-Fold Cross-Validation introduces some randomness, as different splits can result in slightly different out-of-sample errors, whereas LOOCV does not suffer from this variability. - In certain cases, LOOCV can be computationally more efficient than k-Fold CV. For instance, in least squares regression, the LOOCV statistic can be computed using leverage values, requiring roughly the same computational power as fitting a single model instead of n models.

# Load necessary libraries
library(ISLR2)
## Warning: package 'ISLR2' was built under R version 4.4.2
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.4.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(caret)
## Warning: package 'caret' was built under R version 4.4.2
## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift
library(boot)
## 
## Attaching package: 'boot'
## 
## The following object is masked from 'package:lattice':
## 
##     melanoma
library(MASS) # Boston data
## 
## Attaching package: 'MASS'
## 
## The following object is masked from 'package:dplyr':
## 
##     select
## 
## The following object is masked from 'package:ISLR2':
## 
##     Boston
theme_set(theme_light())

(5)

# Set seed for reproducibility
set.seed(1)
data(Default)
glimpse(Default)
## Rows: 10,000
## Columns: 4
## $ default <fct> No, No, No, No, No, No, No, No, No, No, No, No, No, No, No, No…
## $ student <fct> No, Yes, No, No, No, Yes, No, Yes, No, No, Yes, Yes, No, No, N…
## $ balance <dbl> 729.5265, 817.1804, 1073.5492, 529.2506, 785.6559, 919.5885, 8…
## $ income  <dbl> 44361.625, 12106.135, 31767.139, 35704.494, 38463.496, 7491.55…

(a) Predicting default with income & balance - Logistic Regression

# Fit logistic regression model
glm_fit <- glm(default ~ income + balance, family = binomial, data = Default)

# Display model summary
summary(glm_fit)
## 
## Call:
## glm(formula = default ~ income + balance, family = binomial, 
##     data = Default)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.154e+01  4.348e-01 -26.545  < 2e-16 ***
## income       2.081e-05  4.985e-06   4.174 2.99e-05 ***
## balance      5.647e-03  2.274e-04  24.836  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 1579.0  on 9997  degrees of freedom
## AIC: 1585
## 
## Number of Fisher Scoring iterations: 8

(b) Validation Approach - Estimating test Error

(i) train/test Split
set.seed(123, sample.kind = "Rounding")
## Warning in set.seed(123, sample.kind = "Rounding"): non-uniform 'Rounding'
## sampler used
index <- createDataPartition(y = Default$default, p = 0.7, list = F)

train <- Default[index, ]
test <- Default[-index, ]

nrow(train) / nrow(Default)
## [1] 0.7001
(ii) train - Fitting a Logistic Regression Model
glm_fit_2 <- glm(default ~ income + balance, data = train, family = "binomial")

summary(glm_fit_2)
## 
## Call:
## glm(formula = default ~ income + balance, family = "binomial", 
##     data = train)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.141e+01  5.099e-01 -22.379  < 2e-16 ***
## income       2.256e-05  5.898e-06   3.825 0.000131 ***
## balance      5.531e-03  2.651e-04  20.866  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2050.6  on 7000  degrees of freedom
## Residual deviance: 1119.2  on 6998  degrees of freedom
## AIC: 1125.2
## 
## Number of Fisher Scoring iterations: 8
(iii) test Predictions
test_pred <- factor(ifelse(predict(glm_fit_2, newdata = test, type = "response") > 0.5, "Yes", "No"))
test_pred[1:10]
##  2  3  4 12 13 14 15 17 18 20 
## No No No No No No No No No No 
## Levels: No Yes
(iv) test Error
round(mean(test$default != test_pred), 5)
## [1] 0.02601

(c) Repeating the train/test Splits

glm_fit_errors <- c()
glm_fit_errors[1] <- mean(test$default != test_pred)

set.seed(42, sample.kind = "Rounding")
## Warning in set.seed(42, sample.kind = "Rounding"): non-uniform 'Rounding'
## sampler used
for (i in 2:4) {
  index <- createDataPartition(y = Default$default, p = 0.7, list = F)
  train <- Default[index, ]
  test <- Default[-index, ]
  
  glm_fit_new <- glm(default ~ income + balance, data = train, family = "binomial")
  test_pred <- factor(ifelse(predict(glm_fit_new, newdata = test, type = "response") > 0.5, "Yes", "No"))
  glm_fit_errors[i] <- mean(test$default != test_pred)
}

round(glm_fit_errors, 5)
## [1] 0.02601 0.02534 0.02601 0.02534

Inference : There is some variation, as we would expect with the validation set approach and cross-validation. The average test error is 0.02568.

(d) Does Adding student Improve test Performance?

set.seed(43, sample.kind = "Rounding")
## Warning in set.seed(43, sample.kind = "Rounding"): non-uniform 'Rounding'
## sampler used
index <- createDataPartition(y = Default$default, p = 0.7, list = F)
train <- Default[index, ]
test <- Default[-index, ]

glm_fit_3 <- glm(default ~ ., data = train, family = "binomial")
summary(glm_fit_3)
## 
## Call:
## glm(formula = default ~ ., family = "binomial", data = train)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.100e+01  5.957e-01 -18.473   <2e-16 ***
## studentYes  -6.353e-01  2.825e-01  -2.249   0.0245 *  
## balance      5.780e-03  2.789e-04  20.721   <2e-16 ***
## income       4.235e-06  9.788e-06   0.433   0.6653    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2050.6  on 7000  degrees of freedom
## Residual deviance: 1093.3  on 6997  degrees of freedom
## AIC: 1101.3
## 
## Number of Fisher Scoring iterations: 8
test_pred <- factor(ifelse(predict(glm_fit_3, newdata = test, type = "response") > 0.5, "Yes", "No"))
round(mean(test$default != test_pred), 5)
## [1] 0.02634

Inference: Results show that the test error for this model is larger than the test error for the smaller model (larger than all 4 of them, in fact), so including a dummy variable for student does not reduce the test error rate.

(6)

(a) Logistic Regression - Coefficient SE Estimates

glm_fit_4 <- glm(default ~ income + balance, data = Default, family = "binomial")

summary(glm_fit_4)$coefficients[, 2]
##  (Intercept)       income      balance 
## 4.347564e-01 4.985167e-06 2.273731e-04

(b) Creating boot.fn() - Estimating Logistic Coefficients

boot.fn <- function(data, index = 1:nrow(data)) {
  coef(glm(default ~ income + balance, data = data, subset = index, family = "binomial"))[-1]
}

boot.fn(Default)
##       income      balance 
## 2.080898e-05 5.647103e-03

(c) boot.fn() & boot::boot() - Bootstrap Estimate of Coefficient SE’s

set.seed(101, sample.kind = "Rounding")
## Warning in set.seed(101, sample.kind = "Rounding"): non-uniform 'Rounding'
## sampler used
(boot_results <- boot(data = Default, statistic = boot.fn, R = 1000))
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Default, statistic = boot.fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##         original       bias     std. error
## t1* 2.080898e-05 9.797648e-08 4.710812e-06
## t2* 5.647103e-03 1.345215e-05 2.293757e-04
as.data.frame(boot_results$t) %>%
  rename(income = V1, balance = V2) %>%
  gather(key = "variable", value = "estimate") %>%
  ggplot(aes(x = estimate, fill = factor(variable))) + 
  geom_histogram(bins = 20) + 
  facet_wrap(~ variable, scales = "free_x") + 
  labs(title = "1,000 Bootstrap Parameter Estimates - 'balance' & 'income'", 
       subtitle = paste0("SE(balance) = ", formatC(sd(boot_results$t[ ,2]), format = "e", digits = 6), 
                         ", SE(income) = ", formatC(sd(boot_results$t[ ,1]), format = "e", digits = 6)), 
       x = "Parameter Estimate", 
       y = "Count") + 
  theme(legend.position = "none")

(d) Logistic SE’s & Bootstrap SE’s - Differences

sapply(data.frame(income = boot_results$t[ ,1], balance = boot_results$t[ ,2]), sd)
##       income      balance 
## 4.710812e-06 2.293757e-04
summary(glm_fit_4)$coefficients[2:3, 2]
##       income      balance 
## 4.985167e-06 2.273731e-04

Inference :

Both the bootstrap method and the logistic regression (glm) method provide very similar standard error estimates for the coefficients of income and balance. The estimates are:

  • Bootstrap Method:

    • income: 4.710812e-06

    • balance: 2.293757e-04

  • GLM Method:

    • income: 4.985167e-06

    • balance: 2.273731e-04

The small differences between the two methods suggest that the assumptions of the logistic regression standard error estimates are well-satisfied.

(9)

glimpse(Boston)
## Rows: 506
## Columns: 14
## $ crim    <dbl> 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, 0.08829,…
## $ zn      <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5, 12.5, 1…
## $ indus   <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7.87, 7.87, 7.…
## $ chas    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ nox     <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, 0.524, 0.524,…
## $ rm      <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172, 5.631,…
## $ age     <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0, 85.9, 9…
## $ dis     <dbl> 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5.5605, 5.9505…
## $ rad     <int> 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, 4, 4, 4,…
## $ tax     <dbl> 296, 242, 242, 222, 222, 222, 311, 311, 311, 311, 311, 311, 31…
## $ ptratio <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, 15.2, 15…
## $ black   <dbl> 396.90, 396.90, 392.83, 394.63, 396.90, 394.12, 395.60, 396.90…
## $ lstat   <dbl> 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15, 29.93, 17.10…
## $ medv    <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, 18.9, 15…

(a) Estimate the medv Population Mean -

mean(Boston$medv)
## [1] 22.53281

(b) Estimate SE(μ)- Using Theory

sd(Boston$medv) / sqrt(length(Boston$medv))
## [1] 0.4088611

(c) Estimate SE(μ)- Using the Bootstrap

boot.fn <- function(vector, index) {
  mean(vector[index])
}

set.seed(66, sample.kind = "Rounding")
## Warning in set.seed(66, sample.kind = "Rounding"): non-uniform 'Rounding'
## sampler used
(boot_results <- boot(data = Boston$medv, statistic = boot.fn, R = 1000))
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Boston$medv, statistic = boot.fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##     original    bias    std. error
## t1* 22.53281 0.0116587   0.4081538

(d) 95% Confidence Interval for the Mean - Using the Bootstrap

boot_results_SE <- sd(boot_results$t)
round(c(mean(Boston$medv) - 2*boot_results_SE, mean(Boston$medv) + 2*boot_results_SE), 4)
## [1] 21.7165 23.3491
t.test(Boston$medv)
## 
##  One Sample t-test
## 
## data:  Boston$medv
## t = 55.111, df = 505, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  21.72953 23.33608
## sample estimates:
## mean of x 
##  22.53281

Inference : The confidence interval estimates are the same to 1 decimal point

(e) Estimate the medv Population Median - μmed

median(Boston$medv)
## [1] 21.2

(f) Estimate SE(μmed) - Using the Bootstrap

boot.fn <- function(vector, index) {
  median(vector[index])
}

set.seed(77, sample.kind = "Rounding")
## Warning in set.seed(77, sample.kind = "Rounding"): non-uniform 'Rounding'
## sampler used
(boot_results <- boot(data = Boston$medv, statistic = boot.fn, R = 1000))
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Boston$medv, statistic = boot.fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##     original  bias    std. error
## t1*     21.2  0.0094   0.3700318

Inference : As with the mean, the standard error is quite small relative to the estimate.

(g) Estimate the medv Population 10th Percentile - μ0.1

quantile(Boston$medv, 0.1)
##   10% 
## 12.75

(h) Estimate SE(μ0.1)- Using the Bootstrap

boot.fn <- function(vector, index) {
  quantile(vector[index], 0.1)
}

set.seed(77, sample.kind = "Rounding")
## Warning in set.seed(77, sample.kind = "Rounding"): non-uniform 'Rounding'
## sampler used
(boot_results <- boot(data = Boston$medv, statistic = boot.fn, R = 1000))
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Boston$medv, statistic = boot.fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##     original  bias    std. error
## t1*    12.75 0.02085    0.488873

Inference : The standard error is slightly larger relative to μ^0.1, but it is still small.