k-Fold Cross-Validation involves dividing the dataset into k equal-sized subsets or folds. A model is trained on k-1 folds and tested on the remaining fold. This process is repeated k times, ensuring each fold acts as the test set once. The final performance metric is obtained by averaging the results across all folds, providing a robust estimate of the model’s out-of-sample performance.
The choice of performance metric is flexible and not limited to Mean Squared Error (MSE) or Accuracy. Other metrics such as Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), Area Under the Receiver Operating Characteristic Curve (AUROC), and Area Under the Precision-Recall Curve (AUPRC) can also be used if they are applicable to the model and dataset.
Advantages: - k-Fold Cross-Validation reduces variability in performance estimation, especially for smaller datasets, making results more reliable. - It ensures that all data points are used for both training and testing, maximizing the utilization of the dataset. - The validation set approach may overestimate test error since a significant portion of the data is excluded from training, whereas k-Fold Cross-Validation minimizes this issue.
Disadvantages: - The validation set approach is easier to understand and communicate, making it useful in industry settings where stakeholders require a straightforward explanation of model evaluation. - Computationally, k-Fold Cross-Validation is more expensive as it requires training k models compared to the validation set approach, which only requires training and testing a model once.
Advantages: - k-Fold Cross-Validation is computationally more efficient than LOOCV, particularly when dealing with large datasets. For example, with k = 10 and n = 10,000, k-Fold CV requires training 10 models, whereas LOOCV necessitates training 10,000 models. - From a bias-variance perspective, k-Fold Cross-Validation provides a better tradeoff. LOOCV has low bias but high variance, whereas k-Fold CV balances the two, potentially yielding a more accurate test error estimate.
Disadvantages: - k-Fold Cross-Validation introduces some randomness, as different splits can result in slightly different out-of-sample errors, whereas LOOCV does not suffer from this variability. - In certain cases, LOOCV can be computationally more efficient than k-Fold CV. For instance, in least squares regression, the LOOCV statistic can be computed using leverage values, requiring roughly the same computational power as fitting a single model instead of n models.
# Load necessary libraries
library(ISLR2)
## Warning: package 'ISLR2' was built under R version 4.4.2
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.4.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(caret)
## Warning: package 'caret' was built under R version 4.4.2
## Loading required package: lattice
##
## Attaching package: 'caret'
##
## The following object is masked from 'package:purrr':
##
## lift
library(boot)
##
## Attaching package: 'boot'
##
## The following object is masked from 'package:lattice':
##
## melanoma
library(MASS) # Boston data
##
## Attaching package: 'MASS'
##
## The following object is masked from 'package:dplyr':
##
## select
##
## The following object is masked from 'package:ISLR2':
##
## Boston
theme_set(theme_light())
# Set seed for reproducibility
set.seed(1)
data(Default)
glimpse(Default)
## Rows: 10,000
## Columns: 4
## $ default <fct> No, No, No, No, No, No, No, No, No, No, No, No, No, No, No, No…
## $ student <fct> No, Yes, No, No, No, Yes, No, Yes, No, No, Yes, Yes, No, No, N…
## $ balance <dbl> 729.5265, 817.1804, 1073.5492, 529.2506, 785.6559, 919.5885, 8…
## $ income <dbl> 44361.625, 12106.135, 31767.139, 35704.494, 38463.496, 7491.55…
# Fit logistic regression model
glm_fit <- glm(default ~ income + balance, family = binomial, data = Default)
# Display model summary
summary(glm_fit)
##
## Call:
## glm(formula = default ~ income + balance, family = binomial,
## data = Default)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.154e+01 4.348e-01 -26.545 < 2e-16 ***
## income 2.081e-05 4.985e-06 4.174 2.99e-05 ***
## balance 5.647e-03 2.274e-04 24.836 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1579.0 on 9997 degrees of freedom
## AIC: 1585
##
## Number of Fisher Scoring iterations: 8
set.seed(123, sample.kind = "Rounding")
## Warning in set.seed(123, sample.kind = "Rounding"): non-uniform 'Rounding'
## sampler used
index <- createDataPartition(y = Default$default, p = 0.7, list = F)
train <- Default[index, ]
test <- Default[-index, ]
nrow(train) / nrow(Default)
## [1] 0.7001
glm_fit_2 <- glm(default ~ income + balance, data = train, family = "binomial")
summary(glm_fit_2)
##
## Call:
## glm(formula = default ~ income + balance, family = "binomial",
## data = train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.141e+01 5.099e-01 -22.379 < 2e-16 ***
## income 2.256e-05 5.898e-06 3.825 0.000131 ***
## balance 5.531e-03 2.651e-04 20.866 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2050.6 on 7000 degrees of freedom
## Residual deviance: 1119.2 on 6998 degrees of freedom
## AIC: 1125.2
##
## Number of Fisher Scoring iterations: 8
test_pred <- factor(ifelse(predict(glm_fit_2, newdata = test, type = "response") > 0.5, "Yes", "No"))
test_pred[1:10]
## 2 3 4 12 13 14 15 17 18 20
## No No No No No No No No No No
## Levels: No Yes
round(mean(test$default != test_pred), 5)
## [1] 0.02601
glm_fit_errors <- c()
glm_fit_errors[1] <- mean(test$default != test_pred)
set.seed(42, sample.kind = "Rounding")
## Warning in set.seed(42, sample.kind = "Rounding"): non-uniform 'Rounding'
## sampler used
for (i in 2:4) {
index <- createDataPartition(y = Default$default, p = 0.7, list = F)
train <- Default[index, ]
test <- Default[-index, ]
glm_fit_new <- glm(default ~ income + balance, data = train, family = "binomial")
test_pred <- factor(ifelse(predict(glm_fit_new, newdata = test, type = "response") > 0.5, "Yes", "No"))
glm_fit_errors[i] <- mean(test$default != test_pred)
}
round(glm_fit_errors, 5)
## [1] 0.02601 0.02534 0.02601 0.02534
Inference : There is some variation, as we would expect with the validation set approach and cross-validation. The average test error is 0.02568.
set.seed(43, sample.kind = "Rounding")
## Warning in set.seed(43, sample.kind = "Rounding"): non-uniform 'Rounding'
## sampler used
index <- createDataPartition(y = Default$default, p = 0.7, list = F)
train <- Default[index, ]
test <- Default[-index, ]
glm_fit_3 <- glm(default ~ ., data = train, family = "binomial")
summary(glm_fit_3)
##
## Call:
## glm(formula = default ~ ., family = "binomial", data = train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.100e+01 5.957e-01 -18.473 <2e-16 ***
## studentYes -6.353e-01 2.825e-01 -2.249 0.0245 *
## balance 5.780e-03 2.789e-04 20.721 <2e-16 ***
## income 4.235e-06 9.788e-06 0.433 0.6653
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2050.6 on 7000 degrees of freedom
## Residual deviance: 1093.3 on 6997 degrees of freedom
## AIC: 1101.3
##
## Number of Fisher Scoring iterations: 8
test_pred <- factor(ifelse(predict(glm_fit_3, newdata = test, type = "response") > 0.5, "Yes", "No"))
round(mean(test$default != test_pred), 5)
## [1] 0.02634
Inference: Results show that the test error for this model is larger than the test error for the smaller model (larger than all 4 of them, in fact), so including a dummy variable for student does not reduce the test error rate.
glm_fit_4 <- glm(default ~ income + balance, data = Default, family = "binomial")
summary(glm_fit_4)$coefficients[, 2]
## (Intercept) income balance
## 4.347564e-01 4.985167e-06 2.273731e-04
boot.fn <- function(data, index = 1:nrow(data)) {
coef(glm(default ~ income + balance, data = data, subset = index, family = "binomial"))[-1]
}
boot.fn(Default)
## income balance
## 2.080898e-05 5.647103e-03
set.seed(101, sample.kind = "Rounding")
## Warning in set.seed(101, sample.kind = "Rounding"): non-uniform 'Rounding'
## sampler used
(boot_results <- boot(data = Default, statistic = boot.fn, R = 1000))
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Default, statistic = boot.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 2.080898e-05 9.797648e-08 4.710812e-06
## t2* 5.647103e-03 1.345215e-05 2.293757e-04
as.data.frame(boot_results$t) %>%
rename(income = V1, balance = V2) %>%
gather(key = "variable", value = "estimate") %>%
ggplot(aes(x = estimate, fill = factor(variable))) +
geom_histogram(bins = 20) +
facet_wrap(~ variable, scales = "free_x") +
labs(title = "1,000 Bootstrap Parameter Estimates - 'balance' & 'income'",
subtitle = paste0("SE(balance) = ", formatC(sd(boot_results$t[ ,2]), format = "e", digits = 6),
", SE(income) = ", formatC(sd(boot_results$t[ ,1]), format = "e", digits = 6)),
x = "Parameter Estimate",
y = "Count") +
theme(legend.position = "none")
sapply(data.frame(income = boot_results$t[ ,1], balance = boot_results$t[ ,2]), sd)
## income balance
## 4.710812e-06 2.293757e-04
summary(glm_fit_4)$coefficients[2:3, 2]
## income balance
## 4.985167e-06 2.273731e-04
Inference :
Both the bootstrap method and the logistic regression (glm) method provide very similar standard error estimates for the coefficients of income and balance. The estimates are:
Bootstrap Method:
income: 4.710812e-06
balance: 2.293757e-04
GLM Method:
income: 4.985167e-06
balance: 2.273731e-04
The small differences between the two methods suggest that the assumptions of the logistic regression standard error estimates are well-satisfied.
glimpse(Boston)
## Rows: 506
## Columns: 14
## $ crim <dbl> 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, 0.08829,…
## $ zn <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5, 12.5, 1…
## $ indus <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7.87, 7.87, 7.…
## $ chas <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ nox <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, 0.524, 0.524,…
## $ rm <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172, 5.631,…
## $ age <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0, 85.9, 9…
## $ dis <dbl> 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5.5605, 5.9505…
## $ rad <int> 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, 4, 4, 4,…
## $ tax <dbl> 296, 242, 242, 222, 222, 222, 311, 311, 311, 311, 311, 311, 31…
## $ ptratio <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, 15.2, 15…
## $ black <dbl> 396.90, 396.90, 392.83, 394.63, 396.90, 394.12, 395.60, 396.90…
## $ lstat <dbl> 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15, 29.93, 17.10…
## $ medv <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, 18.9, 15…
mean(Boston$medv)
## [1] 22.53281
sd(Boston$medv) / sqrt(length(Boston$medv))
## [1] 0.4088611
boot.fn <- function(vector, index) {
mean(vector[index])
}
set.seed(66, sample.kind = "Rounding")
## Warning in set.seed(66, sample.kind = "Rounding"): non-uniform 'Rounding'
## sampler used
(boot_results <- boot(data = Boston$medv, statistic = boot.fn, R = 1000))
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Boston$medv, statistic = boot.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 22.53281 0.0116587 0.4081538
boot_results_SE <- sd(boot_results$t)
round(c(mean(Boston$medv) - 2*boot_results_SE, mean(Boston$medv) + 2*boot_results_SE), 4)
## [1] 21.7165 23.3491
t.test(Boston$medv)
##
## One Sample t-test
##
## data: Boston$medv
## t = 55.111, df = 505, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 21.72953 23.33608
## sample estimates:
## mean of x
## 22.53281
Inference : The confidence interval estimates are the same to 1 decimal point
median(Boston$medv)
## [1] 21.2
boot.fn <- function(vector, index) {
median(vector[index])
}
set.seed(77, sample.kind = "Rounding")
## Warning in set.seed(77, sample.kind = "Rounding"): non-uniform 'Rounding'
## sampler used
(boot_results <- boot(data = Boston$medv, statistic = boot.fn, R = 1000))
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Boston$medv, statistic = boot.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 21.2 0.0094 0.3700318
Inference : As with the mean, the standard error is quite small relative to the estimate.
quantile(Boston$medv, 0.1)
## 10%
## 12.75
boot.fn <- function(vector, index) {
quantile(vector[index], 0.1)
}
set.seed(77, sample.kind = "Rounding")
## Warning in set.seed(77, sample.kind = "Rounding"): non-uniform 'Rounding'
## sampler used
(boot_results <- boot(data = Boston$medv, statistic = boot.fn, R = 1000))
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Boston$medv, statistic = boot.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 12.75 0.02085 0.488873
Inference : The standard error is slightly larger relative to μ^0.1, but it is still small.