Model validation is an important step in the modeling process and helps in assessing the reliability of models before they can be used in decision making. Model validity can be judged by the stability and reasonableness of the regression coefficients, the plausibility and usability of the regression function and ability to generalize inference drawn from the regression analysis. We introduce the basic concept of bootstrap and review some of its applications in regression model validation.
The bootstrap is a simulation-based procedure for estimating and validating distributions. Generally, it is useful when:
To illustrate, we use bootstrap to derive an empirical distribution and confidence intervals for a sample median, which lacks a theoretical distribution.
Quantify a statistical confidence interval around the median of the values observed in a given data sample.
Suppose we have a sample x of size n=9, which is simply numbers 1 through 9. We assume that x is a representative sample of independent observation from a larger unknown population X (this can be a heroic assumption, and should be addressed in the analysis).
# generate hypothetical sample x = {1,2, ... ,8,9}
x = seq(1:9)
# print x
matrix(x, nrow = 1, dimnames = list("x", paste('Obs', 1:9))) %>%
pandoc.table("Sample Observations x")
Obs 1 | Obs 2 | Obs 3 | Obs 4 | Obs 5 | Obs 6 | Obs 7 | Obs 8 | Obs 9 | |
---|---|---|---|---|---|---|---|---|---|
x | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
# print summary of x
summary(x) %>% t() %>% pandoc.table("Summary of sample x")
Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max. |
---|---|---|---|---|---|
1 | 3 | 5 | 5 | 7 | 9 |
The observed median = 5 (the middle Observation 5), but if x is a representative sample from a population X and 5 is the estimate of median for that population, what is the 90% confidence interval around that estimate?
There is no analytical way to derive the distribution of the median. Plus, the sample is too small. However, we can use bootstrap to derive it empirically from the data.
To illustrate the basic setup, we begin by randomly drawing R=10 samples with replacement from our sample x. Each random draw is of the same length (n=9) as our original sample.
# set up a empty matrix to populate Bootstrap draws
B = matrix(nrow = 10, ncol = 9,
dimnames = list(paste('Bootstrap Sample',1:10),
LETTERS[1:9]))
# loop 10 times
set.seed(111)
for(i in 1:10){
# draw random samples from x
B[i,] = sample(x, size = length(x), replace = TRUE) %>% sort()
}
As a result, we have a R x n matrix of bootstrapped observations.
Obs 1 | Obs 2 | Obs 3 | Obs 4 | Obs 5 | Obs 6 | Obs 7 | Obs 8 | Obs 9 | |
---|---|---|---|---|---|---|---|---|---|
Original Sample x | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
Bootstrap Sample 1 | 1 | 4 | 4 | 4 | 4 | 5 | 5 | 6 | 7 |
Bootstrap Sample 2 | 1 | 1 | 1 | 2 | 2 | 5 | 6 | 6 | 9 |
Bootstrap Sample 3 | 3 | 3 | 3 | 4 | 4 | 4 | 6 | 6 | 9 |
Bootstrap Sample 4 | 1 | 3 | 4 | 5 | 5 | 5 | 6 | 7 | 8 |
Bootstrap Sample 5 | 2 | 3 | 6 | 6 | 6 | 6 | 8 | 8 | 9 |
Bootstrap Sample 6 | 1 | 3 | 4 | 4 | 6 | 6 | 7 | 8 | 8 |
Bootstrap Sample 7 | 1 | 1 | 2 | 5 | 5 | 5 | 6 | 8 | 9 |
Bootstrap Sample 8 | 1 | 1 | 3 | 3 | 3 | 3 | 6 | 8 | 9 |
Bootstrap Sample 9 | 1 | 1 | 2 | 3 | 4 | 5 | 6 | 8 | 9 |
Bootstrap Sample 10 | 1 | 2 | 4 | 4 | 6 | 7 | 7 | 7 | 8 |
Note: Since we drew 10 random samples of the same size as our original sample (9), any given observation xi will appear more than once in some samples and none at all in others. For example, the number 4 appears four times in Sample 1, but is missing completely from bootstrap Sample 2.
The middle column with Observation 5 contains the medians of each bootstrapped sample. It provides us with the empirical distribution of the median, from which we can infer the mean, quantiles, and extreme values of the population median:
Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max. |
---|---|---|---|---|---|
2 | 4 | 4.5 | 4.5 | 5.75 | 6 |
We can also visualize the distribution of the bootstrapped median:
Of course, R=10 random samples are not enough to generate any meaningful distributions. We can repeat the same bootstrap exercise with R=1000 random samples, which gives us a matrix B sized 1000 by 9.
R = 1000 # number of bootstrap samples
n = 9 # sample size
# set up a empty Rxn matrix B
B = matrix(nrow = R, ncol = n,
dimnames = list(1:R, LETTERS[1:n]))
# loop R times
set.seed(111)
for(i in 1:R){
# draw random samples from x
B[i,] = sample(x, size = n, replace = TRUE)
}
Drawing a sufficiently large amounts of bootstrap samples and taking the row medians of bootstrap matrix B results in a nice bell-shaped distribution of our median estimate.
Empirical Quantiles of Bootstrpped Median
0% | 1% | 5% | 10% | 50% | 90% | 95% | 99% | 100% |
---|---|---|---|---|---|---|---|---|
1 | 2 | 3 | 3 | 5 | 7 | 7 | 8 | 8 |
Given the quantiles above, we can infer with 90% confidence that the median of the population X is between 3 and 7 (5% and 95% quantiles). In other words, if we drew 100 samples from the same population, 90 samples would have the median within that range, assuming our original sample was representative of the population.
The basic technique illustrated above can be extended to a variety of use cases in the area of model validation:
We can use bootstrapping to examine the distribution of these characteristics and their sensitivity to the natural variation in the sample without the need for collecting additional data.
Below, we provide just a few examples of its application, but the possibilities of using bootstrap methodology for model validation are limitless.
Once a regression model is estimated, the statistical significance of coefficients is tested with parametric tests. In this section we supplement traditional tests by examining the distribution of coefficient estimates through bootstrapping.
We use a dataset consisting of 1,000 consumer credit profiles obtained from a German bank. For each consumer the binary response variable “credit” (Good/Bad) is available. In addition, 20 covariates that are assumed to influence creditworthiness were recorded. Examples of included predictors are listed below.
The full dataset description is available here.
# load credit data
library(caret)
data(GermanCredit)
data = GermanCredit %>% rename(credit = Class)
For simplicity, we use only a handful of variables to fit a simple logistic regression that estimates the linear effect of the logit of those variables on the probability of having a Good vs. Bad credit quality.
set.seed(304)
logit = glm(credit ~
Amount + Age + Duration +
Personal.Male.Single+
Purpose.UsedCar+
Property.RealEstate
,
data=data, family = binomial(link = "logit"))
Estimate | Std. Error | z value | Pr(>|z|) | |
---|---|---|---|---|
(Intercept) | 0.7901 | 0.2841 | 2.7813 | 0.0054 |
Amount | -0.0001 | 0.0000 | -1.7244 | 0.0846 |
Age | 0.0137 | 0.0069 | 1.9960 | 0.0459 |
Duration | -0.0325 | 0.0075 | -4.3174 | 0.0000 |
Personal.Male.Single | 0.4640 | 0.1513 | 3.0680 | 0.0022 |
Purpose.UsedCar | 1.2100 | 0.2912 | 4.1549 | 0.0000 |
Property.RealEstate | 0.4897 | 0.1767 | 2.7723 | 0.0056 |
All included variables appear to be significant at 10% based on the estimated standard errors included in the output. We can go a step further and validate the the model’s assumptions about the the shape of the estimated parameters’ distributions, based on which the standard errors were calculated.
As part of model validation we often test the fit of the model and the significance of its coefficients. These parametric tests rely on assumptions about the underlying distributions (normal, F, t, Chi-Squared, etc.)
We can use bootstrapping to calculate any complex test statistic and infer confidence intervals without making any assumptions about the distribution of these statistics.
In this section, we apply the same bootstrapping technique to approximate the distribution and test the stability of regression coefficients estimates.
R = 10 # number of bootstrap samples
n = nrow(data) # sample size
k = length(coef(logit)) # number of coefficients
# set up a empty Rxn matrix B
B = matrix(nrow = R, ncol = k,
dimnames = list(paste("Sample",1:R),
names(coef(logit))))
# loop R times
set.seed(111)
for(i in 1:R){
# sample credit data with replacement
boot.data = data[sample(x = 1:n, size = n, replace = TRUE), ]
# fit the model on the boostrapped sample
boot.logit = glm(logit$formula,
data=boot.data, family = binomial(link = "logit"))
# store the coefficients
B[i,] = coef(boot.logit)
}
As a result, we have a R x k matrix of bootstrapped regression coefficients, where each row is a set of coefficients estimated from each one of the bootstrapped samples.
(Intercept) | Amount | Age | Duration | Personal Male Single | Purpose UsedCar | Property RealEstate | |
---|---|---|---|---|---|---|---|
Full Sample | 0.79008 | -0.00006 | 0.01372 | -0.03251 | 0.46405 | 1.20996 | 0.48974 |
Sample 1 | 0.50144 | -0.00008 | 0.01665 | -0.02573 | 0.59468 | 1.61972 | 0.34213 |
Sample 2 | 0.67629 | -0.00009 | 0.02026 | -0.03445 | 0.52069 | 1.28117 | 0.36390 |
Sample 3 | 0.62652 | -0.00007 | 0.01803 | -0.02821 | 0.42826 | 1.33093 | 0.52020 |
Sample 4 | 0.54848 | -0.00009 | 0.01788 | -0.02664 | 0.59014 | 1.59539 | 0.39089 |
Sample 5 | 0.89008 | -0.00011 | 0.01960 | -0.03864 | 0.60542 | 1.36213 | 0.37428 |
Sample 6 | 0.61319 | -0.00009 | 0.01730 | -0.01899 | 0.28116 | 1.02529 | 0.76220 |
Sample 7 | 0.62361 | -0.00008 | 0.01142 | -0.01900 | 0.50472 | 0.89934 | 0.73478 |
Sample 8 | 0.98536 | -0.00004 | 0.01103 | -0.03477 | 0.68881 | 0.86006 | 0.50175 |
Sample 9 | 1.39722 | -0.00003 | -0.00010 | -0.04043 | 0.50344 | 1.11560 | 0.25355 |
Sample 10 | 0.90840 | -0.00006 | 0.00537 | -0.02372 | 0.38076 | 1.78454 | 0.68073 |
R’s boot package provides a convenient function to perform bootstrapping for a variety of measures. Its usage requires a user-defined function that takes in the original data sample and outputs the calculated statistic of interest.
Parallel processing: Generating a high number of random samples and fitting the model is computationally intensive and time-consuming for complex models and large datasets. The boot function provides the option to split the bootstrapping job across a cluster of computers or multiple cores of the local CPU. This allows the user to handle multiple bootstrap samples in parallel, which cuts down on processing time and allows to keep the number of repetitions R sufficiently large.
library(boot)
library(parallel)
# function to return bootstrapped coefficients
myLogitCoef <- function(data, indices, formula) {
d <- data[indices,]
fit <- glm(formula, data=d, family = binomial(link = "logit"))
return(coef(fit))
}
# set up cluster of 4 CPU cores
cl<-makeCluster(4)
clusterExport(cl, 'myLogitCoef')
set.seed(373)
coef.boot <- boot(data=data, statistic=myLogitCoef, R=1000,
formula= logit$formula,
# process in parallel across 4 CPU cores
parallel = 'snow', ncpus=4, cl=cl)
stopCluster(cl)
The plots below illustrate the distribution of bootstrapped model coefficients and 95% empirical confidence intervals (thick black line at the bottom of the chart). For comparison, the graphs also show the normal distributions (orange bell curves) and 95% confidence intervals (thick orange line) that are implied by the estimated coefficients and their standard errors (assuming normality).
We can also visualize the correlation between two model coefficients by plotting their bi-variate distributions.
Model assumptions are important for the quality of predictions from a regression model. Better predictions will result from a model that satisfies its underlying assumptions. However, assumptions can never be fully met in empirical data.
The plots above suggest that in the case of our data, the model coefficients distributions approximate normal. The bootstrapped confidence intervals can provide a sanity check for relying on the distributional assumptions inherent to parametric tests.
The same methodology can be used to bootstrap the distribution of any other complex regression model tests statistics or diagnostics (i.e. R2, MSE, AIC, BIC, etc.)
For predictive models, the Bootstrap can be also examined the context of three types of model performance validation:
In the following section, we demonstrate how to use bootstrap to perform internal validation of model prediction accuracy, as measured by Area under the ROC curve.
The Receiver Operating Characteristic (ROC) curve for our default predicting logistic regression model is plotted below. The Area under the ROC curve (AUC) is a widely-used measure of accuracy for classification models. Its meaning can be interpreted as follows:
We can derive the ROC curve and calculate AUC for the logistic regression fitted above using R’s ROCR package.
library(ROCR)
# score the same training data set on which the model was fit
prob = predict.glm(logit, type='response', newdata = data)
pred = prediction(prob, data$credit)
# AUC
auc = performance(pred,"auc")@y.values[[1]][1]
# plot the ROC curve
perf <- performance(pred,"tpr","fpr")
plot(perf, col="navyblue", cex.main=1,
main= paste("Logistic Regression ROC Curve: AUC =", round(auc,3)))
abline(a=0, b = 1, col='darkorange1')
The logit model’s apparent, in-sample AUC=0.684, which is unrealistically high, because the predictions where made for the same dataset on which the model was estimated. As mentioned before, to get a more realistic measure of the model’s predictive accuracy, we need to apply internal (cross-validation, bootstrap) and/or external (new data, if available) validation techniques.
In the following section, we demonstrate the use of the bootstrap to adjust the apparent accuracy rate to get an idea of its hypothetical performance in predicting the outcomes for additional data sampled from a similar population, without actually collecting more data.
As a basic illustration, we generate R=10 bootstrap samples from our consumer credit profiles dataset, estimate the model on each, and then apply each fitted model to the original sample to give R estimates AUC. The overall estimate of prediction accuracy is the average of these R estimates (see first column in the table below). Further, we recalculate prediction accuracy when the fitted model is applied to the bootstrap sample itself (see the second column below).
R = 10
n = nrow(data)
# empty Rx2 matrix for bootstrap results
B = matrix(nrow = R, ncol = 2,
dimnames = list(paste('Sample',1:R),
c("auc_orig","auc_boot")))
set.seed(701)
for(i in 1:R){
# draw a random sample
obs.boot <- sample(x = 1:n, size = n, replace = T)
data.boot <- data[obs.boot, ]
# fit the model on bootstrap sample
logit.boot <- glm(logit$formula ,
data=data.boot,
family = binomial(link = "logit"))
# apply model to original data
prob1 = predict(logit.boot, type='response', data)
pred1 = prediction(prob1, data$credit)
auc1 = performance(pred1,"auc")@y.values[[1]][1]
B[i, 1] = auc1
# apply model to bootstrap data
prob2 = predict(logit.boot, type='response', data.boot)
pred2 = prediction(prob2, data.boot$credit)
auc2 = performance(pred2,"auc")@y.values[[1]][1]
B[i, 2] = auc2
}
AUC on original sample | AUC on bootstrap sample | Optimism | |
---|---|---|---|
Bootstrap Sample 1 | 0.675 | 0.689 | 0.014 |
Bootstrap Sample 2 | 0.676 | 0.685 | 0.009 |
Bootstrap Sample 3 | 0.674 | 0.691 | 0.017 |
Bootstrap Sample 4 | 0.681 | 0.707 | 0.026 |
Bootstrap Sample 5 | 0.675 | 0.68 | 0.005 |
Bootstrap Sample 6 | 0.675 | 0.671 | -0.005 |
Bootstrap Sample 7 | 0.671 | 0.689 | 0.017 |
Bootstrap Sample 8 | 0.677 | 0.688 | 0.012 |
Bootstrap Sample 9 | 0.682 | 0.705 | 0.023 |
Bootstrap Sample 10 | 0.674 | 0.714 | 0.04 |
AVERAGE | 0.676 | 0.692 | 0.016 |
Not surprisingly, the values in the second column are higher on the average than those in the first column. The improved bootstrap AUC estimate focuses on the difference between the first and second columns, called appropriately the “optimism”; it is the amount by which the average AUC (or " the apparent prediction accuracy“) overestimates the true prediction accuracy. The overall estimate of optimism is the average of the R differences between the first and second columns, a value of 0.016 in this example.
Once an estimate of optimism is obtained, it is subtracted from the apparent AUC to obtain an improved estimate of prediction accuracy. Here we obtain 0.684 - 0.016 = 0.668.
Of course 10 bootstrap samples are too few; repeating with 200 samples gave a value of for the simple bootstrap estimate, and an estimate of 0.009 for the optimism leading to the value 0.684 - 0.009 = 0.675 for the improved estimate of prediction accuracy. Essentially, we have added a bias correction to the apparent AUC.
To summarize:Apparent ‘in-sample’ AUC: | 0.684 |
Optimism | 0.009 |
Bootstrapped AUC: | 0.675 |
For more details on this approach see:
Efron, B., & Tibshirani, R. (1993). “An Introduction to the Bootstrap”. Chapman and Hall