1 Introduction

Model validation is an important step in the modeling process and helps in assessing the reliability of models before they can be used in decision making. Model validity can be judged by the stability and reasonableness of the regression coefficients, the plausibility and usability of the regression function and ability to generalize inference drawn from the regression analysis. We introduce the basic concept of bootstrap and review some of its applications in regression model validation.

1.1 When to use Bootstrap?

The bootstrap is a simulation-based procedure for estimating and validating distributions. Generally, it is useful when:

  1. The theoretical distribution of a statistic is complicated or unknown.
  2. The sample size is insufficient for straightforward statistical inference.

To illustrate, we use bootstrap to derive an empirical distribution and confidence intervals for a sample median, which lacks a theoretical distribution.

1.2 Basic Example: Confidence Intervals of a Sample Median

1.2.1 The Goal

Quantify a statistical confidence interval around the median of the values observed in a given data sample.

1.2.2 The Data

Suppose we have a sample x of size n=9, which is simply numbers 1 through 9. We assume that x is a representative sample of independent observation from a larger unknown population X (this can be a heroic assumption, and should be addressed in the analysis).

# generate hypothetical sample x = {1,2, ... ,8,9}
x = seq(1:9)

# print x
matrix(x, nrow = 1, dimnames = list("x", paste('Obs', 1:9))) %>% 
    pandoc.table("Sample Observations x")
Sample Observations x
  Obs 1 Obs 2 Obs 3 Obs 4 Obs 5 Obs 6 Obs 7 Obs 8 Obs 9
x 1 2 3 4 5 6 7 8 9
# print summary of x
summary(x) %>% t() %>%  pandoc.table("Summary of sample x")
Summary of sample x
Min. 1st Qu. Median Mean 3rd Qu. Max.
1 3 5 5 7 9

The observed median = 5 (the middle Observation 5), but if x is a representative sample from a population X and 5 is the estimate of median for that population, what is the 90% confidence interval around that estimate?

There is no analytical way to derive the distribution of the median. Plus, the sample is too small. However, we can use bootstrap to derive it empirically from the data.

1.2.3 Basic Bootstrap

To illustrate the basic setup, we begin by randomly drawing R=10 samples with replacement from our sample x. Each random draw is of the same length (n=9) as our original sample.

# set up a empty matrix to populate Bootstrap draws
B = matrix(nrow = 10, ncol = 9,
           dimnames = list(paste('Bootstrap Sample',1:10), 
                           LETTERS[1:9]))
# loop 10 times
set.seed(111)
for(i in 1:10){
    # draw random samples from x
    B[i,] = sample(x, size = length(x), replace = TRUE) %>% sort()
}

As a result, we have a R x n matrix of bootstrapped observations.

Bootstraped Samples
  Obs 1 Obs 2 Obs 3 Obs 4 Obs 5 Obs 6 Obs 7 Obs 8 Obs 9
Original Sample x 1 2 3 4 5 6 7 8 9
Bootstrap Sample 1 1 4 4 4 4 5 5 6 7
Bootstrap Sample 2 1 1 1 2 2 5 6 6 9
Bootstrap Sample 3 3 3 3 4 4 4 6 6 9
Bootstrap Sample 4 1 3 4 5 5 5 6 7 8
Bootstrap Sample 5 2 3 6 6 6 6 8 8 9
Bootstrap Sample 6 1 3 4 4 6 6 7 8 8
Bootstrap Sample 7 1 1 2 5 5 5 6 8 9
Bootstrap Sample 8 1 1 3 3 3 3 6 8 9
Bootstrap Sample 9 1 1 2 3 4 5 6 8 9
Bootstrap Sample 10 1 2 4 4 6 7 7 7 8

Note: Since we drew 10 random samples of the same size as our original sample (9), any given observation xi will appear more than once in some samples and none at all in others. For example, the number 4 appears four times in Sample 1, but is missing completely from bootstrap Sample 2.

1.2.4 Bootstrapping the Statistic of Interest

The middle column with Observation 5 contains the medians of each bootstrapped sample. It provides us with the empirical distribution of the median, from which we can infer the mean, quantiles, and extreme values of the population median:

Min. 1st Qu. Median Mean 3rd Qu. Max.
2 4 4.5 4.5 5.75 6

We can also visualize the distribution of the bootstrapped median:

Of course, R=10 random samples are not enough to generate any meaningful distributions. We can repeat the same bootstrap exercise with R=1000 random samples, which gives us a matrix B sized 1000 by 9.

R = 1000  # number of bootstrap samples
n = 9     # sample size

# set up a empty Rxn matrix B
B = matrix(nrow = R, ncol = n,
           dimnames = list(1:R, LETTERS[1:n]))
# loop R times
set.seed(111)
for(i in 1:R){
    # draw random samples from x
    B[i,] = sample(x, size = n, replace = TRUE)
}

Drawing a sufficiently large amounts of bootstrap samples and taking the row medians of bootstrap matrix B results in a nice bell-shaped distribution of our median estimate.

Empirical Quantiles of Bootstrpped Median

0% 1% 5% 10% 50% 90% 95% 99% 100%
1 2 3 3 5 7 7 8 8

1.2.5 Bootstrap Confidence Interval

Given the quantiles above, we can infer with 90% confidence that the median of the population X is between 3 and 7 (5% and 95% quantiles). In other words, if we drew 100 samples from the same population, 90 samples would have the median within that range, assuming our original sample was representative of the population.

1.3 Key Points

  • Estimator of interest
    • can be an algorithm of almost any complexity
    • bootstrapped estimate must be smooth function of data to be useful (no jumps)
  • The original sample must be a fair representation of the underlying population
    • often a heroic assumption
  • Simulation replaces theoretical calculation
    • removes need for math skills (no need to look up distribution formulas)
    • does not remove need for thought, domain knowledge, and creativity
  • Mitigate risk
    • validate code very carefully - garbage in, garbage out
    • insure the original sample is representative and observations are independent and identically distributed
    • increase number of random samples (large enough or as much as computationally feasible)

2 Bootstrap Applications in Model Validation

The basic technique illustrated above can be extended to a variety of use cases in the area of model validation:

  1. Goodness of fit (F-stat, R2, AIC)
  2. Statistical significance of the estimated parameters (t-test and p-values)
  3. Residuals diagnostics (normality, heteroscedasticity)
  4. Prediction Accuracy (Mean Squared Errors, Accuracy, Sensitivity/Specificity, Kolmogorov-Smirnov stat, etc.)
  5. Estimates of model bias & variance (similar to cross-validation)

We can use bootstrapping to examine the distribution of these characteristics and their sensitivity to the natural variation in the sample without the need for collecting additional data.

Below, we provide just a few examples of its application, but the possibilities of using bootstrap methodology for model validation are limitless.

2.1 Stability of Regression Coefficients

Once a regression model is estimated, the statistical significance of coefficients is tested with parametric tests. In this section we supplement traditional tests by examining the distribution of coefficient estimates through bootstrapping.

2.1.1 The Data

We use a dataset consisting of 1,000 consumer credit profiles obtained from a German bank. For each consumer the binary response variable “credit” (Good/Bad) is available. In addition, 20 covariates that are assumed to influence creditworthiness were recorded. Examples of included predictors are listed below.

  • Account Balance
  • Payment Status
  • Savings/Stocks Value
  • Length of Current Employment
  • Installment % of Income
  • Marital Status / Gender
  • Occupation
  • Duration at Current Address
  • Rent/Own

The full dataset description is available here.

# load credit data
library(caret)
data(GermanCredit)
data = GermanCredit %>% rename(credit = Class)

2.1.2 The Model

For simplicity, we use only a handful of variables to fit a simple logistic regression that estimates the linear effect of the logit of those variables on the probability of having a Good vs. Bad credit quality.

set.seed(304)
logit = glm(credit ~
                Amount + Age + Duration +
                Personal.Male.Single+
                Purpose.UsedCar+
                Property.RealEstate
            , 
            data=data, family = binomial(link = "logit"))
Logistic Regression: Probability of ‘Good’ Credit Profile
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.7901 0.2841 2.7813 0.0054
Amount -0.0001 0.0000 -1.7244 0.0846
Age 0.0137 0.0069 1.9960 0.0459
Duration -0.0325 0.0075 -4.3174 0.0000
Personal.Male.Single 0.4640 0.1513 3.0680 0.0022
Purpose.UsedCar 1.2100 0.2912 4.1549 0.0000
Property.RealEstate 0.4897 0.1767 2.7723 0.0056


All included variables appear to be significant at 10% based on the estimated standard errors included in the output. We can go a step further and validate the the model’s assumptions about the the shape of the estimated parameters’ distributions, based on which the standard errors were calculated.

2.1.3 Bootstrapping Regression Coefficients

As part of model validation we often test the fit of the model and the significance of its coefficients. These parametric tests rely on assumptions about the underlying distributions (normal, F, t, Chi-Squared, etc.)

We can use bootstrapping to calculate any complex test statistic and infer confidence intervals without making any assumptions about the distribution of these statistics.

In this section, we apply the same bootstrapping technique to approximate the distribution and test the stability of regression coefficients estimates.

R = 10                      # number of bootstrap samples
n = nrow(data)              # sample size
k = length(coef(logit))     # number of coefficients

# set up a empty Rxn matrix B
B = matrix(nrow = R, ncol = k,
           dimnames = list(paste("Sample",1:R), 
                           names(coef(logit))))
# loop R times
set.seed(111)
for(i in 1:R){
    # sample credit data with replacement
    boot.data = data[sample(x = 1:n, size = n, replace = TRUE), ]
    # fit the model on the boostrapped sample
    boot.logit = glm(logit$formula, 
                     data=boot.data, family = binomial(link = "logit"))
    # store the coefficients
    B[i,] = coef(boot.logit)
}

As a result, we have a R x k matrix of bootstrapped regression coefficients, where each row is a set of coefficients estimated from each one of the bootstrapped samples.

Bootstrapped Logistic Regression Model Coefficients
  (Intercept) Amount Age Duration Personal Male Single Purpose UsedCar Property RealEstate
Full Sample 0.79008 -0.00006 0.01372 -0.03251 0.46405 1.20996 0.48974
Sample 1 0.50144 -0.00008 0.01665 -0.02573 0.59468 1.61972 0.34213
Sample 2 0.67629 -0.00009 0.02026 -0.03445 0.52069 1.28117 0.36390
Sample 3 0.62652 -0.00007 0.01803 -0.02821 0.42826 1.33093 0.52020
Sample 4 0.54848 -0.00009 0.01788 -0.02664 0.59014 1.59539 0.39089
Sample 5 0.89008 -0.00011 0.01960 -0.03864 0.60542 1.36213 0.37428
Sample 6 0.61319 -0.00009 0.01730 -0.01899 0.28116 1.02529 0.76220
Sample 7 0.62361 -0.00008 0.01142 -0.01900 0.50472 0.89934 0.73478
Sample 8 0.98536 -0.00004 0.01103 -0.03477 0.68881 0.86006 0.50175
Sample 9 1.39722 -0.00003 -0.00010 -0.04043 0.50344 1.11560 0.25355
Sample 10 0.90840 -0.00006 0.00537 -0.02372 0.38076 1.78454 0.68073

2.1.4 Using the boot function in R

R’s boot package provides a convenient function to perform bootstrapping for a variety of measures. Its usage requires a user-defined function that takes in the original data sample and outputs the calculated statistic of interest.

Parallel processing: Generating a high number of random samples and fitting the model is computationally intensive and time-consuming for complex models and large datasets. The boot function provides the option to split the bootstrapping job across a cluster of computers or multiple cores of the local CPU. This allows the user to handle multiple bootstrap samples in parallel, which cuts down on processing time and allows to keep the number of repetitions R sufficiently large.

library(boot)
library(parallel)

# function to return bootstrapped coefficients
myLogitCoef <- function(data, indices, formula) {
    d <- data[indices,]
    fit <- glm(formula, data=d, family = binomial(link = "logit"))
    return(coef(fit))
}

# set up cluster of 4 CPU cores
cl<-makeCluster(4)
clusterExport(cl, 'myLogitCoef')

set.seed(373)
coef.boot <- boot(data=data, statistic=myLogitCoef, R=1000, 
                  formula= logit$formula,
                  # process in parallel across 4 CPU cores
                  parallel = 'snow', ncpus=4, cl=cl)
stopCluster(cl)

2.1.5 Univariate Distribution of Bootstrapped Coefficients

The plots below illustrate the distribution of bootstrapped model coefficients and 95% empirical confidence intervals (thick black line at the bottom of the chart). For comparison, the graphs also show the normal distributions (orange bell curves) and 95% confidence intervals (thick orange line) that are implied by the estimated coefficients and their standard errors (assuming normality).

We can also visualize the correlation between two model coefficients by plotting their bi-variate distributions.

  • Each dot represents a set of model coefficient estimates from each bootstrap iteration.
  • The orange dashed lines mark the joint 50%, 95%, and 99% confidence intervals.

2.1.6 Conclusion

Model assumptions are important for the quality of predictions from a regression model. Better predictions will result from a model that satisfies its underlying assumptions. However, assumptions can never be fully met in empirical data.

The plots above suggest that in the case of our data, the model coefficients distributions approximate normal. The bootstrapped confidence intervals can provide a sanity check for relying on the distributional assumptions inherent to parametric tests.

The same methodology can be used to bootstrap the distribution of any other complex regression model tests statistics or diagnostics (i.e. R2, MSE, AIC, BIC, etc.)

2.2 Validation of Model Prediction Accuracy

For predictive models, the Bootstrap can be also examined the context of three types of model performance validation:

  1. Apparent:
    • Performance on same data used to train model (aka “in-sample” fit)
    • Easy to perform (often part of standard regression output)
    • Optimistic estimates of performance
    • Over-fitting - Model fine-tuned to training data. Testing on new data may show disappointing results.
  2. Internal:
    • Performance on same population underlying the sample, but with a twist
    • Honest estimate of performance on observations similar to the training sample
    • Indicates upper limit to expected performance in other settings
    • Three common techniques:
      • Split Sample: Training / Testing
      • Cross-validation (repeated training/testing splits)
      • Bootstrap (demonstrated below)
  3. External:
    • Performance on related but slightly different population (aka “out-of-sample”)
    • Sample from a nearby geography or from the next time period

In the following section, we demonstrate how to use bootstrap to perform internal validation of model prediction accuracy, as measured by Area under the ROC curve.

2.2.1 Area under the ROC curve

The Receiver Operating Characteristic (ROC) curve for our default predicting logistic regression model is plotted below. The Area under the ROC curve (AUC) is a widely-used measure of accuracy for classification models. Its meaning can be interpreted as follows:

  • When AUC=1.00, the model assigns all observation to their true class with perfect accuracy.
  • When AUC=0.50, the ROC curve is equivalent to the 45-degree line. It indicates the model is as accurate as guessing at random.
  • When AUC<0.50, the model accuracy is worse than guessing at random.

We can derive the ROC curve and calculate AUC for the logistic regression fitted above using R’s ROCR package.

library(ROCR)

# score the same training data set on which the model was fit
prob = predict.glm(logit, type='response', newdata =  data)
pred = prediction(prob, data$credit)

# AUC
auc = performance(pred,"auc")@y.values[[1]][1]

# plot the ROC curve
perf <- performance(pred,"tpr","fpr")
plot(perf, col="navyblue", cex.main=1,
     main= paste("Logistic Regression ROC Curve: AUC =", round(auc,3)))
abline(a=0, b = 1, col='darkorange1')

The logit model’s apparent, in-sample AUC=0.684, which is unrealistically high, because the predictions where made for the same dataset on which the model was estimated. As mentioned before, to get a more realistic measure of the model’s predictive accuracy, we need to apply internal (cross-validation, bootstrap) and/or external (new data, if available) validation techniques.

In the following section, we demonstrate the use of the bootstrap to adjust the apparent accuracy rate to get an idea of its hypothetical performance in predicting the outcomes for additional data sampled from a similar population, without actually collecting more data.

2.2.2 Bootstrap Estimates of Prediction Accuracy

As a basic illustration, we generate R=10 bootstrap samples from our consumer credit profiles dataset, estimate the model on each, and then apply each fitted model to the original sample to give R estimates AUC. The overall estimate of prediction accuracy is the average of these R estimates (see first column in the table below). Further, we recalculate prediction accuracy when the fitted model is applied to the bootstrap sample itself (see the second column below).

R = 10
n = nrow(data)

# empty Rx2 matrix for bootstrap results 
B = matrix(nrow = R, ncol = 2,
           dimnames = list(paste('Sample',1:R),
                           c("auc_orig","auc_boot")))

set.seed(701)
for(i in 1:R){
    
    # draw a random sample
    obs.boot <- sample(x = 1:n, size = n, replace = T)
    data.boot <- data[obs.boot, ]
    
    # fit the model on bootstrap sample
    logit.boot <- glm(logit$formula , 
                      data=data.boot,
                      family = binomial(link = "logit"))
    
    # apply model to original data
    prob1 = predict(logit.boot, type='response', data)
    pred1 = prediction(prob1, data$credit)
    auc1 = performance(pred1,"auc")@y.values[[1]][1]
    B[i, 1] = auc1
    
    # apply model to bootstrap data
    prob2 = predict(logit.boot, type='response', data.boot)
    pred2 = prediction(prob2, data.boot$credit)
    auc2 = performance(pred2,"auc")@y.values[[1]][1]
    B[i, 2] = auc2
}
  AUC on original sample AUC on bootstrap sample Optimism
Bootstrap Sample 1 0.675 0.689 0.014
Bootstrap Sample 2 0.676 0.685 0.009
Bootstrap Sample 3 0.674 0.691 0.017
Bootstrap Sample 4 0.681 0.707 0.026
Bootstrap Sample 5 0.675 0.68 0.005
Bootstrap Sample 6 0.675 0.671 -0.005
Bootstrap Sample 7 0.671 0.689 0.017
Bootstrap Sample 8 0.677 0.688 0.012
Bootstrap Sample 9 0.682 0.705 0.023
Bootstrap Sample 10 0.674 0.714 0.04
AVERAGE 0.676 0.692 0.016

Not surprisingly, the values in the second column are higher on the average than those in the first column. The improved bootstrap AUC estimate focuses on the difference between the first and second columns, called appropriately the “optimism”; it is the amount by which the average AUC (or " the apparent prediction accuracy“) overestimates the true prediction accuracy. The overall estimate of optimism is the average of the R differences between the first and second columns, a value of 0.016 in this example.

Once an estimate of optimism is obtained, it is subtracted from the apparent AUC to obtain an improved estimate of prediction accuracy. Here we obtain 0.684 - 0.016 = 0.668.

Of course 10 bootstrap samples are too few; repeating with 200 samples gave a value of for the simple bootstrap estimate, and an estimate of 0.009 for the optimism leading to the value 0.684 - 0.009 = 0.675 for the improved estimate of prediction accuracy. Essentially, we have added a bias correction to the apparent AUC.

To summarize:
Logistic Regression: Internally-Validated Prediction Accuracy
Apparent ‘in-sample’ AUC: 0.684
Optimism 0.009
Bootstrapped AUC: 0.675


For more details on this approach see:
Efron, B., & Tibshirani, R. (1993). “An Introduction to the Bootstrap”. Chapman and Hall