The Bootstrap

Harold Nelson

2025-07-29

Intro

I want to show you how to create a bootstrap distribution without using the infer package. Going through this should give you a clearer picture of the process.

We’ll use the dataframe age_at_mar from the openintro package.

Setup

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(openintro)
## Loading required package: airports
## Loading required package: cherryblossom
## Loading required package: usdata
glimpse(age_at_mar)
## Rows: 5,534
## Columns: 1
## $ age <int> 32, 25, 24, 26, 32, 29, 23, 23, 29, 27, 23, 21, 29, 40, 22, 20, 31…

The Bootstrap Code

I’ll use the standard deviation as an example statistic. We can do this with any statistic, not just the mean.

# How big a bootstrap distribution?
boot_size = 1000
# Create a vector to hold the results
statistics = rep(0,boot_size)

for(i in 1:boot_size){
  samp = sample(age_at_mar$age,
                size = length(age_at_mar$age),
                replace = TRUE)
  statistics[i] = sd(samp)
}

summary(statistics)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.493   4.682   4.721   4.720   4.757   4.898
hist(statistics)

Confidence Interval (normal)

lower = mean(statistics - 1.96 * sd(statistics))
upper = mean(statistics + 1.96 * sd(statistics))

ci_normal = c(lower,upper)
ci_normal
## [1] 4.612517 4.826794

Confidence Interval (Alternative)

This does not assume normality.

lower = quantile(statistics,.025)
upper = quantile(statistics,.975)

ci_alternative = c(lower,upper)
ci_alternative
##     2.5%    97.5% 
## 4.615843 4.821501

80%?

We can use the alternative method to change the confidence level easily.

For 80%, we cut off the top 10% and the bottom 10%.

lower = quantile(statistics,.1)
upper = quantile(statistics,.9)

ci_alternative_80 = c(lower,upper)
ci_alternative_80
##      10%      90% 
## 4.652647 4.789212

Comparison

ci_alternative
##     2.5%    97.5% 
## 4.615843 4.821501
ci_alternative_80
##      10%      90% 
## 4.652647 4.789212

A Normal Distribution Example

Generate a random sample of size 500 from a normal distribution with a mean of 100 and a standard deviation of of 20. Call it n100_20

Solution

set.seed(123)
n100_20 = rnorm(500,mean = 100,sd = 20)

The Median

Find the true median of a normal distribution with this mean and sd. Use qnorm().

Solution

qnorm(.5,mean = 100, sd = 20)
## [1] 100

Bootstrap for the Median.

Use n100_20. Get a 90% CI.

Solution

# How big a bootstrap distribution?
boot_size = 1000
# Create a vector to hold the results
statistics = rep(0,boot_size)

for(i in 1:boot_size){
  samp = sample(n100_20,
                size = length(n100_20),
                replace = TRUE)
  statistics[i] = median(samp)
}

summary(statistics)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   95.99   99.43  100.42  100.34  101.19  103.63
hist(statistics)

upper = quantile(statistics,.95)
lower = quantile(statistics,.05)
CI = c(lower,upper)
CI
##        5%       95% 
##  98.55136 101.85080