The Bootstrap

Harold Nelson

2024-07-18

Intro

I want to show you how to create a bootstrap distribution without using the infer package. Going through this should give you a clearer picture of the process.

We’ll use the dataframe age_at_mar from the openintro package.

Setup

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(openintro)
## Loading required package: airports
## Loading required package: cherryblossom
## Loading required package: usdata
glimpse(age_at_mar)
## Rows: 5,534
## Columns: 1
## $ age <int> 32, 25, 24, 26, 32, 29, 23, 23, 29, 27, 23, 21, 29, 40, 22, 20, 31…

The Bootstrap Code

I’ll use the standard deviation as an example statistic.

# How big a bootstrap distribution?
boot_size = 1000
# Create a vector to hold the results
statistics = rep(0,boot_size)

for(i in 1:boot_size){
  samp = sample(age_at_mar$age,
                size = length(age_at_mar$age),
                replace = TRUE)
  statistics[i] = sd(samp)
}

summary(statistics)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.573   4.689   4.729   4.726   4.761   4.903
hist(statistics)

Confidence Interval (IMS Way)

lower = mean(statistics - 1.96 * sd(statistics))
upper = mean(statistics + 1.96 * sd(statistics))

ci_ims = c(lower,upper)
ci_ims
## [1] 4.620330 4.831133

Confidence Interval (Alternative)

This does not assume normality.

lower = quantile(statistics,.025)
upper = quantile(statistics,.975)

ci_alternative = c(lower,upper)
ci_alternative
##     2.5%    97.5% 
## 4.621390 4.834643

80%?

We can use the alternative method to change the confidence level easily.

For 80%, we cut off the top 10% and the bottom 10%.

lower = quantile(statistics,.1)
upper = quantile(statistics,.9)

ci_alternative_80 = c(lower,upper)
ci_alternative_80
##      10%      90% 
## 4.654242 4.791636

Comparison

ci_ims
## [1] 4.620330 4.831133
ci_alternative
##     2.5%    97.5% 
## 4.621390 4.834643
ci_alternative_80
##      10%      90% 
## 4.654242 4.791636