Code-lab:: Bootstrapping

What is bootstrapping anyways?

Bootstrapping is one of many resmpling techniques where you create multiple samples from the original dataset by doing a random sampling with replacement. There are other resampling methods we can use other than bootstrapping. For example, Jackknife resampling is where each observation is systematically left out of the sample one at a time, and recalculates the statistic with this all-but-one sample. Bootstrapping was originally suggested as an improvement to the jackknife method, which limited the number of resamples to the number of observations, leading to poor performance when the sample sizes are smaller. The upside of jackknife compared to bootstrapping is that it is always reproducible.

The name apparently comes from the the phrase “pull yourself up by your bootstraps” (which itself come from a late-1800s physics textbook that included the question, “Why can not a man lift himself by pulling up on his bootstraps?” If you are interested… https://uselessetymology.com/2019/11/07/the-origins-of-the-phrase-pull-yourself-up-by-your-bootstraps/), widely used in situations emphasizing effort and determination. Hence this nifty method is called bootstrapping, as it allows us to do a lot with not much.

Enough of etymology! We are here to code. Below, I will show you how bootstrapping is done, through a comparison between for loop approach and purrr::map approach. Then, I will present you with an exercise that will hopefully cement our understanding of what we are doing when we use bootstrapping. I also wanted to show you various applications of bootstrapping, but I ran out of time…

As always, let’s set up the environment.

library(tidyverse)
library(magrittr)
library(broom)
library(boot)
library(ggridges)
library(tidymodels)

Before diving in, let’s first make sure we are good with sampling in R. sample_n() and sample_frac are dplyr ways of doing sampling, but apparently it has been superseded in favour of slice_sample(). I will introduce all three, since I’ve seen the older versions still being used frequently. Let’s sample!

You can sample by using the number of rows in sample_n(), while in sample_frac() you can sample with the fraction of the rows to select.

# Our toy tibble
df <- tibble(
  x = 1:10, 
  w = c(0.1, 0.2, 0.3, 4, 5, 6, 70, 80, 90, 1)
  )

# sample_n()
sample_n(df, 3)
## # A tibble: 3 × 2
##       x     w
##   <int> <dbl>
## 1     9  90  
## 2     2   0.2
## 3     8  80
# sample_n(), with replacement
sample_n(df, 8, replace = TRUE)
## # A tibble: 8 × 2
##       x     w
##   <int> <dbl>
## 1     1   0.1
## 2     2   0.2
## 3    10   1  
## 4     9  90  
## 5     6   6  
## 6     4   4  
## 7     4   4  
## 8     1   0.1
# sample_frac()
sample_frac(df, 0.5)
## # A tibble: 5 × 2
##       x     w
##   <int> <dbl>
## 1    10   1  
## 2     8  80  
## 3     9  90  
## 4     2   0.2
## 5     1   0.1
# sample_frac() for shuffling data 
sample_frac(df, 1) ## Look how this can be a way to shuffle
## # A tibble: 10 × 2
##        x     w
##    <int> <dbl>
##  1     3   0.3
##  2    10   1  
##  3     7  70  
##  4     2   0.2
##  5     6   6  
##  6     9  90  
##  7     5   5  
##  8     4   4  
##  9     8  80  
## 10     1   0.1
# sample_frac(), with replacement
sample_frac(df, 0.9, replace = TRUE)
## # A tibble: 9 × 2
##       x     w
##   <int> <dbl>
## 1     4   4  
## 2     4   4  
## 3     7  70  
## 4     5   5  
## 5     1   0.1
## 6     9  90  
## 7     3   0.3
## 8     8  80  
## 9     4   4
# sample_n() -> slice_sample()
slice_sample(df, n = 3)
## # A tibble: 3 × 2
##       x     w
##   <int> <dbl>
## 1     1   0.1
## 2    10   1  
## 3     6   6
slice_sample(df, n = 3, replace = TRUE)
## # A tibble: 3 × 2
##       x     w
##   <int> <dbl>
## 1     5   5  
## 2     8  80  
## 3     1   0.1
# sample_frac() -> slice_sample
slice_sample(df, prop = 0.5)
## # A tibble: 5 × 2
##       x     w
##   <int> <dbl>
## 1     4   4  
## 2     5   5  
## 3     2   0.2
## 4     9  90  
## 5     1   0.1
slice_sample(df, prop = 0.5, replace = TRUE)
## # A tibble: 5 × 2
##       x     w
##   <int> <dbl>
## 1     5   5  
## 2     3   0.3
## 3     6   6  
## 4    10   1  
## 5    10   1

Let’s revisit an old problem that I once brought to class. I was working on a project (alas, I am still working on it) where I had three treatment arms, and wished to plot the bootstrapped results for each arm, by party ID. My lovely co-author had used a for loop approach, and Tom had blessed us with the magic of map. However, I remember thinking that the code looked very complex, even as I marveled at its grace. Back then, we were not the map wizards we are now. Upon revisiting the code, you might share the feeling I had - it is so clear how it works!

Just for pedagogical reasons, let me present both approaches.

# Load data
us_rival <- "https://github.com/hyunso-christy-oh/io_support/raw/refs/heads/main/us_rival.RDS" %>% 
  url %>% 
  gzcon %>% 
  readRDS


us_ally <- "https://github.com/hyunso-christy-oh/io_support/raw/refs/heads/main/us_rival.RDS" %>% 
  url %>% 
  gzcon %>% 
  readRDS

A for loop approach looks something like below:

# We will first write a bootstrapping function
boot_treat <- function(rival_data,B){
  
  boot_res_rival <- rep(NA, B)
  
  for (i in seq_len(B)){
    # Resample and fit for rival
    sample_rival <- sample(1:nrow(rival_data), nrow(rival_data), replace = TRUE)
    temp_rival <- rival_data[sample_rival, ]
    model_rival <- lm(dv ~ rival, data = temp_rival)
    coeff_rival <- coef(model_rival)["rival"]
    boot_res_rival[i] <- coeff_rival
  }
  
  return(list(boot_rival=boot_res_rival, ATE_rival=mean(boot_res_rival, na.rm=TRUE)))
}

# We will do separate bootstrapping for PID by subsetting data

rep <- boot_treat(subset(us_rival, pid == 1), B = 1000)

dem <- boot_treat(subset(us_rival, pid == 2), B = 1000)

ind <- boot_treat(subset(us_rival, pid == 3), B = 1000)

# We can organize this into a list 
results_list <- list(
  rep = c(rep$boot_rival), 
  dem = c(dem$boot_rival),
  ind = c(ind$boot_rival)
)

A purrry approach to achieve the same results will look like below:

us_rival_2 <- us_rival %>% 
  mutate(
    pid_lab = pid %>% 
  case_match(
    1 ~ "Republican",
    2 ~ "Democrat",
    3 ~ "Independent",
    .default = NA 
  ) %>% 
  factor(c("Democrat", "Independent", "Republican"))
  )


us_rival_n <- us_rival_2 %>% 
  group_by(pid_lab) %>% 
  nest %>% 
  filter(
    pid_lab %>% 
      is.na %>% 
      not
  )

us_rival_n$party_boots <- us_rival_n$data %>% 
  map(
    \(i)
    1:1000 %>% 
      map(
        \(j)
        i %>% 
          sample_frac(size = 1, replace = T) %$% 
          lm(
            dv ~ rival
          ) %>% 
          tidy(conf.int = T) %>% 
          slice(2)
      ) %>% 
      list_rbind
  ) 

t1 <- map2(
  us_rival_n$pid_lab,
  us_rival_n$party_boots,
  \(i, j)
  j %>%
    mutate(
      party = i
    )
) %>%
  list_rbind

t1_mu <- t1 %>% 
  group_by(party) %>% 
  select(estimate, party) %>% 
  summarise(
    mu = estimate %>% mean
  )

We can plot the above results of bootstrapping:

t1 %>% 
  ggplot(
    aes(
      x = estimate, 
      y = party,
      fill = party
    )
  ) +
  geom_density_ridges(
    scale = .8,
    alpha = 0.6
  ) +
  geom_point(
    aes(
      mu, y = party
    ),
    shape = 21,
    size = 12,
    fill = "grey97",
    data = t1_mu
  ) +
  geom_text(
    aes(
      mu, party, label = mu %>% round(2)
    ),
    data = t1_mu
  ) +
  geom_vline(
    aes(
      xintercept = 0
    ),
    linetype = "dashed"
  ) +
  theme_minimal() +
  theme(
    legend.position = "none",
    x = "",
    y = "",
  ) +
    scale_fill_brewer(
      palette = "Set3"
  )
## Warning in plot_theme(plot): The `x` theme element is not defined in the
## element hierarchy.
## Warning in plot_theme(plot): The `y` theme element is not defined in the
## element hierarchy.

Cool!

Just to cement our understanding of bootstrapping, we will do a little exercise. We are going to use a toy dataset that contains daily average temperatures in Rapid City, from 1995 to 2011. We are going to plot the results of 100 bootstrapping efforts, and see how it stands with regards to the ground truth.

# Loading data
dt <- read_csv("rapidcity.csv")
## Rows: 6159 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (4): Year, Month, Day, Temp
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

We will first compare the real population mean, with a sampling mean.

# Population mean
dt %>% 
  summarize(
    avg_temp = Temp %>% 
      mean
  )
## # A tibble: 1 × 1
##   avg_temp
##      <dbl>
## 1     47.3
# Mean from a sample size of 50
sp1 <- dt %>% 
  slice_sample(n = 50)

sp1 %>% 
  summarize(
    avg_sp1_temp = Temp %>% 
      mean
  )
## # A tibble: 1 × 1
##   avg_sp1_temp
##          <dbl>
## 1         46.9

Obviously, our sample mean is off compared to the population mean. What if we bootstrapped this sample?

# Bootstrap our sp1 (slice_sample() only works on data frames)
sp1 <- sp1 %>% 
  as_tibble()

boot1_results <-  1:1000 %>% 
      map(
        ~sp1 %>% 
          slice_sample(prop = 1, replace = T) %>% 
          summarise(
            avg_boot1_temp = Temp %>% 
              mean
          ) 
        ) %>% 
  list_rbind

# Did our bootstrapping 'work'?
con_boot1 <- tibble(
  estimate = mean(boot1_results$avg_boot1_temp), # Mean of bootstrap samples
  conf.low = quantile(boot1_results$avg_boot1_temp, 0.025),
  conf.high = quantile(boot1_results$avg_boot1_temp, 0.975)
)

Let’s plot 100 confidence intervals from 100 bootstrapping efforts and check the results. We can plot the results.

# Bootstrapping a hundred times

boot <- map_dfr(
  1:100,
  ~{
    sample_for_boot <- dt %>% 
      slice_sample(n = 50)
    
    boot_results <- map_dbl(
      1:1000,
      ~sample_for_boot %>% 
        slice_sample(prop = 1, replace = TRUE) %>% 
        summarise(
          avg_boot_temp = mean(Temp) 
        ) %>% 
        pull(avg_boot_temp)
    )
    tibble(
      sample_no = .x,
      estimate = mean(boot_results),
      conf.low = quantile(boot_results, 0.025),
      conf.high = quantile(boot_results, 0.975),
      contains_truth = ifelse(
        conf.low %>% 
          is_weakly_less_than(47.28159) &
          conf.high %>% 
          is_weakly_greater_than(47.28159), 
        "yes",
        "no"
      )
    )
  }
)

We can plot the results.

boot %>% 
  ggplot() +
  geom_errorbar(
    aes(
      x = sample_no,
      ymin = conf.low,
      ymax = conf.high,
      color = contains_truth
    )
  ) +
  geom_hline(
    aes(
      yintercept = 47.28159
    ), 
    color = "grey30"
  ) +
  theme_minimal() +
  labs(
    title = "Bootstrapping 100 times",
    x = "",
    y = ""
  ) +
    scale_color_manual(
    name = "Did we capture the truth?",  # Correct way to rename legend
    values = c("yes" = "#66c2a5", "no" = "#fc8d62")  # Optional: specify colors
  )

I also want to briefly introduce the tidymodels package. The rsample package, which is one of the core packages in tidymodels, come with the bootstraps() function. I will use the Fish Market data that is a record of 7 common different fish species in fish market sales. I mean, why not.

# Load data
fish_dt <- "https://github.com/Ankit152/Fish-Market/raw/refs/heads/main/Fish.csv" %>% 
  url %>% 
  gzcon %>% 
  read_csv
## Rows: 159 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Species
## dbl (6): Weight, Length1, Length2, Length3, Height, Width
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# See what the data looks like
fish_dt %>% 
  glimpse
## Rows: 159
## Columns: 7
## $ Species <chr> "Bream", "Bream", "Bream", "Bream", "Bream", "Bream", "Bream",…
## $ Weight  <dbl> 242, 290, 340, 363, 430, 450, 500, 390, 450, 500, 475, 500, 50…
## $ Length1 <dbl> 23.2, 24.0, 23.9, 26.3, 26.5, 26.8, 26.8, 27.6, 27.6, 28.5, 28…
## $ Length2 <dbl> 25.4, 26.3, 26.5, 29.0, 29.0, 29.7, 29.7, 30.0, 30.0, 30.7, 31…
## $ Length3 <dbl> 30.0, 31.2, 31.1, 33.5, 34.0, 34.7, 34.5, 35.0, 35.1, 36.2, 36…
## $ Height  <dbl> 11.5200, 12.4800, 12.3778, 12.7300, 12.4440, 13.6024, 14.1795,…
## $ Width   <dbl> 4.0200, 4.3056, 4.6961, 4.4555, 5.1340, 4.9274, 5.2785, 4.6900…
# Using bootstraps() to do make resampled datasets 
boots_fish <- bootstraps(fish_dt, times = 250, apparent = TRUE)

What do we get as the results?

# To access each resampled dataset 
boots_fish$splits[[1]] %>% 
  analysis
## # A tibble: 159 × 7
##    Species Weight Length1 Length2 Length3 Height Width
##    <chr>    <dbl>   <dbl>   <dbl>   <dbl>  <dbl> <dbl>
##  1 Perch     85      17.8    19.6    20.8   5.14  3.04
##  2 Perch   1000      40.2    43.5    46    12.6   8.14
##  3 Pike     540      40.1    43      45.8   7.79  5.13
##  4 Pike     300      34.8    37.3    39.8   6.29  4.02
##  5 Perch    820      37.1    40      42.5  11.1   6.63
##  6 Perch    685      34      36.5    39    10.9   6.86
##  7 Pike    1550      56      60      64     9.6   6.14
##  8 Pike     300      34.8    37.3    39.8   6.29  4.02
##  9 Smelt     19.9    13.8    15      16.2   2.93  1.88
## 10 Pike     770      44.8    48      51.2   7.68  5.38
## # ℹ 149 more rows
#id has 251 rows: 250 for the resampled datasets and 1 for `Apparent` (original dataset)
boots_fish %>% 
  slice_tail(n = 5)
## # A tibble: 5 × 2
##   splits            id          
##   <list>            <chr>       
## 1 <split [159/54]>  Bootstrap247
## 2 <split [159/61]>  Bootstrap248
## 3 <split [159/60]>  Bootstrap249
## 4 <split [159/57]>  Bootstrap250
## 5 <split [159/159]> Apparent

Let’s try using above result for a bootstrapping exercise we might plausibly do.

# OLS rules!
boots_fish <- boots_fish %>%    
  mutate(
    model = splits %>% 
      map(
        \(i)
        lm(Height ~ sqrt(Weight), analysis(i))
      )
  )

# We can get information on confidence intervals using broom::tidy.
boots_fish <- boots_fish %>% 
  mutate(
    conf_info = model %>% 
      map(
        \(i)
        i %>% 
          tidy(conf.int = TRUE)
      )
  )

# We can mutate the dataset so we have all the ingredients for plotting
boots_fish <- boots_fish %>% 
  mutate(
    estimate = conf_info %>% 
      map_dbl(
        \(i)
        
        # i <- boots_fish$conf_info[[1]]
        
        i %>% 
          filter(
            term %>% 
              equals("sqrt(Weight)")
          ) %>% 
          pull(estimate)
        
      ),
    conf.low = conf_info %>% 
      map_dbl(
        \(i)
        
        i %>% 
          filter(
            term %>% 
              equals("sqrt(Weight)")
          ) %>% 
          pull(conf.low)
      ),
    conf.high  = conf_info %>% 
      map_dbl(
        \(i)
        
        i %>% 
          filter(
            term %>% 
              equals("sqrt(Weight)")
          ) %>% 
          pull(conf.high)
      ),
    intercept = conf_info %>% 
      map_dbl(
        \(i)
        
        i %>% 
          filter(
            term %>% 
              equals("(Intercept)")
          ) %>% 
          pull(estimate)
      )
  )

Now, we are ready to plot. First of all, I will give you an overview of our overall bootstrapping approach.

# Plotting the coefficients

boots_fish %>% 
  ggplot() +
  geom_errorbar(
    aes(
      x = id,
      ymin = conf.low,
      ymax = conf.high
    ),
    data = boots_fish %>% 
      filter(
        id %>% 
          equals("Apparent") %>% 
          not
      ),
    color = "grey30"
  ) +
    geom_errorbar(
    aes(
      x = id,
      ymin = conf.low,
      ymax = conf.high
    ),
    data = boots_fish %>% 
      filter(
        id %>% 
          equals("Apparent") 
      ),
    color = "#1b9e77"
  ) +
  geom_hline(
    aes(
      yintercept = boots_fish %>% 
        filter(
          id %>% 
            equals("Apparent")
        ) %>% 
        pull(estimate)
    ), 
    color = "#7570b3"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_blank()
  ) + 
  labs(
    y = "Coefficient estimates",
    x = ""
  )

We can also show how the regression lines look. The deep purple is the regression from our original dataset.

# Plotting regression lines 
boots_fish$model[[1]] %>% 
  summary
## 
## Call:
## lm(formula = Height ~ sqrt(Weight), data = analysis(i))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.0753 -1.3009 -0.4642  2.4819  4.6489 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.98203    0.42586   4.654 6.87e-06 ***
## sqrt(Weight)  0.39861    0.02177  18.311  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.545 on 157 degrees of freedom
## Multiple R-squared:  0.6811, Adjusted R-squared:  0.679 
## F-statistic: 335.3 on 1 and 157 DF,  p-value: < 2.2e-16
boots_fish %>% 
  ggplot() +
  geom_point(
    data = fish_dt, 
    aes(
      x = sqrt(Weight),
      y = Height
    ),
    alpha = 0.3
  ) +
  geom_abline(
    aes(
      slope = estimate, 
      intercept = intercept, 
      group = id  
    ),
    alpha = 0.05,  
    color = "#7570b3"
  ) +
  geom_abline(
    aes(
      slope = boots_fish %>% 
        filter(
          id %>% 
            equals("Apparent")
          ) %>% 
            pull(estimate),
          intercept = boots_fish %>% 
            filter(
              id %>% 
                equals("Apparent")
              ) %>% 
              pull(intercept)
            ),
          color = "#7570b3",
          alpha = 1
        ) +
  theme_minimal() +
  labs(
    title = "Bootstrapped Regression Lines",
    x = "sqrt(Weight)",
    y = "Height"
  )