Bootstrapping

Please answer the 4 questions below in a Google Doc as you work through this lab.

Local or Server

You can log in to RStudio here or work locally. If you work locally, you will need to install your own packages, BUT, if you’re logged in as yourself, you should only need to do this once.

Bootstrapping

Bootstrapping is a method of estimating the uncertainty in the value of a parameter from data.

Counter-intuitively, statisticians generally more interested in parameters than statistics. What’s the difference? A statistic is an observation, for instance “I just made 60 out of 100 free throw attempts”. A parameter is something often unobservable, for instance my “true” free throw shooting percentage. There’s an important distinction between saying that I made 60% and saying that I’m a 60% shooter. Bernie Sander’s polling average in Iowa is a statistic. His true standing among those who will caucus in Iowa is the parameter we’re interested in. Statistics are really just a means for estimating parameters.

In bootstrapping, we’re aiming to estimate the uncertainty in a parameter that we estimated by making n observations or measurements. We do this by sampling n times with replacement from the n observations.

Example

Let’s say we’re trying to find the weight of a kangaroo. We manage to coax it until to a scale ten times but it’s moving around and our measurements are noisy. We observe the following weights in pounds: 93 77 62 78 75 85 66 83 91 72

We can estimate the kangaroo’s true weight by averaging these values:

k_weights <- c(93, 77, 62, 78, 75, 85, 66, 83, 91, 72)
mean(k_weights)

We can use the sample() function to sample 10 weights, with replacement from our 10 observations. Not that there’s (pseudo) randomness here so you (probably) won’t get the same answer as your neighbor and (probably) won’t get the same answer if you do this twice.

sample(k_weights, size=10, replace=TRUE)

We can repeat this sampling with replacement 100 times over using the replicate() function

replicate(100, sample(k_weights, size=10, replace=TRUE))

and we can average each set of ten samples. This gives us 100 possible weights for the kangaroo.

colMeans(replicate(100, sample(k_weights, size=10, replace=TRUE)))

We can plot these possible weights or take the standard deviation in these possible weights

sample_means <- colMeans(replicate(100, sample(k_weights, size=10, replace=TRUE)))

hist(sample_means)
mean(sample_means)
sd(sample_means)

The standard deviation of these samples of kangaroo’s weights gives us an estimate of the uncertainty in the true weight of the kangaroo. We can eliminate some of the randomness of our procedure by simply sampling with replacement more times (don’t go nuts and crash our server/your computer).

sample_means <- colMeans(replicate(1000, sample(k_weights, size=10, replace=TRUE)))

hist(sample_means)
mean(sample_means)
sd(sample_means)

Is it plausible that this kangaroo really weighs 80 pounds?
The first time we weighed this kangaroo, we recorded a weight of 93 pounds. Is it plausible that this kangaroo really weighs 93 pounds?

Warning:

This procedure gives us an estimate of the uncertainty in the kangaroo’s weight due to chance measurement errors. If there was a systematic bias in my measurements (the kangaroo’s tail was always resting on the ground, for instance) bootstrapping will not account for that error. Similarly, I can use bootstrapping to estimate the uncertainty in polling averages if the deviate from the truth in chance/fluky ways. If polls are systematically under or overestimating the support of one candidate, our statistical tomfoolery can’t reveal that truth.

Bernie

Let’s try this again but using Bernie Sander’s polling results in Iowa in 2020. First, let’s collect those results.

library(readr)
polls_url <- "https://projects.fivethirtyeight.com/polls-page/president_primary_polls.csv"
polls <- read_csv(polls_url)

library(dplyr)
dem_primary_polls <- polls %>% filter(stage=="primary", party=="DEM")

dem_primary_polls <- dem_primary_polls %>% 
  mutate(start_date=as.Date(start_date, "%m/%d/%y"),
         end_date=as.Date(end_date, "%m/%d/%y"),
         create_date=as.Date(created_at, "%m/%d/%y")
         ) 

sanders_2020_poll_results <- dem_primary_polls %>% filter(end_date >= "2020/01/01", answer=="Sanders", state=="Iowa") %>%select(pct)

Let’s find the number of polls and the polling average.

num.polls <- length(sanders_2020_poll_results[[1]])
num.polls

mean.polls <- mean(sanders_2020_poll_results[[1]])
mean.polls

Next, let’s do as we did with kangaroos weights and figure out the uncertainty

sample_means <- colMeans(replicate(1000, sample(sanders_2020_poll_results[[1]], size=num.polls, replace=TRUE)))

hist(sample_means)
mean(sample_means)
sd(sample_means)

Just during the month of January over 90,000 have have been asked about Sanders. The standard deviation we calculated above is several times higher than the uncertainty we would expect based on using the the formula for the standard deviation of a binomial \(\sqrt{np(1-p)}\).

Why is it so much higher? What is the different between these two standard deviations?
Do you think that standard deviation in Sander’s support calculated using bootstrapping higher or lower than the true uncertainty in how he will perform in the Iowa caucus? Why? Please explain your thinking as fully as possible (are there other sources of uncertainty to consider?).