Bootstrapping v. Alternatives

Jared Cross

3/15/2021

Bootstrapping Revisited

Bootstrapping is a way of estimating the uncertainty in data from the data itself. It’s useful because we can estimate many different types of uncertainties this way, uncertainties in means and medians and in correlations and (as we’ll soon see), best fit lines and all sorts of things.

There are also formulas we can use to estimate uncertainty which each work in particular situations. These formulas contains bits of insights and we can compare our results using the formulas to our results using bootstrapping.

Standard Error in the Mean

The uncertainty (standard error) in the mean of a set of values is equal to the standard deviation in the values divided by the square root of the number of data points.

\[s_{x} = \frac{s}{\sqrt{n}}\] (where s is the standard deviation and n is the number of data points)

Kangaroo Test

Using a formula:

k_weights <- c(93, 77, 62, 78, 75, 85, 66, 83, 91, 72)
sd(k_weights)/sqrt(10)
## [1] 3.186081

Using bootstrapping:

sample_means <- 
  colMeans(replicate(500, sample(k_weights, 
                          size=10, replace=TRUE)))

sd(sample_means)
## [1] 3.064381

Insights from the Formula

\[s_{\bar{x}} = \frac{s}{\sqrt{n}}\]

  1. The more spread out the kangaroo measurements, the more uncertainty there is in the true weight of the kangaroo.

  2. The more measurements we take, the less uncertainty there is in the true kangaroo weight.

  3. BUT! There are diminishing returns. We need to take four times as many measurements to cut the uncertainty in half.

Starndard Error in a Correlation

\[s_r = \sqrt{\frac{1-r^2}{n-2}}\]

(where r is the correlation and n is the number of data points in each vector)

Cubit and Foot Test

cubit <- c(45.7, 44, 53.3, 40.6, 42, 44.4, 
           47.7, 44, 42, 36.8, 43.2, 35)

foot <- c(25.4, 28, 24, 22.86, 24, 25.4, 
          29.9, 26.5, 26, 23.6, 24.9, 23.5)

r <- cor(cubit, foot)
r
## [1] 0.4087348
n <- 12

sqrt((1-r^2)/(n-2))
## [1] 0.2886063

Compared to Bootstrapping

sqrt((1-r^2)/(n-2))
## [1] 0.2886063
sample_correlations <- replicate(1000, 
    {students <- sample.int(12, size=12, replace=TRUE);   
      cor(cubit[students], foot[students])}
            )

sd(sample_correlations)
## [1] 0.276099

Insights from the Formula

\[s_r = \sqrt{\frac{1-r^2}{n-2}}\]

  1. You can really talk about correlations until you have at least 3 data points. Two points MUST lie on a line. If the third point lies on (or near) that same line, that’s really one data point which suggests a correlation and not three. Thus the prescense of n-2 in this formula.

  2. There’s less uncertainty when we observe correlations near 1 or -1 and more uncertainty when we observe correlations near 0.

Standard Error in a Proportion

\[s = \sqrt{\frac{p(1-p)}{n}}\] (where p is rate of “success” and n is the number of trials)

Example: We see someone hit 30 or 50 free throws (60% or p=0.6). What is the uncertainty in their true free throw shooting percentage?

sqrt(0.6*(1-0.6)/50)
## [1] 0.06928203

What if we used Boostrapping?

shots <- c(rep(0, 20), rep(1, 30))
mean(shots)
## [1] 0.6
sample_props <- 
  colMeans(replicate(500, sample(shots, 
                          size=50, replace=TRUE)))

sd(sample_props)
## [1] 0.06681383

Insights from the Formula

\[s = \sqrt{\frac{p(1-p)}{n}}\]

  1. Once again, uncertainty decreases with the square root of n. This is a rule-of-thumb worth remembering!

  2. There is less uncertainty when p is near 0 or 1 and more when p is near 0.5.