Jared Cross
3/15/2021
Bootstrapping is a way of estimating the uncertainty in data from the data itself. It’s useful because we can estimate many different types of uncertainties this way, uncertainties in means and medians and in correlations and (as we’ll soon see), best fit lines and all sorts of things.
There are also formulas we can use to estimate uncertainty which each work in particular situations. These formulas contains bits of insights and we can compare our results using the formulas to our results using bootstrapping.
The uncertainty (standard error) in the mean of a set of values is equal to the standard deviation in the values divided by the square root of the number of data points.
\[s_{x} = \frac{s}{\sqrt{n}}\] (where s is the standard deviation and n is the number of data points)
Using a formula:
## [1] 3.186081
Using bootstrapping:
## [1] 3.064381
\[s_{\bar{x}} = \frac{s}{\sqrt{n}}\]
The more spread out the kangaroo measurements, the more uncertainty there is in the true weight of the kangaroo.
The more measurements we take, the less uncertainty there is in the true kangaroo weight.
BUT! There are diminishing returns. We need to take four times as many measurements to cut the uncertainty in half.
\[s_r = \sqrt{\frac{1-r^2}{n-2}}\]
(where r is the correlation and n is the number of data points in each vector)
cubit <- c(45.7, 44, 53.3, 40.6, 42, 44.4,
47.7, 44, 42, 36.8, 43.2, 35)
foot <- c(25.4, 28, 24, 22.86, 24, 25.4,
29.9, 26.5, 26, 23.6, 24.9, 23.5)
r <- cor(cubit, foot)
r
## [1] 0.4087348
## [1] 0.2886063
## [1] 0.2886063
sample_correlations <- replicate(1000,
{students <- sample.int(12, size=12, replace=TRUE);
cor(cubit[students], foot[students])}
)
sd(sample_correlations)
## [1] 0.276099
\[s_r = \sqrt{\frac{1-r^2}{n-2}}\]
You can really talk about correlations until you have at least 3 data points. Two points MUST lie on a line. If the third point lies on (or near) that same line, that’s really one data point which suggests a correlation and not three. Thus the prescense of n-2 in this formula.
There’s less uncertainty when we observe correlations near 1 or -1 and more uncertainty when we observe correlations near 0.
\[s = \sqrt{\frac{p(1-p)}{n}}\] (where p is rate of “success” and n is the number of trials)
Example: We see someone hit 30 or 50 free throws (60% or p=0.6). What is the uncertainty in their true free throw shooting percentage?
## [1] 0.06928203
## [1] 0.6
## [1] 0.06681383
\[s = \sqrt{\frac{p(1-p)}{n}}\]
Once again, uncertainty decreases with the square root of n. This is a rule-of-thumb worth remembering!
There is less uncertainty when p is near 0 or 1 and more when p is near 0.5.