Bootstrapping Lab

Please answer the questions below in a Google Doc as you work through this lab. It is entirely okay if you don’t have time to finish this.

To do this you can log in to RStudio Cloud and create a new project in your workspace.

Bootstrapping

Bootstrapping is a method of estimating the uncertainty in the value of a parameter from data.

Counter-intuitively, statisticians generally more interested in parameters than statistics. What’s the difference? A statistic is an observation, for instance “I just made 60 out of 100 free throw attempts”. A parameter is something often unobservable, for instance my “true” free throw shooting percentage. There’s an important distinction between saying that I made 60% and saying that I’m a 60% shooter.

In bootstrapping, we’re aiming to estimate the uncertainty in a parameter that we estimated by making n observations or measurements. We do this by sampling n times with replacement from the n observations.

Example

Let’s say we’re trying to find the weight of a kangaroo. We manage to coax it until to a scale ten times but it’s moving around and our measurements are noisy. We observe the following weights in pounds: 93 77 62 78 75 85 66 83 91 72

We can estimate the kangaroo’s true weight by averaging these values:

k_weights <- c(93, 77, 62, 78, 75, 85, 66, 83, 91, 72)
mean(k_weights)

We can use the sample() function to sample 10 weights, with replacement from our 10 observations. Not that there’s (pseudo) randomness here so you (probably) won’t get the same answer as your neighbor and (probably) won’t get the same answer if you do this twice.

sample(k_weights, size=10, replace=TRUE)

We can repeat this sampling with replacement 100 times over using the replicate() function

kangaroo_boots <- replicate(100, sample(k_weights, size=10, replace=TRUE))

and we can average each set of ten samples. This gives us 100 possible averages of 10 kangaroos weights.

kangaroo_boot_means <- apply(kangaroo_boots, 2, mean)
kangaroo_boot_means

We can plot these possible weights or take the standard deviation in these possible weights

hist(kangaroo_boot_means)
mean(kangaroo_boot_means)
sd(kangaroo_boot_means)

The standard deviation of these samples of kangaroo’s weights gives us an estimate of the uncertainty in the true weight of the kangaroo. We can eliminate some of the randomness of our procedure by simply sampling with replacement more times (don’t go nuts and crash our server/your computer).

kangaroo_boots <- replicate(1000, sample(k_weights, size=10, replace=TRUE))
kangaroo_boot_means <- apply(kangaroo_boots, 2, mean)
hist(kangaroo_boot_means)
mean(kangaroo_boot_means)
sd(kangaroo_boot_means)

1. The first time we weighed this kangaroo, we recorded a weight of 93 pounds. Is it plausible that this kangaroo really weighs 93 pounds?
1. Is it plausible that this kangaroo really weighs 80 pounds?

Warning:

This procedure gives us an estimate of the uncertainty in the kangaroo’s weight due to chance measurement errors. If there was a systematic bias in my measurements (the kangaroo’s tail was always resting on the ground, for instance) bootstrapping will not account for that error.

We can construct an 80% confidence interval for the weight of a kangaroo. This involved thinking of possible weights for the kangaroo in terms of z-scores. To include 80% of the curve, we want to exclude 10% of z-scores at the low end and another 10% at the high end.

Try running the following:

qnorm(0.1); qnorm(.9)

This tells us that a z-score of -1.28 is higher than 10% of values and a z-score of +1.28 is higher than 90% of values.

I can then get the range of my 80% confidence interval by doing:

mean(kangaroo_boot_means)-1.28*sd(kangaroo_boot_means)
mean(kangaroo_boot_means)+1.28*sd(kangaroo_boot_means)

or, perhaps more elegantly, I can do this in one line:

mean(kangaroo_boot_means)+qnorm(c(0.1, 0.9))*sd(kangaroo_boot_means)

1. Try constructing a 90% confidence interval for the weight of the kangaroo. Is it wider or narrower?

Cubit Lengths

A class of Saint Ann’s Statistics students measured their cubit lengths

Create a vector of cubit lengths using the code below:

cubit <- c(45.7, 44, 53.3, 40.6, 42, 44.4, 47.7, 44, 42, 36.8, 43.2, 35)

We’re interesting in the mean cubit length of Saint Ann’s Statistics students more generally which might well be different than the mean of this sample of students. Try using code similar to the code we use above with kangaroo weights create bootstrap samples of these cubit lengths and find the means of each sample. Careful! While there were only 10 kangaroo weights there are 12 cubit lengths and you’ll have to adjust your code accordingly.

1. What can you say about the mean cubit length of Saint Ann’s Statistics students? Try modifying the code above to help answer this question.
1. Try to construct a 95% confidence interval for the mean cubit length of Saint Ann’s Statistics students.