Please answer the questions below in a Google Doc as you work through this lab.
To do this you can log in to RStudio Cloud and create a new project in your workspace.
Bootstrapping is a method of estimating the uncertainty in the value of a parameter from data.
Counter-intuitively, statisticians generally more interested in parameters than statistics. What’s the difference? A statistic is an observation, for instance “I just made 60 out of 100 free throw attempts”. A parameter is something often unobservable, for instance my “true” free throw shooting percentage. There’s an important distinction between saying that I made 60% and saying that I’m a 60% shooter. The self-measured cubit lengths of a group of Statistics students is a statistic. The true correlation is a parameter.
In bootstrapping, we’re aiming to estimate the uncertainty in a parameter that we estimated by making n observations or measurements. We do this by sampling n times with replacement from the n observations.
Let’s say we’re trying to find the weight of a kangaroo. We manage to coax it until to a scale ten times but it’s moving around and our measurements are noisy. We observe the following weights in pounds: 93 77 62 78 75 85 66 83 91 72
We can estimate the kangaroo’s true weight by averaging these values:
k_weights <- c(93, 77, 62, 78, 75, 85, 66, 83, 91, 72)
mean(k_weights)
We can use the sample() function to sample 10 weights, with replacement from our 10 observations. Not that there’s (pseudo) randomness here so you (probably) won’t get the same answer as your neighbor and (probably) won’t get the same answer if you do this twice.
sample(k_weights, size=10, replace=TRUE)
We can repeat this sampling with replacement 100 times over using the replicate() function
replicate(100, sample(k_weights, size=10, replace=TRUE))
and we can average each set of ten samples. This gives us 100 possible weights for the kangaroo.
colMeans(replicate(100, sample(k_weights, size=10, replace=TRUE)))
We can plot these possible weights or take the standard deviation in these possible weights
sample_means <- colMeans(replicate(100, sample(k_weights, size=10, replace=TRUE)))
hist(sample_means)
mean(sample_means)
sd(sample_means)
The standard deviation of these samples of kangaroo’s weights gives us an estimate of the uncertainty in the true weight of the kangaroo. We can eliminate some of the randomness of our procedure by simply sampling with replacement more times (don’t go nuts and crash our server/your computer).
sample_means <- colMeans(replicate(1000, sample(k_weights, size=10, replace=TRUE)))
hist(sample_means)
mean(sample_means)
sd(sample_means)
Is it plausible that this kangaroo really weighs 80 pounds?
The first time we weighed this kangaroo, we recorded a weight of 93 pounds. Is it plausible that this kangaroo really weighs 93 pounds?
This procedure gives us an estimate of the uncertainty in the kangaroo’s weight due to chance measurement errors. If there was a systematic bias in my measurements (the kangaroo’s tail was always resting on the ground, for instance) bootstrapping will not account for that error.
A class of Saint Ann’s Statistics students measured their cubit lengths
Create a vector of cubit lengths using the code below:
cubit <- c(45.7, 44, 53.3, 40.6, 42, 44.4, 47.7, 44, 42, 36.8, 43.2, 35)
We’re interesting in the mean cubit length of Saint Ann’s Statistics students more generally which might well be different than the mean of this sample of students. Try using code similar to the code we use above with kangaroo weights create bootstrap samples of these cubit lengths and find the means of each sample. Careful! While there were only 10 kangaroo weights there are 12 cubit lengths and you’ll have to adjust your code accordingly.
These same Statistics students also measured the lengths of their feet
foot <- c(25.4, 28, 24, 22.86, 24, 25.4, 29.9, 26.5, 26, 23.6, 24.9, 23.5)
We can find the correlation between the sample of cubit and the sample of feet as follows. Keep in mind that for this to work the values in the two vectors must be measurements of the same students and in the same order.
cor(cubit, foot)
How confident are we in this correlation? We can estimate the uncertainty in the correlation by taking bootstrap samples of cubit and foot lengths. We do need to sample the same students from both vectors in each bootstrap sample.
We can sample integers (with replacement) between 1 and 12 using the code below. This will be our way of picking students for each bootstrap sample.
sample.int(12, size=12, replace=TRUE)
We can get the correlation in one bootstrap sample using the code below. Keep in mind that each time you run this code, you will get a different answer.
students <- sample.int(12, size=12, replace=TRUE)
cor(cubit[students], foot[students])
Lastly, we can use replicate again to do this 1000 times over and collect all of the bootstrap sample correlations.
sample_correlations <- replicate(1000,
{students <- sample.int(12, size=12, replace=TRUE); cor(cubit[students], foot[students])}
)
What, if anything, can we say about the true correlation between foot and cubit lengths among Saint Ann’s Statistics students?
What would you do differently if you wanted a better estimate of the correlation between cubit lengths and foot lengths?