Click here for a video intro to bootstrapping.

Intro to cabbages

Today we will use the cabbages data set, which is built into the R package MASS. Before diving in, we need to first download the MASS package (Tools, Install Packages); you only need to do this the first time. Once it’s downloaded, you can call up the MASS package. You only need to call the package once per session.

library(MASS)

Once the package is loaded, you can use data from within the package.

data(cabbages)
attach(cabbages)

You can learn about the data with the following commands. What does each command tell you?

dim(cabbages)
## [1] 60  4
names(cabbages)
## [1] "Cult"   "Date"   "HeadWt" "VitC"
str(cabbages)
## 'data.frame':    60 obs. of  4 variables:
##  $ Cult  : Factor w/ 2 levels "c39","c52": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Date  : Factor w/ 3 levels "d16","d20","d21": 1 1 1 1 1 1 1 1 1 1 ...
##  $ HeadWt: num  2.5 2.2 3.1 4.3 2.5 4.3 3.8 4.3 1.7 3.1 ...
##  $ VitC  : int  51 55 45 42 53 50 50 52 56 49 ...
summary(cabbages)
##   Cult     Date        HeadWt           VitC      
##  c39:30   d16:20   Min.   :1.000   Min.   :41.00  
##  c52:30   d20:20   1st Qu.:1.875   1st Qu.:50.75  
##           d21:20   Median :2.550   Median :56.00  
##                    Mean   :2.593   Mean   :57.95  
##                    3rd Qu.:3.125   3rd Qu.:66.25  
##                    Max.   :4.300   Max.   :84.00

Intro to the sample command

Now, let’s learn about the sample command. The sample command will sample from a vector with or without replacement. For example, here is how you can take a sample of size 10 of the cabbage head weights, with replacement.

sample(HeadWt, size=10, replace= TRUE)
##  [1] 2.2 1.5 2.3 1.6 1.9 1.4 3.5 1.7 2.7 4.3

Re-run that a couple times to remind yourself there is randomness.

Bootstrap time!

You know that you want to procure a sample that is of the same size as the original sample. Therefore, you need to find the sample size of the original data set. You can find the sample size by finding the length of the variable.

(sampsize <- length(HeadWt))
## [1] 60

Equivalently, you can find the number of rows of the original data set. Whatever floats your boat.

nrow(cabbages)
## [1] 60

Next, lay the groundwork. Decide how many random samples you want to take: we’ll call that m, the Monte Carlo sample size. Then create a vector in which you’ll store your simulated means. I filled this vector with zeros so that I can easily recognize any potential issues (because the mean weight of cabbage heads should be strictly positive).

m <- 10^4
mymeans <- rep(0, m)

Now we are ready to repeatedly sample and calculate sample means from the simulated sample. The following loop performs two steps iteratively: it takes a sample of size 60 with replacement, stores it as mysamp, calculates the mean, and stores the mean in the ith entry of the vector mymeans.

for(i in 1:m){
    mysamp <- sample(HeadWt, size=sampsize, replace= TRUE)
    
    mymeans[i] <- mean(mysamp)
}

You now have a simulated sampling distribution. You can now look at it and learn from it. Here is a histogram of your sampling distribution.

hist(mymeans)

Do a quick check: does this make sense with our original data? Is the sampling distribution centered in the “right” place?

To construct a 95 percent confidence interval for the mean headweight, we use the quantile command. The first argument is the vector of means and the second argument says what proportion we want below a certain cutoff.

quantile(mymeans, .025)
##     2.5% 
## 2.376667
quantile(mymeans, .975)
##    97.5% 
## 2.813333

This says that 2.5 percent of means are below 2.3766667 and 97.5 percent of means are below 2.8133333. If you want to get two quantiles at once, you can use a vector as the second argument, as shown below.

quantile(mymeans, c(.025, .975))
##     2.5%    97.5% 
## 2.376667 2.813333

Our simulated sampling distribution looked pretty normally distributed. Do you think that a confidence interval constructed using the t-distribution would be similar? Construct it and see! Why does this make sense?

Setting the random seed

If you re-run the loop above, you will get different answers. To get the same answers next time you run it, you need to set the random seed.

set.seed(1234)
for(i in 1:m){
    mysamp <- sample(HeadWt, size=sampsize, replace= TRUE)
    mymeans[i] <- mean(mysamp)
}
quantile(mymeans, c(.025, .975))
##     2.5%    97.5% 
## 2.378333 2.816667

Run this chunk a couple times to convince yourself the resulting bootstrap confidence intervals are the same every time.

Practice Problems

1. More cabbages

a.

You can subset the data-set to isolate certain observations. For example, here I’ve found just the cabbages planted on date d16.

newdata <- subset(cabbages, Date =="d16" )
dim(newdata)
## [1] 20  4

Use this mini data set to find the mean headweight of cabbages planted on d16. Create a 85 percent bootstrap confidence interval for this mean.

There’s many ways to do this. Here’s one way:

m <- 10^4
out <- rep(0, m)
n <- nrow(newdata)
set.seed(1234)
for(i in 1:m){
  grabthese <- sample(1:n, n, replace = TRUE)
  tempdata <- newdata[grabthese, "HeadWt"]
  out[i] <- mean(tempdata)
}
quantile(out, c(.075, .925))
##  7.5% 92.5% 
## 2.455 3.000

Here’s another way:

m <- 10^4
out <- rep(0, m)
n <- nrow(newdata)
set.seed(1234)
for(i in 1:m){
  grabthese <- sample(1:n, n, replace = TRUE)
  tempdata <- newdata[grabthese, ]
  out[i] <- mean(tempdata$HeadWt)
}
quantile(out, c(.075, .925))
##  7.5% 92.5% 
## 2.455 3.000

Here’s another way:

m <- 10^4
out <- rep(0, m)
n <- nrow(newdata)
set.seed(1234)
for(i in 1:m){
  grabthese <- sample(newdata$HeadWt, n, replace = TRUE)
  out[i] <- mean(grabthese)
}
quantile(out, c(.075, .925))
##  7.5% 92.5% 
## 2.455 3.000

b.

Find a 90 percent bootstrap confidence interval for the mean ascorbic acid content of cabbages planted on d21.

3. Crabs

a.

Use the crabs data from the MASS package to construct a 95 percent confidence interval for crabs’ mean carapace width.

data(crabs)
summary(crabs)
##  sp      sex         index            FL              RW       
##  B:100   F:100   Min.   : 1.0   Min.   : 7.20   Min.   : 6.50  
##  O:100   M:100   1st Qu.:13.0   1st Qu.:12.90   1st Qu.:11.00  
##                  Median :25.5   Median :15.55   Median :12.80  
##                  Mean   :25.5   Mean   :15.58   Mean   :12.74  
##                  3rd Qu.:38.0   3rd Qu.:18.05   3rd Qu.:14.30  
##                  Max.   :50.0   Max.   :23.10   Max.   :20.20  
##        CL              CW              BD       
##  Min.   :14.70   Min.   :17.10   Min.   : 6.10  
##  1st Qu.:27.27   1st Qu.:31.50   1st Qu.:11.40  
##  Median :32.10   Median :36.80   Median :13.90  
##  Mean   :32.11   Mean   :36.41   Mean   :14.03  
##  3rd Qu.:37.23   3rd Qu.:42.00   3rd Qu.:16.60  
##  Max.   :47.60   Max.   :54.60   Max.   :21.60
attach(crabs)

b.

Use the crabs data from the MASS package to construct a 95 percent confidence interval for female crabs’ mean rear width. (Answer is below in part D.)

c. 

Use the crabs data from the MASS package to construct a 95 percent confidence interval for male crabs’ mean rear width. (Answer is below in part D.)

d. (aka “Let’s get crazy!”)

The code below calculate the mean rear width for females and for males, respectively.

mean(RW[sex == "F"])
## [1] 13.487
mean(RW[sex == "M"])
## [1] 11.99

It looks like females have wider rears than males. In other words, the females’ mean minus the males’ mean is positive:

mean(RW[sex == "F"]) - mean(RW[sex == "M"])
## [1] 1.497

Adapt your bootstrap loop to calculate both the mean for males and the mean for females. For each iteration of the loop, record the difference in the two simulated means (female mean minus male mean). Use this to construct a 95 percent confidence interval that gives us an idea of how much wider females’ rears are compared to males’ rears, on average.

malemeans <- femalemeans <- diffs <- rep(0,m)
Fdata <- subset(crabs, sex == "F")
Mdata <- subset(crabs, sex == "M")
nM <- nrow(Mdata)
nF <- nrow(Fdata)
set.seed(1234)
for(i in 1:m){
  tempM <- sample(1:nM, nM, replace = TRUE)
  tempF <- sample(1:nF, nF, replace = TRUE)
  femalemeans[i] <- meanF <- mean(Fdata[tempF, "RW"])
  malemeans[i] <- meanM <- mean(Mdata[tempM, "RW"])
  diffs[i] <- meanF - meanM
}

hist(diffs)

quantile(diffs, c(.025, .975))
##     2.5%    97.5% 
## 0.811000 2.184025
#answer to b
quantile(malemeans, c(.025, .975))
##   2.5%  97.5% 
## 11.564 12.405
quantile(femalemeans, c(.025, .975))
##     2.5%    97.5% 
## 12.95195 14.02000