Purpose:

  • Illustrate the effect of sample size on the bootstrap distribution and confidence intervals
  • Demonstrate that the number of bootstrap samples has relatively little effect on the bootstrap distribution and confidence intervals (once the number of bootstrap samples is large enough)

Preamble

load mosaic and Lock5Data libraries

library(mosaic)
library(Lock5Data)
library(gridExtra)

set seed of random number generator so we can get the same results each time we run the code

set.seed(1500)

Read in data

ddat<-read.csv("../data/Deercaptures.csv")

Lets create a smaller data set with 1/4 the number of observations

ddatsmall<-sample(ddat, 318/4)

Lets consider the sampling distribution of the proportion of captured deer that were fawns. Our sample proportions are similar in the full and reduced data sets.

prop(~ageclass, data=ddat, success="Fawn") 
## prop_Fawn 
## 0.1886792
prop(~ageclass, data=ddatsmall, success="Fawn")
## prop_Fawn 
## 0.1898734

Effect of sample size

Lets create 5 different bootstrap distributions and confidence intervals using:

  • the reduced data set with 79 cases
  • the original data set with 318 cases

I am going to use a “for loop” rather than mosaic’s “do” function since the simulation is a bit more complicated than the things we normally do in biometry.

# Set up objects to store results
confintSmall<-matrix(NA, 5,2) # confidence intervals for small sample size
confintLarge<-matrix(NA, 5,2) # confidence intervals for full data set 
par(mfrow=c(5,2)) # sets up a plotting window
for(i in 1:5){
  #Bootstrap and CI for small data set
  bootsmall<-do(1000)*{prop(~ageclass, data=resample(ddatsmall), success="Fawn")}
  confintSmall[i,]<-qdata(~prop_Fawn, p=c(0.025, 0.975), data=bootsmall)[,1]
  
  #Bootstrap and CI for full data set
  bootslarge<-do(1000)*{prop(~ageclass, data=resample(ddat), success="Fawn")}
  confintLarge[i,]<-qdata(~prop_Fawn, p=c(0.025, 0.975), data=bootslarge)[,1] 
  hist(bootsmall$prop_Fawn, main="Bootstrap Distribution: Half Data")
  hist(bootslarge$prop_Fawn, main="Bootstrap Distribution: Full Data")
} 

Confidence intervals

Lets look at the confidence intervals

Bootstrap applications using the small data set

confintSmall # data set of 79 cases
##           [,1]      [,2]
## [1,] 0.1139241 0.2784810
## [2,] 0.1139241 0.2784810
## [3,] 0.1139241 0.2784810
## [4,] 0.1012658 0.2911392
## [5,] 0.1012658 0.2784810

Bootstrap applications using the full data set

confintLarge # data set with 318 cases
##           [,1]      [,2]
## [1,] 0.1446541 0.2327044
## [2,] 0.1477201 0.2327044
## [3,] 0.1446541 0.2327830
## [4,] 0.1477987 0.2327044
## [5,] 0.1446541 0.2358491

Summary

  1. the bootstrap distributions for the smaller data set (left panels) are more spread out. This makes sense since we have less information in this data set and therefore, our bootstrap sample proportions will vary more from sample to sample.
  2. Our confidence intervals are also wider when analyzing the smaller data set. Again, this makes sense as we have more uncertainty about the population proportion becuase we have less information in our data set.
  3. The confidence intervals vary a little bit across repeated applications of the bootstrap, but not much. This also makes sense. We are starting with the same data set each time we repeate the bootstrap. But, because we are only creating 1000 bootstrap samples; thus, we get slightly different results each time.

Effect of number of bootstrap samples

Now, lets consider the creating a bootstrap distribution for the full data set, but this time we will vary the number of bootstrap resamples (either 500 or 5000)

confint500 <-matrix(NA, 5,2) # confidence intervals using 500 boostraps
confint5000 <-matrix(NA, 5,2) # confidence intervals using 5000 bootstraps
par(mfrow=c(5,2)) # sets up a plotting window
for(i in 1:5){
  #Bootstrap and CI for small data set
  boot500<-do(500)*{prop(~ageclass, data=resample(ddat), success="Fawn")}
  confint500[i,]<-qdata(~prop_Fawn, p=c(0.025, 0.975), data=boot500)[,1]
  
  #Bootstrap and CI for full data set
  boot5000<-do(5000)*{prop(~ageclass, data=resample(ddat), success="Fawn")}
  confint5000[i,]<-qdata(~prop_Fawn, p=c(0.025, 0.975), data=boot5000)[,1] 
  hist(boot500$prop_Fawn, main="500 Bootstraps")
  hist(boot5000$prop_Fawn, main="5000 Bootstraps")
}

Confidence intervals

Confidence intervals with 500 bootstrap samples

confint500
##           [,1]      [,2]
## [1,] 0.1477987 0.2375000
## [2,] 0.1492925 0.2327044
## [3,] 0.1540881 0.2343553
## [4,] 0.1477987 0.2280660
## [5,] 0.1492925 0.2295597

Confidence intervals with 5000 bootstrap samples

confint5000
##           [,1]      [,2]
## [1,] 0.1446541 0.2295597
## [2,] 0.1446541 0.2327044
## [3,] 0.1446541 0.2327044
## [4,] 0.1477987 0.2327044
## [5,] 0.1477987 0.2327044

Summary

  1. The bootstrap distributions and confidence intervals are similar regardless of whether we are using 500 or 5000 bootstrap samples to calculate the confidence interval. This makes sense. We are starting with the same (full) data set each time we repeate the bootstrap. We don’t get narrower confidence intervals when we use a larger number of bootstrap samples because adding more bootstrap samples does NOT add any information to our original data set.
  2. As in the previous comparison, our confidence intervals vary a little bit across repeated applications of the bootstrap, but not much. Also, note that the confidence intervals vary less when we use 5000 bootstrap samples than when we use 500. This also makes sense. We should get more stable results when we calculate the confidence interval using 5000 (versus 500) bootstrap samples. Ideally, we would use an infinite number of bootstrap samples and get the exact same result each time!