Introduction to Bootstrapping in R

In statistics, we often want to resample to test the accuracy of our sample estimates. This is called bootstrapping—a test based on multiple random sampling with replacement. In this recitation, we will: (1) generate 1000 new random samples from a vector (with replacement), (2) calculate a given estimate (ex.: mean, median, etc.) for each of these samples and (3) calculate the standard deviation of the distribution of that estimate.

In this recitation, I build on the code made available by Carsey and Harden (2014: 217).

1. Generating Data

Instead of loading a dataset, we will create a vector of normally distributed numbers named myData.

1.1 Creating a random normal distribution using `rnorm()`

The rnorm() function allows us to generate a random sequence of numbers from an hypothetical normal distribution centered with a mean of 20 and a standard deviation of 4.5. Here, we chose to generate 2000 observations.

set.seed(300) # Setting the seed for replication purposes

myData <- rnorm(2000,20,4.5) # Creating a random normal distribution (n=300, mean=20, sd=4.5)

1.2 Performing sanity checks on myData using `length()`, `mean()` and `sd()`

To confirm that the myData vector was created correctly, we confirm that (1) there are 2000 observations (length), (2) the mean is approximately 20 (mean) and (3) the standard deviation is approximately 4.5 (sd). Note: We shouldn’t expect the mean and standard deviation to have the exact value we inputted in rnorm, as randomness is involved when using this function (rnorm generates a random normal distribution around a given mean and standard deviation).

length(myData) # How many observations?

## [1] 2000

mean(myData) # What is the mean?

## [1] 20.25773

sd(myData) # What is the standard deviation?

## [1] 4.590852

1.3 Graphing myData

For visual purposes, here is a graph of the myData vector. The dotted red line represents the mean (20.25773) of this distribution.

2. Bootstrapping using basic R functions

What is bootstrapping? It is a way to account for uncertaintly when we measure sample estimates. Using random sampling with replacement, we calculate our estimates for various random samples drawn from our original distribution.

Similarly to what we did last week in recitation when calculating a “mean of means” for dice rolls (make sure to review this if needed), we are estimating how likely we would be to observe a given sample estimate (ex.: mean, standard deviation) if we randomly resampled from this underlying distribution with replacement a large number of times. The reason we are resampling with replacement is to allow for certain numbers to be selected multiple times, which creates greater variance in the generated distributions.

2.1 Resampling from myData 1000 times using `for(i in x)`

Simply put, what we are doing here is calculating the mean of 1000 samples of 2000 observations from myData using for(i in x). The mean of each of these samples is stored withing a vector (bootstrap.results).

What we are interested in is the uncertainty in the distribution of our estimate (in this case, the mean). We want to measure the standard deviation of each mean we calculated using random sampling with replacement.

set.seed(200) # Setting the seed for replication purposes

sample.size <- 2000 # Sample size

n.samples <- 1000 # Number of bootstrap samples

bootstrap.results <- c() # Creating an empty vector to hold the results

for (i in 1:n.samples)
{
    obs <- sample(1:sample.size, replace=TRUE)
    bootstrap.results[i] <- mean(myData[obs]) # Mean of the bootstrap sample
}

length(bootstrap.results) # Sanity check: this should contain the mean of 1000 different samples

## [1] 1000

summary(bootstrap.results) # Sanity check

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   19.94   20.19   20.26   20.26   20.33   20.65

sd(bootstrap.results) # Checking the standard deviation of the distribution of means (this is what we are interested in!)

## [1] 0.103674

par(mfrow=c(2,1), pin=c(5.8,0.98)) # Combining plots (2 rows, 1 column) and setting the plots size

hist(bootstrap.results, # Creating an histogram
     col="#d83737", # Changing the color
     xlab="Mean", # Giving a label to the x axis
     main=paste("Means of 1000 bootstrap samples from myData")) # Giving a title to the graph

hist(myData, # Creating an histogram
     col="#37aad8", # Changing the color
     xlab="Value", # Giving a label to the x axis
     main=paste("Distribution of myData")) # Giving a title to the graph

2.2 Resampling a 1000 times from the actual data generating process using `for(i in x)`

Similarly, you can also draw 1000 random samples from the original data generating process.

set.seed(200) # Setting the seed for replication purposes

sample.size <- 2000 # Sample size

n.samples <- 1000 # Number of bootstrap samples

bootstrap.results <- c() # Creating an empty vector to hold the results

for (i in 1:n.samples)
{
    bootstrap.results[i] <- mean(rnorm(2000,20,4.5)) # Mean of the bootstrap sample
}

length(bootstrap.results) # Sanity check: this should contain the mean of 1000 different samples

## [1] 1000

summary(bootstrap.results) # Sanity check

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   19.64   19.93   20.00   20.00   20.07   20.32

sd(bootstrap.results) # Checking the standard deviation of the distribution of means (this is what we are interested in!)

## [1] 0.1041927

par(mfrow=c(2,1), pin=c(5.8,0.98)) # Combining plots (2 rows, 1 column) and setting the plots size

hist(bootstrap.results, # Creating an histogram
     col="#d83737", # Changing the color
     xlab="Mean", # Giving a label to the x axis
     main=paste("Means of 1000 bootstrap samples from the DGP")) # Giving a title to the graph

hist(myData, # Creating an histogram
     col="#37aad8", # Changing the color
     xlab="Value", # Giving a label to the x axis
     main=paste("Distribution of myData")) # Giving a title to the graph

Exercises

Exercise 1

Set your seed at 150. Generate a random normal distribution of 1000 observations, with a mean of 30 and a standard deviation of 2.5. Calculate the mean of 50 samples of 1000 observations from that dataset. Store your results in a vector.

Relevant functions: set.seed(), rnorm(), for(i in x), sample().

Exercise 2

Generate two histograms to graphically display the distribution of means obtained in Exercise 1 as well as the values of the 1000 observations within your original data set. Combine these histograms into one overall graph.

Relevant functions: par(), hist().

Introduction to Bootstrapping in R

Evelyne Brie