In statistics, we often want to resample to test the accuracy of our sample estimates. This is called bootstrapping—a test based on multiple random sampling with replacement. In this recitation, we will: (1) generate 1000 new random samples from a vector (with replacement), (2) calculate a given estimate (ex.: mean, median, etc.) for each of these samples and (3) calculate the standard deviation of the distribution of that estimate.
In this recitation, I build on the code made available by Carsey and Harden (2014: 217).
Instead of loading a dataset, we will create a vector of normally distributed numbers named myData.
rnorm()
The rnorm() function allows us to generate a random sequence of numbers from an hypothetical normal distribution centered with a mean of 20 and a standard deviation of 4.5. Here, we chose to generate 2000 observations.
set.seed(300) # Setting the seed for replication purposes
myData <- rnorm(2000,20,4.5) # Creating a random normal distribution (n=300, mean=20, sd=4.5)
length(), mean() and sd()
To confirm that the myData vector was created correctly, we confirm that (1) there are 2000 observations (length), (2) the mean is approximately 20 (mean) and (3) the standard deviation is approximately 4.5 (sd). Note: We shouldn’t expect the mean and standard deviation to have the exact value we inputted in rnorm, as randomness is involved when using this function (rnorm generates a random normal distribution around a given mean and standard deviation).
length(myData) # How many observations?
## [1] 2000
mean(myData) # What is the mean?
## [1] 20.25773
sd(myData) # What is the standard deviation?
## [1] 4.590852
For visual purposes, here is a graph of the myData vector. The dotted red line represents the mean (20.25773) of this distribution.
What is bootstrapping? It is a way to account for uncertaintly when we measure sample estimates. Using random sampling with replacement, we calculate our estimates for various random samples drawn from our original distribution.
Similarly to what we did last week in recitation when calculating a “mean of means” for dice rolls (make sure to review this if needed), we are estimating how likely we would be to observe a given sample estimate (ex.: mean, standard deviation) if we randomly resampled from this underlying distribution with replacement a large number of times. The reason we are resampling with replacement is to allow for certain numbers to be selected multiple times, which creates greater variance in the generated distributions.
for(i in x)
Simply put, what we are doing here is calculating the mean of 1000 samples of 2000 observations from myData using for(i in x). The mean of each of these samples is stored withing a vector (bootstrap.results).
What we are interested in is the uncertainty in the distribution of our estimate (in this case, the mean). We want to measure the standard deviation of each mean we calculated using random sampling with replacement.
set.seed(200) # Setting the seed for replication purposes
sample.size <- 2000 # Sample size
n.samples <- 1000 # Number of bootstrap samples
bootstrap.results <- c() # Creating an empty vector to hold the results
for (i in 1:n.samples)
{
obs <- sample(1:sample.size, replace=TRUE)
bootstrap.results[i] <- mean(myData[obs]) # Mean of the bootstrap sample
}
length(bootstrap.results) # Sanity check: this should contain the mean of 1000 different samples
## [1] 1000
summary(bootstrap.results) # Sanity check
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 19.94 20.19 20.26 20.26 20.33 20.65
sd(bootstrap.results) # Checking the standard deviation of the distribution of means (this is what we are interested in!)
## [1] 0.103674
par(mfrow=c(2,1), pin=c(5.8,0.98)) # Combining plots (2 rows, 1 column) and setting the plots size
hist(bootstrap.results, # Creating an histogram
col="#d83737", # Changing the color
xlab="Mean", # Giving a label to the x axis
main=paste("Means of 1000 bootstrap samples from myData")) # Giving a title to the graph
hist(myData, # Creating an histogram
col="#37aad8", # Changing the color
xlab="Value", # Giving a label to the x axis
main=paste("Distribution of myData")) # Giving a title to the graph
for(i in x)
Similarly, you can also draw 1000 random samples from the original data generating process.
set.seed(200) # Setting the seed for replication purposes
sample.size <- 2000 # Sample size
n.samples <- 1000 # Number of bootstrap samples
bootstrap.results <- c() # Creating an empty vector to hold the results
for (i in 1:n.samples)
{
bootstrap.results[i] <- mean(rnorm(2000,20,4.5)) # Mean of the bootstrap sample
}
length(bootstrap.results) # Sanity check: this should contain the mean of 1000 different samples
## [1] 1000
summary(bootstrap.results) # Sanity check
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 19.64 19.93 20.00 20.00 20.07 20.32
sd(bootstrap.results) # Checking the standard deviation of the distribution of means (this is what we are interested in!)
## [1] 0.1041927
par(mfrow=c(2,1), pin=c(5.8,0.98)) # Combining plots (2 rows, 1 column) and setting the plots size
hist(bootstrap.results, # Creating an histogram
col="#d83737", # Changing the color
xlab="Mean", # Giving a label to the x axis
main=paste("Means of 1000 bootstrap samples from the DGP")) # Giving a title to the graph
hist(myData, # Creating an histogram
col="#37aad8", # Changing the color
xlab="Value", # Giving a label to the x axis
main=paste("Distribution of myData")) # Giving a title to the graph
Set your seed at 150. Generate a random normal distribution of 1000 observations, with a mean of 30 and a standard deviation of 2.5. Calculate the mean of 50 samples of 1000 observations from that dataset. Store your results in a vector.
Relevant functions: set.seed(), rnorm(), for(i in x), sample().
Generate two histograms to graphically display the distribution of means obtained in Exercise 1 as well as the values of the 1000 observations within your original data set. Combine these histograms into one overall graph.
Relevant functions: par(), hist().