Probability Theory 2
In statistics, we often want to resample to test the accuracy of our sample estimates. This is called bootstrapping—a test based on multiple random sampling with replacement. In this lab, we will: (1) generate random samples from distribution functions (with replacement), (2) introduce loops and (3) calculate the standard deviation of the distribution of our bootstrap samples.
In this material, I build on the code made available by Carsey and Harden (2014: 217).
Relevant functions: rnorm(),
rbernoulli(), rbinom(), rpois(),
for()
1. Generating Data
1.1 Normal distribution
The rnorm() function allows us to generate a
random sequence of numbers from an hypothetical normal
distribution centered with a mean of 20 and a standard deviation of 4.5.
Here, we chose to generate 2000 observations.
set.seed(300) # Setting the seed for replication purposes
myData <- rnorm(n=2000,mean=20,sd=4.5) # Creating a random normal distribution
To confirm that the myData vector was created
correctly, we confirm that (1) there are 2000 observations
(length), (2) the mean is approximately 20
(mean) and (3) the standard deviation is approximately 4.5
(sd). Note: We shouldn’t expect the mean
and standard deviation to have the exact value we inputted in
rnorm, as randomness is involved when using this function
(rnorm generates a random normal distribution
around a given mean and standard deviation).
length(myData) # How many observations?
## [1] 2000
mean(myData) # What is the mean?
## [1] 20.25773
sd(myData) # What is the standard deviation?
## [1] 4.590852
For visual purposes, here is a graph of the myData vector. The dotted red line represents the mean (20.25773) of this distribution.
1.2 Other distributions
Note that you can create a variety of distributions in R. Here is how to create those we discussed on class earlier this week.
1.2.1 Bernoulli
# Setting the seed for replication purposes
set.seed(300)
# Loading the purrr package
library(purrr)
# Creating a random Bernoulli variable with n=1 (1 sample)
# and p=0.5 (probability of success of 50%)
rbernoulli(n=1, p=0.5)
## [1] TRUE
It returns a logical statement (TRUE or FALSE) indicating whether you sampled a success (i.e. TRUE) or not (i.e. FALSE).
1.2.2 Binomial
Let’s say we have 50 people who flipped a fair coin 10 times, and we define a random variable that takes a value of 1 (i.e. success) every time these people get a Head. The range of this random variable (i.e. the set of all possible values that this rv can take) is therefore \(\{0,1,2,...,10\}\).
Exercise 1
If I tell you that the function to create a random binomial
distribution is rbinom(n, size, prob), where
n is the number of observations you want to sample,
size is the number of trials, and prob
is the probability of success on each trial…
How would you sample a distribution of 50 people flipping a fair coin 10 times?
# Don't forget to set your seed at 123!
set.seed(123)
# You should input the correct arguments in this function:
# rbinom(n=?, size=?, prob=?)
Your results should look like this:
## [1] 4 6 5 7 7 2 5 7 5 5 8 5 6 5 3 7 4 2 4 8 7 6 6 9 6 6 5 5 4 3 8 7 6 6 2 5 6 4
## [39] 4 4 3 5 5 4 3 3 4 5 4 7
1.2.3 Poisson
Let’s say we want to create a Poisson random variable that counts the number of emails you get in a day. The expected number of emails you get in a day is 30 (i.e. \(\lambda = 30\)).
I want to simulate the number of emails I will get in the next 14 days, according to this distribution function.
set.seed(300) # Setting the seed for replication purposes
rpois(n=14, lambda=30) # Creating a random Poisson variable (n=14, lambda=30)
## [1] 37 34 32 33 29 38 34 32 36 31 42 29 22 36
2. Loops
Let’s introduce loops using a very simple example. We’ll need loops
to perform bootstrapping in a few minutes. Here, we simply want to print
5 observations from myData using sample()
and print(). That is to say, we sample and print one
observation, five times in a row.
set.seed(300) # Setting the seed for replication purposes
for (i in 1:5) # Specifying the number of iterations
{ # Opening the curly brackets
obs <- sample(myData,size=1) # Sampling one observation from the myData vector and storing it into the "obs" object
print(obs) # Printing the value of that "obs" object
cat("I have finished", i,"iterations \n") # Printing a string of characters after each iteration
} # Closing the curly brackets
## [1] 13.85058
## I have finished 1 iterations
## [1] 18.4574
## I have finished 2 iterations
## [1] 18.57355
## I have finished 3 iterations
## [1] 22.5185
## I have finished 4 iterations
## [1] 22.79222
## I have finished 5 iterations
We can use loops to automatize a bunch of things. In the next subsections, we’ll use them to demonstrate two important principles in statistics: the Law of Large Numbers and the Central Limit Theorem.
2.1 Law of Large Numbers
Let’s create a die_mean() function to calculate the mean
of n number of die rolls.
die <- 1:6 # Creating a "die" vector with all numbers from 1 to 6
die_mean <- function(n) {
# Sampling with replacement "n" number of times
# from the following vector: c(1,2,3,4,5,6)
mean(sample(die, size = n, replace = TRUE))
}
Calculating the mean of n number of die rolls using
die_mean().
set.seed(500) # Setting the seed for replication purposes
for (n in c(10,100,1000,10000,100000)) # Here, we are using 5 iterations
# with the values within this vector for "n"
{
result <- die_mean(n) # Calculating the mean of n die rolls
cat("The mean of", n, "number of die rolls is", result,"\n") # Printing a string of characters after each iteration
}
## The mean of 10 number of die rolls is 3.2
## The mean of 100 number of die rolls is 3.44
## The mean of 1000 number of die rolls is 3.492
## The mean of 10000 number of die rolls is 3.4994
## The mean of 1e+05 number of die rolls is 3.50887
The mean gets closer to the expected value (or population parameter) as the sample size increases. This is the Law of Large Numbers.
2.2 Central Limit Theorem
Drawing i number of observations from a vector of means
of 15 die rolls using replicate() and graphing the results
using hist().
set.seed(100) # Setting the seed for replication purposes
for (i in c(100,1000,100000))
{
x <- replicate(i, { # Using replicate to repeat the
# "randomized experiment" several times without getting
# the same answer, and storing the results within a x
# vector (in other words, replicate reevaluates the
# expression for each replication)
mm <- die_mean(15) # Calculating the mean of 15 die
# rolls i number of times
mean(mm) # Calculating the mean of these means
# (this isn't as confusing as it sounds: we basically
# calculate the mean of 15 die rolls multiple times
# (either 100, 1000 or 100000 times) and calculate the
# mean of these 100, 1000 or 100000 observations)
})
a <- round(sd(x),digits =3)
hist(x, # Specifying which data will be plotted
xlab="Mean of 15 die rolls", # Labeling the x axis
xlim=c(1,6), # Delimiting the x axis
col="#4286f4", # Check out "color picker" on Google if you want to select a custom color!
main=paste("Histogram for ", i," random draws (sd = ", a,")", sep="")) # Using the paste command to generate the main plot title
}
You can click on these graphs to take a closer look!
The distribution of the sample mean tends toward a normal distribution as the sample size increases. This is the Central Limit Theorem.
3. Bootstrapping
What is bootstrapping? It is a way to account for uncertaintly when we measure sample estimates. Using random sampling with replacement, we calculate our estimates for various random samples drawn from our original distribution.
Similarly to what we just did when calculating a “mean of means” for dice rolls, we are estimating how likely we would be to observe a given sample estimate (ex.: mean, standard deviation) if we randomly resampled from this underlying distribution with replacement a large number of times. The reason we are resampling with replacement is to allow for certain numbers to be selected multiple times, which creates greater variance in the generated distributions.
Simply put, what we are doing here is calculating the mean of 1000
samples from a random normal distribution using
for(i in x). The mean of each of these samples is stored
withing a vector (bootstrap.results).
Why are we doing this? What we are interested in is the uncertainty in the distribution of our estimate (in this case, the mean). We want to measure the standard deviation of each mean we calculated using random sampling with replacement.
set.seed(200) # Setting the seed for replication purposes
n.samples <- 1000 # Number of bootstrap samples
bootstrap.results <- c() # Creating an empty vector to hold the results
for (i in 1:n.samples)
{
bootstrap.results[i] <- mean(rnorm(2000,20,4.5)) # Mean of the bootstrap sample
}
length(bootstrap.results) # Sanity check: this should contain the mean of 1000 different samples
## [1] 1000
summary(bootstrap.results) # Sanity check
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 19.64 19.93 20.00 20.00 20.07 20.32
sd(bootstrap.results) # Checking the standard deviation of the distribution of means (this is what we are interested in!)
## [1] 0.1041927
par(mfrow=c(2,1), pin=c(5.8,0.98)) # Combining plots (2 rows, 1 column) and setting the plots size
hist(bootstrap.results, # Creating an histogram
col="#d83737", # Changing the color
xlab="Mean", # Giving a label to the x axis
main=paste("Means of 1000 bootstrap samples from the DGP")) # Giving a title to the graph
hist(myData, # Creating an histogram
col="#37aad8", # Changing the color
xlab="Value", # Giving a label to the x axis
main=paste("Distribution of myData")) # Giving a title to the graph
Exercise 2
Set your seed at 120. Calculate the mean of 50 samples from random
Poisson variable encompassing 200 observations with a lambda of 7. Store
your results in a vector. What is the summary of that variable, using
the summary() function? What is its sd()?
# Don't forget to set your seed at 123!
set.seed(120)