PSCI 9590A - Introduction to Quantitative Methods

Evelyne Brie

Fall 2023

Probability Theory 2

In statistics, we often want to resample to test the accuracy of our sample estimates. This is called bootstrapping—a test based on multiple random sampling with replacement. In this lab, we will: (1) generate random samples from distribution functions (with replacement), (2) introduce loops and (3) calculate the standard deviation of the distribution of our bootstrap samples.

In this material, I build on the code made available by Carsey and Harden (2014: 217).

Relevant functions: rnorm(), rbernoulli(), rbinom(), rpois(), for()

1. Generating Data

1.1 Normal distribution

The rnorm() function allows us to generate a random sequence of numbers from an hypothetical normal distribution centered with a mean of 20 and a standard deviation of 4.5. Here, we chose to generate 2000 observations.

set.seed(300) # Setting the seed for replication purposes

myData <- rnorm(n=2000,mean=20,sd=4.5) # Creating a random normal distribution 

To confirm that the myData vector was created correctly, we confirm that (1) there are 2000 observations (length), (2) the mean is approximately 20 (mean) and (3) the standard deviation is approximately 4.5 (sd). Note: We shouldn’t expect the mean and standard deviation to have the exact value we inputted in rnorm, as randomness is involved when using this function (rnorm generates a random normal distribution around a given mean and standard deviation).

length(myData) # How many observations?
## [1] 2000
mean(myData) # What is the mean?
## [1] 20.25773
sd(myData) # What is the standard deviation?
## [1] 4.590852

For visual purposes, here is a graph of the myData vector. The dotted red line represents the mean (20.25773) of this distribution.

1.2 Other distributions

Note that you can create a variety of distributions in R. Here is how to create those we discussed on class earlier this week.

1.2.1 Bernoulli

# Setting the seed for replication purposes
set.seed(300) 

# Loading the purrr package
library(purrr)

# Creating a random Bernoulli variable with n=1 (1 sample)
# and p=0.5 (probability of success of 50%)
rbernoulli(n=1, p=0.5)
## [1] TRUE

It returns a logical statement (TRUE or FALSE) indicating whether you sampled a success (i.e. TRUE) or not (i.e. FALSE).

1.2.2 Binomial

Imagine someone flips a fair coin 10 times, and we define a random variable that takes a value of 1 (i.e. success) every time that person gets Heads. The range of this random variable (i.e. the set of all possible values that this rv can take) is therefore \(\{0,1,2,...,10\}\).

Exercise 1

If I tell you that the function to create data from a random binomial distribution is rbinom(n, size, prob), where n is the number of observations you want to sample (i.e. the number of observations, so n=1 if only one person flips a coin 10 times, n=2 if two people flip a coin 10 times each, etc.), size is the number of trials (i.e. in this case, 10 coin flips), and prob is the probability of success on each trial…

How would you generate a random output for the number of heads obtained by 50 people flipping a fair coin 10 times each? Here, we want to see 50 numbers appear: the number of heads out of 10 trials obtained by each person.

# Don't forget to set your seed at 123!
set.seed(123)

# You should input the correct arguments in this function:
# rbinom(n=?, size=?, prob=?)
rbinom(n=50, size=10, prob=0.5)
##  [1] 4 6 5 7 7 2 5 7 5 5 8 5 6 5 3 7 4 2 4 8 7 6 6 9 6 6 5 5 4 3 8 7 6 6 2 5 6 4
## [39] 4 4 3 5 5 4 3 3 4 5 4 7

1.2.3 Poisson

Let’s say we want to create data from an underlying Poisson distributions that counts the number of emails you get in a day. The expected number of emails you get in a day is 30 (i.e. \(\lambda = 30\)).

I want to simulate the number of emails I will get in the next 14 days, according to this distribution function.

set.seed(300) # Setting the seed for replication purposes

rpois(n=14, lambda=30) # Creating a random Poisson variable (n=14, lambda=30)
##  [1] 37 34 32 33 29 38 34 32 36 31 42 29 22 36

2. Loops

Let’s introduce loops using a very simple example. We’ll need loops to perform bootstrapping in a few minutes. Here, we simply want to print 5 observations from myData using sample() and print(). That is to say, we sample and print one observation, five times in a row.

set.seed(300) # Setting the seed for replication purposes

for (i in 1:5) # Specifying the number of iterations 
{ # Opening the curly brackets
    obs <- sample(myData,size=1) # Sampling one observation from the myData vector and storing it into the "obs" object
    print(obs) # Printing the value of that "obs" object
    cat("I have finished", i,"iterations \n") # Printing a string of characters after each iteration
} # Closing the curly brackets
## [1] 13.85058
## I have finished 1 iterations 
## [1] 18.4574
## I have finished 2 iterations 
## [1] 18.57355
## I have finished 3 iterations 
## [1] 22.5185
## I have finished 4 iterations 
## [1] 22.79222
## I have finished 5 iterations

We can use loops to automatize a bunch of things. In the next subsections, we’ll use them to demonstrate two important principles in statistics: the Law of Large Numbers and the Central Limit Theorem.

2.1 Law of Large Numbers

Let’s create a die_mean() function to calculate the mean of n number of die rolls.

die <- 1:6 # Creating a "die" vector with all numbers from 1 to 6 

die_mean <- function(n) {
# Sampling with replacement "n" number of times
# from the following vector: c(1,2,3,4,5,6)
  mean(sample(die, size = n, replace = TRUE))
}

Calculating the mean of n number of die rolls using die_mean().

set.seed(500) # Setting the seed for replication purposes

for (n in c(10,100,1000,10000,100000)) # Here, we are using 5 iterations 
# with the values within this vector for "n"
{
    result <- die_mean(n) # Calculating the mean of n die rolls
    cat("The mean of", n, "number of die rolls is",  result,"\n") # Printing a string of characters after each iteration
}
## The mean of 10 number of die rolls is 3.2 
## The mean of 100 number of die rolls is 3.44 
## The mean of 1000 number of die rolls is 3.492 
## The mean of 10000 number of die rolls is 3.4994 
## The mean of 1e+05 number of die rolls is 3.50887

The sample mean gets closer to the expected value (or population parameter) as the sample size increases. This is the Law of Large Numbers.

2.2 Central Limit Theorem

Drawing i number of observations from a vector of means of 15 die rolls using replicate() and graphing the results using hist().

set.seed(100) # Setting the seed for replication purposes

for (i in c(100,1000,100000))
{
  x <- replicate(i, { # Using replicate to repeat the
    # "randomized experiment" several times without getting
    # the same answer, and storing the results within a x
    # vector (in other words, replicate reevaluates the
    # expression for each replication)
   mm <- die_mean(15) # Calculating the mean of 15 die
   # rolls i number of times
   mean(mm) # Calculating the mean of these means 
   # (this isn't as confusing as it sounds: we basically
   # calculate the mean of 15 die rolls multiple times
   # (either 100, 1000 or 100000 times) and calculate the
   # mean of these 100, 1000 or 100000 observations)
   })
  a <- round(sd(x),digits =3)
  hist(x, # Specifying which data will be plotted
       xlab="Mean of 15 die rolls", # Labeling the x axis
       xlim=c(1,6), # Delimiting the x axis
       col="#4286f4", # Check out "color picker" on Google if you want to select a custom color!
       main=paste("Histogram for ", i," random draws (sd = ", a,")", sep="")) # Using the paste command to generate the main plot title
}

You can click on these graphs to take a closer look!

The distribution of the sample mean tends toward a normal distribution as the sample size increases. This is the Central Limit Theorem.

3. Bootstrapping

What is bootstrapping? It is a way to account for uncertaintly when we measure sample estimates. Using random sampling with replacement, we calculate our estimates for various random samples drawn from our original distribution.

Similarly to what we just did when calculating a “mean of means” for dice rolls, we are estimating how likely we would be to observe a given sample estimate (ex.: mean, standard deviation) if we randomly resampled from this underlying distribution with replacement a large number of times. The reason we are resampling with replacement is to allow for certain numbers to be selected multiple times, which creates greater variance in the generated distributions.

Simply put, what we are doing here is calculating the mean of 1000 samples from a random normal distribution using for(i in x). The mean of each of these samples is stored withing a vector (bootstrap.results).

Why are we doing this? What we are interested in is the uncertainty in the distribution of our estimate (in this case, the mean). We want to measure the standard deviation of each mean we calculated using random sampling with replacement.

set.seed(200) # Setting the seed for replication purposes

n.samples <- 1000 # Number of bootstrap samples

bootstrap.results <- c() # Creating an empty vector to hold the results

for (i in 1:n.samples)
{
    bootstrap.results[i] <- mean(rnorm(2000,20,4.5)) # Mean of the bootstrap sample
}

length(bootstrap.results) # Sanity check: this should contain the mean of 1000 different samples
## [1] 1000
summary(bootstrap.results) # Sanity check
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   19.64   19.93   20.00   20.00   20.07   20.32
sd(bootstrap.results) # Checking the standard deviation of the distribution of means (this is what we are interested in!)
## [1] 0.1041927
par(mfrow=c(2,1), pin=c(5.8,0.98)) # Combining plots (2 rows, 1 column) and setting the plots size

hist(bootstrap.results, # Creating an histogram
     col="#d83737", # Changing the color
     xlab="Mean", # Giving a label to the x axis
     main=paste("Means of 1000 bootstrap samples from the DGP")) # Giving a title to the graph

hist(myData, # Creating an histogram
     col="#37aad8", # Changing the color
     xlab="Value", # Giving a label to the x axis
     main=paste("Distribution of myData")) # Giving a title to the graph

Exercise 2

Set your seed at 120. Calculate 50 means (i.e. for 50 different samples) from the following distribution: a random Poisson variable encompassing 200 observations with a lambda of 7. Store your results (i.e. these 50 means) in a vector called bootstrap.results. What is the summary of that vector, using the summary() function?

# Don't forget to set your seed at 123!
set.seed(120)