POLISCI 3325G - Data Science for Political Science

Evelyne Brie

Winter 2023

Distributions

In this lab, we will: (1) generate random samples from distribution functions (with replacement), (2) introduce loops and (3) demonstrate the Law of Law Numbers and the Central Limit Theorem.

In this material, I build on the code made available by Carsey and Harden (2014: 217).

Relevant functions: rnorm(), rbernoulli(), rbinom(), rpois(), for()

1. Generating Data

1.1 Normal distribution

The rnorm() function allows us to generate a random sequence of numbers from an hypothetical normal distribution centered with a mean of 20 and a standard deviation of 4.5. Here, we chose to generate 2000 observations.

set.seed(300) # Setting the seed for replication purposes

myData <- rnorm(n=2000,mean=20,sd=4.5) # Creating a random normal distribution

To confirm that the myData vector was created correctly, we confirm that (1) there are 2000 observations (length), (2) the mean is approximately 20 (mean) and (3) the standard deviation is approximately 4.5 (sd). Note: We shouldn’t expect the mean and standard deviation to have the exact value we inputted in rnorm, as randomness is involved when using this function (rnorm generates a random normal distribution around a given mean and standard deviation).

length(myData) # How many observations?

## [1] 2000

mean(myData) # What is the mean?

## [1] 20.25773

sd(myData) # What is the standard deviation?

## [1] 4.590852

For visual purposes, here is a graph of the myData vector. The dotted red line represents the mean (20.25773) of this distribution.

1.2 Other distributions

Note that you can create a variety of distributions in R. Here is how to create those we discussed on class earlier this week.

1.2.1 Bernoulli

# Setting the seed for replication purposes
set.seed(300) 

# Loading the purrr package for the rbernoulli() function
library(purrr)

# Creating a random Bernoulli variable with n=1 (1 sample)
# and p=0.5 (probability of success of 50%)
rbernoulli(n=1, p=0.5)

## [1] TRUE

It returns a logical statement (TRUE or FALSE) indicating whether you sampled a success (i.e. TRUE) or not (i.e. FALSE).

1.2.2 Binomial

Let’s say we have 20 people who flipped a fair coin 5 times, and we define a random variable that takes a value of 1 (i.e. success) every time these people get a Head. The range of this random variable (i.e. the set of all possible values that this rv can take) is therefore \(\{0,1,2,3,4,5\}\), which will be given for each observation. Remember: a binomial distribution is a sequence of Bernoulli trials. Here, we are displaying the results to 20 different binomial trials.

# Setting the seed for replication purposes
set.seed(300) 

# Creating a random binomial variable with n=20 people,
# and p=0.5 (probability of success of 5%), for 5 trials
rbinom(n=20, size=5, prob=0.5)

##  [1] 4 3 3 3 3 0 3 2 2 4 4 2 3 3 3 5 4 3 3 3

1.2.3 Poisson

Let’s say we want to create a Poisson random variable that counts the number of emails you get in a day. The expected number of emails you get in a day is 30 (i.e. \(\lambda = 30\)).

I want to simulate the number of emails I will get in the next 14 days, according to this distribution function.

set.seed(300) # Setting the seed for replication purposes

rpois(n=14, lambda=30) # Creating a random Poisson variable (n=14, lambda=30)

##  [1] 37 34 32 33 29 38 34 32 36 31 42 29 22 36

2. Loops

Let’s introduce loops using a very simple example. We’ll need loops to perform bootstrapping in a few minutes. Here, we simply want to print 5 observations from myData using sample() and print(). That is to say, we sample and print one observation, five times in a row.

set.seed(300) # Setting the seed for replication purposes

for (i in 1:5) # Specifying the number of iterations 
{ # Opening the curly brackets
    obs <- sample(myData,size=1) # Sampling one observation from the myData vector and storing it into the "obs" object
    print(obs) # Printing the value of that "obs" object
    cat("I have finished", i,"iterations \n") # Printing a string of characters after each iteration
} # Closing the curly brackets

## [1] 13.85058
## I have finished 1 iterations 
## [1] 18.4574
## I have finished 2 iterations 
## [1] 18.57355
## I have finished 3 iterations 
## [1] 22.5185
## I have finished 4 iterations 
## [1] 22.79222
## I have finished 5 iterations

We can use loops to automatize a bunch of things. In the next subsections, we’ll use them to demonstrate two important principles in statistics: the Law of Large Numbers and the Central Limit Theorem.

2.1 Law of Large Numbers

Let’s create a die_mean() function to calculate the mean of n number of die rolls.

die <- 1:6 # Creating a "die" vector with all numbers from 1 to 6 

die_mean <- function(n) {
# Sampling with replacement "n" number of times
# from the following vector: c(1,2,3,4,5,6)
  mean(sample(die, size = n, replace = TRUE))
}

Calculating the mean of n number of die rolls using die_mean().

set.seed(500) # Setting the seed for replication purposes

for (n in c(10,100,1000,10000,100000)) # Here, we are using 5 iterations 
# with the values within this vector for "n"
{
    result <- die_mean(n) # Calculating the mean of n die rolls
    cat("The mean of", n, "number of die rolls is",  result,"\n") # Printing a string of characters after each iteration
}

## The mean of 10 number of die rolls is 3.2 
## The mean of 100 number of die rolls is 3.44 
## The mean of 1000 number of die rolls is 3.492 
## The mean of 10000 number of die rolls is 3.4994 
## The mean of 1e+05 number of die rolls is 3.50887

The sample mean gets closer to the expected value (or population parameter) as the sample size increases. This is the Law of Large Numbers.

2.2 Central Limit Theorem

Drawing i number of observations from a vector of means of 15 die rolls using replicate() and graphing the results using hist().

set.seed(100) # Setting the seed for replication purposes

for (i in c(100,1000,100000))
{
  x <- replicate(i, { # Using replicate to repeat the
    # "randomized experiment" several times without getting
    # the same answer, and storing the results within a x
    # vector (in other words, replicate reevaluates the
    # expression for each replication)
   mm <- die_mean(15) # Calculating the mean of 15 die
   # rolls i number of times
   mean(mm) # Calculating the mean of these means 
   # (this isn't as confusing as it sounds: we basically
   # calculate the mean of 15 die rolls multiple times
   # (either 100, 1000 or 100000 times) and calculate the
   # mean of these 100, 1000 or 100000 observations)
   })
  a <- round(sd(x),digits =3)
  hist(x, # Specifying which data will be plotted
       xlab="Mean of 15 die rolls", # Labeling the x axis
       xlim=c(1,6), # Delimiting the x axis
       col="#4286f4", # Check out "color picker" on Google if you want to select a custom color!
       main=paste("Histogram for ", i," random draws (sd = ", a,")", sep="")) # Using the paste command to generate the main plot title
}

You can click on these graphs to take a closer look!

The distribution of the sample mean tends toward a normal distribution as the sample size increases. This is the Central Limit Theorem.

Exercise

If I tell you that the function to create a random binomial distribution is rbinom(n, size, prob), where n is the number of observations you want to sample, size is the number of trials, and prob is the probability of success on each trial…

How would you sample a distribution of 50 people flipping a fair coin 10 times?

# Don't forget to set your seed at 123!
set.seed(123)

# You should input the correct arguments in this function:
# rbinom(n=?, size=?, prob=?)