Agenda for today

  1. Sampling distributions (20 mins)
  2. Simulation (25 mins)
  3. Next week agenda & Questions (10 mins)

Sampling distributions

Simulation

Simulation is not to provide insight into the data or the real-world problem being studied, but rather to evaluate the properties of the statistical methods being used, given an assumed generative model.

Simulation allows us to use probability distributions to understand variations.


Ex: The probability that a baby is a girl or boy is approximately 48% or 52%, respectively, and these do not vary much across the world. Suppose that 100 babies are born in a hospital in a given month. How many will be girls?

  • Binomial Distribution
    • Number of success in n independent Bernoulli trials
    • Bernoulli trial: discrete probablity distribution (EX: coin toss -> success (1) or failure (0)
    • Two parameters: number of trials (n) and the probablity of success (p)


SAS

data n_girls (keep=x);
    p = 0.48;
    x = rand("Binomial", p, 100); 
    output;
run;

R

n_girls <- rbinom(1, 100, 0.48)
print(n_girls)
## [1] 54


To get a sense of the distribution of what could happen, we simulate the process 1,000 times.

SAS

data n_sims (keep=n_girls);
    p=0.48; 
    do i= 1 to 1000; 
        x = rand("Binomial", p, 100);
        output;
    end;
    rename x=n_girls;
run;

proc sgplot data=n_sims;
    title "1000 draws from Binomial Distribution (p=0.48, n=100)";
    histogram n_girls; 
run;


R

n_sims <- 1000
n_girls <- rbinom(n_sims, 100, 0.48)
hist (n_girls, main="", xaxt="n", yaxt="n", mgp=c(1.5,.5,0))


Simulate other probability distributions.

  • Normal Distribution
    • Two parameters: mean(mu) and standard deviation(sigma)
    • Standard normal distribution: N~(0,1)


SAS

data n_sims_1 (keep=x);
    mu=0; 
    sigma=1;
    do i= 1 to 1000; 
        x = rand("Normal", mu, sigma);
        output;
    end;
run;
proc sgplot data=n_sims_1;
    title "1000 draws from standard normal dist.";
    histogram x; 
run;

  • Poisson distribution
    • Use to model count variables
    • One parameter: lambda (mean=variance)
    • A special case of Negative Binomial distribution: https://youtu.be/BPlmjp2ymxw


SAS

data n_sims_2 (keep=x);
    lambda=5; 
    do i= 1 to 1000; 
        x = rand("Poisson", lambda);
        output;
    end;
run;
proc sgplot data=n_sims_2;
    title "1000 draws from poisson dist. with mean 5";
    histogram x; 
run;


R

y1 <- rnorm(n_sims, 0, 1)
y2 <- rpois(n_sims, 5)
hist(y1, breaks=seq(floor(min(y1)),ceiling(max(y1)),0.5),main="1000 draws from the standard normal dist.")

hist(y2, breaks=seq(-0.5, max(y2) + 1, 1), main="1000 draws from Poisson dist. with mean 5")

We can use simulation to approximate the sampling distribution: bootstrapping.

  • Resampling data with replacement
  • No need for distributional assumptions
  • Basic steps
    • compute a statistic for the original data
    • resample B times from the data
    • compute the statistic on each bootstrap sample. this creates the bootstrap distribution.
    • use bootstrap distribution to obtain estimates such as SE and CI
  • Luz will introduce more details in the Mediation module


[Optional] EX: Estimate the ratio of the median earnings of women in the United States divided by the median earnings of men (dataset “earnings”).


R

# ratio calculation based on the sample
# setwd("your working directory")
earnings <- read.csv("earnings.csv")
earn <- earnings$earn
male <- earnings$male
print(median(earn[male==0]) / median(earn[male==1]))
## [1] 0.6
#### bootstrap simulations
# Create a function called Boot_ratio:
# 1. resample the data with replacement -> get a new sample called "boot"
# 2. calculate and return the ratio using dataset "boot"

Boot_ratio <- function(data){
  n <- nrow(data)
  boot <- sample(n, replace=TRUE)
  earn_boot <- data$earn[boot]
  male_boot <- data$male[boot]
  ratio_boot <- median(earn_boot[male_boot==0]) / median(earn_boot[male_boot==1])
  return(ratio_boot)
}
n_sims <- 10000
output <- replicate(n_sims, Boot_ratio(data=earnings))

# Summarize the results graphically and numerically
hist(output)

round(sd(output), 2)
## [1] 0.03

Use simulation for forecasting.

Not discussed in this course. You may hear epid/biostat friends talk about Markov chain Monte Carlo.

Next week: TBD