HB761: Week 1 Recitation

Simulation

Simulation is not to provide insight into the data or the real-world problem being studied, but rather to evaluate the properties of the statistical methods being used, given an assumed generative model.

Simulation allows us to use probability distributions to understand variations.

Ex: The probability that a baby is a girl or boy is approximately 48% or 52%, respectively, and these do not vary much across the world. Suppose that 100 babies are born in a hospital in a given month. How many will be girls?

Binomial Distribution
- Number of success in n independent Bernoulli trials
- Bernoulli trial: discrete probablity distribution (EX: coin toss -> success (1) or failure (0)
- Two parameters: number of trials (n) and the probablity of success (p)

SAS

data n_girls (keep=x);
    p = 0.48;
    x = rand("Binomial", p, 100); 
    output;
run;

n_girls <- rbinom(1, 100, 0.48)
print(n_girls)

## [1] 54

To get a sense of the distribution of what could happen, we simulate the process 1,000 times.

SAS

data n_sims (keep=n_girls);
    p=0.48; 
    do i= 1 to 1000; 
        x = rand("Binomial", p, 100);
        output;
    end;
    rename x=n_girls;
run;

proc sgplot data=n_sims;
    title "1000 draws from Binomial Distribution (p=0.48, n=100)";
    histogram n_girls; 
run;

n_sims <- 1000
n_girls <- rbinom(n_sims, 100, 0.48)
hist (n_girls, main="", xaxt="n", yaxt="n", mgp=c(1.5,.5,0))

Simulate other probability distributions.

Normal Distribution
- Two parameters: mean(mu) and standard deviation(sigma)
- Standard normal distribution: N~(0,1)

SAS

data n_sims_1 (keep=x);
    mu=0; 
    sigma=1;
    do i= 1 to 1000; 
        x = rand("Normal", mu, sigma);
        output;
    end;
run;
proc sgplot data=n_sims_1;
    title "1000 draws from standard normal dist.";
    histogram x; 
run;

Poisson distribution
- Use to model count variables
- One parameter: lambda (mean=variance)
- A special case of Negative Binomial distribution: https://youtu.be/BPlmjp2ymxw

SAS

data n_sims_2 (keep=x);
    lambda=5; 
    do i= 1 to 1000; 
        x = rand("Poisson", lambda);
        output;
    end;
run;
proc sgplot data=n_sims_2;
    title "1000 draws from poisson dist. with mean 5";
    histogram x; 
run;

y1 <- rnorm(n_sims, 0, 1)
y2 <- rpois(n_sims, 5)
hist(y1, breaks=seq(floor(min(y1)),ceiling(max(y1)),0.5),main="1000 draws from the standard normal dist.")

hist(y2, breaks=seq(-0.5, max(y2) + 1, 1), main="1000 draws from Poisson dist. with mean 5")

We can use simulation to approximate the sampling distribution: bootstrapping.

Resampling data with replacement
No need for distributional assumptions
Basic steps
- compute a statistic for the original data
- resample B times from the data
- compute the statistic on each bootstrap sample. this creates the bootstrap distribution.
- use bootstrap distribution to obtain estimates such as SE and CI
Luz will introduce more details in the Mediation module

[Optional] EX: Estimate the ratio of the median earnings of women in the United States divided by the median earnings of men (dataset “earnings”).

# ratio calculation based on the sample
# setwd("your working directory")
earnings <- read.csv("earnings.csv")
earn <- earnings$earn
male <- earnings$male
print(median(earn[male==0]) / median(earn[male==1]))

## [1] 0.6

#### bootstrap simulations
# Create a function called Boot_ratio:
# 1. resample the data with replacement -> get a new sample called "boot"
# 2. calculate and return the ratio using dataset "boot"

Boot_ratio <- function(data){
  n <- nrow(data)
  boot <- sample(n, replace=TRUE)
  earn_boot <- data$earn[boot]
  male_boot <- data$male[boot]
  ratio_boot <- median(earn_boot[male_boot==0]) / median(earn_boot[male_boot==1])
  return(ratio_boot)
}
n_sims <- 10000
output <- replicate(n_sims, Boot_ratio(data=earnings))

# Summarize the results graphically and numerically
hist(output)

round(sd(output), 2)

## [1] 0.03

Use simulation for forecasting.

Not discussed in this course. You may hear epid/biostat friends talk about Markov chain Monte Carlo.

Additional resources:

“HB Methods” YouTube Channel (I will update weekly): https://www.youtube.com/playlist?list=PLWrP3iyhWg3xfJjzr9i9ddNhNC_EpVotH
SAS Rand function: https://documentation.sas.com/?cdcId=pgmsascdc&cdcVersion=9.4_3.5&docsetId=lefunctionsref&docsetTarget=p0fpeei0opypg8n1b06qe4r040lv.htm&locale=en
SAS simulation: https://support.sas.com/content/dam/SAS/support/en/books/simulating-data-with-sas/65378_excerpt.pdf
SAS bootstrapping: https://blogs.sas.com/content/iml/2018/12/12/essential-guide-bootstrapping-sas.html

HB761: Week 1 Recitation

Qian, Yiqing

11/30/2020

Agenda for today

Sampling distributions

Simulation

Simulation allows us to use probability distributions to understand variations.

We can use simulation to approximate the sampling distribution: bootstrapping.

Use simulation for forecasting.

Additional resources:

Next week: TBD