Agenda for today

  1. Intro
  2. Sampling distribution
  3. Simulation


My planning on recitations


Recap: sampling distribution


Simulation: this is not a programming course but we will use some simulations later.

Simulation is not to provide insight into the data or the real-world problem being studied, but rather to evaluate the properties of the statistical methods being used, given an assumed generative model.

Simulation allows us to use probability distributions to understand variations.


Ex: The probability that a baby is a girl or boy is approximately 48% or 52%, respectively, and these do not vary much across the world. Suppose that 100 babies are born in a hospital in a given month. How many will be girls?

In this scenario, we defines an event as a baby being a girl, which has p=0.48. So, the expectation of number of girls in 100 babies would be: 0.48*100 = 48.

  • Binomial Distribution
    • Number of success in n independent Bernoulli trials
    • Bernoulli trial: discrete probability distribution (EX: coin toss -> success (1) or failure (0))
    • Two parameters: number of trials (n) and the probability of success (p)


SAS

data n_girls (keep=x);
    p = 0.48;
    x = rand("Binomial", p, 100); 
    output;
run;

R

n_girls <- rbinom(1, 100, 0.48)
print(n_girls)
## [1] 46


To get a sense of the distribution of what could happen, we simulate the process 1,000 times.

SAS

data n_sims (keep=n_girls);
    p=0.48; 
    do i= 1 to 1000; 
        x = rand("Binomial", p, 100);
        output;
    end;
    rename x=n_girls;
run;

proc sgplot data=n_sims;
    title "1000 draws from Binomial Distribution (p=0.48, n=100)";
    histogram n_girls; 
run;


R

n_sims <- 1000
n_girls <- rbinom(n_sims, 100, 0.48)
hist (n_girls, main="", xaxt="n", yaxt="n", mgp=c(1.5,.5,0))


Simulate other probability distributions.

  • Normal Distribution
    • Two parameters: mean(mu) and standard deviation(sigma)
    • Standard normal distribution: N~(0,1)


SAS

data n_sims_1 (keep=x);
    mu=0; 
    sigma=1;
    do i= 1 to 1000; 
        x = rand("Normal", mu, sigma);
        output;
    end;
run;

/* Another way - ps1.2 syntax*/
%let N=1000;
proc iml; /*iml: interactive matrix language*/
    call randseed(4321); 
    x=j(1,&N); /*generate a J=1xN matrix for x*/
    call randgen(x, "Normal",0,1);
    create n_sims_1 var {x};
    append;
    close n_sims_1;
quit;

proc sgplot data=n_sims_1;
    title "1000 draws from standard normal dist.";
    histogram x; 
run;


  • Poisson distribution
    • Use to model count variables
    • One parameter: lambda (mean=variance)
    • A special case of Negative Binomial distribution: https://youtu.be/BPlmjp2ymxw


SAS

data n_sims_2 (keep=x);
    lambda=5; 
    do i= 1 to 1000; 
        x = rand("Poisson", lambda);
        output;
    end;
run;
proc sgplot data=n_sims_2;
    title "1000 draws from poisson dist. with mean 5";
    histogram x; 
run;


R

y1 <- rnorm(n_sims, 0, 1)
y2 <- rpois(n_sims, 5)
hist(y1, breaks=seq(floor(min(y1)),ceiling(max(y1)),0.5),main="1000 draws from the standard normal dist.")

hist(y2, breaks=seq(-0.5, max(y2) + 1, 1), main="1000 draws from Poisson dist. with mean 5")

We can use simulation to approximate the sampling distribution: bootstrapping.

  • Luz will introduce more details in the Mediation module
  • Resampling data with replacement
  • No need for distributional assumptions
  • Basic steps
    • compute a statistic for the original data
    • resample B times from the data
    • compute the statistic on each bootstrap sample. this creates the bootstrap distribution.
    • use bootstrap distribution to obtain estimates such as SE and CI


[Optional] EX: Estimate the ratio of the median earnings of women in the United States divided by the median earnings of men (dataset “earnings”).


R

# ratio calculation based on the sample
# setwd("your working directory")
earnings <- read.csv("earnings.csv")
earn <- earnings$earn
male <- earnings$male
print(median(earn[male==0]) / median(earn[male==1]))
## [1] 0.6
#### bootstrap simulations
# Create a function called Boot_ratio:
# 1. resample the data with replacement -> get a new sample called "boot"
# 2. calculate and return the ratio using dataset "boot"

Boot_ratio <- function(data){
  n <- nrow(data)
  boot <- sample(n, replace=TRUE)
  earn_boot <- data$earn[boot]
  male_boot <- data$male[boot]
  ratio_boot <- median(earn_boot[male_boot==0]) / median(earn_boot[male_boot==1])
  return(ratio_boot)
}
n_sims <- 10000
output <- replicate(n_sims, Boot_ratio(data=earnings))

# Summarize the results graphically and numerically
hist(output)

round(sd(output), 2)
## [1] 0.03


Use simulation for forecasting.

Not discussed in this course.


Next week: Overview of regression and simple linear regression

  • Reminders:
    • Exercise 1.1 (upload to Dropbox by next Tuesday)
    • Readings: ch 6 & 7