Simulation is not to provide insight into the data or the real-world problem being studied, but rather to evaluate the properties of the statistical methods being used, given an assumed generative model.
Ex: The probability that a baby is a girl or boy is approximately 48% or 52%, respectively, and these do not vary much across the world. Suppose that 100 babies are born in a hospital in a given month. How many will be girls?
SAS
data n_girls (keep=x);
p = 0.48;
x = rand("Binomial", p, 100);
output;
run;
R
n_girls <- rbinom(1, 100, 0.48)
print(n_girls)
## [1] 54
To get a sense of the distribution of what could happen, we simulate the process 1,000 times.
SAS
data n_sims (keep=n_girls);
p=0.48;
do i= 1 to 1000;
x = rand("Binomial", p, 100);
output;
end;
rename x=n_girls;
run;
proc sgplot data=n_sims;
title "1000 draws from Binomial Distribution (p=0.48, n=100)";
histogram n_girls;
run;
R
n_sims <- 1000
n_girls <- rbinom(n_sims, 100, 0.48)
hist (n_girls, main="", xaxt="n", yaxt="n", mgp=c(1.5,.5,0))
Simulate other probability distributions.
SAS
data n_sims_1 (keep=x);
mu=0;
sigma=1;
do i= 1 to 1000;
x = rand("Normal", mu, sigma);
output;
end;
run;
proc sgplot data=n_sims_1;
title "1000 draws from standard normal dist.";
histogram x;
run;
SAS
data n_sims_2 (keep=x);
lambda=5;
do i= 1 to 1000;
x = rand("Poisson", lambda);
output;
end;
run;
proc sgplot data=n_sims_2;
title "1000 draws from poisson dist. with mean 5";
histogram x;
run;
R
y1 <- rnorm(n_sims, 0, 1)
y2 <- rpois(n_sims, 5)
hist(y1, breaks=seq(floor(min(y1)),ceiling(max(y1)),0.5),main="1000 draws from the standard normal dist.")
hist(y2, breaks=seq(-0.5, max(y2) + 1, 1), main="1000 draws from Poisson dist. with mean 5")
[Optional] EX: Estimate the ratio of the median earnings of women in the United States divided by the median earnings of men (dataset “earnings”).
R
# ratio calculation based on the sample
# setwd("your working directory")
earnings <- read.csv("earnings.csv")
earn <- earnings$earn
male <- earnings$male
print(median(earn[male==0]) / median(earn[male==1]))
## [1] 0.6
#### bootstrap simulations
# Create a function called Boot_ratio:
# 1. resample the data with replacement -> get a new sample called "boot"
# 2. calculate and return the ratio using dataset "boot"
Boot_ratio <- function(data){
n <- nrow(data)
boot <- sample(n, replace=TRUE)
earn_boot <- data$earn[boot]
male_boot <- data$male[boot]
ratio_boot <- median(earn_boot[male_boot==0]) / median(earn_boot[male_boot==1])
return(ratio_boot)
}
n_sims <- 10000
output <- replicate(n_sims, Boot_ratio(data=earnings))
# Summarize the results graphically and numerically
hist(output)
round(sd(output), 2)
## [1] 0.03
Not discussed in this course. You may hear epid/biostat friends talk about Markov chain Monte Carlo.