A scientist’s job is essentially modeling the real world with data. Part of that is understanding what mechanisms actually produce that data. This is the essence of the data generating process (DGP), which is the process in the real world that generates the data one is interested in. We often apply statistics in order to understand this DGP, and one way of achieving that is with simulation.
Simulating this process can achieve a number of useful aspects of statistics that we may not normally consider. This includes:
With all of that said, let us apply this thinking to simulating data in R. We will start off basic, running some univariate simulations, and procedurally build up to more complex phenomena.
Univariate simulations are quite easy to run in R. The only thing you need to know is which distribution you will try to simulate, as well as the parameters necessary to produce it. For example, a binomial distribution requires probability inputs, whereas a beta distribution does not. While there are many distributions available both in the native programming of R and importable libraries, we only focus on two for this section.
First, we simulate a univariate normal distribution with the
rnorm function, which only requires that you supply a
sample size, mean, and standard deviation. When simulating, always
remember to start by setting the seed, which initializes a pseudo-random
number generator, enabling replication of your simulated work.
Otherwise, your data will look different every time you simulate it. In
some cases you would want this, but for most of the work here, we will
use the set.seed function to make the analysis
reproducible. The number can be set to literally anything. You can think
of it as the “file name” for your random seed, which is completely
arbitrary.
#### Set Seed ####
set.seed(123)
#### Simulate Distribution ####
x <- rnorm(n=1000, # sample size
mean=50, # mean
sd=10) # SD
#### Run Histogram ####
hist(x, # data
xlab="Simulated X", # label x axis
main="Histogram of Simulation", # title
col="steelblue") # color of bars
Data isn’t always normally distributed, so it may be useful to know
how to simulate from a different distribution. Let’s say we are dealing
with counts. Counts never have negative values, and by consequence are
right skewed. How skewed that becomes is based on it’s lambda parameter,
or \(\lambda\). To simulate this, we
use the rpois function and supply \(\lambda = 2\). This will make the
distribution center around counts of \(2\), though most frequent values will be
around \(0\). We can actually check
this by simply adding a mean line onto the histogram with
abline, using the v argument to draw a
vertical line where we choose.
#### Simulate Distribution ####
x.pois <- rpois(1000,2)
#### Run Histogram ####
hist(x.pois, # data
xlab="Simulated X", # label x axis
main="Simulated Poisson Distribution", # title
col="steelblue") # color of bars
#### Add Mean Line ####
abline(v=mean(x.pois), # use this mean to draw line
col="red", # color it red
lwd=2) # make linewidth equal to 2
We can easily extend that further by simulating two distributions, one we purposely increase the mean for to get a robust difference in effects (such as a mean comparison via a t-test or categorical regression). We may also be interested in knowing the power of a test by running a simulated power analysis. We first create two distributions, the control and treatment, with the tools we already have.
#### A Step Further ####
trt <- rnorm(n=1000,
mean=50,
sd=10) # creates treatment group
ctrl <- rnorm(n=1000,
mean=30,
sd=10) # creates control group
hist(trt,
xlim=c(10,80),
main="Group Differences Simulated",
xlab="Simulated X") # labels for first plot carried over for second plot
hist(ctrl,
col="steelblue",
xlim=c(10,80),
add=T) # this adds plot to last one
abline(v=mean(trt),
col="red",
lwd=3) # draws treatment mean
abline(v=mean(ctrl),
col="red",
lwd=3) # draws control mean
Seeing as we have two distributions, we can right away test the differences between them with a t-test. R by default uses the Welch t-test, but you can change this if for some reason you don’t want to.
t <- t.test(trt,ctrl) # run t-test
p <- t$p.value # save p value (dollar sign pulls p value from this object)
print(t) # show results
##
## Welch Two Sample t-test
##
## data: trt and ctrl
## t = 44.94, df = 1997.4, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 19.09604 20.83878
## sample estimates:
## mean of x mean of y
## 49.97344 30.00603
print(p) # print p-value
## [1] 2.303736e-305
Rather than setting all this up every time we want to simulate, what we could do is simply write a function to setup the inputs, and then we just have to run whatever we want into it in order to simulate our two distributions. Below is such a function. We are doing the same thing as before, only we are adding the means, standard deviations, and sample sizes as inputs to our function ahead of time.
t.sim <- function(n,m1,m2,sd){ # sample size, mu1, mu2 and SD
trt <- rnorm(n=n, # goes here
mean=m1, # here
sd=sd) # and here
ctrl <- rnorm(n,
mean=m2, # second mean here
sd=sd)
t <- t.test(trt,ctrl) # run t-test
p <- t$p.value # save p value
}
Then we can test it.
single.p <- t.sim(100,50,46,10)
print(single.p)
## [1] 0.0009933038
As can be expected, the test is statistically significant. But that
isn’t really the goal of the test. The bigger concern here is if the
design has enough statistical power, and one test cannot say much about
that. We can instead run the simulation \(B\) times in order to capture uncertainty
in the estimate. We can use the replicate function to
achieve this, setting \(B =
1,000\).
#### Replicate Function 1,000x ####
multi.p <- replicate(1000, # run function 1000 times
t.sim(100,50,46,10)) # run t-sim function
#### Filter Values Below .05 ####
sig <- multi.p[multi.p < .05]
#### Estimate Power
power <- length(sig)/length(multi.p) # proportion sig. to total p values
#### Visualize ####
hist(multi.p,
main="Power Analysis",
breaks = 20,
col="steelblue",
xlab="P Values")
abline(v=.05,col="red",lwd=2) # show "critical" p region
It looks like if we plot our power, we get a heavily right skewed
distribution. We can check the actual power of the test by simply
running print(power).
print(power)
## [1] 0.81
It seems our test surpasses the typical \(80\)% threshold. However, this is assuming a pretty large difference in means and consequently a much larger effect size. What if we expect our effect to be pretty weak (a minor difference in means) and simulate power?
#### Replicate Function 1,000x ####
multi.p <- replicate(1000, # run function 1000 times
t.sim(100,50,49,10)) # run t-sim function
#### Filter Values Below .05 ####
sig <- multi.p[multi.p < .05]
#### Estimate Power
power <- length(sig)/length(multi.p) # proportion sig. to total p values
hist(multi.p,
main="Power Analysis",
breaks = 20,
col="steelblue",
xlab="P Values")
abline(v=.05,col="red",lwd=2) # show "critical" p region
Not surprisingly, most of our p-values are actually outside of the \(.05\) range, and consequently our power suffers.
print(power)
## [1] 0.121
All of this is relatively straightforward, but it may interest us to simulate beyond univariate and by-group comparisons. What if, for example, we wanted to run a simple regression? We could just create an \(x\) and \(y\) that are normally distributed with the function we have been using already.
#### Predictor ####
x <- rnorm(n=100,
mean=50,
sd=10)
#### Outcome ####
y <- rnorm(n=100,
mean=50,
sd=10)
#### Plot ####
plot(x,y,
main = "Simulated Bivariate Data (Weak)",
xlab = "Simulated X",
ylab = "Simulated Y") # not realistic
We can see this isn’t very realistic. Because we haven’t modeled in any relationship between the variables, the data is completely random, and as such the scatterplot shows no covariance between the variables. The fix for this, however, is quite simple if you already know how to construct a linear equation. Recall that for a simple linear regression, our equation is as follows:
\[ y = \beta_0 + \beta_1 x_1 + \epsilon \]
where \(\beta_0\) is the intercept, \(\beta_1\) is the slope, and \(\epsilon\) is the error term. If we read this equation, we can figure the following:
And that’s really all there is to it. If you know your \(\beta\) and \(\epsilon\) values, you can easily simulate \(y\). Here is now. First, we setup our parameters for \(x\) and save the degrees of freedom as \(n-2\).
n <- 100 # sample size
df <- n-2 # should match SR
m <- 50 # set mean of x
sd <- 10 # set sd of x
Then we write our linear equation by defining our beta coefficients and our error term.
b0 <- 30 # intercept
b1 <- .50 # slope
e <- rnorm(n,sd=3) # RSE
Finally, we just construct \(x\) like we normally do (with the parameters we just chose) and \(y\) as an assembly of our beta coefficients and error term.
x <- rnorm(n,m,sd) # predictor
y <- b0 + (b1*x) + e # linear equation
We can now fit our regression. Run the regression with
lm, then save it as fit. Inspect it with
summary(fit) like below.
fit <- lm(y ~ x) # fit regression
summary(fit) # summarize
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.189 -2.264 0.108 1.945 7.468
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 29.40337 1.38874 21.17 <2e-16 ***
## x 0.51126 0.02732 18.71 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.934 on 98 degrees of freedom
## Multiple R-squared: 0.7813, Adjusted R-squared: 0.7791
## F-statistic: 350.1 on 1 and 98 DF, p-value: < 2.2e-16
Notice that the terms in our model are eerily similar, but because we
have randomized our data, the uncertainty in the estimate provides
wiggle room for what we asked for. However, it should be quite obvious
that the intercept, slope, RSE, and degrees of freedom all match. To
show this is the case, we can also simply plot the regression line with
abline, which automatically plots a regression line if it
is already saved.
#### Plot ####
plot(x,y, # plot variables
main = "Better Regression Simulation", # add title
xlab = "Simulated X", # add x axis label
ylab = "Simulated Y") # add y label
#### Add Regression Line ####
abline(fit, # use regression fit
col="red", # color red
lwd=2) # make it size 2
We can see that unlike our last pair of variables, \(x\) and \(y\) have a strong relationship with each
other. However, we may anticipate that our DGP is a lot noisier, and we
should consider this when designing our experiment, particularly with
respect to statistical power. Here we draw a new simulated relationship,
only changing \(\text{RSE} = 10\) this
time. All the steps are essentially the same, only the e
term is different.
#### Create Noisier Estimates
n <- 100 # sample size
df <- n-2 # should match SR
m <- 50 # set mean of x
sd <- 10 # set sd of x
b0 <- 30 # intercept
b1 <- .50 # slope
e <- rnorm(n,sd=10) # RSE
x <- rnorm(n,m,sd) # predictor
y <- b0 + (b1*x) + e # linear equation
fit <- lm(y ~ x) # fit regression
#### Plot Again ####
plot(x,y, # plot variables
main = "Noisy Regression", # add title
xlab = "Simulated X", # add x axis label
ylab = "Simulated Y") # add y label
#### Add Regression Line Again ####
abline(fit, # use regression fit
col="red", # color red
lwd=2) # make it size 2
We can see that our data has wider dispersion and our \(R^2\) values are consequently not as high given the amount of noise in the model. As such, it may be difficult to capture this DGP with a small sample.
We don’t have to stop with simple regression. We simply add as many predictors as we want, as the linear equation can be any additive terms:
\[ y = \beta_0 + \beta_1x_1...\beta_kx_k + \epsilon \]
where \(k\) is a given variable. The only difference this time when we simulate is that we construct two predictors this time, \(x_1\) and \(x_2\), and add both into the linear equation like before.
#### Predictor Parameters ####
n <- 100
m <- 50
sd <- 10
x1 <- rnorm(n,m,sd)
x2 <- rnorm(n,m,sd)
#### Linear Equation Setup ####
b0 <- 30
b1 <- .50
b2 <- .70
e <- rnorm(n,sd=1)
y <- b0 + (b1*x1) + (b2*x2)+ e
#### Save as Data Frame ####
df <- data.frame(x1,x2,y)
#### Fit and Summarize ####
fit <- lm(y ~ x1 + x2)
summary(fit)
##
## Call:
## lm(formula = y ~ x1 + x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.39430 -0.61228 -0.03649 0.59066 2.54611
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 29.945649 0.665823 44.98 <2e-16 ***
## x1 0.505584 0.009845 51.36 <2e-16 ***
## x2 0.693914 0.010199 68.04 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9783 on 97 degrees of freedom
## Multiple R-squared: 0.9872, Adjusted R-squared: 0.987
## F-statistic: 3750 on 2 and 97 DF, p-value: < 2.2e-16
Once again, our estimates reflect what we simulated. Similar operations can be performed on interaction terms, and these can be adjusted based off the parameters we include in the model.
Often the real world is more difficult to model, but knowing out this
sort of data is generated informs us of the ways it should operate. For
example, we know that many real-world functions of data are nonlinear.
Two commonly known nonlinear functions are exponential
growth and exponential decay. For this
example, we will simulate exponential decay and see why fitting a linear
regression to it is not super helpful. This time, instead of multiplying
each predictor by a beta coefficient, we can instead directly model the
function into our linear equation and add some uncertainty. We first use
runif because we want a uniform distribution of values
around the function. We then model \(y\) the same was as before, only this time
we include \(\text{log}(x)\) into the
equation, adding some uncertainy with the usual rnorm
error. We then plot like usual.
#### Nonlinear Data ####
x <- runif(500)
y <- log(x) + rnorm(500,sd=.5)
plot(x,y,
main="Exponential Decay Model",
xlab = "Simulated X",
ylab = "Simulated Y")
We then fit a linear relationship between the predictor and outcome.
#### Fit Regression ####
poor.fit <- lm(y ~ x)
#### Add Regression Line ####
plot(x,y,
main="Exponential Decay Model",
xlab = "Simulated X",
ylab = "Simulated Y")
abline(poor.fit,
col = "red",
lwd=2)
We can see that the OLS line does a terrible job of modeling the DGP.
The regression assumes there is a constant relationship between the two
variables, but clearly there is an initialy rapid growth in \(y\), followed by a rapid and then steady
decline. We can do better by fitting the data the same way we generated
it. We simply add log(x) to our fit. However, because our
data isn’t a simple linear fit, we have to construct a line with the
predictions from the data. To do this, we use the following steps:
#### Plog Again and Run Regression ####
plot(x,y,
main="Exponential Decay Model",
xlab = "Simulated X",
ylab = "Simulated Y")
fit <- lm(y ~ log(x))
#### Create X Sequence ####
x.new <- seq(
min(x),max(x),length.out=100
)
#### Plot Line ####
new <- data.frame(x = x.new)
pred <- predict(fit, newdata = new)
lines(x.new,pred,col="red",lwd=2)
This is a nice example of a nonlinear process, but other complex relationships can be crafted in R. Our next simulated DGP comes from econometrics. Typically, income and food expenditure are heavily right skewed, and these variables become difficult to model with a straightforward ordinary least squares regression. This is because their skewness causes data to be “pressed into a corner” when visualized on a scatterplot. To convey this notion, we can simulate such data below.
#### Setup Variables ####
x.log <- exp(rnorm(100,mean=10))
b0 <- .01
b1 <- .5
y.log <- b0 + (b1*x.log) + exp(rnorm(100,mean=10))
#### Plot ####
plot(x.log,
y.log,
main = "Compressed Data",
xlab = "Simulated X",
ylab = "Simulated Y")
A common remedy to this issue is a log-log regression, which takes the natural log of both the predictor and outcome. In practice, this does two things. First, it spreads the data out, so the dispersion is not a large problem. Second, it makes the interpretation easier, as each one unit increase in \(x\) equals a given percent change in \(y\).
#### Plot Logged Values ####
plot(log(x.log),
log(y.log),
main = "Log-Log Regression",
xlab = "Log of X",
ylab = "Log of Y")
#### Fit Data ####
fit.log <- lm(log(y.log) ~ log(x.log))
summary(fit.log)
##
## Call:
## lm(formula = log(y.log) ~ log(x.log))
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.25990 -0.40556 -0.04201 0.30516 2.19590
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.07207 0.74014 10.906 < 2e-16 ***
## log(x.log) 0.25777 0.07403 3.482 0.000745 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.615 on 98 degrees of freedom
## Multiple R-squared: 0.1101, Adjusted R-squared: 0.101
## F-statistic: 12.12 on 1 and 98 DF, p-value: 0.0007454
abline(fit.log,
col="red",
lwd=2)
Sometimes it is difficult to understand exactly what Bayesian models are doing, and simulating Bayes can help with that intuition. As a reminder, Bayes theorem is a conditional probability estimate such that:
\[ p(\theta|X) = \frac{p(X|\theta)\,p(\theta)}{p(X)} \] where \(\theta\) is a given estimate and \(X\) is our data. One common way of visualizing how this works with Bayesian prior and posterior estimates is by using something called grid approximation. If we have many values of the probability of \(X\), we would want a way to approximate the sums of those probabilities by using Riemann sums such that:
\[ \int p(X|\theta)\,p(\theta)\, d\theta \approx \sum_{i \in G} p(X|\theta_i)\,p(\theta_i) \, \Delta\theta_i \] where \(G\) is our grid and \(\Delta\theta_i = \theta_i - \theta_{i-1}\). To simplify, the grid is simply a set of points used to evaluate the function used for the sake of approximating the integral (the area under the curve). The more points, the more precise the approximation is. Picking the grids is a loaded topic (how large, uniform or not, etc), so I simply illustrate how to do it without going into more detail about which grids to pick.
Let’s say we are trying to estimate the probability of a fire occurring and we have some idea of what the probability normally should be. We setup the grid approximation with the following steps (much of this is from the book Statistical Rethinking by Richard McElreath with slight rewording):
#### Define Grid ####
p_grid <- seq(from=0,
to=1,
length.out=20 ) # creates a grid of 20 values from 0 to 1
#### Define Prior ####
prior <- rep(1,20) # define a constant prior
#### Compute Likelihood at Each Value in Grid ####
likelihood <- dbinom(6,
size=20,
prob=p_grid)
#### Compute Product of Likelihood and Prior ####
unstd.posterior <- likelihood * prior
#### Standardize the Posterior (Sums to 1) ####
posterior <- unstd.posterior / sum(unstd.posterior)
#### Plot Grid Approximation ####
plot( p_grid ,
posterior ,
type="b" ,
xlab="Probability of Fire" ,
ylab="Posterior Probability",
main="Probability of Fire Given Priors")
Using prior information, we now have a probability distribution of a fire occurring using our prior information and getting a posterior estimate of that probability.
We can also visualize how the Bayesian process works with respect to
estimating regressions. This example comes from the book Regression
and Other Stories by Gelman and colleagues (with some of my own
minor adjustments), where two regressions are fit, one with flat priors
and one with informative priors. The sample size is quite small, and
will provide some informative feedback on it’s influence on estimation.
First, we load the rstanarm package to run a Bayesian
regression model. Then we simulate some data, which models a
relationship between the attractiveness of a parent and their percentage
of girl babies.
#### Load Package ####
library(rstanarm) # run install.packages("rstanarm") if not installed
#### Create Data ####
x <- seq(-2,2,1) # produce sequence from -2 to 2 in 1 point increments
y <- c(50, 44, 50, 47, 56) # create this vector of 5 points
sexratio <- data.frame(x, y) # merge into data frame
We then setup the priors. I don’t go into a ton of detail here, but I defer to the previously two mentioned books for a more in-depth exploration of Bayesian priors. The bulk of what wer are doing here is providing what our estimates should look like in the real world and the uncertainty around those estimates. These will be added to the regression later.
#### Informative Priors ####
theta_hat_prior <- 0
se_prior <- 0.25
theta_hat_data <- 8
se_data <- 3
theta_hat_bayes <- (theta_hat_prior/se_prior^2 + theta_hat_data/se_data^2)/(1/se_prior^2 + 1/se_data^2)
se_bayes <- sqrt(1/(1/se_prior^2 + 1/se_data^2))
We then estimate two regressions, one with a flat prior and one with
an informative prior. We use the flexible function stan_glm
to achieve both. We assign refresh = 0 so it doesn’t keep
printing each estimation. Notice that while the first regression is
modeled very similar to the lm or glm function
in base R, the second regression includes the priors we assigned
earlier.
#### Regression with Flat Prior ####
fit_default <- stan_glm(y ~ x,
data = sexratio,
refresh = 0)
#### Regression with Informative Prior ####
fit_post <- stan_glm(y ~ x,
data = sexratio,
prior = normal(0, 0.2),
prior_intercept = normal(48.8, 0.5),
refresh = 0)
Finally, we plot our fits. There is a lot of code here since I am using base R, but I have annotated and simplified the code from the book here to give you the gist of what is going on here.
#### Customize Plot ####
par(mfrow=c(1,2), # row x col layout
mar=c(5,3,3,2), # margins
mgp=c(1.7,.5,0), # margin and axis lines
tck=-.01) # length of tick marks
#### Save Both Regressions as List ####
fit_bayes <- list(as.data.frame(fit_default),
as.data.frame(fit_post)) # save both regressions into list
#### Plot Simulations from Bayes ####
for (k in 1:2){ # use both regressions
sims <- fit_bayes[[k]] # get their simulations
coef_est <- apply(sims, 2, median) # get their coefs
b_se <- 1.483*median(abs(sims[,2] - median(sims[,2]))) # and their error
plot(x, y, # use this data
ylim=c(43, 57), # limit the y-axis to this range
xlab="Attractiveness of Parent",
ylab="Percentage of Girl Babies", # use these axis labels
main=if (k==1) "Flat Prior"
else "Informative Prior", # provide title based on which fit
pch=19,
cex=1) # pch and cex change look of dots
for (i in 1:100){
abline(sims[i,1], # draw lines from first regression sims
sims[i,2], # draw lines from second regression sims
lwd=.5, # change line size
col="gray50") # add estimated line from Bayes fit and color gray
}
abline(coef_est[1],
coef_est[2],
lwd=2) # add overall effects from both regressions
}
The results show us two peculiar but informative insights, the first being that its quite clear that the flat prior regression produces way too much noise. As such, our certainty in this estimate is fairly limited, as there is too much variation to say that this estimate is doing it’s job. The alternative regression has far greater precision in that it’s bandwidth of fitted lines are narrowly concentrated in one area.
Second, even with a lot more accuracy in fitting, the estimated relationship between the points is still quite poor. This is because the sample is just way too small (and because the data isn’t linear enough to narrow the estimate more). It is often noted that Bayesian methods are sample size ambivalent. This illustration should show that this is far from correct, and careful choices must be made when designing even a Bayesian model.
This concludes the discussion on simulation. If you felt it was informative, you may consider buying me a cup of coffee. Feel free to also check out my other RPubs on this page if you want to explore my other R tutorials.