Simulating the Data Generating Process (DGP)

A scientist’s job is essentially modeling the real world with data. Part of that is understanding what mechanisms actually produce that data. This is the essence of the data generating process (DGP), which is the process in the real world that generates the data one is interested in. We often apply statistics in order to understand this DGP, and one way of achieving that is with simulation.

Simulating this process can achieve a number of useful aspects of statistics that we may not normally consider. This includes:

With all of that said, let us apply this thinking to simulating data in R. We will start off basic, running some univariate simulations, and procedurally build up to more complex phenomena.

Univariate Simulations

Univariate simulations are quite easy to run in R. The only thing you need to know is which distribution you will try to simulate, as well as the parameters necessary to produce it. For example, a binomial distribution requires probability inputs, whereas a beta distribution does not. While there are many distributions available both in the native programming of R and importable libraries, we only focus on two for this section.

First, we simulate a univariate normal distribution with the rnorm function, which only requires that you supply a sample size, mean, and standard deviation. When simulating, always remember to start by setting the seed, which initializes a pseudo-random number generator, enabling replication of your simulated work. Otherwise, your data will look different every time you simulate it. In some cases you would want this, but for most of the work here, we will use the set.seed function to make the analysis reproducible. The number can be set to literally anything. You can think of it as the “file name” for your random seed, which is completely arbitrary.

#### Set Seed ####
set.seed(123)

#### Simulate Distribution ####
x <- rnorm(n=1000, # sample size
           mean=50, # mean
           sd=10) # SD

#### Run Histogram ####
hist(x, # data
     xlab="Simulated X", # label x axis
     main="Histogram of Simulation", # title
     col="steelblue") # color of bars

Data isn’t always normally distributed, so it may be useful to know how to simulate from a different distribution. Let’s say we are dealing with counts. Counts never have negative values, and by consequence are right skewed. How skewed that becomes is based on it’s lambda parameter, or \(\lambda\). To simulate this, we use the rpois function and supply \(\lambda = 2\). This will make the distribution center around counts of \(2\), though most frequent values will be around \(0\). We can actually check this by simply adding a mean line onto the histogram with abline, using the v argument to draw a vertical line where we choose.

#### Simulate Distribution ####
x.pois <- rpois(1000,2)

#### Run Histogram ####
hist(x.pois, # data
     xlab="Simulated X", # label x axis
     main="Simulated Poisson Distribution", # title
     col="steelblue") # color of bars

#### Add Mean Line ####
abline(v=mean(x.pois), # use this mean to draw line
       col="red", # color it red
       lwd=2) # make linewidth equal to 2

A Simple Power Analysis

We can easily extend that further by simulating two distributions, one we purposely increase the mean for to get a robust difference in effects (such as a mean comparison via a t-test or categorical regression). We may also be interested in knowing the power of a test by running a simulated power analysis. We first create two distributions, the control and treatment, with the tools we already have.

#### A Step Further ####
trt <- rnorm(n=1000,
             mean=50,
             sd=10) # creates treatment group

ctrl <- rnorm(n=1000,
              mean=30,
              sd=10) # creates control group
hist(trt,
     xlim=c(10,80),
     main="Group Differences Simulated",
     xlab="Simulated X") # labels for first plot carried over for second plot

hist(ctrl,
     col="steelblue",
     xlim=c(10,80),
     add=T) # this adds plot to last one

abline(v=mean(trt),
       col="red",
       lwd=3) # draws treatment mean

abline(v=mean(ctrl),
       col="red",
       lwd=3) # draws control mean

Seeing as we have two distributions, we can right away test the differences between them with a t-test. R by default uses the Welch t-test, but you can change this if for some reason you don’t want to.

t <- t.test(trt,ctrl) # run t-test
p <- t$p.value # save p value (dollar sign pulls p value from this object)
print(t) # show results
## 
##  Welch Two Sample t-test
## 
## data:  trt and ctrl
## t = 44.94, df = 1997.4, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  19.09604 20.83878
## sample estimates:
## mean of x mean of y 
##  49.97344  30.00603
print(p) # print p-value
## [1] 2.303736e-305

Rather than setting all this up every time we want to simulate, what we could do is simply write a function to setup the inputs, and then we just have to run whatever we want into it in order to simulate our two distributions. Below is such a function. We are doing the same thing as before, only we are adding the means, standard deviations, and sample sizes as inputs to our function ahead of time.

t.sim <- function(n,m1,m2,sd){ # sample size, mu1, mu2 and SD
  
  trt <- rnorm(n=n, # goes here
               mean=m1, # here
               sd=sd) # and here
  
  ctrl <- rnorm(n,
                mean=m2, # second mean here
                sd=sd) 
  
  t <- t.test(trt,ctrl) # run t-test
  p <- t$p.value # save p value
}

Then we can test it.

single.p <- t.sim(100,50,46,10)
print(single.p)
## [1] 0.0009933038

As can be expected, the test is statistically significant. But that isn’t really the goal of the test. The bigger concern here is if the design has enough statistical power, and one test cannot say much about that. We can instead run the simulation \(B\) times in order to capture uncertainty in the estimate. We can use the replicate function to achieve this, setting \(B = 1,000\).

#### Replicate Function 1,000x ####
multi.p <- replicate(1000, # run function 1000 times
          t.sim(100,50,46,10)) # run t-sim function

#### Filter Values Below .05 ####
sig <- multi.p[multi.p < .05]

#### Estimate Power
power <- length(sig)/length(multi.p) # proportion sig. to total p values

#### Visualize ####
hist(multi.p,
     main="Power Analysis",
     breaks = 20,
     col="steelblue",
     xlab="P Values")

abline(v=.05,col="red",lwd=2) # show "critical" p region

It looks like if we plot our power, we get a heavily right skewed distribution. We can check the actual power of the test by simply running print(power).

print(power)
## [1] 0.81

It seems our test surpasses the typical \(80\)% threshold. However, this is assuming a pretty large difference in means and consequently a much larger effect size. What if we expect our effect to be pretty weak (a minor difference in means) and simulate power?

#### Replicate Function 1,000x ####
multi.p <- replicate(1000, # run function 1000 times
                     t.sim(100,50,49,10)) # run t-sim function

#### Filter Values Below .05 ####
sig <- multi.p[multi.p < .05]

#### Estimate Power
power <- length(sig)/length(multi.p) # proportion sig. to total p values
hist(multi.p,
     main="Power Analysis",
     breaks = 20,
     col="steelblue",
     xlab="P Values")

abline(v=.05,col="red",lwd=2) # show "critical" p region

Not surprisingly, most of our p-values are actually outside of the \(.05\) range, and consequently our power suffers.

print(power)
## [1] 0.121

Simulating Regressions

All of this is relatively straightforward, but it may interest us to simulate beyond univariate and by-group comparisons. What if, for example, we wanted to run a simple regression? We could just create an \(x\) and \(y\) that are normally distributed with the function we have been using already.

#### Predictor  ####
x <- rnorm(n=100,
           mean=50,
           sd=10)

#### Outcome ####
y <- rnorm(n=100,
           mean=50,
           sd=10)

#### Plot ####
plot(x,y,
     main = "Simulated Bivariate Data (Weak)",
     xlab = "Simulated X",
     ylab = "Simulated Y") # not realistic

We can see this isn’t very realistic. Because we haven’t modeled in any relationship between the variables, the data is completely random, and as such the scatterplot shows no covariance between the variables. The fix for this, however, is quite simple if you already know how to construct a linear equation. Recall that for a simple linear regression, our equation is as follows:

\[ y = \beta_0 + \beta_1 x_1 + \epsilon \]

where \(\beta_0\) is the intercept, \(\beta_1\) is the slope, and \(\epsilon\) is the error term. If we read this equation, we can figure the following:

And that’s really all there is to it. If you know your \(\beta\) and \(\epsilon\) values, you can easily simulate \(y\). Here is now. First, we setup our parameters for \(x\) and save the degrees of freedom as \(n-2\).

n <- 100 # sample size
df <- n-2 # should match SR
m <- 50 # set mean of x
sd <- 10 # set sd of x

Then we write our linear equation by defining our beta coefficients and our error term.

b0 <- 30 # intercept
b1 <- .50 # slope
e <- rnorm(n,sd=3) # RSE

Finally, we just construct \(x\) like we normally do (with the parameters we just chose) and \(y\) as an assembly of our beta coefficients and error term.

x <- rnorm(n,m,sd) # predictor
y <- b0 + (b1*x) + e # linear equation

We can now fit our regression. Run the regression with lm, then save it as fit. Inspect it with summary(fit) like below.

fit <- lm(y ~ x) # fit regression
summary(fit) # summarize
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -6.189 -2.264  0.108  1.945  7.468 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 29.40337    1.38874   21.17   <2e-16 ***
## x            0.51126    0.02732   18.71   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.934 on 98 degrees of freedom
## Multiple R-squared:  0.7813, Adjusted R-squared:  0.7791 
## F-statistic: 350.1 on 1 and 98 DF,  p-value: < 2.2e-16

Notice that the terms in our model are eerily similar, but because we have randomized our data, the uncertainty in the estimate provides wiggle room for what we asked for. However, it should be quite obvious that the intercept, slope, RSE, and degrees of freedom all match. To show this is the case, we can also simply plot the regression line with abline, which automatically plots a regression line if it is already saved.

#### Plot ####
plot(x,y, # plot variables
     main = "Better Regression Simulation", # add title
     xlab = "Simulated X", # add x axis label
     ylab = "Simulated Y")  # add y label

#### Add Regression Line ####
abline(fit, # use regression fit
       col="red", # color red
       lwd=2) # make it size 2

We can see that unlike our last pair of variables, \(x\) and \(y\) have a strong relationship with each other. However, we may anticipate that our DGP is a lot noisier, and we should consider this when designing our experiment, particularly with respect to statistical power. Here we draw a new simulated relationship, only changing \(\text{RSE} = 10\) this time. All the steps are essentially the same, only the e term is different.

#### Create Noisier Estimates
n <- 100 # sample size
df <- n-2 # should match SR
m <- 50 # set mean of x
sd <- 10 # set sd of x
b0 <- 30 # intercept
b1 <- .50 # slope
e <- rnorm(n,sd=10) # RSE
x <- rnorm(n,m,sd) # predictor
y <- b0 + (b1*x) + e # linear equation
fit <- lm(y ~ x) # fit regression

#### Plot Again ####
plot(x,y, # plot variables
     main = "Noisy Regression", # add title
     xlab = "Simulated X", # add x axis label
     ylab = "Simulated Y")  # add y label

#### Add Regression Line Again ####
abline(fit, # use regression fit
       col="red", # color red
       lwd=2) # make it size 2

We can see that our data has wider dispersion and our \(R^2\) values are consequently not as high given the amount of noise in the model. As such, it may be difficult to capture this DGP with a small sample.

Extension to Multiple Linear Regression

We don’t have to stop with simple regression. We simply add as many predictors as we want, as the linear equation can be any additive terms:

\[ y = \beta_0 + \beta_1x_1...\beta_kx_k + \epsilon \]

where \(k\) is a given variable. The only difference this time when we simulate is that we construct two predictors this time, \(x_1\) and \(x_2\), and add both into the linear equation like before.

#### Predictor Parameters ####
n <- 100
m <- 50
sd <- 10
x1 <- rnorm(n,m,sd)
x2 <- rnorm(n,m,sd)

#### Linear Equation Setup ####
b0 <- 30
b1 <- .50
b2 <- .70
e <- rnorm(n,sd=1)
y <- b0 + (b1*x1) + (b2*x2)+ e

#### Save as Data Frame ####
df <- data.frame(x1,x2,y)

#### Fit and Summarize ####
fit <- lm(y ~ x1 + x2)
summary(fit)
## 
## Call:
## lm(formula = y ~ x1 + x2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.39430 -0.61228 -0.03649  0.59066  2.54611 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 29.945649   0.665823   44.98   <2e-16 ***
## x1           0.505584   0.009845   51.36   <2e-16 ***
## x2           0.693914   0.010199   68.04   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9783 on 97 degrees of freedom
## Multiple R-squared:  0.9872, Adjusted R-squared:  0.987 
## F-statistic:  3750 on 2 and 97 DF,  p-value: < 2.2e-16

Once again, our estimates reflect what we simulated. Similar operations can be performed on interaction terms, and these can be adjusted based off the parameters we include in the model.

More Complex Models

Often the real world is more difficult to model, but knowing out this sort of data is generated informs us of the ways it should operate. For example, we know that many real-world functions of data are nonlinear. Two commonly known nonlinear functions are exponential growth and exponential decay. For this example, we will simulate exponential decay and see why fitting a linear regression to it is not super helpful. This time, instead of multiplying each predictor by a beta coefficient, we can instead directly model the function into our linear equation and add some uncertainty. We first use runif because we want a uniform distribution of values around the function. We then model \(y\) the same was as before, only this time we include \(\text{log}(x)\) into the equation, adding some uncertainy with the usual rnorm error. We then plot like usual.

#### Nonlinear Data ####
x <- runif(500)
y <- log(x) + rnorm(500,sd=.5)
plot(x,y,
     main="Exponential Decay Model",
     xlab = "Simulated X",
     ylab = "Simulated Y")

We then fit a linear relationship between the predictor and outcome.

#### Fit Regression ####
poor.fit <- lm(y ~ x)

#### Add Regression Line ####
plot(x,y,
     main="Exponential Decay Model",
     xlab = "Simulated X",
     ylab = "Simulated Y")

abline(poor.fit,
       col = "red",
       lwd=2)

We can see that the OLS line does a terrible job of modeling the DGP. The regression assumes there is a constant relationship between the two variables, but clearly there is an initialy rapid growth in \(y\), followed by a rapid and then steady decline. We can do better by fitting the data the same way we generated it. We simply add log(x) to our fit. However, because our data isn’t a simple linear fit, we have to construct a line with the predictions from the data. To do this, we use the following steps:

  1. Create sequence of \(x\) values that span the range of \(x\), setting a number of values large enough to create a line.
  2. Generate predictions based off these values using the regression fit.
  3. Plot sequenced data and predicted data as a line to fit plot.
#### Plog Again and Run Regression ####
plot(x,y,
     main="Exponential Decay Model",
     xlab = "Simulated X",
     ylab = "Simulated Y")

fit <- lm(y ~ log(x))

#### Create X Sequence ####
x.new <- seq(
  min(x),max(x),length.out=100
)

#### Plot Line ####
new <- data.frame(x = x.new)
pred <- predict(fit, newdata = new)
lines(x.new,pred,col="red",lwd=2)

This is a nice example of a nonlinear process, but other complex relationships can be crafted in R. Our next simulated DGP comes from econometrics. Typically, income and food expenditure are heavily right skewed, and these variables become difficult to model with a straightforward ordinary least squares regression. This is because their skewness causes data to be “pressed into a corner” when visualized on a scatterplot. To convey this notion, we can simulate such data below.

#### Setup Variables ####
x.log <- exp(rnorm(100,mean=10))
b0 <- .01
b1 <- .5
y.log <- b0 + (b1*x.log) + exp(rnorm(100,mean=10))

#### Plot ####
plot(x.log,
     y.log,
     main = "Compressed Data",
     xlab = "Simulated X",
     ylab = "Simulated Y")

A common remedy to this issue is a log-log regression, which takes the natural log of both the predictor and outcome. In practice, this does two things. First, it spreads the data out, so the dispersion is not a large problem. Second, it makes the interpretation easier, as each one unit increase in \(x\) equals a given percent change in \(y\).

#### Plot Logged Values ####
plot(log(x.log),
     log(y.log),
     main = "Log-Log Regression",
     xlab = "Log of X",
     ylab = "Log of Y")

#### Fit Data ####
fit.log <- lm(log(y.log) ~ log(x.log))
summary(fit.log)
## 
## Call:
## lm(formula = log(y.log) ~ log(x.log))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.25990 -0.40556 -0.04201  0.30516  2.19590 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.07207    0.74014  10.906  < 2e-16 ***
## log(x.log)   0.25777    0.07403   3.482 0.000745 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.615 on 98 degrees of freedom
## Multiple R-squared:  0.1101, Adjusted R-squared:  0.101 
## F-statistic: 12.12 on 1 and 98 DF,  p-value: 0.0007454
abline(fit.log,
       col="red",
       lwd=2)

Simulating Bayesian Thinking

Sometimes it is difficult to understand exactly what Bayesian models are doing, and simulating Bayes can help with that intuition. As a reminder, Bayes theorem is a conditional probability estimate such that:

\[ p(\theta|X) = \frac{p(X|\theta)\,p(\theta)}{p(X)} \] where \(\theta\) is a given estimate and \(X\) is our data. One common way of visualizing how this works with Bayesian prior and posterior estimates is by using something called grid approximation. If we have many values of the probability of \(X\), we would want a way to approximate the sums of those probabilities by using Riemann sums such that:

\[ \int p(X|\theta)\,p(\theta)\, d\theta \approx \sum_{i \in G} p(X|\theta_i)\,p(\theta_i) \, \Delta\theta_i \] where \(G\) is our grid and \(\Delta\theta_i = \theta_i - \theta_{i-1}\). To simplify, the grid is simply a set of points used to evaluate the function used for the sake of approximating the integral (the area under the curve). The more points, the more precise the approximation is. Picking the grids is a loaded topic (how large, uniform or not, etc), so I simply illustrate how to do it without going into more detail about which grids to pick.

Let’s say we are trying to estimate the probability of a fire occurring and we have some idea of what the probability normally should be. We setup the grid approximation with the following steps (much of this is from the book Statistical Rethinking by Richard McElreath with slight rewording):

  1. Define the grid by deciding how many points to use in estimating the posterior, and then you make a list of the parameter values on the grid.
  2. Calculate the value of the prior at each parameter value on the grid.
  3. Estimate the likelihood at each individual parameter value.
  4. Compute the unstandardized posterior at each parameter value, by multiplying the prior by the likelihood.
  5. Standardize the posterior by dividing each value by the sum of all values.
#### Define Grid #### 
p_grid <- seq(from=0, 
              to=1,
              length.out=20 ) # creates a grid of 20 values from 0 to 1

#### Define Prior #### 
prior <- rep(1,20) # define a constant prior

#### Compute Likelihood at Each Value in Grid #### 
likelihood <- dbinom(6, 
                     size=20,
                     prob=p_grid)

#### Compute Product of Likelihood and Prior ####
unstd.posterior <- likelihood * prior 

#### Standardize the Posterior (Sums to 1) #### 
posterior <- unstd.posterior / sum(unstd.posterior) 

#### Plot Grid Approximation ####
plot( p_grid , 
      posterior ,
      type="b" , 
      xlab="Probability of Fire" ,
      ylab="Posterior Probability",
      main="Probability of Fire Given Priors")

Using prior information, we now have a probability distribution of a fire occurring using our prior information and getting a posterior estimate of that probability.

We can also visualize how the Bayesian process works with respect to estimating regressions. This example comes from the book Regression and Other Stories by Gelman and colleagues (with some of my own minor adjustments), where two regressions are fit, one with flat priors and one with informative priors. The sample size is quite small, and will provide some informative feedback on it’s influence on estimation. First, we load the rstanarm package to run a Bayesian regression model. Then we simulate some data, which models a relationship between the attractiveness of a parent and their percentage of girl babies.

#### Load Package ####
library(rstanarm) # run install.packages("rstanarm") if not installed

#### Create Data ####
x <- seq(-2,2,1) # produce sequence from -2 to 2 in 1 point increments
y <- c(50, 44, 50, 47, 56) # create this vector of 5 points
sexratio <- data.frame(x, y) # merge into data frame

We then setup the priors. I don’t go into a ton of detail here, but I defer to the previously two mentioned books for a more in-depth exploration of Bayesian priors. The bulk of what wer are doing here is providing what our estimates should look like in the real world and the uncertainty around those estimates. These will be added to the regression later.

#### Informative Priors ####
theta_hat_prior <- 0
se_prior <- 0.25
theta_hat_data <- 8
se_data <- 3
theta_hat_bayes <- (theta_hat_prior/se_prior^2 + theta_hat_data/se_data^2)/(1/se_prior^2 + 1/se_data^2)
se_bayes <- sqrt(1/(1/se_prior^2 + 1/se_data^2))

We then estimate two regressions, one with a flat prior and one with an informative prior. We use the flexible function stan_glm to achieve both. We assign refresh = 0 so it doesn’t keep printing each estimation. Notice that while the first regression is modeled very similar to the lm or glm function in base R, the second regression includes the priors we assigned earlier.

#### Regression with Flat Prior ####
fit_default <- stan_glm(y ~ x, 
                        data = sexratio,
                        refresh = 0)

#### Regression with Informative Prior ####
fit_post <- stan_glm(y ~ x, 
                     data = sexratio,
                     prior = normal(0, 0.2),
                     prior_intercept = normal(48.8, 0.5),
                     refresh = 0)

Finally, we plot our fits. There is a lot of code here since I am using base R, but I have annotated and simplified the code from the book here to give you the gist of what is going on here.

#### Customize Plot ####
par(mfrow=c(1,2), # row x col layout
    mar=c(5,3,3,2), # margins
    mgp=c(1.7,.5,0), # margin and axis lines
    tck=-.01) # length of tick marks

#### Save Both Regressions as List ####
fit_bayes <- list(as.data.frame(fit_default), 
                  as.data.frame(fit_post)) # save both regressions into list

#### Plot Simulations from Bayes ####
for (k in 1:2){ # use both regressions
  sims <- fit_bayes[[k]] # get their simulations
  coef_est <- apply(sims, 2, median) # get their coefs
  b_se <- 1.483*median(abs(sims[,2] - median(sims[,2]))) # and their error
  
  plot(x, y, # use this data
       ylim=c(43, 57), # limit the y-axis to this range
       xlab="Attractiveness of Parent",
       ylab="Percentage of Girl Babies", # use these axis labels
       main=if (k==1) "Flat Prior"
       else "Informative Prior", # provide title based on which fit
       pch=19, 
       cex=1) # pch and cex change look of dots
  
  for (i in 1:100){
    abline(sims[i,1], # draw lines from first regression sims
           sims[i,2], # draw lines from second regression sims
           lwd=.5, # change line size
           col="gray50") # add estimated line from Bayes fit and color gray
  }
  abline(coef_est[1], 
         coef_est[2],
         lwd=2) # add overall effects from both regressions
}

The results show us two peculiar but informative insights, the first being that its quite clear that the flat prior regression produces way too much noise. As such, our certainty in this estimate is fairly limited, as there is too much variation to say that this estimate is doing it’s job. The alternative regression has far greater precision in that it’s bandwidth of fitted lines are narrowly concentrated in one area.

Second, even with a lot more accuracy in fitting, the estimated relationship between the points is still quite poor. This is because the sample is just way too small (and because the data isn’t linear enough to narrow the estimate more). It is often noted that Bayesian methods are sample size ambivalent. This illustration should show that this is far from correct, and careful choices must be made when designing even a Bayesian model.

Conclusion

This concludes the discussion on simulation. If you felt it was informative, you may consider buying me a cup of coffee. Feel free to also check out my other RPubs on this page if you want to explore my other R tutorials.