Simulation in Zelig: How it Works

A Brief Tutorial on Social Capital and Germany Renewable Energy

In a separate tutorial, I introduced simulation in Zelig as a great way to visualize your statistical models’ effects. But what is statistical simulation, and how does it work? This tutorial describes how to simulate your model’s effects in Zelig, and then goes under the hood to make our own simulation.

To do so, we will examine a negative binomial model, borrowing from my dataset on the role of social capital in renewable energy transition. This paper shows how bridging social capital can boost or block the adoption of renewable energy in some communities. In some communities, such as Germany cities between 2008 and 2011, residents with strong bridging social capital were more likely to adopt more solar farms. This may be because communities channeled close inter-group ties, mobilized, and petitioned local and state officials to take concrete steps on climate change.

In this tutorial, we will look at why some cities see more greater solar adoption than others. If you would like to follow along, you can download the replication data on the Harvard Dataverse here.

1. Loading Packages

Let’s load our required packages.

#install.packages(c("tidyverse", "Zelig"))
# If you haven't installed them, just remove the hashtags on the function above and run it.
library(tidyverse) # for data manipulation, and the pipeline ("%>%"), which lets us "pipe" functions together
library(Zelig) # for statistical simulation

1.2 Import Data

We’re going to use this helpful dataset of solar farm adoption in Germany cities from 2008-2011 from a recent study dataset.

# This is a time-series dataset of Germany cities.

dat <- read_csv("germany_data.csv") %>%
  # Let's select our key variables of interest
  select(id, # Unique ID Code
         muni, # Name of German District Name 
         solar, # No. of Solar Farms built between 2008 and 2011
         pvout, # Sunlight (Average Photovoltaic Output)
         area, # Area in square kilometers
         pop, # Population in 2000
         land_price, # Cost of Land in 2000 in euros per square kilometer
         income, # Income per taxpayer in 2001 in thousands of euros
         unemployment, # Unemployment rate in 2001
         crime_rate, # Crime Rate in 2007
         # (a common proxy for bonding social capital
         # when survey measures are unavailable)
         voter_turnout, # Voter Turnout in 2005
         # (proxy for bridging social capital)
         winning_party, # % Votes for Ruling Paer (CDU-CSU) in 2005
         green) %>% # % Votes for Green Party in 2005
    # Let's zoom into cities with at least 1 solar farm,
  # Since the other cities are probably not very comparable to the rest
  filter(solar > 0)

1.3 Model Data

Next, let’s model our data with an negative binomial model using the zelig() function in the Zelig package with the model = “negbin” parameter. Our outcome, solar farms, is a right-skewed count variable (0, 1, 2, 3, etc.), and so to model it appropriately, we need a negative binomial model. (Why use an example like this? Because many, many outcomes in disaster studies are counts (people, injuries, houses damaged, migration, recoveries, coffee cups consumed while conducting research, etc.)

# Make an Negative Binomial model using our dataset
z <- dat %>%

  # Let's model the association between solar farm adoption 
  # and voter turnout in that community
  zelig(formula = solar ~ voter_turnout +
       # While controlling for other proxies for other types of social capital
                 crime_rate + winning_party +
         # environmentalism and socioeconomics
         green + income +  unemployment +
         # and technical conditions
         pvout + area + pop + land_price, model = "negbin", cite = FALSE) 

# Let's check out the effect of voter turnout
z

## Model: 
## 
## Call:
## z5$zelig(formula = formula, data = data, by = by)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6593  -0.8863  -0.3220   0.1627   3.9092  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)
## (Intercept)   -8.363e+00  4.371e+00  -1.913 0.055740
## voter_turnout  1.041e+01  4.319e+00   2.410 0.015931
## crime_rate    -1.354e-02  1.849e-03  -7.322 2.44e-13
## winning_party -2.275e+00  1.086e+00  -2.095 0.036142
## green         -2.132e+01  4.362e+00  -4.889 1.02e-06
## income        -2.272e-02  2.095e-02  -1.085 0.278029
## unemployment  -1.343e-01  3.143e-02  -4.272 1.94e-05
## pvout          3.805e+00  5.916e-01   6.430 1.27e-10
## area           1.341e-03  1.243e-04  10.793  < 2e-16
## pop           -2.432e-06  6.722e-07  -3.618 0.000296
## land_price    -6.194e-03  1.189e-03  -5.208 1.91e-07
## 
## (Dispersion parameter for Negative Binomial(1.0994) family taken to be 1)
## 
##     Null deviance: 746.00  on 315  degrees of freedom
## Residual deviance: 361.47  on 305  degrees of freedom
##   (47 observations deleted due to missingness)
## AIC: 4565.8
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  1.0994 
##           Std. Err.:  0.0810 
## 
##  2 x log-likelihood:  -4541.8290 
## Next step: Use 'setx' method

We can see above that voter turnout has a positive association with solar farm adoption. This is shown in the estimate column, which shows a beta coefficient (log-odds) of 10.4, and in the p.value column, which shows a p.value below 0.05. This means we are over 95% confident that this is not the result of sampling error. This is because communities with strong voter turnout and resident engagement in city politics are better mobilized and capable of pressuring local officials to act on climate change.

Zelig’s real power comes in visualizing regression results. A few quick steps can allow us to visualize the effect of voter turnout on solar adoption, accounting for both estimation uncertainty and fundamental uncertainty.

# Let's get the 20th, 40th, 60th, and 80th percentiles
range <- quantile(dat$voter_turnout, probs = c(0.2,0.4,0.6,0.8), na.rm = TRUE)

# And let's generate simulated effects
z_sim <- z %>%
  # We can set the specified range of values here (5 evenly spaced points between the 20th and 80th percentiles),
  # and then all other covariates will be held at their means
  setx(voter_turnout = range) %>%
  # and running the simulations
  sim()

# We can view them here.
# We get 1000 simulated values for each point in our range.
# for both predicted values (adjusts for estimation uncertainty)
# and also for expected values (adjust for both estimation and fundamental uncertainty)
# You'll notice that predicted values have wider confidence intervals, while expected values are narrower.
# You should probably used predicted values for things like election forecasting, 
# because they give you a wider margin of error,
# while expected values are great for explaining past outcomes.
z_sim

## 
## [1] 0.745447
## 
## 
##  sim range :
##  -----
## ev
##          mean       sd      50%    2.5%   97.5%
## [1,] 400.6865 39.15241 398.5038 332.294 487.988
## pv
##         mean       sd   50% 2.5%    97.5%
## [1,] 415.691 400.7108 307.5   15 1546.025
## 
## 
## [1] 0.7604877
## 
## 
##  sim range :
##  -----
## ev
##         mean       sd      50%     2.5%    97.5%
## [1,] 468.312 25.90634 466.4424 421.8838 524.1302
## pv
##         mean       sd 50% 2.5%   97.5%
## [1,] 461.028 444.0572 327   14 1620.15
## 
## 
## [1] 0.7681372
## 
## 
##  sim range :
##  -----
## ev
##          mean       sd      50%     2.5%    97.5%
## [1,] 507.7821 27.85701 505.6322 458.0946 565.0828
## pv
##         mean       sd   50% 2.5%    97.5%
## [1,] 525.502 485.8711 392.5   18 1843.125
## 
## 
## [1] 0.777131
## 
## 
##  sim range :
##  -----
## ev
##        mean       sd      50%     2.5%    97.5%
## [1,] 559.23 42.22306 557.5015 482.6765 645.4024
## pv
##         mean       sd   50% 2.5%   97.5%
## [1,] 572.633 518.3154 427.5   23 1926.15

Let’s visualize it in ggplot, the tidyverse’s excellent package for visualization!

z_sim %>%
  # Let's get these in a dataset
  zelig_qi_to_df() %>%
  # and shrink them down to just the median values and 95% confidence intervals
  # Let's get them for just expected values
  qi_slimmer(qi_type = "ev", ci = 0.95) %>%
  # And we can map them here!
  ggplot(mapping = aes(x = voter_turnout, y = qi_ci_median, 
                       ymin = qi_ci_min, ymax = qi_ci_max)) +
  geom_ribbon(alpha = 0.5, fill = "firebrick") +
  geom_line() +
  theme_minimal(base_size = 14) +
  theme(plot.caption = element_text(hjust = 0.5)) +
  labs(x = "(%) Voter Turnout",
       y = "Expected Solar Farms\n(in an average city)",
       caption = "Based on 1000 simulations made with the sim() function in Zelig.")

2. How Zelig Works

Wow! Very neat. But wait - how does Zelig work? Zelig uses simulations. A simulation refers to taking your model’s beta coefficients, varying them slightly by chance, and calculating some quantity of interest about your outcome. Generally, Zelig takes 1000 simulations, allowing you to generate confidence intervals, find the median, the mean, or any quantity of interest based on these 1000 simulations.

What kinds of quantities of interest might we care about? Here are two:

Predicted values: simulates error from a normal distribution; accounts for just estimation uncertainty.
Expected values: simulates error from a multivariate normal distribution; accounts for estimation uncertainty and fundamental uncertainty.

3. Normal Prediction

3.1 Statistics Assemble!

First, let’s gather all the input data we’ll need to start our simulations. We’ll need our model object, our alpha coefficient (intercept) and beta coefficients (effects), our variance-covariance matrix, and a set of fake-data we will feed to the model to get predicted outcomes.

# First, let's extract the model from its zelig format, 
# so we can work more easily with it
m <- from_zelig_model(z)

# Get observed alpha and beta coefficients from your model, which we will call "gamma"
gamma <- m$coefficients

# Get variance covariance matrix from your model, called "sigma"
sigma <- vcov(m)


# Grab some average traits from our data, a vector we will call "x"
# (This is nearly the same code that we used to make fakedata before, just from a new object)
x <- m$model %>% 
  summarize(
    # Set Voter turnout at a specific value (20th percentile)
    voter_turnout = range[1],
    # But leave everything else at their means
    crime_rate = mean(crime_rate, na.rm = TRUE),
    winning_party = mean(winning_party, na.rm = TRUE), 
    green = mean(green, na.rm = TRUE), 
    income = mean(income, na.rm = TRUE),
    unemployment = mean(unemployment, na.rm = TRUE),
    pvout = mean(pvout, na.rm = TRUE),
    area = mean(area, na.rm = TRUE),
    pop = mean(pop, na.rm = TRUE),
    land_price = mean(land_price, na.rm = TRUE)) %>%
  # Let's convert it into a matrix too.
  as.matrix()

3.2 Normal Predicted Values

First, we are going to create simple predicted values. Let’s extract our observed coefficients, feed them our fake data (x), calculate predicted outcomes, and then simulate those outcomes.

We would usually calculate (non-simulated) predicted values like so:

# Let's grab our alpha coefficient, the intercept
alpha <- gamma[1] # grab just the first, which is the intercept

# Let's grab our beta coefficients
beta <- gamma[-1] # grab all but the first (which was the intercept)

# Let's get theta (sometimes written as eta for negative binomial models)
# Theta is the total of X*Beta + Alpha
theta <- sum(beta * x + alpha)

# Now, we will multiply theta by the link function to get the outcome. 
# In the case of ordinary least squares models,
# the link function is non-existent; 
# it is equal to multiplying theta by 1.
# However, many other models have link functions, 
# which transform the total product of our coefficients and data into an outcome.
# Negative binomial models have a very simple log-link function 
# (it looks like: log(y) = X * Beta + Alpha).
# So all we have to do is exponentiate it.
y_hat <- exp(theta)

# Let's take a look
y_hat

## [1] 8.211526e-31

This generates 1 prediction of the number of solar farms given average traits and low voter turnout (20th percentile), but simulation can give us a range of them, incorporating error. Let’s find out how!

4. Simulation

4.1 Simulate Outcomes

Next, we’re going to do simulation! But what are we really “simulating?” Technically, we are simulating the potential error in our model coefficients.

We will draw on fixed aspects of our data and model, like the coefficients and variance-covariance matrix, to build out a bunch of fake-but-mostly-real beta coefficients, also known as simulated coefficients. Then, we will feed our simulated coefficients our data about those variables, including varying levels of voter turnout and average levels of everything else. Then, we’ll use our model equation with the simulated coefficients to churn out a single simulated outcome. Finally, we will repeat this process 1000 times.

So, first, let’s generate 1000 simulated coefficients from a multivariate normal distribution. What’s that? A normal distribution is a distribution a values for one variable. A multivariate normal distribution is a distribution of values for many variables, done in such a way that it retains the relationships between the variables. It does not matter what kind of model you use (OLS, negative binomial, etc.); a multivariate normal will still serve you well, because we’re implying that the even if there is error, the true value of each coefficient should still be somewhere in a bell curve.

# Generate 1000 simulations,
# drawing from a multivariate normal distribution
gamma_hat <- MASS::mvrnorm(n = 1000, mu = gamma, Sigma = sigma)
# Each of these variables are related, based on the variance-covariance matrix,
# even though they were randomly drawn.

# Let's check out the first two lines
gamma_hat %>% head(2)

##      (Intercept) voter_turnout  crime_rate winning_party     green      income
## [1,]   -4.857593      7.222666 -0.01584699     -3.530922 -19.61586 -0.02175370
## [2,]   -7.673149     10.295234 -0.01276366     -2.903608 -21.83582 -0.05135347
##      unemployment    pvout        area           pop   land_price
## [1,]   -0.1605790 3.854125 0.001222727 -3.530996e-06 -0.008012631
## [2,]   -0.1823735 4.074137 0.001347049 -1.789098e-06 -0.004704175

# We have 11 columns, one for each coefficient, and 1000 rows, one per simulation.

Now, let’s split these simulated coefficients into alpha (the intercept) and beta (the variable effects).

# Get simulated version of alpha (meaning 1000 rows of the simulated Intercept coefficient)
alpha_hat <- gamma_hat[,1] # this little [,1] says, please keep just the first column

# Get simulated version of beta coefficients (meaning 1000 rows of the 10 beta coefficients)
beta_hat <- gamma_hat[,-1] # this little [,-1] says, please remove the first column

Next, let’s calculate a simulated outcome.

# Let's multiply each simulated beta coefficient by its corresponding set x value,
# and then add them up as "eta_hat" 
# (This will give us 5 columns, because we had 5 rows of fakedata in x to simulate)
sim_out <- exp(beta_hat %*% t(x) + alpha_hat)

Notice the following four elements that reappear above: 1) beta_hat, the simulated beta coefficients, 2) x, the fake data we supplied, 3) alpha_hat, the simulated alpha coefficient, and 4) exp(), the inverse link function.

4.2 Simulate Predicted Values

Finally, we’re going to simulate error by using this simulated outcome as the mean for a negative binomial distribution, with a skew based on the original model’s parameters. (Negative binomial models have a dispersion parameter called theta; for OLS, this would be a normal distribution with a variance parameter.) Then, we’re going to take 1000 random draws from this distribution.

# These are our 1000 predicted values
pv <- MASS::rnegbin(n = 1000, mu = sim_out, theta = m$theta)

# Let's repeat our zelig simulation again
z_sim <- z %>%
  setx(voter_turnout = range) %>%
  sim()
# Let's bind both together
bind_rows(
  # Grabbing our self-made predicted values,
  data.frame(pv = pv,
           type = "Do-it-yourself"), 
  # And comparing them against zelig's predicted values
  # (As you can see from the code, I had to go digging into the model object to find them)
  data.frame(pv = z_sim$sim.out$range[[1]]$pv %>% unlist(),
           type = "Zelig")) %>%
  # And mapping each to different parts of a visual
  ggplot(mapping = aes(x = type, y = pv, fill = type)) +
  # Displaying the points, jittered
  geom_jitter(alpha = 0.5, color = "grey") +
  # Using violin plots to map their distributions.
  geom_violin(alpha = 0.75) +
  theme_minimal(base_size = 14) +
  labs(x = "Type of Simulation", y = "Predicted # of Solar Farms Adopted\ngiven low voter turnout (20th percentile)") +
  guides(fill = "none")

#If you re-run this code many times, you'll find they have almost identical shapes.

4.3 Simulate Expected Values

Simulating predicted values leads to fairly wide confidence intervals. We can also calculate expected values, which average out fundamental uncertainty - random variation just due to chance, like weather, giving us more refined confidence intervals. If you’re predicting something where chance is a big factor, like when you do election forecasting, it is probably best to use predicted values. But if you are trying to explain the past, then expected values represent a big improvement.

We follow the same steps as above, having generated a single simulated outcome.

# Get simulated coefficients
gamma_hat <- MASS::mvrnorm(n = 1000, mu = gamma, Sigma = sigma)
alpha_hat <- gamma_hat[,1]
beta_hat <- gamma_hat[,-1]
# Now let's take this portion of the model and add the simulated intercept,
# and then exponentiate both, to deal with the log-link function in the model
sim_out <- exp(beta_hat %*% t(x) + alpha_hat)

But now, we’re going to take each prediction, approximate the distribution in which it would vary due to fundamental uncertainty, take 1000 random samples from that distribution, and then average it into a single expected value. This is what makes it “expected” - we have seen 1000 versions, and this is the average, most likely value. Let’s write a short function below to do that.

# Let's write a quick function to produce 1 expected value averaged over 100 simulations from a negative binomial distribution,
fundamental = function(i){
  # Take this simulated outcome from a multivariate normal distribution,
  # and now make a negative binomial distribution with a mean of that value,
  # and grab 1000 random simulations.
  # This approximates the effect of fundamental uncertainty on your value
  rnbinom(n = 1000, size = 1, mu = sim_out[i]) %>%
    # Now let's take the mean of each, averaging over fundamental uncertainty.
    mean() %>%
    return()
}

Now, let’s generate our expected values and compare them with our Zelig simulation results in a visualization

# These are our 1000 expected values
ev <- 1:1000 %>%
  map(~fundamental(.)) %>%
  unlist()

# Let's repeat our zelig simulation again
z_sim <- z %>%
  setx(voter_turnout = range[1]) %>%
  sim()


# Let's bind both together
bind_rows(
  # Grabbing our self-made expected values,
  data.frame(ev = ev,
           type = "Do-it-yourself"), 
  # And comparing them against zelig's expected values
  # (As you can see from the code, I had to go digging into the model object to find them)
  data.frame(ev = z_sim$sim.out[[1]]$ev %>% unlist(),
           type = "Zelig")) %>%
  # And mapping each to different parts of a visual
  ggplot(mapping = aes(x = type, y = ev, fill = type)) +
  # Displaying the points, jittered
  geom_jitter(alpha = 0.5, color = "grey") +
  # Using violin plots to map their distributions.
  geom_violin(alpha = 0.75) +
  theme_minimal(base_size = 14) +
  labs(x = "Type of Simulation", y = "Expected # of Solar Farms Adopted\ngiven low voter turnout (20th percentile)") +
  guides(fill = "none")

#If you re-run this code many times, you'll find they have almost identical shapes.

5. Conclusion

So why simulation? Simulation opens up some really interesting doors for us. Using simulated effects, we have 1000 data points to work with, rather than just 1 prediction. As a result, we can start offering really specialized quantities of interest. Want to know the interquartile range of the expected value? Sure! Just take the 25th to 75th percentiles of the expected values. Want to show change in the distribution of expected values given one categorical variable or another? Great! Just get the expected values for each categorical variable and plot them side by side as histograms. A very useful option is first differences, which is the difference in expected values given one set of conditions versus another. The options are nearly endless.

Questions?

Want to try this yourself? Download the source data from the Harvard Dataverse.

Have more questions? Reach out to me at timothy.fraser.1@gmail.com or on ResearchGate!

References

If you liked this tutorial, take a look at my other tutorialsas well! Also, please do check out the Zelig team’s vignettes here, or read the paper that inspired this technique:

Gary King, Michael Tomz, and Jason Wittenberg. (2000). Making the most of statistical analyses: Improving interpretation and presentation. American Journal of Political Science 44, 341-365. https://gking.harvard.edu/files/abs/making-abs.shtml
For more information on the dataset and techniques in this tutorial, take a look at the source below:

Fraser, Timothy. (2021). Does social capital boost or block renewable energy siting? South African solar politics in comparison. Energy Research & Social Science 71, 101845. https://doi.org/10.1016/j.erss.2020.101845
For further examples of statistical simulation, take a look at my open access article with colleagues:

Fraser, Timothy, and Daniel P. Aldrich (2021). The dual effect of social ties on COVID-19 spread in Japan. Scientific Reports 11, 1596. https://doi.org/10.1038/s41598-021-81001-4