Visualization Predictions in R

A Brief Tutorial using Solar Farm Adoption in Germany

0. Introduction

Social scientists working on climate change, environmental politics, and disaster resilience are modeling all kinds of outcomes these days, including greenhouse gas emissions, renewable energy adoption, disaster fatality rates, migration rates due to climate change, and economic recovery. However, as a field, we still rely tremendously on statistical tables, which are general dull and difficult for non-specialists to interpret. What are some other ways we can present the results from our statistical models?

Here are two main tools.

Prediction: We can directly predict an outcome as a predictor increases, holding other covariates at their mean values (also known as marginal effects at the means). This using the original model equation.
- You can use the helpful predict() function in the stats package.
- Daniel Lüdecke’s ggeffects package does this for a really wide range of models, including random effects models.
Simulation: Since 2008, Christine Choirat, Kosuke Imai, Gary King, and their colleagues from the Zelig project have worked to create an easy-to-use R package that helps us visualize our results more clearly, by leveraging randomness through simulations. This is a top-notch way to visualize your results.

Below, I introduce each using the case of modeling solar power plant adoption - an important metric of adaptation to climate change.

To do this, you’ll need R. Learning R? Been using R but looking to pick up new tricks? Check out more of my R tutorials here!

This tutorial will go over five parts:

Running your model in R.
Using the stats package’s predict() function to visualize results.
Using the ggeffects package’s ggpredict() function to visualize results.
Using the Zelig package’s sim() function to visualize results.
Tricks for visualizing simulated effects well.

1. Running your model in R

1.1 Load Packages

First, please load the following packages in R. If you haven’t installed any of these before, you can do so by removing the hashtags below from the install.packages() function.

# If you haven't yet, uncomment and run the following code to install these packages.
# install.packages(c("tidyverse", "texreg", "Zelig", "ggeffects"))

# Now let's load these packages for use.
library(tidyverse) # for data manipulation, like the "%>%" (pipeline)
library(broom) # for viewing model results nicely
library(texreg) # for making statistical tables
library(Zelig) # For visualizing with statistical simulation
library(ggeffects) # For visualizing with marginal effects

1.2 Import Data

We’re going to use this helpful dataset of solar farm adoption in Germany cities from 2008-2011 from a recent study dataset.

# This is a time-series dataset of Germany cities.

dat <- read_csv("germany_data.csv") %>%
  # Let's select our key variables of interest
  select(id, # Unique ID Code
         muni, # Name of German District Name 
         solar, # No. of Solar Farms built between 2008 and 2011
         pvout, # Sunlight (Average Photovoltaic Output)
         area, # Area in square kilometers
         pop, # Population in 2000
         land_price, # Cost of Land in 2000 in euros per square kilometer
         income, # Income per taxpayer in 2001 in thousands of euros
         unemployment, # Unemployment rate in 2001
         crime_rate, # Crime Rate in 2007
         # (a common proxy for bonding social capital
         # when survey measures are unavailable)
         voter_turnout, # Voter Turnout in 2005
         # (proxy for bridging social capital)
         winning_party, # % Votes for Ruling Paer (CDU-CSU) in 2005
         green) # % Votes for Green Party in 2005

1.3 Model Data

Next, let’s model our data with an negative binomial model using the glm.nb() function in the MASS package. Our outcome, solar farms, is a right-skewed count variable (0, 1, 2, 3, etc.), and so to model it appropriately, we need a negative binomial model. (Why use an example like this? Because many, many outcomes in disaster studies are counts (people, injuries, houses damaged, migration, recoveries, coffee cups consumed while conducting research, etc.)

# Make an OLS model using our dataset
mymodel <- dat %>%
  # Let's zoom into cities with at least 1 solar farm,
  # Since the other cities are probably not very comparable to the rest
  filter(solar > 0) %>%
  # Let's model the association between solar farm adoption 
  # and voter turnout in that community
  MASS::glm.nb(formula = solar ~ voter_turnout +
       # While controlling for other proxies for other types of social capital
                 crime_rate + winning_party +
         # environmentalism and socioeconomics
         green + income +  unemployment +
         # and technical conditions
         pvout + area + pop + land_price) 

# Now let's use the tidy() function in the broom package 
# to view our coefficient table. 
mymodel %>% 
  tidy()

## # A tibble: 11 × 5
##    term              estimate   std.error statistic  p.value
##    <chr>                <dbl>       <dbl>     <dbl>    <dbl>
##  1 (Intercept)    -8.36       4.37            -1.91 5.57e- 2
##  2 voter_turnout  10.4        4.32             2.41 1.59e- 2
##  3 crime_rate     -0.0135     0.00185         -7.32 2.44e-13
##  4 winning_party  -2.28       1.09            -2.10 3.61e- 2
##  5 green         -21.3        4.36            -4.89 1.02e- 6
##  6 income         -0.0227     0.0209          -1.08 2.78e- 1
##  7 unemployment   -0.134      0.0314          -4.27 1.94e- 5
##  8 pvout           3.80       0.592            6.43 1.27e-10
##  9 area            0.00134    0.000124        10.8  3.73e-27
## 10 pop            -0.00000243 0.000000672     -3.62 2.96e- 4
## 11 land_price     -0.00619    0.00119         -5.21 1.91e- 7

We can see above that voter turnout has a positive association with solar farm adoption. This is shown in the estimate column, which shows a beta coefficient (log-odds) of 10.4, and in the p.value column, which shows a p.value below 0.05. This means we are over 95% confident that this is not the result of sampling error. This is because communities with strong voter turnout and resident engagement in city politics are better mobilized and capable of pressuring local officials to act on climate change.

And while packages like texreg do a very nice job of displaying these results in cool statistical tables, we want to create visuals to communicate our main results clearly to all readers. So how do we do that?

2. Using the predict() function

One way to display effects is by using the predict() function in the stats package, which comes with R. In this case, we literally use the model equation to predict the number of solar farms for an average community with average traits, except that we will vary one trait. Let’s try it out below.

As our first step, we need a new version of our dataframe, which I’ll call fakedata, formatted in the same way as our model table. We’re going to feed this into our model object, and the predict() function will spit out the predicted outcome given those inputs.

# Conveniently, we can reuse the original model data, which is stored in the model object and accessible using whatevermymodelnameis$model.
fakedata <- mymodel$model %>% 
  # Now, let's create a smaller data.frame, 
  # populated with all the variables in our model,
  # given values that suit our assumptions.
  summarize(
    # Let's assume that the index for social vulnerability among racial/ethnic/linguistic minorities ranges from low to high.
    voter_turnout = seq(
      # To represent low, let's use the 20th percentile of its observed values;
      from = quantile(voter_turnout, 0.2),
      # To represent high, let's use the 80th percentile of its observed values.
      # This helps avoid unreasonably high or low outcome values in your predictions, 
      # because we're supplying  pretty ordinary values in your dataset, not high or low outliers.
      to = quantile(voter_turnout, 0.8),
      # spread evenly across 5 values
      length.out = 5),
    # Let's also assume that all the other traits are held at their mean
    # (If you've got ordinal predictors, the median makes sense;
    # (If you've got categorical predictors, the mode makes sense))
    crime_rate = mean(crime_rate, na.rm = TRUE),
    winning_party = mean(winning_party, na.rm = TRUE), 
    green = mean(green, na.rm = TRUE), 
    income = mean(income, na.rm = TRUE),
    unemployment = mean(unemployment, na.rm = TRUE),
    pvout = mean(pvout, na.rm = TRUE),
    area = mean(area, na.rm = TRUE),
    pop = mean(pop, na.rm = TRUE),
    land_price = mean(land_price, na.rm = TRUE))


  # Let's take a look at our fakedata
# We'll use the head() function to view the first six rows
fakedata %>% 
  head()

##   voter_turnout crime_rate winning_party      green   income unemployment
## 1     0.7521199   49.09386     0.4314535 0.05234046 32.64747     8.036076
## 2     0.7584936   49.09386     0.4314535 0.05234046 32.64747     8.036076
## 3     0.7648673   49.09386     0.4314535 0.05234046 32.64747     8.036076
## 4     0.7712410   49.09386     0.4314535 0.05234046 32.64747     8.036076
## 5     0.7776148   49.09386     0.4314535 0.05234046 32.64747     8.036076
##      pvout     area      pop land_price
## 1 2.872953 894.1977 176587.1   84.33652
## 2 2.872953 894.1977 176587.1   84.33652
## 3 2.872953 894.1977 176587.1   84.33652
## 4 2.872953 894.1977 176587.1   84.33652
## 5 2.872953 894.1977 176587.1   84.33652

# Cool - so voter turnout varies, while everything else is held constant at its mean

Next, let’s apply the predict() function.

# Now, let's run the predict() function!
pred <- mymodel %>% 
  predict.glm(
    # We will feed it the fakedata,
    newdata = fakedata,
    # Ask it to give us the response/outcome variable
    type = "response",
    # and also to supply standard errors for our prediction
    se.fit = TRUE) %>%
  # So pred is formatted as a list object. Lists are different from data.frames, in that each item of the list (eg. $fit, $se.fit, $df, $residual.scale) can be different lengths. We're really interested in $fit and $se.fit, so let's turn it into a data.frame.
  as.data.frame() %>%
  # But let's also join in our fakedata, so we know what the x-axis is.
  # Conveniently, it is also exactly five rows long and in the same order,
  # so we can bind them together using bind_cols()
  bind_cols(fakedata)

# Let's check it out!
pred %>%
  head()

##        fit   se.fit residual.scale voter_turnout crime_rate winning_party
## 1 428.7676 32.05574              1     0.7521199   49.09386     0.4314535
## 2 458.1807 27.07582              1     0.7584936   49.09386     0.4314535
## 3 489.6115 26.41211              1     0.7648673   49.09386     0.4314535
## 4 523.1983 32.43518              1     0.7712410   49.09386     0.4314535
## 5 559.0893 44.34712              1     0.7776148   49.09386     0.4314535
##        green   income unemployment    pvout     area      pop land_price
## 1 0.05234046 32.64747     8.036076 2.872953 894.1977 176587.1   84.33652
## 2 0.05234046 32.64747     8.036076 2.872953 894.1977 176587.1   84.33652
## 3 0.05234046 32.64747     8.036076 2.872953 894.1977 176587.1   84.33652
## 4 0.05234046 32.64747     8.036076 2.872953 894.1977 176587.1   84.33652
## 5 0.05234046 32.64747     8.036076 2.872953 894.1977 176587.1   84.33652

# $fit is our predicted value
# $se.fit is the standard error around the predicted value

Finally, let’s visualize it with the ggplot() function!

# Now let's visualize it using the amazing ggplot() function in the ggplot2 package!

# Let's draw on data from the pred dataframe...
pred %>%
  # And map specific variables to different parts of the plot
  ggplot(mapping = aes(
    # Setting voter turnout from our fakedata to be the x-axis
    x = voter_turnout, 
    # And the predicted values to be our y-axis
    y = fit, 
    # And let's make a confidence interval around it using the standard error
    # Since we know that 95% of observations fall within
    # 1.96 standard deviations of the mean of a normal distribution,
    # (Fun fact!)
    # we can estimate the 2.5th and 97.5th percentiles of the sampling distribution!
    # which reflects a zone we are 95% confident that our prediction would fall into
    # even if sampling error caused our beta coefficients to be slightly off.
    # Let's calculate the lower confidence interval...
    ymin = fit - se.fit * 1.96,
    # And the upper confidence interval...
    ymax = fit + se.fit * 1.96)) +
  # And let's visualize this using a transparent ribbon (alpha gives transparency from 0 (fuzzy) to 1 (solid)) 
  geom_ribbon(alpha = 0.5, fill = "firebrick") +
  geom_line() +
  theme_minimal(base_size = 14) +
  labs(x = "(%) Voter Turnout",
       y = "Predicted Solar Farms\n(in an average city)")

3. Using the ggpredict() function

Well, that wasn’t terrible! But it did require tediously specifying the mean value for every covariate. What if a software did this for us? Let’s use the ggeffects package, which calculates marginal effects at the mean (which is what we did above). Marginal effects at the mean hold covariates constant at their average values, reflecting the rate of change between a single covariate and their outcome.

First, we’ll need to know exactly the 5 values between the 20th and 80th percentiles again. Let’s view them here.

seq(from = quantile(dat$voter_turnout, 0.2, na.rm = TRUE),
    to = quantile(dat$voter_turnout, 0.8, na.rm = TRUE),
    length.out = 5)

## [1] 0.7455777 0.7534772 0.7613766 0.7692761 0.7771756

Second, we’ll use ggpredict() in the ggeffects package to get marginal effects.

# Drawing from our model object,
ggpred <- mymodel %>%
  # Let's run the ggpredict function..
  ggpredict(
    # Ask it to holding everything else at its means, 
    # but then supply the following values for voter_turnout,
    # sticking them in brackets
    terms = "voter_turnout [0.7455777,0.7534772,0.7613766,0.7692761,0.7771756]", 
    # And make a 95% confidence interval
    ci.lvl = 0.95) %>%
  # And then convert it into a data.frame
  as.data.frame()

# Let's view one of these marginal effects here
ggpred %>%
  slice(5)

##           x predicted  std.error conf.low conf.high group
## 1 0.7771756  556.5392 0.07793846 477.6991  648.3912     1

That was easy! Let’s visualize them! (They look almost identical, and they should!)

ggpred %>%
  ggplot(mapping = aes(x = x, y = predicted,
                       ymin = conf.low, ymax = conf.high)) +
  geom_ribbon(alpha = 0.5, fill = "firebrick") +
  geom_line() +
  theme_minimal(base_size = 14) +
  theme(plot.caption = element_text(hjust = 0.5)) +
  labs(x = "(%) Voter Turnout",
       y = "Predicted Solar Farms\n(in an average city)",
       caption = "Produced with the ggpredict() function.")

4. Using Zelig

Finally, both predict() and ggpredict() only take into account error from our model, and they rely on standard error to project confidence intervals. What if we could rely on the power of randomness to simulate what our expected outcome should look like?

The Zelig package is excellent for this. Zelig simulates the outcome 1000 times, holding all variables except one at their means, to see the effect of that variable. First, I’ll describe the code for it below, and then, second, we will build our own simulation, so it is clear how it works.

library(Zelig)

# Let's make our model again, but in Zelig
zelig_model <- dat %>%
  filter(solar > 0) %>%
  zelig(formula = solar ~ voter_turnout +
                 crime_rate + winning_party +
         green + income +  unemployment +
         pvout + area + pop + land_price,
      model = "negbin",  # specify model type
      cite = FALSE) # remove extra material

# Let's simulate the effects of our model,
# using Zelig's code
z_sim <- zelig_model %>%
  # Varying Minority Status from 0 to 1 over 5 values,
  # while holding everything else constant at their means/modes
  setx(voter_turnout = seq(
    # Get the 20th percentile of voter turnout
    from = quantile(zelig_model$data$voter_turnout, probs = 0.2, na.rm = TRUE),
    # Get the 80th percentile of voter turnout
    to = quantile(zelig_model$data$voter_turnout, probs =  0.8, na.rm = TRUE),
    length.out = 5)) %>%
  sim()

# This produces both predicted and expected values for each.
# What's the difference?
# Well, a predicted value reflects the estimation error from a model.
# An expected value reflects the estimation error from a model, but also the fundamental uncertainty from all covariates.
z_sim

## 
## [1] 0.745447
## 
## 
##  sim range :
##  -----
## ev
##          mean       sd      50%     2.5%    97.5%
## [1,] 402.3409 38.93579 402.8307 330.9518 479.8573
## pv
##         mean      sd 50% 2.5%    97.5%
## [1,] 331.555 324.989 241   11 1242.275
## 
## 
## [1] 0.753368
## 
## 
##  sim range :
##  -----
## ev
##          mean       sd      50%     2.5%    97.5%
## [1,] 435.5749 31.05486 435.4634 377.2259 495.6809
## pv
##         mean      sd 50%   2.5%  97.5%
## [1,] 356.523 353.885 249 14.975 1336.4
## 
## 
## [1] 0.761289
## 
## 
##  sim range :
##  -----
## ev
##          mean       sd      50%     2.5%   97.5%
## [1,] 472.0828 26.16372 471.4198 421.4869 523.963
## pv
##        mean       sd   50% 2.5%   97.5%
## [1,] 440.12 435.2384 313.5   12 1533.95
## 
## 
## [1] 0.76921
## 
## 
##  sim range :
##  -----
## ev
##         mean       sd      50%     2.5%    97.5%
## [1,] 512.224 29.45532 511.5478 456.8146 572.3151
## pv
##         mean      sd 50%   2.5%    97.5%
## [1,] 543.129 498.164 390 14.975 1890.575
## 
## 
## [1] 0.777131
## 
## 
##  sim range :
##  -----
## ev
##          mean       sd      50%     2.5%    97.5%
## [1,] 556.4008 42.31384 555.5984 479.5147 644.5439
## pv
##         mean       sd 50% 2.5%    97.5%
## [1,] 576.859 567.4136 388   26 2115.075

Let’s visualize them both.

# Generate expected values 
# (where we simulate all coefficients and 
# average over fundamental uncertainty - even better!)
ev <- z_sim %>%
  # Extracts quantities of interest
  zelig_qi_to_df() %>%
  # Consolidates them into just the requested number of expected values (5)
  # and their confidence intervals
  qi_slimmer(qi_type = "ev", ci = 0.95)

ev %>%
  # And we can map them here!
  ggplot(mapping = aes(x = voter_turnout, y = qi_ci_median, 
                       ymin = qi_ci_min, ymax = qi_ci_max)) +
  geom_ribbon(alpha = 0.5, fill = "firebrick") +
  geom_line() +
  theme_minimal(base_size = 14) +
  theme(plot.caption = element_text(hjust = 0.5)) +
  labs(x = "(%) Voter Turnout",
       y = "Expected Solar Farms\n(in an average city)",
       caption = "Based on 1000 simulations made with the sim() function in Zelig.")

This produces a very nice confidence interval, quite similar to the ggeffects package, but based on simulations. Expected value is particularly handy, because it averages over both estimation uncertainty and fundamental uncertainty.

5. Using First Differences

Finally, now that we are armed our toolkit of statistical simulations, how far can we go?

5.1 Simulating First Differences

First, we could generate first differences, which refers to the difference in expected values simulated with one set of conditions versus another set of conditions. Take, for example, first differences between the 20th and 80th percentiles.

# Let's calculate first differences
  # By simulating the expected number of solar farms given...
z_fd <- zelig_model %>%
  # 20th percentile voter turnout...
  sim(x = setx(., voter_turnout = quantile(zelig_model$data$voter_turnout, 
                                        probs = 0.2, na.rm = TRUE)),
      # versus 80th percentile voter turnout...
      x1 = setx(., voter_turnout = quantile(zelig_model$data$voter_turnout,
                                         probs = 0.8, na.rm = TRUE)))

# And let's view the results!
z_fd

## 
##  sim x :
##  -----
## ev
##          mean       sd      50%     2.5%    97.5%
## [1,] 402.0212 39.16236 401.1777 329.5343 482.2183
## pv
##         mean       sd   50%   2.5%  97.5%
## [1,] 435.672 420.3272 317.5 17.975 1481.5
## 
##  sim x1 :
##  -----
## ev
##          mean       sd      50%     2.5%    97.5%
## [1,] 557.1804 44.46759 556.3005 471.3546 645.9457
## pv
##         mean       sd   50% 2.5%  97.5%
## [1,] 562.508 521.5244 394.5   17 1929.2
## fd
##          mean       sd      50%     2.5%    97.5%
## [1,] 155.1592 65.05998 155.0682 30.24602 278.9683

Great! Looks like we expect to see more solar farms given an increase in voter turnout between these levels; the 2.5th and 97.5th percentiles of our first differences are both positive. First differences are a really handy use of simulated expected values, because even if simulated expected values show a visual increase as a predictor increases, it might not be a big enough increase that we can be confident in it 95% of the time or more. First differences gives us a simple tool for checking that.

5.2 Comparisons

How could we visualize this? Well, we could perhaps compare the level of increase in solar as we change the level of voter turnout. For example, let’s compare the increase in solar as we change from the 20th to 40th percentiles, then the 20th to 60th percentiles, and finally then the 20th to 80th percentiles!

# Let's get first differences for 20th to 40th percentiles...
fd40 <- zelig_model %>%
  sim(x = setx(., voter_turnout = quantile(zelig_model$data$voter_turnout, 
                                        probs = 0.2, na.rm = TRUE)),
      x1 = setx(., voter_turnout = quantile(zelig_model$data$voter_turnout,
                                         probs = 0.4, na.rm = TRUE))) %>%
  # Let's go into the simulated outputs...
  with(sim.out) %>%
  # Extract the expected/predicted/first-differences for the comparison group...
  with(x1) %>%
  # And grab the vector of first differences.
  with(fd) %>%
  # and format it as a vector
  unlist()

# Let's get first differences for 20th to 60th percentiles...
fd60 <- zelig_model %>%
  sim(x = setx(., voter_turnout = quantile(zelig_model$data$voter_turnout, 
                                        probs = 0.2, na.rm = TRUE)),
      x1 = setx(., voter_turnout = quantile(zelig_model$data$voter_turnout,
                                         probs = 0.6, na.rm = TRUE))) %>%
  # Same as above...
  with(sim.out) %>%
  with(x1) %>%
  with(fd) %>%
  unlist()

# And let's get first differences for 20th to 40th percentiles...
fd80 <- zelig_model %>%
  sim(x = setx(., voter_turnout = quantile(zelig_model$data$voter_turnout, 
                                        probs = 0.2, na.rm = TRUE)),
      x1 = setx(., voter_turnout = quantile(zelig_model$data$voter_turnout,
                                         probs = 0.8, na.rm = TRUE))) %>%
  # Same as above...
  with(sim.out) %>%
  with(x1) %>%
  with(fd) %>%
  unlist()

# Finally, let's make a tidy data.frame of the results
comparison <- bind_rows(
  data.frame(fd = fd40,
             type = "20th - 40th"),
  data.frame(fd = fd60,
             type = "20th - 60th"),
  data.frame(fd = fd80,
             type = "20th - 80th"))

# Here's a quick glimpse of the data.frame
comparison %>% 
  head()

##         fd        type
## 1 74.22596 20th - 40th
## 2 25.03479 20th - 40th
## 3 79.57220 20th - 40th
## 4 90.44067 20th - 40th
## 5 60.45771 20th - 40th
## 6 69.52536 20th - 40th

5.3 Using Density Plots

Finally, let’s visualize these using density plots. These show the distributions of expected differences in solar farms given low vs. high voter turnout. In particular, we see that the shape of the distribution changes as we extend to higher and higher levels of voter turnout, with the most likely (median) difference in expected solar farms growing larger and larger.

comparison %>%
  ggplot(mapping = aes(x = fd, fill = type)) +
  geom_density(alpha = 0.5) +
  theme_minimal(base_size = 14) +
  theme(plot.caption = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5)) +
  labs(fill = "Difference in\nVoter Turnout\nby percentile",
       x = "Expected Difference in Solar Farms\n(in an average city)",
       y = "% of Simulations",
       caption = "Based on 1000 simulations made with the sim() function in Zelig.")

Now, one limitation of this visual while pretty, we can’t really convey to the reader that 97.5% of these first differences are in fact above 0, at least not without using a lot of lines. Let’s try out a few alternatives.

5.4 Alternatives

comparison %>%
  ggplot(mapping = aes(x = type, y = fd, color = type)) +
  # Let's plot a line at 0, to show how far above it each violin is
  geom_hline(yintercept = 0, color = "black", linetype = "dashed", size = 1) +
  # Let's display the first differences, jittered
  geom_jitter(alpha = 0.5) +
  # Let's visualize the same distributions, just vertically
  geom_violin(alpha = 0.75, fill = "white", 
              # And we can even mark the 0.025th and 97.5th percentiles
              draw_quantiles = c(0.025, 0.975)) +
  # This shows much better our confidence in these first differneces
  theme_minimal(base_size = 14)  +
 theme(plot.caption = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5)) +
  # Get rid of the color legend; unnecessary
  guides(color = FALSE) +
  # And provide labels
  labs(x = "Difference in Voter Turnout by percentile",
       y = "Expected Difference in Solar Farms\n(in an average city)",
       caption = "Based on 1000 simulations made with the sim() function in Zelig.\nBands depict 95% confidence intervals.")

## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.

While it is very cool to see the variation in these distributions, it may also show your reader a little more than necessary. Alternatively, we could simplify the visual to just error bars.

comparison %>%
  group_by(type) %>%
  summarize(median = median(fd),
            lower = quantile(fd, probs = 0.025),
            upper = quantile(fd, probs = 0.975)) %>%
  ggplot(mapping = aes(x = type, y = median,
                       ymin = lower, ymax = upper, color = type)) +
  # Show the range with bars
  geom_linerange(size = 1.15) +
  # Add points for the median
  geom_point(size = 5) +
  # This is a trick to get a nice white circle
  geom_point(size = 3, color = "white", show.legend = TRUE) +
  # Flip the plot for easier reading
  coord_flip() +
  # This shows much better our confidence in these first differneces
  theme_minimal(base_size = 14)  +
  theme(plot.caption = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5)) +
  # Get rid of the color legend; unnecessary
  guides(color = FALSE) +
  # And provide labels
  labs(x = "Difference in Voter Turnout\nby percentile",
       y = "Expected Difference in Solar Farms\n(in an average city)",
       caption = "Based on 1000 simulations made with the sim() function in Zelig.\nBands depict 95% confidence intervals.")

## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.

5.5 Getting Creative

Finally, we could even go over the top, visualizing the actual difference in expected values within the 95% confidence interval. To do so, we’ll have to resimulate our values and extract the expected values from both parts.

# Let's get first differences for 20th to 40th percentiles...
ev40 <- zelig_model %>%
  sim(x = setx(., voter_turnout = quantile(zelig_model$data$voter_turnout, 
                                        probs = 0.2, na.rm = TRUE)),
      x1 = setx(., voter_turnout = quantile(zelig_model$data$voter_turnout,
                                         probs = 0.4, na.rm = TRUE))) %>%
  # Let's go into the simulated outputs...
  with(sim.out)

# Let's get first differences for 20th to 60th percentiles...
ev60 <- zelig_model %>%
  sim(x = setx(., voter_turnout = quantile(zelig_model$data$voter_turnout, 
                                        probs = 0.2, na.rm = TRUE)),
      x1 = setx(., voter_turnout = quantile(zelig_model$data$voter_turnout,
                                         probs = 0.6, na.rm = TRUE))) %>%
  # Same as above...
  with(sim.out)

# And let's get first differences for 20th to 40th percentiles...
ev80 <- zelig_model %>%
  sim(x = setx(., voter_turnout = quantile(zelig_model$data$voter_turnout, 
                                        probs = 0.2, na.rm = TRUE)),
      x1 = setx(., voter_turnout = quantile(zelig_model$data$voter_turnout,
                                         probs = 0.8, na.rm = TRUE))) %>%
  # Same as above...
  with(sim.out)

# Let's bind all these first differences 
# and expected values into the following data.frame
compare_range <- bind_rows(
  # first for the 20th to 40th percentiles,
  data.frame(
    fd = ev40$x1$fd %>% unlist(),
    x = ev40$x$ev %>% unlist(),
    x1 = ev40$x1$ev %>% unlist(),
    type = "20th to 40th"),
  # Then for the 20th to 60th percentiles
  data.frame(
    fd = ev60$x1$fd %>% unlist(),
    x = ev60$x$ev %>% unlist(),
    x1 = ev60$x1$ev %>% unlist(),
    type = "20th to 60th"),
  # And then for the 20th to 80th percentiles
  data.frame(
    fd = ev80$x1$fd %>% unlist(),
    x = ev80$x$ev %>% unlist(),
    x1 = ev80$x1$ev %>% unlist(),
    type = "20th to 80th"))  %>%
  # Finally,
  # Let's give each simulation an ID
  mutate(id = 1:n()) %>%
  # For each type,
  group_by(type) %>%
  # Zoom into just the most common values 
  # (those values which occur 95% of the time, 
  # excluding the outliers above the 97.5th percentile and below the 2.5th percentile)
  filter(fd > quantile(fd, probs = 0.025),
         fd < quantile(fd, probs = 0.975))

# Now let's see the range of expected first differences
compare_range %>%
  ggplot(mapping = aes(x = reorder(id, x), ymin = x, ymax = x1, color = type)) +
  geom_linerange(alpha = 0.5, size = 0.3) +
  facet_wrap(~type) +
  guides(color = FALSE) +
  # This shows much better our confidence in these first differneces
  theme_classic(base_size = 14)  +
  theme(plot.caption = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5),
        # Add a nice border
        panel.border = element_rect(fill = NA, color = "black"),
        # Get rid of the x axis ticks and labels; too many
        axis.text.x = element_blank(), 
        axis.ticks.x = element_blank()) +
  # Get rid of the color legend; unnecessary
  guides(color = FALSE) +
  # And provide labels
  labs(subtitle = "First Differences in Solar Adoption\nGiven Change in Voter Turnout by Percentile",
       x = "Simulations\n(95% Most Common Expected Outcomes, from Lowest to Highest)",
       y = "Expected Difference in Solar Farms\n(in an average city)",
       caption = "Based on 1000 simulations made with the sim() function in Zelig.\nBands depict 95% confidence intervals.")

## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.

## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.

6. Conclusion

This last visualization is definitely unnecessarily complex, but it’s a great illustration of how you can use simulated expected values from Zelig in many diverse ways.

Questions?

Want to try this yourself? Download the source data from the Harvard Dataverse.

Have more questions? Reach out to me at timothy.fraser.1@gmail.com or on ResearchGate!

References

If you liked this tutorial, take a look at my other tutorials as well! If you want to understand the nuts and bolts of statistical simulation and do it yourself, check out my tutorial breaking down Zelig here. Also, please do check out the Zelig team’s vignettes here, or read the paper that inspired this technique:

Gary King, Michael Tomz, and Jason Wittenberg. (2000). Making the most of statistical analyses: Improving interpretation and presentation. American Journal of Political Science 44, 341-365. https://gking.harvard.edu/files/abs/making-abs.shtml

Similarly, Daniel Lüdecke’s ggeffects package is a lifesaver, especially when working with multilevel models. Check out his paper here:

Lüdecke, Daniel. (2018). ggeffects: Tidy Data Frames of Marginal Effects from Regression Models. Journal of Open Source Software, 3(26), 772. doi: 10.21105/joss.00772

For more information on the dataset and techniques in this tutorial, take a look at the source below:

Fraser, Timothy. (2021). Does social capital boost or block renewable energy siting? South African solar politics in comparison. Energy Research & Social Science 71, 101845. https://doi.org/10.1016/j.erss.2020.101845

For further examples of statistical simulation, take a look at my open access article with colleagues:

Fraser, Timothy, and Daniel P. Aldrich. (2021). The dual effect of social ties on COVID-19 spread in Japan. Scientific Reports 11, 1596. https://doi.org/10.1038/s41598-021-81001-4