Module 10: Visualizing Results

Workshop 10: Using Statistical Simulation to Predict Food Deserts

Click here to run the code for this workshop yourself on RStudio.Cloud!


Social scientists, policy analysts, and coders often have vitally important findings to convey to key decision-makers, but it can be very difficult to convey that information quickly, concisely, and in a visually appealing way. Instead of telling policymakers about our beta-coefficients, what if we could show them what our model predicts instead? Exactly how many more lives can we affect, dollars can we earn, emissions can we cut, if we adopt policy A instead of B, or increase our level of spending by $1000? We can answer these kinds of questions using statistical simulation.

This tutorial introduces statistical simulation using the example of the Food Environment Index from the University of Wisconsin’s County Health Rankings, which measures access to healthy foods in each county. Using just a few covariates, we estimate the relationship between key demographic traits and access to healthy foods. This workshop highlights how racial and ethnic minorities groups face systemic barriers to health, like “food deserts”, which describe when our cities’ urban planning has sited grocery stores and food infrastructure far away from certain communities - often communities of color and low-income communities. But can we show these trends with data?

0. Import Data

Load Data

In this dataset, each row is a US county!

# Load Packages
library(tidyverse) # for data manipulation
library(viridis) # for colors
library(moderndive) # for easy model summaries
library(Zelig) # for simulation

# Load Data
counties <- read_csv("food_deserts.csv")

View Data

# View first 3 rows of dataset
counties %>% head(3)
fips county state region urban food_env_index pop pop_black pop_hisplat median_income democrat_2016
01001 Autauga County AL East South Central urbanized_area 7.2 55036 19.1 2.7 55317 23.8
01003 Baldwin County AL East South Central urbanized_area 8.0 203360 9.5 4.4 52562 19.4
01005 Barbour County AL East South Central urban_cluster 5.6 26201 48.0 4.2 33368 46.5

Codebook

In this dataset, our variables mean:

fips: Unique 5 digit ID for each US county.

county: Name of each county.

state: Abbreviation for each state.

region: Census region for each county (eg. "New England")

urban: dichotomous variable showing whether county is a continuously built-up area with a population of 50,000 or more ("Urbanized Area") or not ("Rural Area"), according to the US Census.

Outcome Variable

  • food_env_index: Food Environment Index, which measures access to healthy foods and food insecurity in each county. Ranges from a scale of 0 (worst) to 10 (best). Access to healthy foods measured based on the share of low income residents who do not live close to a grocery store (in urban areas, < 1 mile away; in rural areas, <10 miles away). Food insecurity measured based on share of residents who lacked a reliable source of food during the past year. Good way to measure the prevalence of ‘food deserts’ in America.

Explanatory Variables

  • pop_black: percentage of population who are Black. (Written as a percentage, eg. 10.2%)

  • pop_hisplat: percentage of population who are Hispanic or Latino. (Written as a percentage, eg. 23.5%).

Control Variables

  • pop: population of each county.

  • `median_income: median household income in each county.

  • democrat_2016: share of residents who voted for the Democratic presidential candidate in 2016.




1. Modeling

First, let’s make a small model, predicting the food_env_index based on the variables pop_black and pop_hisplat, controlling for pop, median_income, and democrat_2016.

Learning Check 1

Question

Review: Make the model above, and save it as an object named m! - Using get_regression_table(), as the percentage of Black residents increases by 1%, holding all else constant, how much is food_env_index projected to increase?

  • As the percentage of Hispanic residents increases by 1%, how much does food_env_index increase? Are these changes statistically significant?

  • Using get_regression_summaries(), how much variation does this model explain?

Answer

m <- counties %>%
  lm(formula = food_env_index ~ pop_black + pop_hisplat +
       pop + median_income + democrat_2016) 
# Get coefficients
m %>% get_regression_table()
term estimate statistic p_value
intercept 5.657 81.480 0.000
pop_black -0.027 -19.991 0.000
pop_hisplat -0.006 -5.482 0.000
pop 0.000 0.745 0.456
median_income 0.000 31.419 0.000
democrat_2016 0.002 1.593 0.111
# Get r-squared
m %>% get_regression_summaries()
r_squared p_value nobs
0.437 0 3095

Yes! It looks like the Food Environment Index decreases by -0.027 points for every additional percent of Black residents in that county, and it decreases by -0.006 points for every additional percent of Hispanic residents in that county. Both effects are extremely statistically significant (p < 0.001). This model explains 43.7% of the variation in Food Environment Index scores, despite including only 5 variables. Not bad!

Note: Other factors certainly matter here, but for our purposes, this is a descriptive study, meant to show how much key demographic trend with this outcome.




2. Simulation

Simulation allows us to get precise estimates of our outcome using our model, based on a certain amount of our predictor, holding everything else constant. We want to know how easily can you access healthy food (food_env_index) in a county with a specific rate of Black residents (pop_black), assuming an average rate of Hispanic / Latinx residents (pop_hisplat), average population size (pop), average median income (median_income), and average support for the Democratic (vs. Republican) party.

The Zelig package will take our OLS regression model, our supplied level of the predictor, generate predictions, and then give us back 1000 projected outcomes, which represent our outcomes plus or minus our error. We can use these 1000 predictions to build pretty confidence intervals! These 1000 predictions are handy, because they tell us that the most likely value is somewhere in that zone.

She was totally talking about Confidence Intervals.

She was totally talking about Confidence Intervals.

Examine the tabs below to learn how!

to_zelig()

First, convert our model to a Zelig formatted object, named z, using to_zelig().

z <- m %>% to_zelig()

setx() and sim()

Second, tell Zelig the hypothetical value of our predictor(s), which it will plug into our regression model equation. We can put this info in the setx() function, which ‘sets’ the value of our predictor x. For example, let’s estimate the Food Environment Index in a county with 9% Black residents (the average). For all other traits, unless specified, Zelig automatically assumes a county with mean traits (or modal traits, for categorical variables). In our case, this means 9.1% Hispanic/Latino residents, 102165.6 residents overall, a median household income of 49754.2 USD, and 31.53% support for Democrats in 2016. Then, we’ll use sim() to calculate our 1000 predictions, and save it all as the object zsim.

zsim <- z %>%
  setx(pop_black = 9) %>%
  sim()

zelig_qi_to_df()

Third, we will extract our results! zelig_qi_to_df() returns a very large data.frame (1000 rows for each condition, you entered (pop_black = 9, so just 1 condition)). Let’s look at the first few rows! It contains one column for each variable in our model, containing the values they were set at. Then, it returns an expected value and predicted value for each row, totalling 1000 different numbers in each row. Each is a projected Food Environment Index score given the values of the predictors we supplied.

zsim %>%
  zelig_qi_to_df() %>%
  head(3)
setx_value pop_black pop_hisplat pop median_income democrat_2016 expected_value predicted_value
x 9 9.193376 103432.8 49606.91 31.537 7.452 6.421
x 9 9.193376 103432.8 49606.91 31.537 7.479 9.108
x 9 9.193376 103432.8 49606.91 31.537 7.481 7.959

Let’s extract just the trait we specified (pop_black) and our expected_value and predicted_value, which we can rename ev and pv for short. We will save these results as the data.frame zresults.

zresults <- zsim %>%
  zelig_qi_to_df() %>%
  select(pop_black, ev = expected_value, pv = predicted_value)

quantile()

We can find our really quickly the interval in which most of our predictions lie - the 95% confidence interval. We do this using the quantile() function, which lets us query what value in our data is at specific percentile. So, for the 2.5th percentile of the vector ev in our data.frame zev, we can use quantile(ev, probs = 0.025). probs = accepts values from 0 (0%) to 1 (100%). We’ll calculate the 2.5th percentile, 50th percentile, and 97.5th percentile (Because 97.5 - 2.5 = 95%). These represent:

  • 2.5th percentile: lower end of range where 95% of predictions lie.

  • 50th percentile: middle of range where the most common predictions cluster.

  • 97.5th percentile: upper end of range where 95% of predictions lie.

# Let's get the range where 95% of points lie
zev <- zresults %>%
  summarize(
    # Get lower 2.5th percentile
    lower = quantile(ev, probs = 0.025),
    # Get median (50th percentile)
    median = median(ev),
    # Get upper 97.5th percentile
    upper = quantile(ev, probs = 0.975))
zev
lower median upper
7.436 7.469 7.5

Cool! So 95% of all our expected values lie within this zone, so we can be 95% confident that the expected Food Environment Index for an average city would lie between 7.436 and 7.5, with its most likely value being the median at 7.469.

Note: Your values may differ ever so slightly from mine, because simulations involve random sampling. But if you rerun your simulations many times, they’re results will always be about the same.

pv vs. ev

You may notice, our predicted values are slightly different from our expected values. We can calculate the 95% confidence interval for these in much the same way, just replacing ev with pv.

# Let's get the range where 95% of points lie
zpv <- zresults %>%
  summarize(
    # Get lower 2.5th percentile
    lower = quantile(pv, probs = 0.025),
    # Get median (50th percentile)
    median = median(pv),
    # Get upper 97.5th percentile
    upper = quantile(pv, probs = 0.975))
zpv
lower median upper
5.772 7.478 9.17

Compare those against our expected values:

zev
lower median upper
7.436 7.469 7.5

You will see below that the confidence interval for our predicted values is much wider than for expected values. In other words, we predict the outcome will be somewhere in that zone, but we expect it will actually be in this more narrow range.

The short answer is, predicted values are helpful for elections or forecasting into the future, because they make a conservative projections, while expected values are great for pretty much everything else, because they give us a better, more realistic projections.

ggplot()

We can actually visualize simulations in ggplot2! Let’s see what the range of expected Food Environment Index Scores is for an average county, using the ev variable in zresults!

zresults %>%
  ggplot(mapping = aes(x = ev)) +
  geom_histogram(color = "white", fill = "dodgerblue") +
  # Let's add some vertical lines
  # One for my lower confidence interval, queried directly from mystats$lower, which equals 7.44
  geom_vline(xintercept = zev$lower, size = 1, linetype = "dashed") +
  # A line for my median, which equals 7.47
  geom_vline(xintercept = zev$median, size = 3) +
  # A line from my upper confidence interval, which equals 7.50
  geom_vline(xintercept = zev$upper, size = 1, linetype = "dashed") +
  # Labels
  labs(x = "Expected Food Environment Index Score (0 to 10) for an average city
(95% Confidence Interval from 1000 simulations in Zelig package in R)",
       y = "Frequency (# of Counties)",
       title = "Distribution of Expected Values from Simulation")



Learning Check 2

Question

Please recalculate the expected Food Environment Index scores for a city, using setx() and sim(), but this time, please simulate for an average city with 10% Black residents, not 9%, saving your results as zsim1, zresults1, and zev1. Please fill in the blanks below!

My model simulates that an average county where 9% of residents are Black will see a median expected Food Environment Index of 7.47 points, where 95% of those expected values lie between 7.44 and 7.5 points. But, an average county where 10% of residents are Black will see a median expected Food Environment Index of [MEDIAN], where 95% of those expected values lie between [LOWER] and [UPPER]. In other words, the median expected index increases from 7.47 to [MEDIAN], a difference of [MEDIAN - 7.47].

Answer

First, we simulate for an average city with 10% Black residents.

zsim1 <- z %>%
  setx(
    # changed just pop_black from 9 to to 10
    pop_black = 10) %>%
  sim()

Then, we extract the expected values into a data.frame, zresults1.

zresults1 <- zsim1 %>%
  zelig_qi_to_df() %>%
  select(pop_black, ev = expected_value, pv = predicted_value)

We summarize() them to get our median expected value and confidence interval.

zev1 <- zresults1 %>% 
  summarize(lower = quantile(ev, probs = 0.025),
            median = quantile(ev, probs = 0.50),
            upper = quantile(ev, probs = 0.975))
zev1
lower median upper
7.414 7.441 7.47

My model simulates that an average county where 9% of residents are Black will see a median expected Food Environment Index of 7.47 points, where 95% of those expected values lie between 7.44 and 7.5 points. But, an average county where 10% of residents are Black will see a median expected Food Environment Index of [7.44 ], where 95% of those expected values lie between [7.41] and [7.47`]. In other words, the median expected index increases from 7.47 to [7.44], a difference of [-0.03].

If we were to visualize our two data.frames, zev and zev1, the confidence intervals would look like this.



These results highlight some major racial disparities in access to affordable, healthy food options. This is deeply concerning! This indicates that many counties, which have become deeply stratified by race and socioeconomic status over time, have built fewer and fewer grocery stores in Black (and Latinx) communities and see more households facing food access issues. This inequality is a public health and urban planning issue, about where cities encourage companies to build their stores.



3. Scaling Up

It was easy enough to get expected values given 1 or 2 sets of values (pop_black = 9, pop_black = 10). But what if we want to examine many values at once, like comparing expected food environment indices given 10, 20, 30, 40, 50, or 60% Black residents? We can do that too!

zsim2 <- z %>%
  setx(
    # Add many values
    pop_black = c(10, 20, 30, 40, 50, 60)) %>%
  sim()

Let’s look at the output. zelig_qi_to_df() returns 1000 rows for each condition, so 6000 rows for our 6 conditions. Here are the first 3 rows:

zsim2 %>%
  zelig_qi_to_df() %>%
  head(3)
setx_value pop_black pop_hisplat pop median_income democrat_2016 expected_value predicted_value
x 10 9.193376 103432.8 49606.91 31.537 7.432 7.312
x 10 9.193376 103432.8 49606.91 31.537 7.441 5.740
x 10 9.193376 103432.8 49606.91 31.537 7.453 7.478

That’s a lot of data, so let’s just select() the variables that actually vary: pop_black and expected_value. (We’ll skip predicted_value). We’ll rename expected_value to ev and save the data.frame to zresult2.

zresults2 <- zsim2 %>%
  zelig_qi_to_df() %>%
  select(pop_black, ev = expected_value)
zresults2 %>% head(3)
pop_black ev
10 7.432
10 7.441
10 7.453

Using this data, we can make a bunch of cool visuals, using geom_jitter(), geom_violin(), geom_linerange(), geom_errorbar(), and geom_ribbon()! Let’s try it out!

First, let’s try a jitter plot, with geom_jitter(), plus violins with geom_violin().

zresults2 %>%
  ggplot(mapping = aes(x = pop_black, y = ev, group = pop_black, 
                       fill = pop_black)) +
  # Let's plot each expected value as a jittered point,
  # with some aesthetic below
  geom_jitter(alpha = 0.5, size = 3, shape = 21, color = "white") +
  # Let's overlay a violin to get the shape of its distribution,
  # using draw_quantiles to highlight 
  # the median expected value with a line.
  geom_violin(alpha = 0.5, fill = "white", draw_quantiles = 0.5) +
  # And add some labels, themes, and colors
  scale_fill_viridis(option = "mako", begin = 0.2, end = 0.8) +
  theme_bw() +
  labs(x = "% Black Residents", 
       y = "Expected\nFood Environment Index", 
       fill = "% Black",
       caption = "From 1000 simulations for average county.") +
  # Fun fact: you can turn off a legend here:
  guides(fill = "none")

Second, let’s try summarizing this data, and then using geom_ribbon().

zev2 <- zresults2 %>%
  # For each value of pop_black we supplied,
  group_by(pop_black) %>%
  # Get the median, lower, and upper confidence intervals
  summarize(lower = quantile(ev, probs = 0.025),
            median = quantile(ev, probs = 0.50),
            upper = quantile(ev, probs = 0.975))

Let’s view it!

zev2
pop_black lower median upper
10 7.413 7.441 7.470
20 7.127 7.169 7.213
30 6.833 6.898 6.961
40 6.536 6.627 6.716
50 6.236 6.356 6.470
60 5.942 6.085 6.224

Okay, there’s 1 row for each level of representation of Black residents in the community (10, 20, 30, 40, 50, and 60). So, we can plot 6 features with this, one per row. But, this time, the range of our expected values will be slightly constrained, because we’re showing just the 95% most frequently occurring values, excluding those pesky 5% unlikely high or low outcomes.

zev2 %>%
  ggplot(mapping = aes(x = pop_black, y = median, ymin = lower, ymax = upper)) +
  # Add a transparent ribbon, with a black outline and blue fill
  geom_ribbon(alpha = 0.5, color = "black", fill = "dodgerblue") +
  # Add a dashed line to plot the median
  geom_line(linetype = "dashed") +
  # Add a theme and labels
  labs(x = "% Black Residents", y = "Expected\nFood Environment Index",
       subtitle = "Change in Expected Index\nwith 95% Confidence Intervals",
       caption = "From 1000 simulations for an average county.") +
  theme_bw()

This is very concerning; our simulations projects a strong decrease in access to healthy food in counties with increasingly large shares of Black residents, the sign of a major health equity issue.

Learning Check 3

Question

Please simulate what happens to the food environment index in an average county given an increasingly large Hispanic/Latinx population (10, 20, 30, 40, 50, and 60%), using zelig_qi_to_df() and use geom_ribbon() to plot your confidence interval. Be sure to summarize() your data to get your confidence interval, using group_by(). Save your data.frame as zev3.

Answer

Generate your confidence interval with the code below:

zev3 <- z %>%
  setx(pop_hisplat = c(10, 20, 30, 40, 50, 60)) %>%
  sim() %>%
  # Extract data.frame
  zelig_qi_to_df() %>%
  # Grab just our variables of interest
  select(pop_hisplat, ev = expected_value) %>%
  # For each value of pop_hisplat we supplied,
  group_by(pop_hisplat) %>%
  # Get the median, lower, and upper confidence intervals
  summarize(lower = quantile(ev, probs = 0.025),
            median = quantile(ev, probs = 0.50),
            upper = quantile(ev, probs = 0.975))

Then visualize it as a ribbon plot below.

zev3 %>%
  ggplot(mapping = aes(x = pop_hisplat, y = median, ymin = lower, ymax = upper)) +
  # Add a transparent ribbon, with a black outline and blue fill
  geom_ribbon(alpha = 0.5, color = "black", fill = "seagreen") +
  # Add a dashed line to plot the median
  geom_line(linetype = "dashed") +
  # Add a theme and labels
  labs(x = "% Hispanic/Latinx Residents", y = "Expected\nFood Environment Index",
       subtitle = "Change in Expected Index\nwith 95% Confidence Intervals",
       caption = "From 1000 simulations for an average county.") +
  theme_bw()




4. Under the Hood

Above, we talked through the procedure for simulating expected values and making confidence intervals, but how is Zelig actually producing these projections? We can actually do it manually below. Thanks to Zelig, you won’t need to do this manually in the future, but it is helpful to understand the theory and mechanics that power our projections.

Preparations

We’re going to need to gather a few ingredients before we simulate.

  • x: a one-row data.frame full of our supplied predictor values. Akin to setx() in Zelig.

  • sigma: a single value representing the residual standard error. It’s roughly equivalent to the ‘standard deviation’ of our residuals. A good measure of how spread out our predictions should be.

  • mycoef: a vector our alpha and beta coefficients from our actual model.

  • mydist: a 1000 row data.frame representing a multivariate normal distribution. Each row contains a set of alpha and beta coefficients that are just slightly different from our model’s real beta coefficients.

Please click on the dropdown menu below to build each ingredient.

Select

mycoef

This one is really fast. We want to extract our model coefficients quickly, all in a row. We can use dplyr’s bind_rows() function for this, and rename() the intercept.

mycoef <- m$coefficients %>%
  bind_rows() %>%
  rename(intercept = 1)
mycoef
## # A tibble: 1 × 6
##   intercept pop_black pop_hisplat        pop median_income democrat_2016
##       <dbl>     <dbl>       <dbl>      <dbl>         <dbl>         <dbl>
## 1      5.66   -0.0272    -0.00637 0.00000004     0.0000412       0.00211

x

Let’s make a one-row data.frame containing the traits of our hypothetical average county, if 9% of that county’s residents were Black.

x <- counties %>%
  # Drop any rows with NAs in our dataset
  drop_na() %>%
  # Set our supplied predictor values (like setx() )
  summarize(
    label = "Average County",
    # First, let's choose a value for pop_black (9%)
    pop_black = 9,
    # Now, let's set all other values to their means
    pop_hisplat = mean(pop_hisplat),
    pop = mean(pop),
    median_income = mean(median_income),
    democrat_2016 = mean(democrat_2016))
x
label pop_black pop_hisplat pop median_income democrat_2016
Average County 9 9.193376 103432.8 49606.91 31.537

sigma

Next, let’s calculate sigma. If sigma is bigger, your model’s predicted outcomes deviate more from the observed outcomes on average; if sigma is lower, your model’s predicted outcomes deviate less (yay!). sigma allows our simulations to reflect that error.

diy_sigma <- m %>%
  get_regression_points() %>%
  # First, we calculate the sum of squared residuals,
  # to approximate how much error is in the model
  summarize(
    residual_sum_of_squares = sum(( food_env_index  - food_env_index_hat) ^2),
    # get the sample size
    n = n(),
    # get the number of variables in the model (6)
    p = length(c("food_env_index", "pop_black", "pop_hisplat", 
                 "pop", "median_income", "democrat_2016")),
    # Calculate the mean squared error, which is the average of our squared residuals, 
    # where we've adjusted for sample size and number of predictors. 
    mean_squared_error = (residual_sum_of_squares / (n - p)),
    # Finally, take the square root to get the residual standard error.
    # You could say it is like the *standard deviation of residuals*
    residual_standard_error = sqrt(mean_squared_error),
    # In fact, Let's compare this against what we'd get if 
    # we just took the standard deviation of residuals
    residual_standard_dev = sd(residual))

# Let's view our result. See how our residual standard error
# is almost exactly the same as the standard deviation of residuals?
diy_sigma %>% glimpse()
## Rows: 1
## Columns: 6
## $ residual_sum_of_squares <dbl> 2185.098
## $ n                       <int> 3095
## $ p                       <int> 6
## $ mean_squared_error      <dbl> 0.7073803
## $ residual_standard_error <dbl> 0.8410591
## $ residual_standard_dev   <dbl> 0.8403792

Good news: now that we know how to calculate it, we can get sigma really, really quickly from the moderndive package’s get_regression_summaries() function (or broom’s glance() function).

sigma <- get_regression_summaries(m)$sigma
# Let's view it!
sigma
## [1] 0.841

mydist

Last, we need to get our multivariate normal distribution. What’s that again?

Below, we use the MASS package to run the mvrnorm() function, which generates 1000 rows of coefficients, all based on our model’s coefficients (mycoef), but each slightly different.

What’s cool is that we use vcov(m) to extract the correlations and variance of all our variables; so, when mvrnorm() makes those 1000 rows of simulated coefficients, these simulated coefficients are all correlated, reflecting the traits of our data. Finally, we turn this into a data.frame(), give each row a unique id, rename() the intercept column, and save the whole thing as mydist.

mydist <- data.frame(
  id = 1:1000,  # Give each row a unique ID
  MASS::mvrnorm(
    n = 1000, # get 1000 simulations
    mu = unlist(mycoef), # get vector of our coefficients 
    Sigma = vcov(m)) # get variance/covariance of variables
  ) %>%
  rename(intercept = 2)   # Rename intercept column
# Let's view the first few rows!
mydist %>%
  head(3)
id intercept pop_black pop_hisplat pop median_income democrat_2016
1 5.542611 -0.0265461 -0.0049156 1e-07 4.32e-05 0.0021606
2 5.731225 -0.0277801 -0.0065645 0e+00 3.99e-05 0.0015924
3 5.662799 -0.0272977 -0.0061562 1e-07 3.98e-05 0.0043279

Great! So you can see, they are all very similar values - and similar to our actual model coefficients (mycoef). But, each is slightly different. If we make predictions with each of these 1000 sets of coefficients, we’ll get 1000 slightly different predictions.

However, mvrnorm() preserves the overall relationships between variables. For example, in our observed coefficients, the effect (beta) of pop_black on food_env_index is negative (beta = -0.027), while the effect of democrat_2016 on food_env_index is positive (beta = 0.002). In other words, they are opposite. When simulating our coefficients, we want each simulation (each row) to retain this general relationship - where the effects of pop_black and democrat_2016 should always be pretty opposite. Let’s test it! Are our coefficients correlated?

mydist %>%
  get_correlation(formula = pop_black ~ democrat_2016)
##          cor
## 1 -0.6053813

That’s exactly what we find: these simulated coefficients are strongly negatively correlated with each other. This means our simulations are simultaneously (1) randomly selected but also (2) still reflect the traits of our actual data.



Learning Check 4

Question

What does mydist represent, and why does it matter whether its coefficients are correlated?

Answer

Our object mydist contains not just one set of model coefficients, but 1000 slightly varied versions of our model coefficients. These values were randomly drawn from a multivariate normal distribution, and that kind of distribution is special because it makes sure that simulated coefficients (like pop_black and democrat_2016) keep the same correlations and patterns from our original model.



Types of Uncertainty

So, we’ve compiled (1) our set of predictor values (x), (2) the standard error in our model’s residuals (sigma), (3) our observed model coefficients (mycoef), and (4) our multivariate normal distribution of 1000 slightly different model coefficients (mydist). What now?

Now, we are going to use simulation in two different ways to adjust for (1) estimation uncertainty and (2) fundamental uncertainty. What does this mean?

Definitions

  • Estimation Uncertainty: the uncertainty in our predictions due to not knowing Beta and Alpha perfectly, because we had fewer than an infinite number of observations in our data. Our multivariate normal distribution helps us account for this.

  • Fundamental Uncertainty: the uncertainty in our predictions due to innumerable chance events, like weather, illness, people spilling their coffee, banana peels, etc. that may influence our outcome but are not contained within our predictors. Random samples from a normal distribution help us account for this.

Examples

  • Estimation Uncertainty: If our coefficients were Marvel superheros, estimation uncertainty is akin to the feeling that I’m pretty sure the superheroes will team up and defeat the bad guy, but I’m not exactly sure how. We know that Thor will go do something hotheaded, and we know that the Hulk will SMASH, but we’re not exactly sure whether that means a little SMASH or a lot of SMASH, for example. So, we are uncertain on exactly what will happen, but we kind of already know what is going to happen when we go to the movies.

  • Fundamental Uncertainty: If our coefficients were Marvel superheros, fundamental uncertainty is akin to the butterfly effect or the series What If, where tiny changes alter the outcome in innumerable, unpredictable ways.



We can account for estimation uncertainty and fundamental uncertainty in two different ways:

Measure Accounts for
Estimation Uncertainty by:
Accounts for
Fundamental Uncertainty by:
Confidence
Interval
Predicted Value 1000 simulations from
Multivariate Normal
Distribution
1 simulation from
Normal Distribution
Wide
Expected Value 1000 simulations from
Multivariate Normal
Distribution
Average of 1000 simulations
from Normal Distribution
Narrow



Learning Check 5

Question

In your data.frames myev and mypv, you simulated 1000 predicted values and 1000 expected values, and calculated the 95% confidence intervals for each. Which confidence interval was wider? Now compare that to the table above. Generally, does the confidence interval of predicted values tend to be wider or narrower than the confidence interval of expected values?

Answer

In our example above with zev and zpv, the confidence interval for our predicted values (5.772 to 9.17) was much wider than the confidence interval for our expected values (7.436 to 7.5). This matches the table above, which says predicted values produce wider confidence intervals than expected values.



Coding

Coding Estimation Uncertainty

Now, we’re going to make 1000 predictions from our 1000 simulated coefficients! These will be saved into the pv column in the object zsim. We will make them by plugging our predictor data x into each of our 1000 slightly different model equations using our 1000 slightly different simulated coefficients, to get our 1000 slightly different predictions. These values reflect estimation uncertainty.

Model Equation: \(Y_{simulated} = Alpha + Beta_{1} \times X_{1.observed} + Beta_{2} \times X_{2.observed} + ...\)

diysim <- mydist %>%
  mutate(
    # Use our simulated coefficients, 
    # saved in mydist under each variable name
    ysim = intercept + 
      # Multiply each by single value for variables in x
      pop_black * x$pop_black + 
      pop_hisplat * x$pop_hisplat +
      pop * x$pop + 
      median_income * x$median_income + 
      democrat_2016 * x$democrat_2016)
diysim %>% # reorder variables
  select(id, ysim, intercept:democrat_2016) %>%
  head(3)
id ysim intercept pop_black pop_hisplat pop median_income democrat_2016
1 7.475453 5.542611 -0.0265461 -0.0049156 1e-07 4.32e-05 0.0021606
2 7.454637 5.731225 -0.0277801 -0.0065645 0e+00 3.99e-05 0.0015924
3 7.483526 5.662799 -0.0272977 -0.0061562 1e-07 3.98e-05 0.0043279

For example, here are 3 of our simulated outcomes (lines), compared with all 1000 of them (histogram).

Coding Fundamental Uncertainty

Next, we can account for fundamental uncertainty using the rnorm() function.

rnorm() samples 1 or more values from a hypothetical distribution of values. To make that hypothetical distribution, we must give it a center (mean) and a standard deviation to estimate its width (sd). Our distribution will have a mean equal to our simulated value ysim, and a sd equal to sigma, the average variation of our model residuals. This distribution reflects the full range of simulated values we might get due to fundamental uncertainty. Some are larger, some are smaller, but most are still close to our simulated value. We can sample just 1 value from this distribution (n = 1) or many values, eg. 1000 (n = 1000).

  • To get the predicted value, we will randomly draw one value from our distribution approximating fundamental uncertainty, giving us our predicted value pv.

  • To get the expected value, we will randomly draw 1000 values from our distribution approximating fundamental uncertainty, and calculate the mean of those 1000 values, giving us our expected value ev.

We will save them both in a data.frame called diyresults.

diyresults <- diysim %>% 
  # For each of our 1000 simulated outcomes,
  # which already accounted for estimation uncertainty,
  group_by(id) %>%
  summarize(
    # Record the value we used to calculate everything (9%) 
    pop_black = x$pop_black,
    # Calculate the predicted value by drawing 1 value
    # from our normal distribution
    # to represent fundamental uncertainty
    pv = rnorm(n = 1, mean = ysim, sd = sigma),
    # Calculate the expected value by drawing 1000 values
    # from our normal distribution
    # and average them to eliminate fundamental uncertainty
    ev = rnorm(n = 1000, mean = ysim, sd = sigma) %>% mean())

Compare

Finally, let’s compare our version against Zelig’s version, by stacking our do-it-yourself data.frame of projections diyresults on top of our Zelig-made data.frame, and visualizing the distributions of these simulations. We see that our estimates line up nearly perfectly! Our expected values have a very thin spread, because we averaged to get rid of fundamental uncertainty, while predicted values have very wide spread, to reflect fundamental uncertainty.

Extra: Click to here to learn how we make this visual (optional)

We can do this using dplyr’s bind_rows() function, which takes two data.frames and stacks their variables sharing names on top of each other.

compare <- bind_rows(
  # Get our version
  diyresults %>% # labelling it by 'type'
    mutate(type = "Do-It-Yourself"),
  # Get Zelig's version
  zresults %>% # labelling it by 'type'
    mutate(type = "Zelig") )
compare %>%
  pivot_longer(cols = c(ev, pv), 
               names_to = "measure", values_to = "value") %>%
  ggplot(mapping = aes(x = type, y = value, color = type)) +
  geom_jitter(alpha = 0.5, shape = 21, size = 3, fill = "white") +
  geom_violin(alpha = 0.75, color = "black") +
  facet_wrap(~measure) +
  # Add labels
  labs(x = "Type", y = "Projected\nFood Environment Index",
       subtitle = "Equivalent Results",
       caption = "From 1000 simulations of average county\ngiven 9% Black Residents") +
  scale_color_viridis(option = "mako", discrete = TRUE,
                       begin = 0.2, end = 0.8) +
  theme_bw() +
  guides(color = "none")



Learning Check 6

Question

What does the variable ysim in our data.frame diysim refer to? How did we calculate it?

Answer

  • The variable ysim refers to 1000 different simulated outcomes.

  • We calculated it by taking 1000 slightly different model equations, each a row in mydist, plugging in the values of our predictors from x (eg. pop_black = 9), and calculating the output of each of these equations. This produced our 1000 different simulated outcomes, ysim.

  • Their variation reflects estimation uncertainty, or how our model coefficients could be slightly off due to chance.



Learning Check 7

Question

When we calculate a single predicted value, how many values do we draw from a normal distribution? Why? How about when we calculate a single expected value? How does this help us account for uncertainty?

Answer

  • When we calculate a single predicted value, we draw just 1 value from a normal distribution using rnorm(). This adds some randomness to our predicted values, helping spread them out to demonstrate fundamental uncertainty.

  • When we calcualte a single expected value, we draw 1000 value from a normal distribution using rnorm(), and then take their average. This approximates the fundamental uncertainty of our estimate, then lets us get rid of it by finding the most likely, most expected of our predictions, the average.



5. Applications

Now that we know how Zelig works, let’s spice things up. We can also simulate multiple conditions, like what is the effect of race, given the effect of region? First, let’s create our new Zelig model object, z2, controlling for region.

z2 <- counties %>%
  mutate(region = factor(region)) %>% # Convert to factor
  lm(formula = food_env_index ~ pop_black + pop_hisplat +
       pop + median_income + democrat_2016 + 
       region) %>% # Also control for region
  to_zelig() # convert to Zelig

Next, let’s simulate what happens given 10%, 30%, or 50% Black residents, first in New England, and then in the South Atlantic states. Usually, we need to supply the same number of value to setx() for both conditions. We gave 6 values for pop_black, so we need to give 6 values for region too. (There are some exceptions, but it is a good rule of thumb.)

final <- z2 %>%
  setx(pop_black = c(10, 30, 50,  
                     10, 30, 50,
                     10, 30, 50),
       region = c("New England", "New England", "New England", 
                  "South Atlantic", "South Atlantic", "South Atlantic",
                  "Middle Atlantic", "Middle Atlantic", "Middle Atlantic")) %>%
  sim() %>%
  zelig_qi_to_df() %>%
  select(pop_black, region, ev = expected_value) %>%
  # 
  group_by(pop_black, region) %>%
  summarize(lower = quantile(ev, probs = 0.025),
            median = quantile(ev, probs = 0.50),
            upper = quantile(ev, probs = 0.975))

Last, let’s visualize with ggplot()!

  • We can make ribbons for our 95% confidence interval with geom_ribbon().

  • Since it’s only a few estimates, we can even add geom_linerange() and geom_point() over top of it.

  • We can split it into panels by region with facet_wrap().

  • We can order the panels by the median using reorder(region, median).

  • We can write in the actual median expected values using geom_text(), setting label equal to the values.

  • We can round() those values to just 1 decimal place with round(median, 1).

  • We can nudge those text labels over just 5 points on the x-axis, with nudge_x = 5.

  • We can turn off the fill legend with guides(fill = "none").

  • We can add line-breaks in labels using " \n ".

final %>%
  ggplot(mapping = aes(x = pop_black, y = median, 
                       ymin = lower, ymax = upper, 
                       fill = region)) +
  # Add Confidence Intervals
  geom_ribbon(alpha = 0.5, color = "white") +
  geom_linerange(size = 1) +
  # Add median expected value
  geom_point(size = 3) +
  # Split up by region, ordered by median
  facet_wrap(~reorder(region, median) ) +
  # Label median expected values
  geom_text(mapping = aes(label = round(median, 1)), nudge_x = 5) +
  # Cut fill legend
  guides(fill = "none") +
  # Add labels
  labs(x = "% Black Residents",
       y = "Expected\nFood Environment Index",
       subtitle = "Effects of Racial Discrimination and Region\non Access to Healthy Food",
       caption = "From 1000 simulations of average counties\n(Points = median; ribbons = 95% confidence intervals)") +
  scale_fill_viridis(option = "mako", discrete = TRUE,
                     begin = 0.2, end = 0.8) +
  theme_bw()



Learning Check 8

Question

Reproduce the visual below! Please simulate and visualize the change in expected Food Environment Index scores as the share of Black residents increases from 10 to 30 to 50 percent in an otherwise average county with a median income of 40,000 USD (~25th percentile), 50,000 USD (~50th percentile), and 60,000 USD (~75th percentile). What trends do you find?

Answer

  • Results ($40,000): We observe that as the share of Black residents increases from 10% to 50% in an average county with a median income of $40,000, our model simulates a decrease in expected Food Environment Index scores from a median of ~7 to ~5.8 points.

  • Results ($50,000): However, as the median income increases by 10,000 USD to 50,000 USD, the expected Food Environment Index score also increases, by roughly 0.40 points.

  • Results ($60,000): Given 60,000 USD, an otherwise average county still sees a decrease in expected Food Environment Index scores as the share of Black residents increases from 10% to 50%, but it now decreases from a median of ~7.8 to ~6.6 points.

  • Interpretation: This shows that while racial discrimination’s long term effects impact access to healthy foods, income helps mediate this effect, meaning that poorer communities are especially at risk. See code below!

lc8 <- z2 %>%
  setx(pop_black = c(10, 30, 50,  
                     10, 30, 50,
                     10, 30, 50),
       median_income = c(40000, 40000, 40000,
                         50000, 50000, 50000,
                         60000, 60000, 60000)) %>%
  sim() %>%
  zelig_qi_to_df() %>%
  select(pop_black, median_income, ev = expected_value) %>%
  # 
  group_by(pop_black, median_income) %>%
  summarize(lower = quantile(ev, probs = 0.025),
            median = quantile(ev, probs = 0.50),
            upper = quantile(ev, probs = 0.975))
lc8 # View it!
pop_black median_income lower median upper
10 40000 6.908950 6.973227 7.044665
10 50000 7.326517 7.385222 7.450332
10 60000 7.731335 7.798524 7.860935
30 40000 6.296392 6.390919 6.484587
30 50000 6.706918 6.803420 6.891978
30 60000 7.110374 7.215271 7.309959
50 40000 5.657950 5.807041 5.943845
50 50000 6.065946 6.219433 6.362588
50 60000 6.473135 6.631735 6.779621
lc8 %>%
  # Recode, changing it to a categorical variable
  mutate(median_income = median_income %>% recode_factor(
    "40000" = "$40,000",
    "50000" = "$50,000",
    "60000" = "$60,000")) %>%
  ggplot(mapping = aes(x = pop_black, y = median, 
                       ymin = lower, ymax = upper, 
                       fill = median_income)) +
  # Add Confidence Intervals
  geom_ribbon(alpha = 0.5, color = "white") +
  geom_linerange(size = 1) +
  # Add median expected value
  geom_point(size = 3) +
  # Split up by region, ordered by median
  facet_wrap(~reorder(median_income, median) ) +
  # Label median expected values
  geom_text(mapping = aes(label = round(median, 1)), 
            size = 5, nudge_x = 5) +
  # Cut fill legend
  guides(fill = "none") +
  # Add labels
  labs(x = "% Black Residents",
       y = "Expected\nFood Environment Index",
       subtitle = "Effects of Racial Discrimination and Income\non Access to Healthy Food",
       caption = "From 1000 simulations of average counties\n(Points = median; ribbons = 95% confidence intervals)") +
  scale_fill_viridis(option = "mako", discrete = TRUE,
                     begin = 0.2, end = 0.8) +
  theme_bw()

Congrats! You made it! You can now use statistical simulation to make meaningful projections about your outcomes of interest, using your models and the Zelig package. In fact, if you had to, you could now do it yourself too (although that sounds arduous, so please just use the Zelig package!). Happy simulating!