Module 10: Visualizing Results
Workshop 10: Using Statistical Simulation to Predict Food Deserts
Social scientists, policy analysts, and coders often have vitally important findings to convey to key decision-makers, but it can be very difficult to convey that information quickly, concisely, and in a visually appealing way. Instead of telling policymakers about our beta-coefficients, what if we could show them what our model predicts instead? Exactly how many more lives can we affect, dollars can we earn, emissions can we cut, if we adopt policy A instead of B, or increase our level of spending by $1000? We can answer these kinds of questions using statistical simulation.
This tutorial introduces statistical simulation using the example of the Food Environment Index from the University of Wisconsin’s County Health Rankings, which measures access to healthy foods in each county. Using just a few covariates, we estimate the relationship between key demographic traits and access to healthy foods. This workshop highlights how racial and ethnic minorities groups face systemic barriers to health, like “food deserts”, which describe when our cities’ urban planning has sited grocery stores and food infrastructure far away from certain communities - often communities of color and low-income communities. But can we show these trends with data?
0. Import Data
Load Data
In this dataset, each row is a US county!
# Load Packages
library(tidyverse) # for data manipulation
library(viridis) # for colors
library(moderndive) # for easy model summaries
library(Zelig) # for simulation
# Load Data
counties <- read_csv("food_deserts.csv")View Data
# View first 3 rows of dataset
counties %>% head(3)| fips | county | state | region | urban | food_env_index | pop | pop_black | pop_hisplat | median_income | democrat_2016 |
|---|---|---|---|---|---|---|---|---|---|---|
| 01001 | Autauga County | AL | East South Central | urbanized_area | 7.2 | 55036 | 19.1 | 2.7 | 55317 | 23.8 |
| 01003 | Baldwin County | AL | East South Central | urbanized_area | 8.0 | 203360 | 9.5 | 4.4 | 52562 | 19.4 |
| 01005 | Barbour County | AL | East South Central | urban_cluster | 5.6 | 26201 | 48.0 | 4.2 | 33368 | 46.5 |
Codebook
In this dataset, our variables mean:
fips: Unique 5 digit ID for each US county.
county: Name of each county.
state: Abbreviation for each state.
region: Census region for each county (eg. "New England")
urban: dichotomous variable showing whether county is a continuously built-up area with a population of 50,000 or more ("Urbanized Area") or not ("Rural Area"), according to the US Census.
Outcome Variable
food_env_index: Food Environment Index, which measures access to healthy foods and food insecurity in each county. Ranges from a scale of 0 (worst) to 10 (best). Access to healthy foods measured based on the share of low income residents who do not live close to a grocery store (in urban areas, < 1 mile away; in rural areas, <10 miles away). Food insecurity measured based on share of residents who lacked a reliable source of food during the past year. Good way to measure the prevalence of ‘food deserts’ in America.
Explanatory Variables
pop_black: percentage of population who are Black. (Written as a percentage, eg. 10.2%)pop_hisplat: percentage of population who are Hispanic or Latino. (Written as a percentage, eg. 23.5%).
Control Variables
pop: population of each county.`
median_income: median household income in each county.democrat_2016: share of residents who voted for the Democratic presidential candidate in 2016.
1. Modeling
First, let’s make a small model, predicting the food_env_index based on the variables pop_black and pop_hisplat, controlling for pop, median_income, and democrat_2016.
Learning Check 1
Question
m! - get_regression_table(), as the percentage of Black residents increases by 1%, holding all else constant, how much is food_env_index projected to increase?
As the percentage of Hispanic residents increases by 1%, how much does food_env_indexincrease? Are these changes statistically significant?Using get_regression_summaries(), how much variation does this model explain?
Answer
m <- counties %>%
lm(formula = food_env_index ~ pop_black + pop_hisplat +
pop + median_income + democrat_2016) # Get coefficients
m %>% get_regression_table()| term | estimate | statistic | p_value |
|---|---|---|---|
| intercept | 5.657 | 81.480 | 0.000 |
| pop_black | -0.027 | -19.991 | 0.000 |
| pop_hisplat | -0.006 | -5.482 | 0.000 |
| pop | 0.000 | 0.745 | 0.456 |
| median_income | 0.000 | 31.419 | 0.000 |
| democrat_2016 | 0.002 | 1.593 | 0.111 |
# Get r-squared
m %>% get_regression_summaries()| r_squared | p_value | nobs |
|---|---|---|
| 0.437 | 0 | 3095 |
2. Simulation
Simulation allows us to get precise estimates of our outcome using our model, based on a certain amount of our predictor, holding everything else constant. We want to know how easily can you access healthy food (food_env_index) in a county with a specific rate of Black residents (pop_black), assuming an average rate of Hispanic / Latinx residents (pop_hisplat), average population size (pop), average median income (median_income), and average support for the Democratic (vs. Republican) party.
The Zelig package will take our OLS regression model, our supplied level of the predictor, generate predictions, and then give us back 1000 projected outcomes, which represent our outcomes plus or minus our error. We can use these 1000 predictions to build pretty confidence intervals! These 1000 predictions are handy, because they tell us that the most likely value is somewhere in that zone.
She was totally talking about Confidence Intervals.
Examine the tabs below to learn how!
to_zelig()
First, convert our model to a Zelig formatted object, named z, using to_zelig().
z <- m %>% to_zelig()setx() and sim()
Second, tell Zelig the hypothetical value of our predictor(s), which it will plug into our regression model equation. We can put this info in the setx() function, which ‘sets’ the value of our predictor x. For example, let’s estimate the Food Environment Index in a county with 9% Black residents (the average). For all other traits, unless specified, Zelig automatically assumes a county with mean traits (or modal traits, for categorical variables). In our case, this means 9.1% Hispanic/Latino residents, 102165.6 residents overall, a median household income of 49754.2 USD, and 31.53% support for Democrats in 2016. Then, we’ll use sim() to calculate our 1000 predictions, and save it all as the object zsim.
zsim <- z %>%
setx(pop_black = 9) %>%
sim()zelig_qi_to_df()
Third, we will extract our results! zelig_qi_to_df() returns a very large data.frame (1000 rows for each condition, you entered (pop_black = 9, so just 1 condition)). Let’s look at the first few rows! It contains one column for each variable in our model, containing the values they were set at. Then, it returns an expected value and predicted value for each row, totalling 1000 different numbers in each row. Each is a projected Food Environment Index score given the values of the predictors we supplied.
zsim %>%
zelig_qi_to_df() %>%
head(3)| setx_value | pop_black | pop_hisplat | pop | median_income | democrat_2016 | expected_value | predicted_value |
|---|---|---|---|---|---|---|---|
| x | 9 | 9.193376 | 103432.8 | 49606.91 | 31.537 | 7.452 | 6.421 |
| x | 9 | 9.193376 | 103432.8 | 49606.91 | 31.537 | 7.479 | 9.108 |
| x | 9 | 9.193376 | 103432.8 | 49606.91 | 31.537 | 7.481 | 7.959 |
Let’s extract just the trait we specified (pop_black) and our expected_value and predicted_value, which we can rename ev and pv for short. We will save these results as the data.frame zresults.
zresults <- zsim %>%
zelig_qi_to_df() %>%
select(pop_black, ev = expected_value, pv = predicted_value)quantile()
We can find our really quickly the interval in which most of our predictions lie - the 95% confidence interval. We do this using the quantile() function, which lets us query what value in our data is at specific percentile. So, for the 2.5th percentile of the vector ev in our data.frame zev, we can use quantile(ev, probs = 0.025). probs = accepts values from 0 (0%) to 1 (100%). We’ll calculate the 2.5th percentile, 50th percentile, and 97.5th percentile (Because 97.5 - 2.5 = 95%). These represent:
2.5th percentile: lower end of range where 95% of predictions lie.
50th percentile: middle of range where the most common predictions cluster.
97.5th percentile: upper end of range where 95% of predictions lie.
# Let's get the range where 95% of points lie
zev <- zresults %>%
summarize(
# Get lower 2.5th percentile
lower = quantile(ev, probs = 0.025),
# Get median (50th percentile)
median = median(ev),
# Get upper 97.5th percentile
upper = quantile(ev, probs = 0.975))zev| lower | median | upper |
|---|---|---|
| 7.436 | 7.469 | 7.5 |
Cool! So 95% of all our expected values lie within this zone, so we can be 95% confident that the expected Food Environment Index for an average city would lie between 7.436 and 7.5, with its most likely value being the median at 7.469.
Note: Your values may differ ever so slightly from mine, because simulations involve random sampling. But if you rerun your simulations many times, they’re results will always be about the same.
pv vs. ev
You may notice, our predicted values are slightly different from our expected values. We can calculate the 95% confidence interval for these in much the same way, just replacing ev with pv.
# Let's get the range where 95% of points lie
zpv <- zresults %>%
summarize(
# Get lower 2.5th percentile
lower = quantile(pv, probs = 0.025),
# Get median (50th percentile)
median = median(pv),
# Get upper 97.5th percentile
upper = quantile(pv, probs = 0.975))zpv| lower | median | upper |
|---|---|---|
| 5.772 | 7.478 | 9.17 |
Compare those against our expected values:
zev| lower | median | upper |
|---|---|---|
| 7.436 | 7.469 | 7.5 |
You will see below that the confidence interval for our predicted values is much wider than for expected values. In other words, we predict the outcome will be somewhere in that zone, but we expect it will actually be in this more narrow range.
The short answer is, predicted values are helpful for elections or forecasting into the future, because they make a conservative projections, while expected values are great for pretty much everything else, because they give us a better, more realistic projections.
ggplot()
We can actually visualize simulations in ggplot2! Let’s see what the range of expected Food Environment Index Scores is for an average county, using the ev variable in zresults!
zresults %>%
ggplot(mapping = aes(x = ev)) +
geom_histogram(color = "white", fill = "dodgerblue") +
# Let's add some vertical lines
# One for my lower confidence interval, queried directly from mystats$lower, which equals 7.44
geom_vline(xintercept = zev$lower, size = 1, linetype = "dashed") +
# A line for my median, which equals 7.47
geom_vline(xintercept = zev$median, size = 3) +
# A line from my upper confidence interval, which equals 7.50
geom_vline(xintercept = zev$upper, size = 1, linetype = "dashed") +
# Labels
labs(x = "Expected Food Environment Index Score (0 to 10) for an average city
(95% Confidence Interval from 1000 simulations in Zelig package in R)",
y = "Frequency (# of Counties)",
title = "Distribution of Expected Values from Simulation")
Learning Check 2
Question
setx() and sim(), but this time, please simulate for an average city with 10% Black residents, not 9%, saving your results as zsim1, zresults1, and zev1. Please fill in the blanks below!
Answer
zsim1 <- z %>%
setx(
# changed just pop_black from 9 to to 10
pop_black = 10) %>%
sim()zresults1.
zresults1 <- zsim1 %>%
zelig_qi_to_df() %>%
select(pop_black, ev = expected_value, pv = predicted_value)summarize() them to get our median expected value and confidence interval.
zev1 <- zresults1 %>%
summarize(lower = quantile(ev, probs = 0.025),
median = quantile(ev, probs = 0.50),
upper = quantile(ev, probs = 0.975))zev1| lower | median | upper |
|---|---|---|
| 7.414 | 7.441 | 7.47 |
zev and zev1, the confidence intervals would look like this.
These results highlight some major racial disparities in access to affordable, healthy food options. This is deeply concerning! This indicates that many counties, which have become deeply stratified by race and socioeconomic status over time, have built fewer and fewer grocery stores in Black (and Latinx) communities and see more households facing food access issues. This inequality is a public health and urban planning issue, about where cities encourage companies to build their stores.
3. Scaling Up
It was easy enough to get expected values given 1 or 2 sets of values (pop_black = 9, pop_black = 10). But what if we want to examine many values at once, like comparing expected food environment indices given 10, 20, 30, 40, 50, or 60% Black residents? We can do that too!
zsim2 <- z %>%
setx(
# Add many values
pop_black = c(10, 20, 30, 40, 50, 60)) %>%
sim()Let’s look at the output. zelig_qi_to_df() returns 1000 rows for each condition, so 6000 rows for our 6 conditions. Here are the first 3 rows:
zsim2 %>%
zelig_qi_to_df() %>%
head(3)| setx_value | pop_black | pop_hisplat | pop | median_income | democrat_2016 | expected_value | predicted_value |
|---|---|---|---|---|---|---|---|
| x | 10 | 9.193376 | 103432.8 | 49606.91 | 31.537 | 7.432 | 7.312 |
| x | 10 | 9.193376 | 103432.8 | 49606.91 | 31.537 | 7.441 | 5.740 |
| x | 10 | 9.193376 | 103432.8 | 49606.91 | 31.537 | 7.453 | 7.478 |
That’s a lot of data, so let’s just select() the variables that actually vary: pop_black and expected_value. (We’ll skip predicted_value). We’ll rename expected_value to ev and save the data.frame to zresult2.
zresults2 <- zsim2 %>%
zelig_qi_to_df() %>%
select(pop_black, ev = expected_value)zresults2 %>% head(3)| pop_black | ev |
|---|---|
| 10 | 7.432 |
| 10 | 7.441 |
| 10 | 7.453 |
Using this data, we can make a bunch of cool visuals, using geom_jitter(), geom_violin(), geom_linerange(), geom_errorbar(), and geom_ribbon()! Let’s try it out!
First, let’s try a jitter plot, with geom_jitter(), plus violins with geom_violin().
zresults2 %>%
ggplot(mapping = aes(x = pop_black, y = ev, group = pop_black,
fill = pop_black)) +
# Let's plot each expected value as a jittered point,
# with some aesthetic below
geom_jitter(alpha = 0.5, size = 3, shape = 21, color = "white") +
# Let's overlay a violin to get the shape of its distribution,
# using draw_quantiles to highlight
# the median expected value with a line.
geom_violin(alpha = 0.5, fill = "white", draw_quantiles = 0.5) +
# And add some labels, themes, and colors
scale_fill_viridis(option = "mako", begin = 0.2, end = 0.8) +
theme_bw() +
labs(x = "% Black Residents",
y = "Expected\nFood Environment Index",
fill = "% Black",
caption = "From 1000 simulations for average county.") +
# Fun fact: you can turn off a legend here:
guides(fill = "none")Second, let’s try summarizing this data, and then using geom_ribbon().
zev2 <- zresults2 %>%
# For each value of pop_black we supplied,
group_by(pop_black) %>%
# Get the median, lower, and upper confidence intervals
summarize(lower = quantile(ev, probs = 0.025),
median = quantile(ev, probs = 0.50),
upper = quantile(ev, probs = 0.975))Let’s view it!
zev2| pop_black | lower | median | upper |
|---|---|---|---|
| 10 | 7.413 | 7.441 | 7.470 |
| 20 | 7.127 | 7.169 | 7.213 |
| 30 | 6.833 | 6.898 | 6.961 |
| 40 | 6.536 | 6.627 | 6.716 |
| 50 | 6.236 | 6.356 | 6.470 |
| 60 | 5.942 | 6.085 | 6.224 |
Okay, there’s 1 row for each level of representation of Black residents in the community (10, 20, 30, 40, 50, and 60). So, we can plot 6 features with this, one per row. But, this time, the range of our expected values will be slightly constrained, because we’re showing just the 95% most frequently occurring values, excluding those pesky 5% unlikely high or low outcomes.
zev2 %>%
ggplot(mapping = aes(x = pop_black, y = median, ymin = lower, ymax = upper)) +
# Add a transparent ribbon, with a black outline and blue fill
geom_ribbon(alpha = 0.5, color = "black", fill = "dodgerblue") +
# Add a dashed line to plot the median
geom_line(linetype = "dashed") +
# Add a theme and labels
labs(x = "% Black Residents", y = "Expected\nFood Environment Index",
subtitle = "Change in Expected Index\nwith 95% Confidence Intervals",
caption = "From 1000 simulations for an average county.") +
theme_bw()This is very concerning; our simulations projects a strong decrease in access to healthy food in counties with increasingly large shares of Black residents, the sign of a major health equity issue.
Learning Check 3
Question
zelig_qi_to_df() and use geom_ribbon() to plot your confidence interval. Be sure to summarize() your data to get your confidence interval, using group_by(). Save your data.frame as zev3.
Answer
zev3 <- z %>%
setx(pop_hisplat = c(10, 20, 30, 40, 50, 60)) %>%
sim() %>%
# Extract data.frame
zelig_qi_to_df() %>%
# Grab just our variables of interest
select(pop_hisplat, ev = expected_value) %>%
# For each value of pop_hisplat we supplied,
group_by(pop_hisplat) %>%
# Get the median, lower, and upper confidence intervals
summarize(lower = quantile(ev, probs = 0.025),
median = quantile(ev, probs = 0.50),
upper = quantile(ev, probs = 0.975))zev3 %>%
ggplot(mapping = aes(x = pop_hisplat, y = median, ymin = lower, ymax = upper)) +
# Add a transparent ribbon, with a black outline and blue fill
geom_ribbon(alpha = 0.5, color = "black", fill = "seagreen") +
# Add a dashed line to plot the median
geom_line(linetype = "dashed") +
# Add a theme and labels
labs(x = "% Hispanic/Latinx Residents", y = "Expected\nFood Environment Index",
subtitle = "Change in Expected Index\nwith 95% Confidence Intervals",
caption = "From 1000 simulations for an average county.") +
theme_bw()
4. Under the Hood
Above, we talked through the procedure for simulating expected values and making confidence intervals, but how is Zelig actually producing these projections? We can actually do it manually below. Thanks to Zelig, you won’t need to do this manually in the future, but it is helpful to understand the theory and mechanics that power our projections.
Preparations
We’re going to need to gather a few ingredients before we simulate.
x: a one-row data.frame full of our supplied predictor values. Akin tosetx()inZelig.sigma: a single value representing the residual standard error. It’s roughly equivalent to the ‘standard deviation’ of our residuals. A good measure of how spread out our predictions should be.mycoef: a vector our alpha and beta coefficients from our actual model.mydist: a 1000 row data.frame representing a multivariate normal distribution. Each row contains a set of alpha and beta coefficients that are just slightly different from our model’s real beta coefficients.
Please click on the dropdown menu below to build each ingredient.
Select
mycoef
This one is really fast. We want to extract our model coefficients quickly, all in a row. We can use dplyr’s bind_rows() function for this, and rename() the intercept.
mycoef <- m$coefficients %>%
bind_rows() %>%
rename(intercept = 1)mycoef## # A tibble: 1 × 6
## intercept pop_black pop_hisplat pop median_income democrat_2016
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 5.66 -0.0272 -0.00637 0.00000004 0.0000412 0.00211
x
Let’s make a one-row data.frame containing the traits of our hypothetical average county, if 9% of that county’s residents were Black.
x <- counties %>%
# Drop any rows with NAs in our dataset
drop_na() %>%
# Set our supplied predictor values (like setx() )
summarize(
label = "Average County",
# First, let's choose a value for pop_black (9%)
pop_black = 9,
# Now, let's set all other values to their means
pop_hisplat = mean(pop_hisplat),
pop = mean(pop),
median_income = mean(median_income),
democrat_2016 = mean(democrat_2016))x| label | pop_black | pop_hisplat | pop | median_income | democrat_2016 |
|---|---|---|---|---|---|
| Average County | 9 | 9.193376 | 103432.8 | 49606.91 | 31.537 |
sigma
Next, let’s calculate sigma. If sigma is bigger, your model’s predicted outcomes deviate more from the observed outcomes on average; if sigma is lower, your model’s predicted outcomes deviate less (yay!). sigma allows our simulations to reflect that error.
diy_sigma <- m %>%
get_regression_points() %>%
# First, we calculate the sum of squared residuals,
# to approximate how much error is in the model
summarize(
residual_sum_of_squares = sum(( food_env_index - food_env_index_hat) ^2),
# get the sample size
n = n(),
# get the number of variables in the model (6)
p = length(c("food_env_index", "pop_black", "pop_hisplat",
"pop", "median_income", "democrat_2016")),
# Calculate the mean squared error, which is the average of our squared residuals,
# where we've adjusted for sample size and number of predictors.
mean_squared_error = (residual_sum_of_squares / (n - p)),
# Finally, take the square root to get the residual standard error.
# You could say it is like the *standard deviation of residuals*
residual_standard_error = sqrt(mean_squared_error),
# In fact, Let's compare this against what we'd get if
# we just took the standard deviation of residuals
residual_standard_dev = sd(residual))
# Let's view our result. See how our residual standard error
# is almost exactly the same as the standard deviation of residuals?
diy_sigma %>% glimpse()## Rows: 1
## Columns: 6
## $ residual_sum_of_squares <dbl> 2185.098
## $ n <int> 3095
## $ p <int> 6
## $ mean_squared_error <dbl> 0.7073803
## $ residual_standard_error <dbl> 0.8410591
## $ residual_standard_dev <dbl> 0.8403792
Good news: now that we know how to calculate it, we can get sigma really, really quickly from the moderndive package’s get_regression_summaries() function (or broom’s glance() function).
sigma <- get_regression_summaries(m)$sigma
# Let's view it!
sigma## [1] 0.841
mydist
Last, we need to get our multivariate normal distribution. What’s that again?
Below, we use the MASS package to run the mvrnorm() function, which generates 1000 rows of coefficients, all based on our model’s coefficients (mycoef), but each slightly different.
What’s cool is that we use vcov(m) to extract the correlations and variance of all our variables; so, when mvrnorm() makes those 1000 rows of simulated coefficients, these simulated coefficients are all correlated, reflecting the traits of our data. Finally, we turn this into a data.frame(), give each row a unique id, rename() the intercept column, and save the whole thing as mydist.
mydist <- data.frame(
id = 1:1000, # Give each row a unique ID
MASS::mvrnorm(
n = 1000, # get 1000 simulations
mu = unlist(mycoef), # get vector of our coefficients
Sigma = vcov(m)) # get variance/covariance of variables
) %>%
rename(intercept = 2) # Rename intercept column# Let's view the first few rows!
mydist %>%
head(3)| id | intercept | pop_black | pop_hisplat | pop | median_income | democrat_2016 |
|---|---|---|---|---|---|---|
| 1 | 5.542611 | -0.0265461 | -0.0049156 | 1e-07 | 4.32e-05 | 0.0021606 |
| 2 | 5.731225 | -0.0277801 | -0.0065645 | 0e+00 | 3.99e-05 | 0.0015924 |
| 3 | 5.662799 | -0.0272977 | -0.0061562 | 1e-07 | 3.98e-05 | 0.0043279 |
Great! So you can see, they are all very similar values - and similar to our actual model coefficients (mycoef). But, each is slightly different. If we make predictions with each of these 1000 sets of coefficients, we’ll get 1000 slightly different predictions.
However, mvrnorm() preserves the overall relationships between variables. For example, in our observed coefficients, the effect (beta) of pop_black on food_env_index is negative (beta = -0.027), while the effect of democrat_2016 on food_env_index is positive (beta = 0.002). In other words, they are opposite. When simulating our coefficients, we want each simulation (each row) to retain this general relationship - where the effects of pop_black and democrat_2016 should always be pretty opposite. Let’s test it! Are our coefficients correlated?
mydist %>%
get_correlation(formula = pop_black ~ democrat_2016)## cor
## 1 -0.6053813
That’s exactly what we find: these simulated coefficients are strongly negatively correlated with each other. This means our simulations are simultaneously (1) randomly selected but also (2) still reflect the traits of our actual data.
Learning Check 4
Question
mydist represent, and why does it matter whether its coefficients are correlated?
Answer
mydist contains not just one set of model coefficients, but 1000 slightly varied versions of our model coefficients. These values were randomly drawn from a multivariate normal distribution, and that kind of distribution is special because it makes sure that simulated coefficients (like pop_black and democrat_2016) keep the same correlations and patterns from our original model.
Types of Uncertainty
So, we’ve compiled (1) our set of predictor values (x), (2) the standard error in our model’s residuals (sigma), (3) our observed model coefficients (mycoef), and (4) our multivariate normal distribution of 1000 slightly different model coefficients (mydist). What now?
Now, we are going to use simulation in two different ways to adjust for (1) estimation uncertainty and (2) fundamental uncertainty. What does this mean?
Definitions
Estimation Uncertainty: the uncertainty in our predictions due to not knowing Beta and Alpha perfectly, because we had fewer than an infinite number of observations in our data. Our multivariate normal distribution helps us account for this.
Fundamental Uncertainty: the uncertainty in our predictions due to innumerable chance events, like weather, illness, people spilling their coffee, banana peels, etc. that may influence our outcome but are not contained within our predictors. Random samples from a normal distribution help us account for this.
Examples
- Estimation Uncertainty: If our coefficients were Marvel superheros, estimation uncertainty is akin to the feeling that I’m pretty sure the superheroes will team up and defeat the bad guy, but I’m not exactly sure how. We know that Thor will go do something hotheaded, and we know that the Hulk will SMASH, but we’re not exactly sure whether that means a little SMASH or a lot of SMASH, for example. So, we are uncertain on exactly what will happen, but we kind of already know what is going to happen when we go to the movies.
- Fundamental Uncertainty: If our coefficients were Marvel superheros, fundamental uncertainty is akin to the butterfly effect or the series What If, where tiny changes alter the outcome in innumerable, unpredictable ways.
We can account for estimation uncertainty and fundamental uncertainty in two different ways:
| Measure | Accounts for Estimation Uncertainty by: |
Accounts for Fundamental Uncertainty by: |
Confidence Interval |
|---|---|---|---|
| Predicted Value | 1000 simulations from Multivariate Normal Distribution |
1 simulation from Normal Distribution |
Wide |
| Expected Value | 1000 simulations from Multivariate Normal Distribution |
Average of 1000 simulations from Normal Distribution |
Narrow |
Learning Check 5
Question
myev and mypv, you simulated 1000 predicted values and 1000 expected values, and calculated the 95% confidence intervals for each. Which confidence interval was wider? Now compare that to the table above. Generally, does the confidence interval of predicted values tend to be wider or narrower than the confidence interval of expected values?
Answer
zev and zpv, the confidence interval for our predicted values (5.772 to 9.17) was much wider than the confidence interval for our expected values (7.436 to 7.5). This matches the table above, which says predicted values produce wider confidence intervals than expected values.
Coding
Coding Estimation Uncertainty
Now, we’re going to make 1000 predictions from our 1000 simulated coefficients! These will be saved into the pv column in the object zsim. We will make them by plugging our predictor data x into each of our 1000 slightly different model equations using our 1000 slightly different simulated coefficients, to get our 1000 slightly different predictions. These values reflect estimation uncertainty.
Model Equation: \(Y_{simulated} = Alpha + Beta_{1} \times X_{1.observed} + Beta_{2} \times X_{2.observed} + ...\)
diysim <- mydist %>%
mutate(
# Use our simulated coefficients,
# saved in mydist under each variable name
ysim = intercept +
# Multiply each by single value for variables in x
pop_black * x$pop_black +
pop_hisplat * x$pop_hisplat +
pop * x$pop +
median_income * x$median_income +
democrat_2016 * x$democrat_2016)diysim %>% # reorder variables
select(id, ysim, intercept:democrat_2016) %>%
head(3)| id | ysim | intercept | pop_black | pop_hisplat | pop | median_income | democrat_2016 |
|---|---|---|---|---|---|---|---|
| 1 | 7.475453 | 5.542611 | -0.0265461 | -0.0049156 | 1e-07 | 4.32e-05 | 0.0021606 |
| 2 | 7.454637 | 5.731225 | -0.0277801 | -0.0065645 | 0e+00 | 3.99e-05 | 0.0015924 |
| 3 | 7.483526 | 5.662799 | -0.0272977 | -0.0061562 | 1e-07 | 3.98e-05 | 0.0043279 |
For example, here are 3 of our simulated outcomes (lines), compared with all 1000 of them (histogram).
Coding Fundamental Uncertainty
Next, we can account for fundamental uncertainty using the rnorm() function.
rnorm() samples 1 or more values from a hypothetical distribution of values. To make that hypothetical distribution, we must give it a center (mean) and a standard deviation to estimate its width (sd). Our distribution will have a mean equal to our simulated value ysim, and a sd equal to sigma, the average variation of our model residuals. This distribution reflects the full range of simulated values we might get due to fundamental uncertainty. Some are larger, some are smaller, but most are still close to our simulated value. We can sample just 1 value from this distribution (n = 1) or many values, eg. 1000 (n = 1000).
To get the predicted value, we will randomly draw one value from our distribution approximating fundamental uncertainty, giving us our predicted value
pv.To get the expected value, we will randomly draw 1000 values from our distribution approximating fundamental uncertainty, and calculate the mean of those 1000 values, giving us our expected value
ev.
We will save them both in a data.frame called diyresults.
diyresults <- diysim %>%
# For each of our 1000 simulated outcomes,
# which already accounted for estimation uncertainty,
group_by(id) %>%
summarize(
# Record the value we used to calculate everything (9%)
pop_black = x$pop_black,
# Calculate the predicted value by drawing 1 value
# from our normal distribution
# to represent fundamental uncertainty
pv = rnorm(n = 1, mean = ysim, sd = sigma),
# Calculate the expected value by drawing 1000 values
# from our normal distribution
# and average them to eliminate fundamental uncertainty
ev = rnorm(n = 1000, mean = ysim, sd = sigma) %>% mean())Compare
Finally, let’s compare our version against Zelig’s version, by stacking our do-it-yourself data.frame of projections diyresults on top of our Zelig-made data.frame, and visualizing the distributions of these simulations. We see that our estimates line up nearly perfectly! Our expected values have a very thin spread, because we averaged to get rid of fundamental uncertainty, while predicted values have very wide spread, to reflect fundamental uncertainty.
Extra: Click to here to learn how we make this visual (optional)
We can do this using dplyr’s bind_rows() function, which takes two data.frames and stacks their variables sharing names on top of each other.
compare <- bind_rows(
# Get our version
diyresults %>% # labelling it by 'type'
mutate(type = "Do-It-Yourself"),
# Get Zelig's version
zresults %>% # labelling it by 'type'
mutate(type = "Zelig") )compare %>%
pivot_longer(cols = c(ev, pv),
names_to = "measure", values_to = "value") %>%
ggplot(mapping = aes(x = type, y = value, color = type)) +
geom_jitter(alpha = 0.5, shape = 21, size = 3, fill = "white") +
geom_violin(alpha = 0.75, color = "black") +
facet_wrap(~measure) +
# Add labels
labs(x = "Type", y = "Projected\nFood Environment Index",
subtitle = "Equivalent Results",
caption = "From 1000 simulations of average county\ngiven 9% Black Residents") +
scale_color_viridis(option = "mako", discrete = TRUE,
begin = 0.2, end = 0.8) +
theme_bw() +
guides(color = "none")
Learning Check 6
Question
ysim in our data.frame diysim refer to? How did we calculate it?
Answer
The variable ysimrefers to 1000 different simulated outcomes.We calculated it by taking 1000 slightly different model equations, each a row in mydist, plugging in the values of our predictors fromx(eg.pop_black = 9), and calculating the output of each of these equations. This produced our 1000 different simulated outcomes,ysim.Their variation reflects estimation uncertainty, or how our model coefficients could be slightly off due to chance.
Learning Check 7
Question
Answer
When we calculate a single predicted value, we draw just 1 value from a normal distribution using rnorm(). This adds some randomness to our predicted values, helping spread them out to demonstrate fundamental uncertainty.When we calcualte a single expected value, we draw 1000 value from a normal distribution using rnorm(), and then take their average. This approximates the fundamental uncertainty of our estimate, then lets us get rid of it by finding the most likely, most expected of our predictions, the average.
5. Applications
Now that we know how Zelig works, let’s spice things up. We can also simulate multiple conditions, like what is the effect of race, given the effect of region? First, let’s create our new Zelig model object, z2, controlling for region.
z2 <- counties %>%
mutate(region = factor(region)) %>% # Convert to factor
lm(formula = food_env_index ~ pop_black + pop_hisplat +
pop + median_income + democrat_2016 +
region) %>% # Also control for region
to_zelig() # convert to ZeligNext, let’s simulate what happens given 10%, 30%, or 50% Black residents, first in New England, and then in the South Atlantic states. Usually, we need to supply the same number of value to setx() for both conditions. We gave 6 values for pop_black, so we need to give 6 values for region too. (There are some exceptions, but it is a good rule of thumb.)
final <- z2 %>%
setx(pop_black = c(10, 30, 50,
10, 30, 50,
10, 30, 50),
region = c("New England", "New England", "New England",
"South Atlantic", "South Atlantic", "South Atlantic",
"Middle Atlantic", "Middle Atlantic", "Middle Atlantic")) %>%
sim() %>%
zelig_qi_to_df() %>%
select(pop_black, region, ev = expected_value) %>%
#
group_by(pop_black, region) %>%
summarize(lower = quantile(ev, probs = 0.025),
median = quantile(ev, probs = 0.50),
upper = quantile(ev, probs = 0.975))Last, let’s visualize with ggplot()!
We can make ribbons for our 95% confidence interval with
geom_ribbon().Since it’s only a few estimates, we can even add
geom_linerange()andgeom_point()over top of it.We can split it into panels by
regionwithfacet_wrap().We can order the panels by the
medianusingreorder(region, median).We can write in the actual median expected values using
geom_text(), settinglabelequal to the values.We can
round()those values to just 1 decimal place withround(median, 1).We can nudge those text labels over just 5 points on the x-axis, with
nudge_x = 5.We can turn off the
filllegend withguides(fill = "none").We can add line-breaks in labels using
" \n ".
final %>%
ggplot(mapping = aes(x = pop_black, y = median,
ymin = lower, ymax = upper,
fill = region)) +
# Add Confidence Intervals
geom_ribbon(alpha = 0.5, color = "white") +
geom_linerange(size = 1) +
# Add median expected value
geom_point(size = 3) +
# Split up by region, ordered by median
facet_wrap(~reorder(region, median) ) +
# Label median expected values
geom_text(mapping = aes(label = round(median, 1)), nudge_x = 5) +
# Cut fill legend
guides(fill = "none") +
# Add labels
labs(x = "% Black Residents",
y = "Expected\nFood Environment Index",
subtitle = "Effects of Racial Discrimination and Region\non Access to Healthy Food",
caption = "From 1000 simulations of average counties\n(Points = median; ribbons = 95% confidence intervals)") +
scale_fill_viridis(option = "mako", discrete = TRUE,
begin = 0.2, end = 0.8) +
theme_bw()
Learning Check 8
Question
Answer
Results ($40,000): We observe that as the share of Black residents increases from 10% to 50% in an average county with a median income of $40,000, our model simulates a decrease in expected Food Environment Index scores from a median of ~7 to ~5.8 points. Results ($50,000): However, as the median income increases by 10,000 USD to 50,000 USD, the expected Food Environment Index score also increases, by roughly 0.40 points. Results ($60,000): Given 60,000 USD, an otherwise average county still sees a decrease in expected Food Environment Index scores as the share of Black residents increases from 10% to 50%, but it now decreases from a median of ~7.8 to ~6.6 points. Interpretation: This shows that while racial discrimination’s long term effects impact access to healthy foods, income helps mediate this effect, meaning that poorer communities are especially at risk. See code below!
lc8 <- z2 %>%
setx(pop_black = c(10, 30, 50,
10, 30, 50,
10, 30, 50),
median_income = c(40000, 40000, 40000,
50000, 50000, 50000,
60000, 60000, 60000)) %>%
sim() %>%
zelig_qi_to_df() %>%
select(pop_black, median_income, ev = expected_value) %>%
#
group_by(pop_black, median_income) %>%
summarize(lower = quantile(ev, probs = 0.025),
median = quantile(ev, probs = 0.50),
upper = quantile(ev, probs = 0.975))lc8 # View it!| pop_black | median_income | lower | median | upper |
|---|---|---|---|---|
| 10 | 40000 | 6.908950 | 6.973227 | 7.044665 |
| 10 | 50000 | 7.326517 | 7.385222 | 7.450332 |
| 10 | 60000 | 7.731335 | 7.798524 | 7.860935 |
| 30 | 40000 | 6.296392 | 6.390919 | 6.484587 |
| 30 | 50000 | 6.706918 | 6.803420 | 6.891978 |
| 30 | 60000 | 7.110374 | 7.215271 | 7.309959 |
| 50 | 40000 | 5.657950 | 5.807041 | 5.943845 |
| 50 | 50000 | 6.065946 | 6.219433 | 6.362588 |
| 50 | 60000 | 6.473135 | 6.631735 | 6.779621 |
lc8 %>%
# Recode, changing it to a categorical variable
mutate(median_income = median_income %>% recode_factor(
"40000" = "$40,000",
"50000" = "$50,000",
"60000" = "$60,000")) %>%
ggplot(mapping = aes(x = pop_black, y = median,
ymin = lower, ymax = upper,
fill = median_income)) +
# Add Confidence Intervals
geom_ribbon(alpha = 0.5, color = "white") +
geom_linerange(size = 1) +
# Add median expected value
geom_point(size = 3) +
# Split up by region, ordered by median
facet_wrap(~reorder(median_income, median) ) +
# Label median expected values
geom_text(mapping = aes(label = round(median, 1)),
size = 5, nudge_x = 5) +
# Cut fill legend
guides(fill = "none") +
# Add labels
labs(x = "% Black Residents",
y = "Expected\nFood Environment Index",
subtitle = "Effects of Racial Discrimination and Income\non Access to Healthy Food",
caption = "From 1000 simulations of average counties\n(Points = median; ribbons = 95% confidence intervals)") +
scale_fill_viridis(option = "mako", discrete = TRUE,
begin = 0.2, end = 0.8) +
theme_bw()Congrats! You made it! You can now use statistical simulation to make meaningful projections about your outcomes of interest, using your models and the Zelig package. In fact, if you had to, you could now do it yourself too (although that sounds arduous, so please just use the Zelig package!). Happy simulating!