Matching Experiments for Disaster Social Science

A Brief Tutorial on COVID19 Test Positivity Rates

1. Matching

Social scientists studying disasters, environmental politics, and climate change face a tricky puzzle: almost none of their key variables of interest can be administered as an intervention. Field experiments are impossible when, for example, the treatment of interest is being struck by a disaster, terrorism, COVID-19, or long term conditions of civil society, or unethical when the treatment is a life-saving intervention, like vaccinations. Further, while natural experiments are exciting, their requirement for as-if-randomness is very difficult to find in nature. However, over the last 15 years, social scientists and statisticians have delivered several tools to help us make better natural experiments, or at least better observational models. The foremost of these is coarsened exact matching, or CEM, developed by Iacus, King, & Porro (2012).

Matching is a surprisingly simple science. It identifies a smaller-but-more-comparable subset of data within a larger set of observed cases. It involves three steps.

Select matching variables - the traits of your cases which will be as similar as possible. The more matching variables, the smaller your final matched sample will be.
Select your treatment variable. Coarsened exact matching will identify the towns as similar as possible in terms of the matching variables, except that one set experienced the treatment while the other did not.
Create a matched sample with coarsened exact matching, which breaks your variables into categories based on their distributions and exactly matches treated cases with similar untreated cases.
Check whether your treated and untreated cases are, in fact, better balanced than in the original sample.
Apply your model or statistical test on your new matched sample, using the CEM weights produced by the matching process.

Let’s try out each step using the example of CEM models on COVID19 test positivity rates.

2. Application

To do this, you’ll need R (although you can do all of this in Stata too).

# If you haven't yet, uncomment and run the following code to install these packages.
# install.packages(c("tidyverse", "MatchIt", "cem", "texreg", "Zelig"))

# Now let's load these packages for use.
library(tidyverse) # for data manipulation, like the "%>%" (pipeline)
library(MatchIt) # For matching
library(cem) # for Coarsened Exact Matching
library(cobalt) # for Matching diagnostics
library(texreg) # for making statistical tables
library(Zelig) # For visualizing with statistical simulation

2.1 Data

We’re going to use this helpful dataset of Massachusetts County Subdivision COVID19 test positivity rates, sourced from a recent study dataset. County subdivisions are city-equivalents in census-speak.

# This is a time-series dataset of county subdivision observations over time. Let's zoom into just the first observation, on September 30th, 2020.
dat <- read_csv("ma_data.csv") %>%
  filter(date == "2020-09-30") %>%
  # Let's select our key variables of interest
  select(unit, # Unique ID Code
         city, # Name of County Subdivision/City 
         positivity_rate, # COVID Test Positivity Rate 
         social_capital, # Social Capital Index 
         bonding, # Bonding Social Capital Index 
         bridging, # Bridging Social Capital Index
         linking, # Linking Social Capital Index 
         # Social Vulnerability Indicies accounting for
         svi_household_disability, # Household traits and disability,
         svi_housing_tranport, # Housing and Transportation
         svi_socioeconomic, # Socioeconomic Status
         svi_minority) # Minority Status & Language

2.2 Treatment Variable

Next, let’s make our treatment variable. In this case, the treatment variable will be the effect of vulnerability by race, ethnicity, and language (our Minority Status & Language index). COVID has led to the deaths of 1 in 1000 Black Americans and the permanent closure of 40% of Black-owned small businesses. Has race played a large role in the pandemic in Massachusetts cities as well?

# Let's create a binary version of our treatment variable, called "treat"
dat <- dat %>%
  # by splitting the data into those cities with above the median level 
  # of this index and those with below the median level of this index.
  mutate(treat = if_else(
    condition = svi_minority > median(svi_minority, na.rm = TRUE),
    # If above, give it a 1; if below, give it a zero.
    true = 1, false = 0))

2.3 Match by Matching Covariates

Now, let’s use Coarsened Exact Matching, relying on the MatchIt and cem packages. Let’s get communities as similar as possible in terms of socioeconomic status, housing and transportation access, household-level vulnerability, and overall social capital, except that some of them had higher shares of minority residents while others had lower shares.

matched <- dat %>%
  # Let's match, mentioning our treatment variable,
  # and then listing matching covariates,
  matchit(formula = treat ~ svi_socioeconomic + 
            svi_housing_tranport + svi_household_disability +
            social_capital, 
  # and specifying that we want Coarsened Exact Matching (cem)
          method = "cem")

# Let's look at our matched sample object
matched

## A matchit object
##  - method: Coarsened exact matching
##  - number of obs.: 344 (original), 53 (matched)
##  - target estimand: ATT
##  - covariates: svi_socioeconomic, svi_housing_tranport, svi_household_disability, social_capital

# Wow! That's a lot smaller of a sample (53 vs. 344)

We can extract our own matched dataset.

sample <- dat %>%
  # Let's add the CEM weight in
  mutate(weights = matched$weights) %>%
  # And eliminate any that got weights of zero (they weren't matched)
  filter(weights > 0)

2.4 Compare Matching Covariates before and After Matching

We can also visually demonstrate that this matched sample is better than the original. The following shows for each of our samples, the absolute standardized average difference in covariate values between the treatment and control groups. We see that the matched model has much better balance between covariates than the original, which had very different levels of social capital.

# Using the cobalt package, let's make this plot!
love.plot(matched, abs = TRUE,
          var.order = "unadjusted", line = TRUE) +
  geom_vline(xintercept = 0.1, linetype = "dashed")

2.5 Descriptives

And as the following scatterplot shows, our matched sample does show an even stronger bivariate trend with COVID spread.

# Let's bind our original and matched sample on top of each other,
bind_rows(dat, sample, .id = "type") %>%
  # label each
  mutate(type = type %>% recode_factor(
    "1" = "Original", "2" = "Matched")) %>%
  # Let's be sure to weight our variables of interest for the matched sample
  mutate(svi_minority = if_else(condition = type == "Matched", 
                                true = svi_minority*weights, 
                                false = svi_minority),
         positivity_rate = if_else(condition = type == "Matched", 
                                   true = positivity_rate*weights,
                                   false = positivity_rate)) %>%
  # and visualize them as two scatterplots with best fit lines
  ggplot(mapping = aes(x = svi_minority, y = positivity_rate, color = type)) + 
  # add a scatterplot
  geom_point(alpha = 0.5) +
  # with lines of best fit
  geom_smooth(method = "lm", se = FALSE) +
  # with a log-transformed y-axis
    scale_y_log10() +
  # a minimalist background
  theme_minimal() +
  # nice colors
  scale_color_manual(values = c("steelblue", "firebrick")) +
  # and labels!
  labs(x = "Social Vulnerability from Minority Status & Language",
       y = "COVID19 Test Positivity Rates (log-scaled)", 
       color = "Sample",
       title = "Bivariate Scatterplot\nComparing Before and After Coarsened Exact Matching")

2.6 Model the Matched Sample

Using our matched sample, let’s run an Ordinary Least Squares model on COVID19 test positivity rates. We’ll model the log of the positivity rate as our outcome, because it’s right-skewed. Then, we’ll model the effect our treatment variable, svi_minority, the level of vulnerable residents from racial, ethnic, and linguistic minorities in the community. As controls, let’s add our other matching covariates - although it’s not strictly necessary, since we already controlled by matching for these factors. In this case, we can even add new and different control variables, like levels of bonding, bridging, and linking social capital. Finally, we must add our CEM weights too to weight this regression model.

matched_model <- sample %>%
  lm(formula = log(positivity_rate) ~ svi_minority +
       # while controlling for matching covariates (but not necessary)
       svi_socioeconomic + svi_housing_tranport + 
       svi_household_disability + 
       # and even breaking apart social capital into its subtypes
      bonding + bridging + linking, 
     # Remember to add CEM weights
     weights = weights)

We can also compare results with our original model.

# Let's run a linear model on our original dataset
normal_model <- dat %>%
  lm(formula = log(positivity_rate) ~  svi_minority + 
       svi_socioeconomic + svi_housing_tranport + svi_household_disability +
       bonding + bridging + linking)

In fact, let’s do so in a table, using the handy texreg package.

# Please make an html table
htmlreg(
  # featuring these models
  list(normal_model, matched_model),
  # with these names
        custom.model.names = c("Original Sample", "Matched Sample"),
  # with this table title
          caption = "OLS Models of COVID19 Test Positivity Rates (log)",
  # located on top
  caption.above = TRUE,
  # With covariates in bold if p < 0.10
        bold = 0.10,
  # And stars for these 4 levels of p values
        stars = c(0.001, 0.01, 0.05, 0.10),
  # And please show the F-statistic
    include.fstat = TRUE) %>%
  
  # And use magic to make HTML code into something we can read!
  htmltools::HTML()

OLS Models of COVID19 Test Positivity Rates (log)
	Original Sample	Matched Sample
(Intercept)	-14.05^**	-57.68^***
	(5.17)	(13.81)
svi_minority	9.90^***	18.06^***
	(1.39)	(2.98)
svi_socioeconomic	0.55	-2.35
	(1.28)	(3.62)
svi_housing_tranport	-4.29^***	-5.81^·
	(1.05)	(3.25)
svi_household_disability	2.43^*	6.21^·
	(1.08)	(3.31)
bonding	21.88^**	85.69^***
	(7.74)	(20.36)
bridging	30.32^*	56.66
	(14.17)	(39.74)
linking	-14.51^***	-3.98
	(4.31)	(10.33)
R²	0.23	0.59
Adj. R²	0.21	0.53
Num. obs.	344	53
F statistic	14.42	9.30
^*p < 0.001; ^p < 0.01; ^*p < 0.05; ^·p < 0.1

We can take several findings from this.

First, our matched sample explains a much higher share of variation in COVID spread than our original sample.
Even among communities as similar as possible in terms of vulnerability, communities with higher shares of ethnic minorities suffered much greater rates of COVID spread.
Among these communities, bonding social capital also was associated with greater COVID spread (possibly due to insular social networks).

2.7 Visualization

Finally, let’s visualize these results! The Zelig package is really good for this. Zelig simulates the outcome 1000 times, holding all variables except one at their means, to see the effect of that variable. (The way they do it is very cool, but more than we can cover here.)

library(Zelig)

# Let's make our model again, but in Zelig
zelig_model <- sample %>%
  zelig(formula = log(positivity_rate) ~ svi_minority +
       svi_socioeconomic + svi_housing_tranport + 
       svi_household_disability + 
      bonding + bridging + linking, 
      model = "ls",  # specify model type
      weights = "weights",  # tell Zelig to use our CEM "weights" variable
      cite = FALSE) # remove extra material

# Let's simulate the effects of our model
matched_sim <- zelig_model %>%
  # Varying Minority Status from 0 to 1 over 10 values,
  # while holding everything else constant at their means/modes
  setx(svi_minority = seq(from = quantile(zelig_model$data$svi_minority, 0.2), 
                          to = quantile(zelig_model$data$svi_minority, 0.8), 
                          length.out = 10)) %>%
  sim() %>%
  # Then convert the simulated results to a data.frame
  zelig_qi_to_df() %>%
  # And get the median expected outcome, with 95% confidence intervals
  qi_slimmer(qi_type = "ev", ci = 0.95)

Next, let’s visualize the results of our simulations.

# Using our simulated expected outcomes,
matched_sim %>%
  # We can make a plot, telling R where to look for our data,
  ggplot(mapping = aes(
    x = svi_minority, # Where the x-axis is minority status,
    # The y-axis is the median expected outcome
    y = exp(qi_ci_median), # we have to exponeniate it, since it's a log-outcome
    # The lower 95% confidence interval
    ymin = exp(qi_ci_min), 
    # The upper 95% confidence interval
    ymax = exp(qi_ci_max) )) +
  # Plotting a line
  geom_line() +
  # And a nice blue ribbon around it
  geom_ribbon(alpha = 0.5, fill = "steelblue")  +
  # and a nice minimalist theme
  theme_minimal() +
  # and some simple labels
  labs(x = "Social Vulnerability by Minority Status & Language\n(20th - 80th Percentiles)",
       y = "Median Expected COVID19 Test Positivity Rate\nWith 95% Confidence Interval",
       title = "Minority Communities face higher Risk of COVID Spread\nin a Sample Matched by Social Vulnerability & Social Capital")

3. Future Directions

What else can we do with matched models? Below contains just a few options.

Longitudinal Matching Experiments - get a matched sample, and then follow an outcome among those cases over time. See Fraser, Cunningham, and Nasongo (2021) for an example!
Interaction effects - You can get a matched sample as similar as possible except for high densities of ethnic minorities and social capital, for example. See Fraser, Aldrich, & Page-Tan (2020) for an example.

Questions

Looking for the data? Download the source data from the Harvard Dataverse.

Have more questions? Reach out to me at timothy.fraser.1@gmail.com or on ResearchGate!

References:

Iacus, Stefano M., Gary King, and Giuseppe Porro. (2012). Causal Inference without Balance Checking. Political Analysis 20(1), 1-24. https://gking.harvard.edu/files/political_analysis-2011-iacus-pan_mpr013.pdf
Fraser, Timothy, Daniel P. Aldrich, and Courtney Page-Tan. (2020). Bowling Alone or Masking Together? The Role of Social Capital in Excess Death Rates from COVID19. Social Science Research Network. https://ssrn.com/abstract=3744251 or http://dx.doi.org/10.2139/ssrn.3744251
Fraser, Timothy, Lily Cunningham, & Amos Nasongo. (2021). Build Back Better? Effects of Crisis on Climate Change Adaptation Through Solar Power in Japan and the United States. Global Environmental Politics 21(1), 54-75.
Fraser, Timothy; Courtney Page-Tan; Daniel P. Aldrich, 2021, “Replication Data for: Social Capital Indices for county subdivisions, zipcodes, and census tracts”, https://doi.org/10.7910/DVN/OSVCRC, Harvard Dataverse, V1.