1. Introduction

This page introduces a quick tutorial demoing the use of multiple imputation using the Amelia package in R. This lets us fill in missing data, by drawing from the latent trends with other variables in your dataset to guestimate the most likely value for your missing data point. Of course, all projections are imperfect, so we customarily do this process multiple times, producing at least 5 imputed datasets, each just slightly different from the other. We can run our analyses on these datasets, and then at the end, we can ‘meld’ together these statistics from different imputations into just one single statistic. Of course, sometimes we might be missing more data, so in these cases, we can increase the number of imputated datasets we use, which helps compensate. Multiple imputation is robust to surprisingly high levels of missing data, as long as you increase the number of imputations.

To get started, please install the following packages!

install.packages(c("Amelia", "tidyverse", "Zelig"))

Next, let’s load those packages, using the library() function!

library(tidyverse) # for data wrangling
library(Amelia) # for multiple imputation
library(Zelig) # for modeling
library(texreg) # for model tables
library(lmtest) # for likelihood ratio tests

Then, we’re going to download the following values, which measure several different quantities of interest about communities in Massachusetts census tracts. (You can download this dataset from here.)

Let’s load in our census tract data, filtering just to the year 2018!

dat <- read_csv("ma_data.csv") %>%
  filter(year == 2018)

Let’s take a quick look at the data available for these census tracts.

  • year: year of census results
  • geoid: census id, showing the county id with the tract ID tacked on.
  • county: name of county where tract is located.
  • telephone: one of many indicators scaled from 0 to 1.

For context, these variables are indices, and they have all been scales from 0 to 1, where 0 indicates the minimal level of that concept in the US, and 1 indicates the max level of that concept seen in the United States. I’m not going to go specifically into what each index measure means, but if you’re interested, you can check out our paper on these indices here!.

dat %>% head(3)
year geoid county telephone education ethnicity_similarity employment_equality gender_income_similarity gini local_gov state_gov fed_gov lang non_elder race_similarity voting_age civic_orgs religious_orgs unions advocacy_orgs charitable_orgs fraternal politicalacts
2018 25001010100 Barnstable 0.9092678 0.0256 0.8645237 0.8618754 0.0308195 0.1813842 0.8417009 0.7992685 0.6321977 0.8476390 0 0.7938806 1.0000000 0.7494544 0.9822647 1.0000000 0.6402613 1.000000 1 0.8715846
2018 25001010206 Barnstable 0.9432924 0.9360 0.9861428 0.8804411 0.0000000 0.4848846 0.8093308 0.7736245 0.5476534 0.9885815 0 0.7714279 0.8571022 1.0000000 0.5253820 0.5648315 1.0000000 0.984103 1 0.8715846
2018 25001010208 Barnstable 1.0000000 0.7232 0.9649274 0.8934169 0.2811448 0.7084328 0.7766107 0.6675968 0.0000000 0.9361930 0 0.9552881 0.8865977 1.0000000 0.9128424 0.9414016 0.6402613 1.000000 1 0.8715846

All set!

2. Missing Data

So how much of our data is missing in Massachusetts census tracts? We can use summarize_at(), mutate_at(), and pivot_longer() to calculate this for all our variables, all at once!

missing <- dat %>%
  # Calculate what percentage are missing
  summarize_at(
    vars(telephone:politicalacts), 
    list(~ sum(is.na(.)) / n() )) %>%
  # multiply each percentage by 100, and round to 2 decimal places
  mutate_at(
    vars(telephone:politicalacts),
    list(~ round(. * 100, 2)) ) %>%
  # Pivot longer!
  pivot_longer(cols = -c(), names_to = "variable", values_to = "percent")

Let’s view our data.frame, which tallies missing data points per variablei!

missing
variable percent
telephone 1.29
education 1.01
ethnicity_similarity 0.95
employment_equality 1.01
gender_income_similarity 0.00
gini 1.42
local_gov 1.01
state_gov 1.01
fed_gov 1.01
lang 0.95
non_elder 0.95
race_similarity 0.95
voting_age 1.01
civic_orgs 0.00
religious_orgs 0.14
unions 1.15
advocacy_orgs 0.00
charitable_orgs 0.00
fraternal 0.00
politicalacts 0.00

Okay, looks like very low amounts of missing data ~ just about 1 percent for most variables. That’s very easy to work with.

3. Multiple Imputation

To impute our data, we’re going to do a few quick steps.

First, we’re going to set a seed for reproducibility. This will tell Amelia to use the same strings of random numbers involved in imputation in the future.

set.seed(98754)

Next, you can run the algorithm right, and save the output as mi!

mi <- dat %>% 
  # convert to data.frame (must do this)
  as.data.frame() %>%
  # And run amelia!
  amelia(
    # Give Amelia the number of imputations
    m = 5, 
    # Tell it to keep, but don't impute, these three ID variables
    idvars = c("year", "geoid", "county"),
    # Tell it to give up after 1000 resamples.
    max.resample = 1000, 
    # Use parallel processing to make it run fast!
    parallel = "multicore")
## -- Imputation 1 --
## 
##   1  2  3  4
## 
## -- Imputation 2 --
## 
##   1  2  3  4
## 
## -- Imputation 3 --
## 
##   1  2  3  4  5
## 
## -- Imputation 4 --
## 
##   1  2  3
## 
## -- Imputation 5 --
## 
##   1  2  3  4

Let’s check out the contents!

mi # looks good.
## 
## Amelia output with 5 imputed datasets.
## Return code:  1 
## Message:  Normal EM convergence. 
## 
## Chain Lengths:
## --------------
## Imputation 1:  4
## Imputation 2:  4
## Imputation 3:  5
## Imputation 4:  3
## Imputation 5:  4

Let’s find where the imputations are stored, inside mi, under imputations.

We can check inside each of the imputations, just as evidence! The first imputed dataset is $imp1, the second is $imp2, and so on, up to $imp5.

mi$imputations$imp1 %>% head(2)
year geoid county telephone education ethnicity_similarity employment_equality gender_income_similarity gini local_gov state_gov fed_gov lang non_elder race_similarity voting_age civic_orgs religious_orgs unions advocacy_orgs charitable_orgs fraternal politicalacts
2018 25001010100 Barnstable 0.9092678 0.0256 0.8645237 0.8618754 0.0308195 0.1813842 0.8417009 0.7992685 0.6321977 0.8476390 0 0.7938806 1.0000000 0.7494544 0.9822647 1.0000000 0.6402613 1.000000 1 0.8715846
2018 25001010206 Barnstable 0.9432924 0.9360 0.9861428 0.8804411 0.0000000 0.4848846 0.8093308 0.7736245 0.5476534 0.9885815 0 0.7714279 0.8571022 1.0000000 0.5253820 0.5648315 1.0000000 0.984103 1 0.8715846

What if your data has a different format? Well, you could lightly adjust your code to something like these:

If categorical/ordinal data:

dat %>% 
  as.data.frame() %>%
  amelia(
    m = 5, 
    idvars = c("year", "geoid", "county"),
    # Specific categorical and ordinal variables here
    noms = c("categoricalvar1", "categoricalvar2"),
    ord = c("ordinalvar1", "ordinalvar2"),
    # note, this won't work, because your dataset doesn't have any good ones.
    max.resample = 1000, 
    parallel = "multicore")

If time-series/panel data:

# Import the whole dataset, from 2010 to 2018
read_csv("ma_data.csv") %>% 
  as.data.frame() %>%
  amelia(
    m = 5, 
    ts = "year",  # specify time-series variable (the year)
    cs = "geoid", # specify cross-sectional variable (the tract)
    idvars = c("county"),
    max.resample = 1000, 
    parallel = "multicore")

4. Modeling

The Zelig package has great integration with Amelia-imputed datasets. So, I recommend using Zelig when working with imputed datasets, because it handles melding imputations automatically.

Let’s test whether census tracts where residents tend to be of the same racial background have higher or lower rates of civic organizations.

z1 <- mi %>%
  zelig(formula = civic_orgs ~ race_similarity, 
        # Gotta specify 'least-squares' for OLS
        # alternatively, if your outcome is binomial, write model = 'logit'
        model = "ls")
## How to cite this model in Zelig:
##   R Core Team. 2007.
##   ls: Least Squares Regression for Continuous Dependent Variables
##   in Christine Choirat, Christopher Gandrud, James Honaker, Kosuke Imai, Gary King, and Olivia Lau,
##   "Zelig: Everyone's Statistical Software," https://zeligproject.org/
# Let's view it!
z1
## Model: Combined Imputations 
## 
##                 Estimate Std.Error z value Pr(>|z|)
## (Intercept)      0.53847   0.00892    60.4  < 2e-16
## race_similarity  0.07766   0.01411     5.5  3.7e-08
## 
## For results from individual imputed datasets, use summary(x, subset = i:j)
## Next step: Use 'setx' method

Now let’s see whether that loosely positive effect remains after adjusting for ethnicity similarity (Hispanic/Latino vs. non-Hispanic/Latino.)

z2 <- mi %>%
  zelig(formula = civic_orgs ~ race_similarity + ethnicity_similarity, 
        # Gotta specify 'least-squares' for OLS
        # alternatively, if your outcome is binomial, write model = 'logit'
        model = "ls")
## How to cite this model in Zelig:
##   R Core Team. 2007.
##   ls: Least Squares Regression for Continuous Dependent Variables
##   in Christine Choirat, Christopher Gandrud, James Honaker, Kosuke Imai, Gary King, and Olivia Lau,
##   "Zelig: Everyone's Statistical Software," https://zeligproject.org/
# Let's view it!
z2
## Model: Combined Imputations 
## 
##                      Estimate Std.Error z value Pr(>|z|)
## (Intercept)            0.5417    0.0107   50.60  < 2e-16
## race_similarity        0.0849    0.0189    4.50  6.7e-06
## ethnicity_similarity  -0.0108    0.0192   -0.56     0.58
## 
## For results from individual imputed datasets, use summary(x, subset = i:j)
## Next step: Use 'setx' method

Now let’s see if that effect persists after accounting for income inequality (here, the gini index has actually been flipped, so that 1 = equality and 0 = inequality).

z3 <- mi %>%
  zelig(formula = civic_orgs ~ race_similarity + ethnicity_similarity + gini,
        # Gotta specify 'least-squares' for OLS
        # alternatively, if your outcome is binomial, write model = 'logit'
        model = "ls")
## How to cite this model in Zelig:
##   R Core Team. 2007.
##   ls: Least Squares Regression for Continuous Dependent Variables
##   in Christine Choirat, Christopher Gandrud, James Honaker, Kosuke Imai, Gary King, and Olivia Lau,
##   "Zelig: Everyone's Statistical Software," https://zeligproject.org/
# Let's view it!
z3
## Model: Combined Imputations 
## 
##                      Estimate Std.Error z value Pr(>|z|)
## (Intercept)           0.59756   0.01227   48.70   <2e-16
## race_similarity       0.11676   0.01878    6.22    5e-10
## ethnicity_similarity  0.00116   0.01868    0.06     0.95
## gini                 -0.16342   0.01898   -8.61   <2e-16
## 
## For results from individual imputed datasets, use summary(x, subset = i:j)
## Next step: Use 'setx' method

Interesting! No one should read any policy implications from these results, because it’s a very simple test model, but the story here is that racially homogenous neighborhoods in Massachusetts tend to see more civic ogranizations (think environmental activists in Newton, for example), while neighborhoods with less income equality tend to more civic organizations. This could hint at how a lot of activism might occur in neighborhoods of Boston or Worcester where income inequality is high, prompting need for civil society response.

We can quickly view these results in a texreg table like this, displaying the beta coefficient, standard error, and p-value via asterisks. So much great customization can be done for this table; run ?texreg to find out more.

screenreg(l = list(z1,z2,z3))
## 
## ===========================================================
##                       Model 1      Model 2      Model 3    
## -----------------------------------------------------------
## (Intercept)              0.54 ***     0.54 ***     0.60 ***
##                         (0.01)       (0.01)       (0.01)   
## race_similarity          0.08 ***     0.08 ***     0.12 ***
##                         (0.01)       (0.02)       (0.02)   
## ethnicity_similarity                 -0.01         0.00    
##                                      (0.02)       (0.02)   
## gini                                              -0.16 ***
##                                                   (0.02)   
## -----------------------------------------------------------
## Num. obs.             7390         7390         7390       
## Num. imp.                5            5            5       
## ===========================================================
## *** p < 0.001; ** p < 0.01; * p < 0.05

5. Likelihood Ratio Tests

Sometimes it can be great to find out whether adding a covariate actually improves the fit of our models at all, or whether they are better disregarded. With multiply-imputed datasets, this can be tricky, because the programs are usually not easily compatible with Amelia. Here’s how to do it in the lmtest package in R.

We can easily run a likelihood ratio test for just 1 imputation from each dataset. For example, lmtest can take a normal model object, from lm() or others. We can extract that lm() formatted model object from a Zelig/Amelia style model like this.

# This retrieves model of the FIRST imputed dataset, for model z1
m1 <- from_zelig_model(z1)[[1]]
# This retrieves model of the FIRST imputed dataset, for model z2
m2 <- from_zelig_model(z2)[[1]]
# This retrieves model of the FIRST imputed dataset, for model z3
m3 <- from_zelig_model(z3)[[1]]

And we can test whether adding each extra term improves model fit!

lrtest(m1,m2,m3)
## Likelihood ratio test
## 
## Model 1: civic_orgs ~ race_similarity
## Model 2: civic_orgs ~ race_similarity + ethnicity_similarity
## Model 3: civic_orgs ~ race_similarity + ethnicity_similarity + gini
##   #Df LogLik Df   Chisq Pr(>Chisq)    
## 1   3 659.18                          
## 2   4 659.36  1  0.3683     0.5439    
## 3   5 697.18  1 75.6433     <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We really care about the last column, Pr(>Chisq). It looks like the log-likelihood improved A LOT when we added the inverse gini coefficient, but not when we added ethnicity_similarity. We can tell because in row 2, model 2’s new terms led to an insignificant increase in log-likelihood (p = ~0.6187 or so; might vary depending on Amelia’s string of random numbers), while model 3’s new terms led to a very significant boost (p < 0.001).

So, the quickest way, without using melding, to deal with this is to just check, of my five imputed datasets, which gave me the worst p-values? This is a conservative approach; choose that likelihood ratio test. Alternatively, you can learn about Rubin’s rules and meld the output of these tests. I’ll cover this more in a future tutorial.