Missing Data

You need to know your Mechanism of Missingness (band name?)

  • MCAR: Missing Completely At Random. The missingness is totally unrelated to anything else in your dataset. For example, if your equipment just flips out sometimes for no reason and doesn’t reocrd stuff, that will result in missing data that is MCAR.
  • MAR: Missing at Random. This is a bad name. It should probably be called “missing conditionally at random” (but that would produce a confusing acronym). This is when the missingness is unrelated to the data that’s gone, but it is related to something else in your dataset. That means that if you control for the thing(s) that predict missingness, then it will be missing at random. For example, you ask both men and women to report their martial status, and women are more likely to refuse than men, so you have more missing data from women than men. Importantly, we’re assuming that the missingness is not related to the missing data (i.e. martial status); within women, the ones who report or don’t report are totally random, and within men it’s totally random.
  • Non-Ignorable, or Missing Not at Random: The missingness is related to the missing data. For example, single people are less likely to report martial status than married people.

Other considerations

  • How much is missing?
  • Where is your missingness? All in one variable? All in one case? Spread out all over?
  • How related is the variable(s) with missingness to the other variables in your dataset? You can be missing more “information” with the same amount of missing data if the vairable(s) with missingness are not very related to other variables in your data.
  • The best stratey for you will depend on what your data are like and what your missingness is like.

Strategies for missing data

  • Listwise deletion: Analyze only cases with no missing data.
  • Easy. Oh, so easy. This is the default in most statistical software (including R)
  • Yields good results, as long as its MCAR (so the complete cases are a representative sample of all the cases), and you’re left with sufficient sample size to not be underpowered.
  • Imputation: Replace the missing values with estimates, then analyze the full dataset.
    • Lots of ways to skin that cat
      • Mean imputation: Take the mean of the existing values and impute that.
      • Regression imputation: Use a regression equation based on the other variables to predict values for the missing values. This should result in better estimates for the missing values, but it will reduce your SEs, since if you were to now run a regression line through your data, the imputed points will all be right along it.
      • Stocastic regression: Regression + some random error.
      • Substitution: Find a new individual who is not already in the dataset, and use their value instead.
    • Data should be MAR
    • Works best when missingness is spread across variables
    • Can be problematic
      • May introduce bias (depending on which strategy you use to impute and what your data are like)
      • Understates uncertainty! The imputed values are treated as just as good as the measured values, even though really we’re a little uncertain about the imputed values. MI addresses this.

MI (Mutiple Imputation):

Use an imputation method with error (like stochastic), and then do it mutiple times. This way, the real measured values are preserved in each dataset and get repeated across each time, but the imputed values change a little each time. Then you run your analyses on each dataset, and get your results and combine them (with magic) in such a way that it takes into account the difference between datasets.
* Any major stats package can do this (although perhaps not for any type of analysis). R does it like a boss.
* Data must be MAR
* You can have a ton missing, as long as it’s MAR (simulations suggest as much as 70% of your data can be missing, and you’ll still end up with accurate estimates and accurate SEs!!)

# multiple imputation in R using Amelia
# install.packages("Amelia")
library(Amelia)
## Loading required package: foreign
## Loading required package: Rcpp
## Loading required package: RcppArmadillo
## ## 
## ## Amelia II: Multiple Imputation
## ## (Version 1.7.2, built: 2013-04-03)
## ## Copyright (C) 2005-2014 James Honaker, Gary King and Matthew Blackwell
## ## Refer to http://gking.harvard.edu/amelia/ for more information
## ##
# data() # Amelia comes with some datasets
data(africa) # let's pull in the africa dataset
str(africa)
## 'data.frame':    120 obs. of  7 variables:
##  $ year      : num  1972 1973 1974 1975 1976 ...
##  $ country   : Factor w/ 6 levels "Burkina Faso",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ gdp_pc    : int  377 376 393 416 435 448 445 461 457 466 ...
##  $ infl      : num  -2.92 7.6 8.72 18.76 -8.4 ...
##  $ trade     : num  29.7 31.3 35.2 40.1 37.8 ...
##  $ civlib    : num  0.5 0.5 0.333 0.333 0.5 ...
##  $ population: int  5848380 5958700 6075700 6202000 6341030 6486870 6639120 6797540 6962000 7132320 ...
?africa
# View(africa)
summary(africa)
##       year              country       gdp_pc          infl       
##  Min.   :1972   Burkina Faso:20   Min.   : 376   Min.   : -8.40  
##  1st Qu.:1977   Burundi     :20   1st Qu.: 514   1st Qu.:  4.76  
##  Median :1982   Cameroon    :20   Median :1036   Median :  8.72  
##  Mean   :1982   Congo       :20   Mean   :1058   Mean   : 12.75  
##  3rd Qu.:1986   Senegal     :20   3rd Qu.:1245   3rd Qu.: 13.56  
##  Max.   :1991   Zambia      :20   Max.   :2723   Max.   :127.89  
##                                   NA's   :2                      
##      trade           civlib        population      
##  Min.   : 24.4   Min.   :0.000   Min.   : 1332490  
##  1st Qu.: 38.5   1st Qu.:0.167   1st Qu.: 4332190  
##  Median : 59.6   Median :0.167   Median : 5853565  
##  Mean   : 62.6   Mean   :0.289   Mean   : 5765594  
##  3rd Qu.: 81.2   3rd Qu.:0.333   3rd Qu.: 7355000  
##  Max.   :134.1   Max.   :0.667   Max.   :11825390  
##  NA's   :5
summary(lm(africa$civlib ~ africa$trade)) # listwise deletion
## 
## Call:
## lm(formula = africa$civlib ~ africa$trade)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.3004 -0.1233 -0.1173  0.0485  0.3814 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.278347   0.043693    6.37  4.2e-09 ***
## africa$trade 0.000184   0.000645    0.28     0.78    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.179 on 113 degrees of freedom
##   (5 observations deleted due to missingness)
## Multiple R-squared:  0.000718,   Adjusted R-squared:  -0.00813 
## F-statistic: 0.0812 on 1 and 113 DF,  p-value: 0.776
?amelia
m <- 5 # the number of datasets to create (5 is typical)
a.out <- amelia(x = africa, cs = "country", ts = "year", logs = "gdp_pc") # note that we're using all the variables, even though we won't use them all in the regression
## -- Imputation 1 --
## 
##   1  2
## 
## -- Imputation 2 --
## 
##   1  2  3
## 
## -- Imputation 3 --
## 
##   1  2  3
## 
## -- Imputation 4 --
## 
##   1  2  3
## 
## -- Imputation 5 --
## 
##   1  2
summary(a.out)
## 
## Amelia output with 5 imputed datasets.
## Return code:  1 
## Message:  Normal EM convergence. 
## 
## Chain Lengths:
## --------------
## Imputation 1:  2
## Imputation 2:  3
## Imputation 3:  3
## Imputation 4:  3
## Imputation 5:  2
## 
## Rows after Listwise Deletion:  115 
## Rows after Imputation:  120 
## Patterns of missingness in the data:  3 
## 
## Fraction Missing for original variables: 
## -----------------------------------------
## 
##            Fraction Missing
## year                0.00000
## country             0.00000
## gdp_pc              0.01667
## infl                0.00000
## trade               0.04167
## civlib              0.00000
## population          0.00000
plot(a.out)

plot of chunk unnamed-chunk-1

par(mfrow=c(1,1))
missmap(a.out)

plot of chunk unnamed-chunk-1

# run our regression on each dataset
b.out<-NULL
se.out<-NULL
for(i in 1:m) {
  ols.out <- lm(civlib ~ trade ,data = a.out$imputations[[i]])
  b.out <- rbind(b.out, ols.out$coef)
  se.out <- rbind(se.out, coef(summary(ols.out))[,2])
}
# combine the results from all of the different regressions
combined.results <- mi.meld(q = b.out, se = se.out)
combined.results
## $q.mi
##      (Intercept)     trade
## [1,]      0.2743 0.0002349
## 
## $se.mi
##      (Intercept)     trade
## [1,]     0.04241 0.0006333
# Once you've generated the imputed datasets, you can run any analyses you like! 

# See if there are differences between countries on trade (5 missing cases)
# Get the mean for each country, and a SE around that mean. The easiest way to get means and SEs for a bunch of groups in R is with lm().
mean.out<-NULL
se.out<-NULL
for(i in 1:m) {
  lm.out <- lm(trade ~ 0 + country ,data = a.out$imputations[[i]])
  mean.out <- rbind(mean.out, lm.out$coef)
  se.out <- rbind(se.out, coef(summary(lm.out))[,2])
}
# combine the results from all of the different regressions
combined.results <- mi.meld(q = mean.out, se = se.out)
combined.results
## $q.mi
##      countryBurkina Faso countryBurundi countryCameroon countryCongo
## [1,]               39.18           33.8           49.51        101.1
##      countrySenegal countryZambia
## [1,]          71.84         76.03
## 
## $se.mi
##      countryBurkina Faso countryBurundi countryCameroon countryCongo
## [1,]               2.396          2.492           2.681        2.396
##      countrySenegal countryZambia
## [1,]          2.396         2.396
?AmeliaView # Sounds fun, but it didn't work for me. Meh.