Use an imputation method with error (like stochastic), and then do it mutiple times. This way, the real measured values are preserved in each dataset and get repeated across each time, but the imputed values change a little each time. Then you run your analyses on each dataset, and get your results and combine them (with magic) in such a way that it takes into account the difference between datasets.
* Any major stats package can do this (although perhaps not for any type of analysis). R does it like a boss.
* Data must be MAR
* You can have a ton missing, as long as it’s MAR (simulations suggest as much as 70% of your data can be missing, and you’ll still end up with accurate estimates and accurate SEs!!)
# multiple imputation in R using Amelia
# install.packages("Amelia")
library(Amelia)
## Loading required package: foreign
## Loading required package: Rcpp
## Loading required package: RcppArmadillo
## ##
## ## Amelia II: Multiple Imputation
## ## (Version 1.7.2, built: 2013-04-03)
## ## Copyright (C) 2005-2014 James Honaker, Gary King and Matthew Blackwell
## ## Refer to http://gking.harvard.edu/amelia/ for more information
## ##
# data() # Amelia comes with some datasets
data(africa) # let's pull in the africa dataset
str(africa)
## 'data.frame': 120 obs. of 7 variables:
## $ year : num 1972 1973 1974 1975 1976 ...
## $ country : Factor w/ 6 levels "Burkina Faso",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ gdp_pc : int 377 376 393 416 435 448 445 461 457 466 ...
## $ infl : num -2.92 7.6 8.72 18.76 -8.4 ...
## $ trade : num 29.7 31.3 35.2 40.1 37.8 ...
## $ civlib : num 0.5 0.5 0.333 0.333 0.5 ...
## $ population: int 5848380 5958700 6075700 6202000 6341030 6486870 6639120 6797540 6962000 7132320 ...
?africa
# View(africa)
summary(africa)
## year country gdp_pc infl
## Min. :1972 Burkina Faso:20 Min. : 376 Min. : -8.40
## 1st Qu.:1977 Burundi :20 1st Qu.: 514 1st Qu.: 4.76
## Median :1982 Cameroon :20 Median :1036 Median : 8.72
## Mean :1982 Congo :20 Mean :1058 Mean : 12.75
## 3rd Qu.:1986 Senegal :20 3rd Qu.:1245 3rd Qu.: 13.56
## Max. :1991 Zambia :20 Max. :2723 Max. :127.89
## NA's :2
## trade civlib population
## Min. : 24.4 Min. :0.000 Min. : 1332490
## 1st Qu.: 38.5 1st Qu.:0.167 1st Qu.: 4332190
## Median : 59.6 Median :0.167 Median : 5853565
## Mean : 62.6 Mean :0.289 Mean : 5765594
## 3rd Qu.: 81.2 3rd Qu.:0.333 3rd Qu.: 7355000
## Max. :134.1 Max. :0.667 Max. :11825390
## NA's :5
summary(lm(africa$civlib ~ africa$trade)) # listwise deletion
##
## Call:
## lm(formula = africa$civlib ~ africa$trade)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.3004 -0.1233 -0.1173 0.0485 0.3814
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.278347 0.043693 6.37 4.2e-09 ***
## africa$trade 0.000184 0.000645 0.28 0.78
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.179 on 113 degrees of freedom
## (5 observations deleted due to missingness)
## Multiple R-squared: 0.000718, Adjusted R-squared: -0.00813
## F-statistic: 0.0812 on 1 and 113 DF, p-value: 0.776
?amelia
m <- 5 # the number of datasets to create (5 is typical)
a.out <- amelia(x = africa, cs = "country", ts = "year", logs = "gdp_pc") # note that we're using all the variables, even though we won't use them all in the regression
## -- Imputation 1 --
##
## 1 2
##
## -- Imputation 2 --
##
## 1 2 3
##
## -- Imputation 3 --
##
## 1 2 3
##
## -- Imputation 4 --
##
## 1 2 3
##
## -- Imputation 5 --
##
## 1 2
summary(a.out)
##
## Amelia output with 5 imputed datasets.
## Return code: 1
## Message: Normal EM convergence.
##
## Chain Lengths:
## --------------
## Imputation 1: 2
## Imputation 2: 3
## Imputation 3: 3
## Imputation 4: 3
## Imputation 5: 2
##
## Rows after Listwise Deletion: 115
## Rows after Imputation: 120
## Patterns of missingness in the data: 3
##
## Fraction Missing for original variables:
## -----------------------------------------
##
## Fraction Missing
## year 0.00000
## country 0.00000
## gdp_pc 0.01667
## infl 0.00000
## trade 0.04167
## civlib 0.00000
## population 0.00000
plot(a.out)
par(mfrow=c(1,1))
missmap(a.out)
# run our regression on each dataset
b.out<-NULL
se.out<-NULL
for(i in 1:m) {
ols.out <- lm(civlib ~ trade ,data = a.out$imputations[[i]])
b.out <- rbind(b.out, ols.out$coef)
se.out <- rbind(se.out, coef(summary(ols.out))[,2])
}
# combine the results from all of the different regressions
combined.results <- mi.meld(q = b.out, se = se.out)
combined.results
## $q.mi
## (Intercept) trade
## [1,] 0.2743 0.0002349
##
## $se.mi
## (Intercept) trade
## [1,] 0.04241 0.0006333
# Once you've generated the imputed datasets, you can run any analyses you like!
# See if there are differences between countries on trade (5 missing cases)
# Get the mean for each country, and a SE around that mean. The easiest way to get means and SEs for a bunch of groups in R is with lm().
mean.out<-NULL
se.out<-NULL
for(i in 1:m) {
lm.out <- lm(trade ~ 0 + country ,data = a.out$imputations[[i]])
mean.out <- rbind(mean.out, lm.out$coef)
se.out <- rbind(se.out, coef(summary(lm.out))[,2])
}
# combine the results from all of the different regressions
combined.results <- mi.meld(q = mean.out, se = se.out)
combined.results
## $q.mi
## countryBurkina Faso countryBurundi countryCameroon countryCongo
## [1,] 39.18 33.8 49.51 101.1
## countrySenegal countryZambia
## [1,] 71.84 76.03
##
## $se.mi
## countryBurkina Faso countryBurundi countryCameroon countryCongo
## [1,] 2.396 2.492 2.681 2.396
## countrySenegal countryZambia
## [1,] 2.396 2.396
?AmeliaView # Sounds fun, but it didn't work for me. Meh.