Week 12: Missing Data and Multiple Imputation

Packages used:

library(Amelia) library(Zelig) library(ZeligChoice) library(texreg)

What is missing data and why is it a problem?

  • how does it happen? respondents can refuse to answer questions, there can be interviewer error, lost records, sensitive information gets withheld, etc.

  • missing data and counterfactuals

library(Amelia)
library(Zelig)
library(ZeligChoice)
library(texreg)
library(ZeligChoice)
data(africa)
head(africa)
dim(africa)
[1] 120   7

In this case, 120 cases, 7 variables.

z.afr <- zelig(gdp_pc ~ trade + civlib, model = "ls", data = africa, cite = F)
#fix this htmlreg(z.afr, doctype=FALSE)
summary(z.afr)
Model: 

Call:
z5$zelig(formula = gdp_pc ~ trade + civlib, data = africa)

Residuals:
    Min      1Q  Median      3Q     Max 
-678.95 -225.33  -95.38  166.51  964.97 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  136.063    100.409   1.355 0.178120
trade         18.027      1.272  14.172  < 2e-16
civlib      -665.428    185.436  -3.588 0.000495

Residual standard error: 352.5 on 112 degrees of freedom
  (5 observations deleted due to missingness)
Multiple R-squared:  0.6534,    Adjusted R-squared:  0.6472 
F-statistic: 105.6 on 2 and 112 DF,  p-value: < 2.2e-16

Next step: Use 'setx' method
#built in Amelia dataset
data(freetrade)
dim(freetrade)
[1] 171  10

If no missing values, then all 171 cases should be present for all 10 variables

z.frt <- zelig(tariff ~ polity + pop + gdp.pc + year, model = "ls", data = freetrade, cite = F)
summary(z.frt)
Model: 

Call:
z5$zelig(formula = tariff ~ polity + pop + gdp.pc + year, data = freetrade)

Residuals:
    Min      1Q  Median      3Q     Max 
-35.773  -9.260  -5.255   5.903  38.008 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)
(Intercept)  4.017e+03  6.897e+02   5.824 6.22e-08
polity       7.455e-01  3.234e-01   2.305 0.023103
pop          3.184e-08  6.011e-09   5.296 6.46e-07
gdp.pc      -2.139e-03  5.877e-04  -3.639 0.000424
year        -2.004e+00  3.468e-01  -5.777 7.71e-08

Residual standard error: 15.66 on 106 degrees of freedom
  (60 observations deleted due to missingness)
Multiple R-squared:  0.4839,    Adjusted R-squared:  0.4644 
F-statistic: 24.84 on 4 and 106 DF,  p-value: 1.591e-14

Next step: Use 'setx' method

60 observations missing – do we know that they are represented well by the others? we can’t say for sure that they are similar or not, so we don’t know if the model of the other 111 observations is any good.

What to do about it?

Traditional approaches

  1. Listwise deletion (complete case analysis)
  2. Pairwise deletion (e.g., regtools)
  3. Dummy variable + mean substitution
  4. Simple imputation (e.g., regression prediction) = predict the missed independent variable by looking at the other dependent variable values.
  • This last two were pretty common approaches 20 years ago, but recent research shows that the last two methods are often just as bad as the first two, and sometimes worse.

The common problems

If you estimate your missing values based on “best guess” your regression model will be very different than someone whose best guess was different, and the two competing models can’t really be measured against each other because we can’t know if either of the guesses are correct

Multiple Imputations

conceptual issues

The sampling analogy - Population parameter: unknown - Sample statistics: known - Repeated random samples: sampling distribution

Similarly - Impute the missing value multiple times - Estimate the selected model on each of the imputed data sets - Averaging/combining the coefficients to get the final results

If you create 20 different datasets with 20 different ‘guesses’ for the missing values (can be called complete case data or imputed data), combine the 20 different results, and average the coefficients, this is a better way of modeling missing data values.

Multiple imputations

technical issues

How to construct the imputation model? How to average multiple coefficients?

Constructing the imputation model

Many different ways, many R packages - Amelia (most used in this lecture) - mi - mice - norm - pan

Combining results I: point estimates

Simply the average of the m separate estimates

Combining results II: standard error

The variance of the multiple imputation point estimates equals the sum of: - The average of of the estimated variance from within each completed data set; and - The sample variance in the point estimates across the data set

No need to hand-calculate since we have access to packages like Amelia

# one line of code to generate the default number of imputations
# default in Amelia might be 5?
#cs is column number or variable name indicating the cross section variable.
#ts is column number or variable name indicating the variable identifying time in time series data
# gdp_pc is a large number, so taking the log is a more normalized way of looking at it but logs is optional command
data(africa)
a.out <- amelia(x = africa, cs = "country", ts = "year", logs = "gdp_pc")
-- Imputation 1 --

  1  2  3

-- Imputation 2 --

  1  2  3

-- Imputation 3 --

  1  2  3

-- Imputation 4 --

  1  2  3

-- Imputation 5 --

  1  2  3
a.out

Amelia output with 5 imputed datasets.
Return code:  1 
Message:  Normal EM convergence. 

Chain Lengths:
--------------
Imputation 1:  3
Imputation 2:  3
Imputation 3:  3
Imputation 4:  3
Imputation 5:  3
# what is in a.out? use names to see the elements
names(a.out)
 [1] "imputations" "m"           "missMatrix"  "overvalues"  "theta"       "mu"         
 [7] "covMatrices" "code"        "message"     "iterHist"    "arguments"   "orig.vars"  
View(a.out$imputations$imp1)
View(a.out$imputations$imp2)

We could do it for all five, but we get it

#amelia comes from same universe as zelig, so it's pretty easy to swap zelig and amelia commands
z.out <- zelig(gdp_pc ~ trade + civlib, model="ls", data=a.out, cite = FALSE)
summary(z.out)
Model: Combined Imputations 

            Estimate Std.Error z value Pr(>|z|)
(Intercept)   125.01    100.41    1.25  0.21312
trade          17.97      1.26   14.21  < 2e-16
civlib       -638.95    186.91   -3.42  0.00063

For results from individual imputed datasets, use summary(x, subset = i:j)
Next step: Use 'setx' method

This is our final estimated average of the multiply-imputed data regression models!

#if you're curious, you can take a look at the individual regression results
summary(z.out, subset = 1)
Imputed Dataset 1
Call:
z5$zelig(formula = gdp_pc ~ trade + civlib, data = a.out)

Residuals:
    Min      1Q  Median      3Q     Max 
-673.45 -226.04  -98.62  177.89  958.05 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  100.841    100.730   1.001  0.31885
trade         18.091      1.288  14.046  < 2e-16
civlib      -592.702    185.949  -3.187  0.00184

Residual standard error: 357.9 on 117 degrees of freedom
Multiple R-squared:  0.6364,    Adjusted R-squared:  0.6302 
F-statistic: 102.4 on 2 and 117 DF,  p-value: < 2.2e-16

Next step: Use 'setx' method
#use summary with a subset 1, 2, 3, 4, 5 to look at each individual model
#compare subset 2 
summary(z.out, subset = 2)
Imputed Dataset 2
Call:
z5$zelig(formula = gdp_pc ~ trade + civlib, data = a.out)

Residuals:
   Min     1Q Median     3Q    Max 
-679.7 -224.5 -103.5  171.4  958.0 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)   125.15      96.74   1.294  0.19834
trade          18.02       1.24  14.529  < 2e-16
civlib       -627.63     181.30  -3.462  0.00075

Residual standard error: 348.9 on 117 degrees of freedom
Multiple R-squared:  0.653, Adjusted R-squared:  0.6471 
F-statistic: 110.1 on 2 and 117 DF,  p-value: < 2.2e-16

Next step: Use 'setx' method

….and we get it, this can be done for each imputed value set.

#and we can also just set up the new amelia data using zelig counterfactuals
z.out$setx()
z.out$sim()
plot(z.out)

#let's do this with the freetrade example
#it had a lot more missing values!
data(freetrade)
a.out <- amelia(freetrade, m = 20, ts = "year", cs = "country")
-- Imputation 1 --

  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16

-- Imputation 2 --

  1  2  3  4  5  6  7  8  9 10 11 12 13

-- Imputation 3 --

  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15

-- Imputation 4 --

  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17

-- Imputation 5 --

  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16

-- Imputation 6 --

  1  2  3  4  5  6  7  8  9 10 11 12

-- Imputation 7 --

  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17

-- Imputation 8 --

  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18

-- Imputation 9 --

  1  2  3  4  5  6  7  8  9 10 11 12 13

-- Imputation 10 --

  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15

-- Imputation 11 --

  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17

-- Imputation 12 --

  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
 21 22 23 24

-- Imputation 13 --

  1  2  3  4  5  6  7  8  9 10

-- Imputation 14 --

  1  2  3  4  5  6  7  8  9 10 11 12 13 14

-- Imputation 15 --

  1  2  3  4  5  6  7  8  9 10 11 12 13 14

-- Imputation 16 --

  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17

-- Imputation 17 --

  1  2  3  4  5  6  7  8  9 10 11 12 13 14

-- Imputation 18 --

  1  2  3  4  5  6  7  8  9 10 11

-- Imputation 19 --

  1  2  3  4  5  6  7  8  9 10 11 12

-- Imputation 20 --

  1  2  3  4  5  6  7  8  9 10 11 12 13 14
z.out <- zelig(tariff ~ polity + pop + gdp.pc + year + country, model = "ls", data = a.out, cite = FALSE)
#using the amelia dataset a.out with its multiple imputations as the input
summary(z.out)
Model: Combined Imputations 

                    Estimate Std.Error z value Pr(>|z|)
(Intercept)         2.47e+03  8.10e+02    3.04   0.0023
polity              3.90e-02  3.48e-01    0.11   0.9108
pop                -9.09e-08  5.10e-08   -1.78   0.0749
gdp.pc             -3.74e-04  1.52e-03   -0.25   0.8055
year               -1.17e+00  4.17e-01   -2.81   0.0049
countryIndonesia   -9.32e+01  3.68e+01   -2.53   0.0113
countryKorea       -1.14e+02  4.19e+01   -2.73   0.0063
countryMalaysia    -1.16e+02  4.36e+01   -2.67   0.0076
countryNepal       -1.11e+02  4.39e+01   -2.53   0.0114
countryPakistan    -6.75e+01  4.02e+01   -1.68   0.0926
countryPhilippines -1.07e+02  4.21e+01   -2.54   0.0110
countrySriLanka    -1.04e+02  4.44e+01   -2.35   0.0188
countryThailand    -1.00e+02  4.21e+01   -2.38   0.0173

For results from individual imputed datasets, use summary(x, subset = i:j)
Statistical Warning: The GIM test suggests this model is misspecified
 (based on comparisons between classical and robust SE's; see http://j.mp/GIMtest).
 We suggest you run diagnostics to ascertain the cause, respecify the model
 and run it again.

Next step: Use 'setx' method
#if you don't want to use zelig, you don't have to but it is a little more involved
#first create two empty vectors, b.out and se.out
b.out<-NULL
se.out<-NULL
#so for 1 to 5, estimate the regression model using one of the multiply imputed dataset
#create this model by filling in this vector
for(i in 1:a.out$m) {
  ols.out <- lm(tariff ~ polity + pop + gdp.pc, data = a.out$imputations[[i]])
  b.out <- rbind(b.out, ols.out$coef)
  se.out <- rbind(se.out, coef(summary(ols.out))[,2])
}
#now the two vectors are full, we can use the mi.meld command to do the actual average
combined.results <- mi.meld(q = b.out, se = se.out)
combined.results
$q.mi
     (Intercept)    polity          pop       gdp.pc
[1,]     32.8519 -0.184289 3.241249e-08 -0.002752912

$se.mi
     (Intercept)    polity          pop       gdp.pc
[1,]    2.406057 0.3046256 6.541023e-09 0.0006600312

Further considerations

The Amelia package with multiple imputation options can work with much more complicated data structures, variable types, etc, but the basic necessary steps have been illustrated.

The rest of the lecture will show ways to improve the basic procedure.

  1. What happens when variables are noncontinuous or non-normal?
  • Amelia works well even when some variables are discrete or non-normal
  • Nonetheless it helps to declare nominal and ordinal variables
  1. What about variables with large extreme variation, like GDP?
  • It can help to apply log-transformation to continuous variables that are heavily skewed
  • Including identification variables

Ordinal variable

table(a.out$imputations$imp3$polity)

               -8                -7                -6                -5                -4 
                1                22                 4                 7                 3 
-3.49025024538997                -2                -1                 2                 3 
                1                 9                 1                 7                 7 
                4  4.97767762751257                 5                 6                 7 
               15                 1                26                13                 5 
                8                 9 
               36                13 

Polity is an integer variable but imputation procedure produced some nonsensical values (you can see them, they are clear!)

#specify that polity is an ordinal variable instead of interval/ratio
a.out1 <- amelia(freetrade, m = 5, ts = "year", cs = "country", ords = "polity")
-- Imputation 1 --

  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20


-- Imputation 2 --

  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15

-- Imputation 3 --

  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
 21

-- Imputation 4 --

  1  2  3  4  5  6  7  8  9 10 11

-- Imputation 5 --

  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15
table(a.out1$imputations$imp3$polity)

-8 -7 -6 -5 -4 -2 -1  2  3  4  5  6  7  8  9 
 1 22  4  7  3  9  1  7  7 15 26 13  6 37 13 

Now they are all integer values, as they should be!

Nominal variables

table(a.out1$imputations$imp3$signed)

-0.208006652109979                  0  0.502541102297093   0.56307810696986                  1 
                 1                142                  1                  1                 26 
#specify that signed variable is nominal scale
a.out2 <- amelia(freetrade, m = 5, ts = "year", cs = "country", noms = "signed")
-- Imputation 1 --

  1  2  3  4  5  6  7  8  9 10 11 12

-- Imputation 2 --

  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20


-- Imputation 3 --

  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17

-- Imputation 4 --

  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17

-- Imputation 5 --

  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17
table(a.out2$imputations$imp3$signed)

  0   1 
144  27 

Transformation

hist(freetrade$tariff)

This is not a normal distribution, it is very positively skewed. Makes it not a great candidate for normal linear regression, but Amelia can normalize it before entering it into other models

Take the natural log!

hist(log(freetrade$tariff))

Much better!

Identification variables

The two variables, year and country remain in the data, but they do not contribute any information to the imputation

This is cross secional time series data. Datasets have it when you take cross sections data, like same variables in different years or different locations. Amelia is well suited to handling this year/country data by letting you specify cs and ts but even if it’s not country/year data, Amelia lets you put ID variables in one option to treat them as ID variables.

tmp <- amelia(freetrade, idvars = c("year", "country"))
-- Imputation 1 --

  1  2  3  4  5  6  7

-- Imputation 2 --

  1  2  3  4  5  6  7  8  9 10 11 12 13

-- Imputation 3 --

  1  2  3  4  5  6  7  8  9 10 11 12 13 14

-- Imputation 4 --

  1  2  3  4  5  6  7  8  9 10

-- Imputation 5 --

  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17

Country and time in Amelia

Start with linear model

a.test1 <- amelia(freetrade, ts = "year", cs = "country")
-- Imputation 1 --

  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15

-- Imputation 2 --

  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16

-- Imputation 3 --

  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15

-- Imputation 4 --

  1  2  3  4  5  6  7  8  9 10 11 12 13 14

-- Imputation 5 --

  1  2  3  4  5  6  7  8  9 10 11 12 13 14
a.test2 <- amelia(freetrade, ts = "year", cs = "country", polytime = 2)
-- Imputation 1 --

  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20


-- Imputation 2 --

  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
 21 22 23 24 25 26

-- Imputation 3 --

  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15

-- Imputation 4 --

  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20


-- Imputation 5 --

  1  2  3  4  5  6  7  8  9 10 11 12
a.test3 <- amelia(freetrade, ts = "year", cs = "country", polytime = 2,
                  intercs = TRUE)
-- Imputation 1 --

  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120
 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140
 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157

-- Imputation 2 --

  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55

-- Imputation 3 --

  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
 61 62 63 64 65 66 67 68 69 70 71 72 73 74

-- Imputation 4 --

  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120
 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140
 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160
 161 162 163 164 165 166 167 168 169 170 171 172 173 174

-- Imputation 5 --

  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120
 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140
 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160
 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180
 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200
 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220
 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240
 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260
 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280
 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300
 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320
 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340
 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360
 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380
 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400
 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420
 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440
 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460
 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480
 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500
 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520
 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540

Country and time linear

tscsPlot(a.test1, cs = "Malaysia", main = "Malaysia (no time settings)", 
         var = "tariff", ylim = c(-10, 60))

black dots represent original, non missing level red dots+bars represent imputed values with 95% confidence interval

(Conceptual question: You can have very narrow confidence intervals, or very wide confidence intervals. Which is preferred? Wider means that we are less certain of our estimate. If the CI is too wide, we say that the regression parameter is too vague and we can’t know if the influence of the independent variable is much at all.)

Country and time polynomial

introduce polynomial term to handle nonlinear data

tscsPlot(a.test2, cs = "Malaysia", main = "Malaysia (no time settings)", 
         var = "tariff", ylim = c(-10, 60))

tscsPlot(a.test3, cs = "Malaysia", main = "Malaysia (no time settings)", 
         var = "tariff", ylim = c(-10, 60))

This is a much tighter cluster, smaller confidence intervals; we can say this model looks a lot more useful.

This is the model that allows each country to have its own time trend (the interaction and 2nd order polynomial term) which is clearly a better model.

Cautions

There are three different kinds of missing data! - Missing completely at random (MCAR) - Missing at random (MAR) - Missing not at random (MNAR)

MI works well under MCAR and MAR but not so well under MNAR because if it’s not randomly missing, it suggests there is a reason it is missing consistently in one way and our actual observations may not easily represent the missing values.

MNAR means the missing data is dependent on something in your data, and something outside your data. Before doing any modeling, you need to figure out what that is.

MCAR is fine, but multiple inputation performs no better than a simple deletion substition solution, because the model is unaffected by the data. You can do MI but it’s not necessarily required.

MAR is random but not completely at random, and this is where MI really shines and is the best solution.

When doing MI, how many m?

---
title: "lecture 12: missing data"
output: 
  html_notebook: 
    theme: journal
---

# Week 12: Missing Data and Multiple Imputation

Packages used:

library(Amelia)
library(Zelig)
library(ZeligChoice)
library(texreg)

## What is missing data and why is it a problem?

- how does it happen?
respondents can refuse to answer questions, there can be interviewer error, lost records, sensitive information gets withheld, etc.

- missing data and counterfactuals


```{r message=FALSE, warning=FALSE}
library(Amelia)
library(Zelig)
library(ZeligChoice)
library(texreg)
library(ZeligChoice)
data(africa)
head(africa)
```

```{r message=FALSE, warning=FALSE}
#find out how many cases, how many variables
dim(africa)
```

In this case, 120 cases, 7 variables.

```{r message=FALSE, warning=FALSE}
z.afr <- zelig(gdp_pc ~ trade + civlib, model = "ls", data = africa, cite = F)
#fix this htmlreg(z.afr, doctype=FALSE)
summary(z.afr)

```


```{r}
#built in Amelia dataset
data(freetrade)
dim(freetrade)
```

If no missing values, then all 171 cases should be present for all 10 variables

```{r message=FALSE, warning=FALSE}
z.frt <- zelig(tariff ~ polity + pop + gdp.pc + year, model = "ls", data = freetrade, cite = F)
summary(z.frt)
```


60 observations missing -- do we know that they are represented well by the others? we can't say for sure that they are similar or not, so we don't know if the model of the other 111 observations is any good.

##What to do about it?

###Traditional approaches

1. Listwise deletion (complete case analysis)
2. Pairwise deletion (e.g., regtools)
3. Dummy variable + mean substitution
4. Simple imputation (e.g., regression prediction) = predict the missed independent variable by looking at the other dependent variable values. 

- This last two were pretty common approaches 20 years ago, but recent research shows that the last two methods are often just as bad as the first two, and sometimes worse.



# The common problems
- The loss of valuable information, reduced statistical power
- Leading to potentially biased sample
- If a value is missing, we will never know it for sure
- All we can do is to guess or impute the missing value
- Some guesses may be closer to the true value that is missing than others
- But we will never know

If you estimate your missing values based on "best guess" your regression model will be very different than someone whose best guess was different, and the two competing models can't really be measured against each other because we can't know if either of the guesses are correct

# Multiple Imputations
##conceptual issues

The sampling analogy
- Population parameter: unknown
- Sample statistics: known
- Repeated random samples: sampling distribution

Similarly
- Impute the missing value multiple times
- Estimate the selected model on each of the imputed data sets
- Averaging/combining the coefficients to get the final results

If you create 20 different datasets with 20 different 'guesses' for the missing values (can be called *complete case data* or *imputed data*), combine the 20 different results, and average the coefficients, this is a better way of modeling missing data values.

#Multiple imputations
##technical issues
How to construct the imputation model?
How to average multiple coefficients?

#Constructing the imputation model
Many different ways, many R packages
- Amelia (most used in this lecture)
- mi
- mice
- norm
- pan

##Combining results I: point estimates

Simply the average of the m separate estimates

##Combining results II: standard error

The variance of the multiple imputation point estimates equals the sum of:
- The average of of the estimated variance from within each completed data set; and
- The sample variance in the point estimates across the data set

No need to hand-calculate since we have access to packages like *Amelia*

```{r message=FALSE, warning=FALSE}
# one line of code to generate the default number of imputations
# default in Amelia might be 5?
#cs is column number or variable name indicating the cross section variable.
#ts is column number or variable name indicating the variable identifying time in time series data
# gdp_pc is a large number, so taking the log is a more normalized way of looking at it but logs is optional command
data(africa)
a.out <- amelia(x = africa, cs = "country", ts = "year", logs = "gdp_pc")
```

```{r message=FALSE, warning=FALSE}
a.out
```

```{r message=FALSE, warning=FALSE}
# what is in a.out? use names to see the elements
names(a.out)
```


```{r message=FALSE, warning=FALSE}
#generate new file to view (opens in new tab in Rstudio)
#note that imp1 refers to the 1st imputation
#we have 4, so we can do imp1, imp2, imp3, imp4, imp5 but imp6 is nothing and will be missing value
View(a.out$imputations$imp1)
```


```{r message=FALSE, warning=FALSE}
#example, shows different value in the missing field
View(a.out$imputations$imp2)
```

We could do it for all five, but we get it

```{r message=FALSE, warning=FALSE}
#amelia comes from same universe as zelig, so it's pretty easy to swap zelig and amelia commands
z.out <- zelig(gdp_pc ~ trade + civlib, model="ls", data=a.out, cite = FALSE)
#we have just supplied the amelia output we just created into the zelig ecosystem, no extra steps required
summary(z.out)
```

This is our final estimated average of the multiply-imputed data regression models! 

```{r message=FALSE, warning=FALSE}
#if you're curious, you can take a look at the individual regression results
summary(z.out, subset = 1)
#use summary with a subset 1, 2, 3, 4, 5 to look at each individual model
```


```{r message=FALSE, warning=FALSE}
#compare subset 2 
summary(z.out, subset = 2)
```

....and we get it, this can be done for each imputed value set.

```{r message=FALSE, warning=FALSE}
#and we can also just set up the new amelia data using zelig counterfactuals
#super easy and nice to do
z.out$setx()
z.out$sim()
plot(z.out)
```



```{r message=FALSE, warning=FALSE, paged.print=TRUE}
#let's do this with the freetrade example
#it had a lot more missing values!
data(freetrade)
a.out <- amelia(freetrade, m = 20, ts = "year", cs = "country")
z.out <- zelig(tariff ~ polity + pop + gdp.pc + year + country, model = "ls", data = a.out, cite = FALSE)
#using the amelia dataset a.out with its multiple imputations as the input
summary(z.out)
```

```{r}
#if you don't want to use zelig, you don't have to but it is a little more involved
#first create two empty vectors, b.out and se.out
b.out<-NULL
se.out<-NULL
#so for 1 to 5, estimate the regression model using one of the multiply imputed dataset
#create this model by filling in this vector
for(i in 1:a.out$m) {
  ols.out <- lm(tariff ~ polity + pop + gdp.pc, data = a.out$imputations[[i]])
  b.out <- rbind(b.out, ols.out$coef)
  se.out <- rbind(se.out, coef(summary(ols.out))[,2])
}
#now the two vectors are full, we can use the mi.meld command to do the actual average
combined.results <- mi.meld(q = b.out, se = se.out)
combined.results
```


##Further considerations

The Amelia package with multiple imputation options can work with much more complicated data structures, variable types, etc, but the basic necessary steps have been illustrated.

The rest of the lecture will show ways to improve the basic procedure.

1. What happens when variables are noncontinuous or non-normal?

- Amelia works well even when some variables are discrete or non-normal
- Nonetheless it helps to declare nominal and ordinal variables

2. What about variables with large extreme variation, like GDP?
- It can help to apply log-transformation to continuous variables that are heavily skewed
- Including identification variables


## Ordinal variable 

```{r}
table(a.out$imputations$imp3$polity)
```

Polity is an integer variable but imputation procedure produced some nonsensical values (you can see them, they are clear!)

```{r message=FALSE, warning=FALSE}
#specify that polity is an ordinal variable instead of interval/ratio
a.out1 <- amelia(freetrade, m = 5, ts = "year", cs = "country", ords = "polity")
```


```{r message=FALSE, warning=FALSE}
table(a.out1$imputations$imp3$polity)
```

Now they are all integer values, as they should be!

## Nominal variables

```{r}
table(a.out1$imputations$imp3$signed)
```

```{r}
#specify that signed variable is nominal scale
a.out2 <- amelia(freetrade, m = 5, ts = "year", cs = "country", noms = "signed")
```

```{r}
table(a.out2$imputations$imp3$signed)
```

## Transformation

```{r}
hist(freetrade$tariff)
```

This is not a normal distribution, it is very positively skewed. Makes it not a great candidate for normal linear regression, but Amelia can normalize it before entering it into other models

Take the natural log!

```{r}
hist(log(freetrade$tariff))
```

Much better!

## Identification variables

The two variables, year and country remain in the data, but they do not contribute any information to the imputation

This is *cross secional time series* data. Datasets have it when you take cross sections data, like same variables in different years or different locations.
Amelia is well suited to handling this year/country data by letting you specify *cs* and *ts* but even if it's not country/year data, Amelia lets you put ID variables in one option to treat them as ID variables.


```{r}
#idvars mean they are just ID variables
tmp <- amelia(freetrade, idvars = c("year", "country"))
```

##Country and time in *Amelia*

###Start with linear model


```{r}
#using cs and ts (not idvars) because they really are time series and location data
#the cs and ts is very powerful command and you should use it if it applies

# this assumes linear
a.test1 <- amelia(freetrade, ts = "year", cs = "country")

# introduce polynomial term to allow for nonlinearity
a.test2 <- amelia(freetrade, ts = "year", cs = "country", polytime = 2)

# third model introduces polynomial term and ALSO an interaction term between time sensitive and cross section term
# this means each country is allowed to have its unique time trend
# kind of like a no pooling model
a.test3 <- amelia(freetrade, ts = "year", cs = "country", polytime = 2,
                  intercs = TRUE)
```

## Country and time linear

```{r}
tscsPlot(a.test1, cs = "Malaysia", main = "Malaysia (no time settings)", 
         var = "tariff", ylim = c(-10, 60))
```

black dots represent original, non missing level
red dots+bars represent imputed values with 95% confidence interval

(Conceptual question: You can have very narrow confidence intervals, or very wide confidence intervals. Which is preferred? Wider means that we are less certain of our estimate. If the CI is too wide, we say that the regression parameter is too vague and we can't know if the influence of the independent variable is much at all.)


#Country and time polynomial

introduce polynomial term to handle nonlinear data

```{r}
tscsPlot(a.test2, cs = "Malaysia", main = "Malaysia (no time settings)", 
         var = "tariff", ylim = c(-10, 60))
```






```{r}
tscsPlot(a.test3, cs = "Malaysia", main = "Malaysia (no time settings)", 
         var = "tariff", ylim = c(-10, 60))
```

This is a much tighter cluster, smaller confidence intervals; we can say this model looks a lot more useful.

This is the model that allows each country to have its own time trend (the interaction and 2nd order polynomial term) which is clearly a better model.



##Cautions
There are three different kinds of missing data!
- Missing completely at random (MCAR)
- Missing at random (MAR)
- Missing not at random (MNAR)

MI works well under MCAR and MAR but not so well under MNAR because if it's not randomly missing, it suggests there is a reason it is missing consistently in one way and our actual observations may not easily represent the missing values.

MNAR means the missing data is dependent on something in your data, and something outside your data. Before doing any modeling, you need to figure out what that is.

MCAR is fine, but multiple inputation performs no better than a simple deletion substition solution, because the model is unaffected by the data. You can do MI but it's not necessarily required.

MAR is random but not *completely* at random, and this is where MI really shines and is the best solution.

# When doing MI, how many *m*?
- Amelia default is 5
- Many suggests that it is enough to have \(m = 10 \sim 20\)
- Some suggest that, for the MI method to work, \(m\) needs to be much larger
