Multiple imputation theory

Here we briefly review some of the theory surrounding multiple imputation (MI), drawn from here to start with. Missing data is common, and can bias results or lead to substantial losses of data if otherwise complete observation are dropped because one variable is missing data.

There are different types of missing data, which influence the approach to managing it.

Missing completely at random, where there are not systematic differences between the observed and missing values.
Missing at random, where any systematic difference between the observed and missing values can be explained by differences in the observed data.
Missing not at random, where there are systematic difference between observed and missing values even when observed data are taken into account.

There are a variety of solutions to missing data, most of which are not ideal. For example, single imputation (for instance, subsituting missing values for the mean of the variable), often causes standard errors to be too small.

Multiple imputation incorporates uncertainty about missing data, by creating several plausable imputed data sets and combining results from each of them. Multiple imputation generally has two steps.

Create multiple versions of the dataset, imputing values for missing data. These imputed values are sampled from the distribution of the observed data.
Fit the desired model for each impututed dataset. Estimates will differ between each modelled dataset, and so these are averaged together to give the overall estimated association. Standard errors are then calculated using Rubin’s rules, which take into account the variability in the results becuase of the imputed data.

Some common pitfalls include:

Omitting the outcome variable from the imputation procedure
Dealing with non-normally distributed variables, or binary or catagorical data (although solutions for dealing with these data exist)
Failing to meet missing at random assumption

The above mentioned paper suggests some reporting guidelines:

"Report the number of missing values for each variable of interest, or the number of cases with complete data for each important component of the analysis. Give reasons for missing values if possible, and indicate how many individuals were excluded because of missing data when reporting the flow of participants through the study. If possible, describe reasons for missing data in terms of other variables (rather than just reporting a universal reason such as treatment failure).
Clarify whether there are important differences between individuals with complete and incomplete data—for example, by providing a table comparing the distributions of key exposure and outcome variables in these different groups
Describe the type of analysis used to account for missing data (eg, multiple imputation), and the assumptions that were made (eg, missing at random)

For analyses based on multiple imputation

Report details of the software used and of key settings for the imputation modelling
Report the number of imputed datasets that were created (Although five imputed datasets have been suggested to be sufficient on theoretical grounds, a larger number (at least 20) may be preferable to reduce sampling variability from the imputation process)
What variables were included in the imputation procedure?
How were non-normally distributed and binary/categorical variables dealt with?
If statistical interactions were included in the final analyses, were they also included in imputation models?
If a large fraction of the data is imputed, compare observed and imputed values
Where possible, provide results from analyses restricted to complete cases, for comparison with results based on multiple imputation. If there are important differences between the results, suggest explanations, bearing in mind that analyses of complete cases may suffer more chance variation, and that under the missing at random assumption multiple imputation should correct biases that may arise in complete cases analyses
Discuss whether the variables included in the imputation model make the missing at random assumption plausible
It is also desirable to investigate the robustness of key inferences to possible departures from the missing at random assumption, by assuming a range of missing not at random mechanisms in sensitivity analyses."

In our case, we’re combining MI and bootstrapping. One hand, the method for calculating SE for models using MI is established (although I don’t understand it). Similarly, calculating CI’s from bootstrapping, drawing a given percentile from the ordered set of estimates is straightforward. However, combining the two methods - for example if trying to calculate CI’s from non-normally distributed data - is less clear. Should you do MI first then bootstrap on that, or the other way around, and do you pool estimates from each of the imputed datasets and find the CI from that or do you include each estimate from each imputed dataset, and find the CI from that?

I am not sure, but from my understanding of this paper, there are a couple of approaches that are acceptable, partially depending on the amount of within bootstrap variation and between imputed dataset variation. The consequence of using the wrong method is that the SE can be too wide – in other words, getting it wrong makes us more likely to get a false negative. I do not really know how to meaningfully compare the within imputed dataset and between bootstrap variation, so I’m going to proceed with one of the suggested approaches. This is where you do the MI first, then bootstrap from each imputed dataset. You then pool the estimates from each imputed dataset, and take a given percentile (like the 2.5th and 97.5th) from the ordered estimates as your CI’s. My understanding is that this corresponds to “Method 1”, which was deemed acceptable, in the above paper. Part of the reason for this is that the model already takes about two days to run with bootstrapping, and approach of doing MI on bootstrapped datasets would take an order of magnitude longer. Additionally, the variables with missing data are not of that much interest to us, so it’s not the end of world if we’re overly conservative about their statistical significance; it’s not going to bias the other results of interest.