Here we briefly review some of the theory surrounding multiple imputation (MI), drawn from here to start with. Missing data is common, and can bias results or lead to substantial losses of data if otherwise complete observation are dropped because one variable is missing data.

There are different types of missing data, which influence the approach to managing it.

There are a variety of solutions to missing data, most of which are not ideal. For example, single imputation (for instance, subsituting missing values for the mean of the variable), often causes standard errors to be too small.

Multiple imputation incorporates uncertainty about missing data, by creating several plausable imputed data sets and combining results from each of them. Multiple imputation generally has two steps.

Some common pitfalls include:

The above mentioned paper suggests some reporting guidelines:

For analyses based on multiple imputation

In our case, we’re combining MI and bootstrapping. One hand, the method for calculating SE for models using MI is established (although I don’t understand it). Similarly, calculating CI’s from bootstrapping, drawing a given percentile from the ordered set of estimates is straightforward. However, combining the two methods - for example if trying to calculate CI’s from non-normally distributed data - is less clear. Should you do MI first then bootstrap on that, or the other way around, and do you pool estimates from each of the imputed datasets and find the CI from that or do you include each estimate from each imputed dataset, and find the CI from that?

I am not sure, but from my understanding of this paper, there are a couple of approaches that are acceptable, partially depending on the amount of within bootstrap variation and between imputed dataset variation. The consequence of using the wrong method is that the SE can be too wide – in other words, getting it wrong makes us more likely to get a false negative. I do not really know how to meaningfully compare the within imputed dataset and between bootstrap variation, so I’m going to proceed with one of the suggested approaches. This is where you do the MI first, then bootstrap from each imputed dataset. You then pool the estimates from each imputed dataset, and take a given percentile (like the 2.5th and 97.5th) from the ordered estimates as your CI’s. My understanding is that this corresponds to “Method 1”, which was deemed acceptable, in the above paper. Part of the reason for this is that the model already takes about two days to run with bootstrapping, and approach of doing MI on bootstrapped datasets would take an order of magnitude longer. Additionally, the variables with missing data are not of that much interest to us, so it’s not the end of world if we’re overly conservative about their statistical significance; it’s not going to bias the other results of interest.