Missing data is an issue ubiquitous in many fields of science. Given that statistical procedures often require data to be fully observed, there has long been an interest and need to manage missing data in a way that allows researchers to reach valid conclusions. Today, multiple imputation is accepted as the gold standard in missing data analysis, in no short thanks to the work of Donald Rubin (Buuren 2012). In 1977, Rubin proposed using multiple completed versions of the dataset with missing observations, applying the complete-data procedure of interest, and pooling the estimates to draw valid inferences (Buuren 2012). The main advantage of multiple imputation, as opposed to single imputation, which had been used by researchers since the early 1970s, is its ability to properly estimate the variance caused by missing observations (Efron 1994). The emphasis placed on variance and uncertainty by Rubin was a departure from the status quo of the time, which was to fill in the missing observation with the most probable value and to proceed with complete case analyses as if the observation had not been missing, to begin with (Buuren 2012). This approach, however, fails to incorporate the loss of information caused by missing observations into the estimation of parameters, resulting in the underestimation of variance (Rubin 1978).
Like all revolutionary ideas, multiple imputation received harsh criticism following its conception. Perhaps the most notable of the objections came from Fay in 1992, who demonstrated through counterexamples that multiple imputation produced biased covariance estimates (Bartlett and Hughes 2020). Fay added the need for unison between the imputation and analysis model made multiple imputation a poor general-purpose tool, particularly in instances where the imputer and analyst are different individuals (Fay 1992; Buuren 2012). Fay’s arguments led to the conceptualization of congeniality1 between the imputation and analysis model, which was later accepted to be a requirement to obtain valid inferences from multiple imputation using Rubin’s pooling rules (Buuren 2012; Meng 1994). Although Fay’s work initially criticized biases introduced to the covariance matrix following multiple imputation, a similar phenomenon of biased estimations were observed with variance under uncongeniality (Fay 1992; Meng 1994; Xie and Meng 2016).
Some of the earliest works demonstrating Rubin’s variance estimator to be biased under uncongeniality were from Wang_1998 and Robins in 1998, who also proposed an alternate variance estimator in the same paper (Buuren 2012). The variance estimator proposed by Wang and Robins requires the calculation of several quantities, which are not easily accessible to the average statistical practitioner (Bartlett and Hughes 2020). The challenging construction of the variance estimator proposed by Wang and Robins has resulted in it receiving little-to-no attention in applied settings (Bartlett and Hughes 2020). In an attempt to create a more user-friendly variance estimator in instances of suspected uncongeniality, researchers have proposed combining resampling methods with multiple imputation. Of the two main resampling methods, bootstrap has received more attention from multiple imputation researchers compared to jackknife resampling, which has mostly been investigated under single hot-deck imputation. Although particular combinations of bootstrap and multiple imputation have been demonstrated to create asymptotically unbiased estimates of variance, the associated computational cost makes this an active area of research (Bartlett and Hughes 2020). Most recently, von Hippel has proposed a bootstrap variance estimator which addresses the issue of computational cost; however, it has been demonstrated to create confidence intervals that are slightly wider compared to traditional bootstrap and multiple imputation combinations (Bartlett and Hughes 2020). Given the lower computational cost associated with jackknife resampling, as well as desirable properties demonstrated under single imputation, such as being unbiased in certain scenarios, it is an attractive alternative that should be considered as a variance estimator of multiply imputed data under uncongeniality (Chen and Shao 2001; Rao and Shao 1992).
What are the properties of the jackknife variance estimator in multiply imputed datasets, namely, datasets containing missing outcome variables imputed through predictive mean matching?
In what manner should the jackknife estimator be combined with multiple imputation to obtain an estimator with desirable characteristics?
How do the characteristics of the response variable, such as the proportion of missingness, and the number of observations influence the performance of the proposed jackknife variance estimator?
How does the proposed jackknife variance estimator compare to Rubin’s rules, and bootstrap resampling methods in estimating the variance of multiply imputed variables under uncongeniality?
If any bias is noted in the variance estimates obtained through the jackknife estimator, are they negligible given the decreased computational cost?
For the proposed Monte Carlo simulation, \(N\) = 999 datasets will be generated with the following characteristics: A response variable, \(Y\), with \(p_{miss} \in \{0.10, 0.30, 0.50\}\) proportion of missing observations, where the mechanism of missingness is missing at random (MAR), and an \(n \times q\) matrix of fully observed covariates, where \(n \in \{500, 1000, 10000\}\) is the sample size, and \(q = 3\) is the number of covariates. Moreover, two of the covariates will be simulated such that there is an interaction between them. During the imputation phase of the study, the uncongenial imputation model will ignore such interaction, whereas the congenial imputation model will accommodate it.
Each of the \(N\) datasets generated with the foregoing characteristics will be imputed using the function. In order to determine \(m\), which represents the number of complete datasets to be generated per \(N\), von Hippel’s two-stage \(m\) calculator will be utilized (Hippel 2020). The imputation process will be visually monitored for convergence using a pilot study with small \(N\), which will help determine the maximum number of iterations each combination of \(N\) and \(m\) are allowed to go through. Finally, the imputation method for the outcome variable will be predictive mean matching.
Initially, the imputation model utilized by will be uncongenial to the analysis model, per the definition of congeniality set forth in Appendix B. Thereafter, the analysis model of interest to estimate \(\theta\) will be applied to each of the \(m\) datasets, and \(\text{var}(\hat{\theta})\) will be estimated using the jackknife estimator. The foregoing process will be repeated under a congenial imputation and analysis model pairing, which will serve as the control. All analyses will be conducted using 4.2.0 (Vigorous Calisthenics), alongside 3.14.0, 1.0.0, 1.2.0, and 1.3.1.
In short, one may define congeniality as the imputer and analyst making different assumptions regarding the data. The following two-part formal definition of uncongeniality was proposed by Meng in 1994, and will be utilized in our research. Meeting the assumptions set forth in the following two definitions qualifies the imputation model as being congenial to the analysis model, or vice versa.
Let \(E_f\) and \(V_f\) denote posterior mean and variance with respect to \(f\), respectively. A Bayesian model \(f\) is said to be congenial to the analysis procedure \(\mathscr{P} \equiv \{\mathscr{P}_{obs}; \mathscr{P}_{com}\}\) for given \(Z_o\) if the following hold:
The posterior mean and variance of \(\theta\) under \(f\) given the incomplete data are asymptotically the same as the estimate and variance from the analyst’s incomplete-data procedure \(\mathscr{P}_{obs}\), that is, \[\begin{equation} [\hat{\theta}(Z_o), U(Z_o)] \simeq [E_f[\theta | Z_o], V_f[\theta | Z_o]] \end{equation}\]
The posterior mean and variance of \(\theta\) under \(f\) given the complete data are asymptotically the same as the estimate and variance from the analyst’s complete-data procedure \(\mathscr{P}_{com}\), that is, \[\begin{equation} [\hat{\theta}(Z_c), U(Z_c)] \simeq [E_f[\theta | Z_c], V_f[\theta | Z_c]] \end{equation}\]
for any possible \(Y_{inc} = (Y_{obs}, Y_{miss})\) with \(Y_{obs}\) conditioned upon.
If the foregoing conditions are met, \(f\) is said to be second-moment congenial to \(\mathscr{P}\).
The analysis procedure \(\mathscr{P}\) is said to be congenial to the imputation model \(g(Y_{miss}|Z_o, A)\) where \(A\) represents possible additional data the imputer has access to, if one can find an \(f\) such that (i) \(f\) is congenial to \(\mathscr{P}\) and (ii) the posterior predictive density for \(Y_{miss}\) derived under \(f\) is identical to the imputation model \(f(Y_{miss}|Z_o) = g(Y_{miss}|Z_o, A) \ \forall \ Y_{miss}\).
Please see Appendix B for a formal definition of congeniality.↩︎