Asymptotically Exact and Fast Gaussian Copula Models for Imputation of Mixed Data Types

dummy slide

Motivation & Method

\[ \renewcommand\vec{\boldsymbol} \def\bigO#1{\mathcal{O}(#1)} \def\Cond#1#2{\left(#1\,\middle|\, #2\right)} \def\mat#1{\boldsymbol{#1}} \def\der{{\mathop{}\!\mathrm{d}}} \def\argmax{\text{arg}\,\text{max}} \def\Prob{\text{P}} \def\Expec{\text{E}} \def\logit{\text{logit}} \def\diag{\text{diag}} \]

Previous Work

Missing values of mixed data types is a common problem.

Multinomial, ordinal, binary, and continuous. Examples: surveys and medical records.

Copula models can be used for imputation and have flexible marginal distributions.

Mostly assumptions about the dependence between variables.

Gaussian copula models have been suggested for imputation

with an approximate expectation maximization (AEM) algorithm (Zhao and Udell 2020b, 2020a).

Fast but may be biased and inefficient.

Example of a Gaussian Copula

Latent \(Z_1,Z_2\) are joint normal distributed. We observe \(X_2 = f(Z_2)\) and \(X_1 \in\{A,B,C\}\). The illustration is contours of possible conditional densities of the continuous variable on the latent and observed scale.

Contributions

An efficient importance sampler to estimate the model and impute the missing variables.

An extension of Genz and Bretz (2002).

Added support for multinomial variables

in addition the existing binary, ordinal, and continuous variables.

Experimental Results

Comparison with the AEM Algorithm

Simulation study comparing our method with the AEM algorithm. Left: relative error of the estimated correlation matrix versus the sample size (gray: our method; white: the AEM algorithm). Right: differences in imputation error between our method and the AEM algorithm for each data type versus the sample size.

With Multinomial Variables

Average imputation error and computation time with simulated and observational data sets. We compare with the random forest based missForest (Stekhoven and Buehlmann 2012) and the PCA-like imputeFAMD (Audigier, Husson, and Josse 2014; Josse and Husson 2016).

With Multinomial Variables

Average imputation error and computation time with simulated and observational data sets. We compare with the random forest based missForest (Stekhoven and Buehlmann 2012) and the PCA-like imputeFAMD (Audigier, Husson, and Josse 2014; Josse and Husson 2016).

With Multinomial Variables

Average imputation error and computation time with simulated and observational data sets. We compare with the random forest based missForest (Stekhoven and Buehlmann 2012) and the PCA-like imputeFAMD (Audigier, Husson, and Josse 2014; Josse and Husson 2016).

Thank You!

The mdgc package is on CRAN and at github.com/boennecd/mdgc.

The presentation is at rpubs.com/boennecd/mdgc-ACML.

The markdown is at github.com/boennecd/Talks.

References are on the next slide.

References

Audigier, Vincent, François Husson, and Julie Josse. 2014. “A Principal Component Method to Impute Missing Values for Mixed Data.” Advances in Data Analysis and Classification 10 (1): 5–26.
Genz, Alan, and Frank Bretz. 2002. “Comparison of Methods for the Computation of Multivariate t Probabilities.” Journal of Computational and Graphical Statistics 11 (4): 950–71.
Josse, Julie, and François Husson. 2016. missMDA: A Package for Handling Missing Values in Multivariate Data Analysis.” Journal of Statistical Software 70 (1): 1–31.
Stekhoven, Daniel J., and Peter Buehlmann. 2012. “MissForest - Non-Parametric Missing Value Imputation for Mixed-Type Data.” Bioinformatics 28 (1): 112–18.
Zhao, Yuxuan, and Madeleine Udell. 2020a. “Matrix Completion with Quantified Uncertainty Through Low Rank Gaussian Copula.” In Advances in Neural Information Processing Systems (NeurIPS). http://arxiv.org/abs/2006.10829.
———. 2020b. “Missing Value Imputation for Mixed Data via Gaussian Copula.” In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 636–46. KDD ’20. New York, NY, USA: Association for Computing Machinery.