Imputation Using Gaussian Copulas

dummy slide

## Loading required package: randomForest
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
## Loading required package: foreach
## Loading required package: itertools
## Loading required package: iterators

Introduction

\[ \renewcommand\vec{\boldsymbol} \def\bigO#1{\mathcal{O}(#1)} \def\Cond#1#2{\left(#1\,\middle|\, #2\right)} \def\mat#1{\boldsymbol{#1}} \def\der{{\mathop{}\!\mathrm{d}}} \def\argmax{\text{arg}\,\text{max}} \def\Prob{\text{P}} \def\diag{\text{diag}} \]

Introduction Example

##       X1      X2      X3    X4    X5    X6   X7   X8   X9
## 1     NA      NA -1.0154 FALSE    NA FALSE    E    E    C
## 2  0.990      NA  1.0106  TRUE  TRUE FALSE <NA>    D    B
## 3  1.213 0.71453 -1.1448    NA FALSE  TRUE <NA>    A    E
## 4  1.080 0.87564  0.0942 FALSE    NA    NA    B <NA> <NA>
## 5  0.712      NA      NA  TRUE    NA    NA    A <NA>    C
## 6  1.072 0.56038  0.3495  TRUE  TRUE  TRUE <NA> <NA> <NA>
## 7     NA 0.48548  0.4229 FALSE    NA FALSE    C    C <NA>
## 8  1.725      NA      NA    NA  TRUE  TRUE    A <NA> <NA>
## 9     NA      NA      NA FALSE FALSE    NA    A    D    D
## 10    NA 0.00989  0.8905  TRUE  TRUE    NA    A    A <NA>

Copula Methods

  1. Estimate the \(p\) marginal cumulative distribution functions (CDFs) \(\widehat F_1, \dots, \widehat F_p\).
  2. Use CDFs, \(\widehat F_i\)s, to map to \((0, 1)^p\).
  3. Make assumptions for the joint distribution on the \((0, 1)^p\) scale.

Marginal CDFs

Marginal CDFs

Continuous variables maps to a point in \((0, 1)\).

Binary/ordinal variables maps to intervals in \((0, 1)\).

Result of discretization of a hidden variable.

Make assumptions on the joint distribution of hidden variables.

Presentation Outline

Motivation.

Show the marginal likelihood.

Show CDF approximations, gradient approximation, and imputation method.

Simulation example and simulation study.

Conclusions.

Motivation

Motivation

Zhao and Udell (2019) suggest a method to estimate the model.

Use approximate E-step in an EM algorithm and approximate imputation method.

Show six-fold reduction in computation time compared with the MCMC method suggested by Hoff (2007).

Show good performance compared with alternative methods.

Critique

Zhao and Udell (2019) approximations may fail in some cases.

Direct approximation of the log marginal likelihood is possible for a moderate amount of variables (\(p < 100\)).

Marginal Likelihood

Setup

Have \(\vec x_1, \dots, \vec x_n\) outcomes.

Each come from a hidden vector \(\vec z_1, \dots, \vec z_n\in\mathbb{R}^p\).

\[\vec z_i \sim N^{(p)}(\vec 0, \mat C)\]

where \(\mat C\) is an unknown correlation matrix.

Continuous Variables

Continuous variable \(x_{ij}\) is the result of \(F_j^{-1}\circ \Phi\) applied to \(z_{ij}\).

\(\Phi\) is the standard normal distribution’s CDF and \(F_j\) is an unknown CDF.

The inverse we will use is \(\Phi^{-1} \circ \widehat F_j\).

Ordinal/Binary Variables

Ordinal variable \(j\) with \(k_j\)-levels is a discretization of \(z_{ij}\).

It has unknown borders given by \(b_{j0} = 0 < b_{j1} < b_{j2} < \cdots < b_{jk_j} = 1\).

\(x_{ij}\) is the result of \(F_j^{-1}\circ \Phi\) applied to \(z_{ij}\) where \(F_j^{-1}:\, (0, 1)\rightarrow \{1, \dots, k_j \}\) with

\[ F_j^{-1}(u) = l \Leftrightarrow b_{j,l-1} < u \leq b_{jl}. \]

Ordinal/Binary Variables

We estimate \(\widehat b_{j1}, \dots, \widehat b_{j,k_j - 1}\) and define \(\widehat F_j:\, \{1, \dots, k_j\} \rightarrow (0, 1)^2\) such that

\[\widehat F_j(x) = (\widehat b_{j,l - 1}, \widehat b_{jl}] \Leftrightarrow x = l.\]

Map from the level \(x_{ij}\) to an interval in \(\mathbb R^2\) using \(\Phi^{-1} \circ \widehat F_j\)

where \(\Phi^{-1}\) is applied elementwise.

Notation

Let

\[\vec x_{i\mathcal I} = (x_{il_1}, \dots, x_{il_m})^\top, \qquad \mathcal I = \{l_1, \dots, l_m\}.\]

We observe:

  1. the continuous variables in \(\mathcal C_i \subseteq \{1,\dots, p\}\).
  2. the ordinal variables in \(\mathcal O_i \subseteq \{1,\dots, p\}\).
  3. miss the variables in \(\mathcal M_i \subseteq \{1,\dots, p\}\).

Where \(\mathcal C_i \cap \mathcal O_i = \mathcal C_i \cap \mathcal M_i = \mathcal O_i \cap \mathcal M_i = \emptyset\) and \(\mathcal C_i \cup \mathcal O_i \cup \mathcal M_i = \{1, \dots, p\}\).

Notation

Let

\[ \begin{align*}\mat C_{\mathcal I\mathcal J} &= \begin{pmatrix} c_{l_1h_1} & \cdots & c_{l_1h_s} \\ \vdots & \ddots & \vdots \\ c_{l_mh_1} & \cdots & c_{l_mh_s}\end{pmatrix}, \qquad \mathcal J = \{h_1, \dots, h_s\} \\ \widehat{\vec z}_{\mathcal C_i} &= (\Phi^{-1}(\widehat F_j(x_{ij})))_{ j \in \mathcal C_i} \\ \mathcal D_i &= \left\{j \in \mathcal O_i: (\Phi^{-1}(\widehat b_{j,x_{ij} - 1}), \Phi^{-1}(\widehat b_{jx_{ij}})]\right\} \end{align*} \]

Log Marginal Likelihood

\[ \begin{align*} L_i(\mat V) &= \log \phi^{(\lvert\mathcal C_i\rvert)} (\widehat{\vec z}_{\mathcal C_i}; \vec 0, \mat V_{\mathcal C_i\mathcal C_i}) + \log\Phi^{(\lvert\mathcal O_i\rvert)}(\\ &\hspace{20pt} \mathcal D_i; \mat V_{\mathcal O_i\mathcal C_i} \mat V_{\mathcal C_i\mathcal C_i}^{-1}\widehat{\vec z}_{\mathcal C_i}, \mat V_{\mathcal C_i\mathcal C_i} - \mat V_{\mathcal O_i\mathcal C_i}\mat V_{\mathcal C_i\mathcal C_i}^{-1} \mat V_{\mathcal C_i\mathcal O_i}) \\ L(\mat V) &= \sum_{i = 1}^n L_i(\mat V) \end{align*} \]

where \(\phi^{(s)}\) is a \(s\)-dimensional multivariate normal density and \(\Phi^{(s)}\) is a \(s\)-dimensional multivariate normal CDF over a hypercube.

Need an approximation for the CDF.

Approximations

CDF Approximation

Want an approximation of

\[ \Phi^{(h)}(\mathcal P; \vec\mu, \mat\Sigma) = \int_{a_1}^{b_1}\cdots\int_{a_h}^{b_h} \phi^{(h)}(\vec z; \vec\mu,\mat\Sigma)\der\vec z \]

for some \(\mathcal P = (a_1, b_1]\times \cdots \times (a_h, b_h]\).

Use method from Genz and Bretz (2002).

Closely follow Genz and Bretz (2009).

Separation-of-Variables

Let \(\mat S\) be Cholesky decomposition of \(\mat\Sigma = \mat S\mat S^\top\). Then use \(\mat S\vec y + \vec \mu = \vec z\):

\[ \begin{align*} \Phi^{(h)}(\mathcal P; \vec\mu, \mat\Sigma) = \int_{\tilde a_1}^{\tilde b_1}\phi(y_1) \int_{\tilde a_2(y_1)}^{\tilde b_2(y_1)}\phi(y_2) \cdots\int_{\tilde a_h(y_1, \dots, y_{h-1})}^{ \tilde b_h(y_1, \dots, y_{h-1})} \phi(y_h)\der\vec y \end{align*} \]

\[ \tilde a_j(y_1,\dots y_{j - 1}) = \frac{a_j - \sum_{l = 1}^{j - 1}s_{jl}y_l}{s_{jj}} \]

and \(\tilde b_j\) similarly.

Separation-of-Variables

Use \(\Phi^{-1}(u_j) = y_j\):

\[ \begin{align*} \Phi^{(h)}(\mathcal P; \vec\mu, \mat\Sigma) = \int_{\Phi(\tilde a_1)}^{\Phi(\tilde b_1)} \int_{\Phi(\tilde a_2(\Phi^{-1}(u_1)))}^{\Phi(\tilde b_2(\Phi^{-1}(u_1)))} \cdots\int_{\Phi(\tilde a_h(\Phi^{-1}(u_1), \dots, \Phi^{-1}(u_{h-1})))}^{ \Phi(\tilde b_h(\Phi^{-1}(u_1), \dots, \Phi^{-1}(u_{h-1})))} \der\vec u \end{align*} \]

Separation-of-Variables

Re-scale and relocate with:

\[ \begin{align*} \bar a_j + (\bar b_j - \bar a_j)w_j &= u_j \\ \bar a_j(w_1, \dots, w_{j -1}) &= \Phi(\tilde a_h(\Phi^{-1}(u_1), \dots, \Phi^{-1}(u_{j-1}))) \\ \bar b_j(w_1, \dots, w_{j -1}) &= \Phi(\tilde b_h(\Phi^{-1}(u_1), \dots, \Phi^{-1}(u_{j-1}))) \\ \Phi^{(h)}(\mathcal P; \vec\mu, \mat\Sigma) &= (\bar b_1 - \bar a_1) \int_0^1(\bar b_2(w_1) - \bar a_2(w_1))\cdots \\ &\hspace{20pt} \cdot\int_0^1(\bar b_h(w_1, \dots, w_{h - 1}) - \bar a_h(w_1, \dots, w_{h - 1})) \int_0^1\der\vec w \end{align*} \]

Example with Original Integrand

Example with Transformed Integrand

Example with Permutated Integrand

Notice the legend.

Variable Re-ordering

A heuristic variable re-ordering can reduce the variance integrand.

Quasi-random Numbers

Genz and Bretz (2002) use randomized Korobov rules.

Yields an error which is \(\mathcal O (s^{-1}(\log s)^h)\) instead of \(\mathcal O(s^{-1/2})\) where \(s\) is the number samples.

Derivatives and Imputation

Derivatives w.r.t. \(\vec \mu\) and \(\mat\Sigma\) and mean imputation requires:

\[ \int_{a_1}^{b_1}\cdots\int_{a_h}^{b_h} \vec h(\vec z, \vec \mu, \mat\Sigma)\phi^{(h)}(\vec z; \vec\mu,\mat\Sigma)\der\vec z \]

for some function \(\vec h\).

If \(\vec g:\, (0,1)^h\rightarrow \mathbb R^h\) is the map from \(\vec w\) to \(\vec z\) then using a transformation and permutation like before is beneficial if \(\vec h(\vec g(\vec w), \vec\mu, \mat\Sigma)\) is not too variable.

The Implementation

Extends the Fortran code By Genz and Bretz (2002).

Allows for computation in parallel.

The computation is compute-bound so this is very beneficial.

Examples of Integral Approximations

## [1] 2000
## [1] -14119

Examples of Integral Approximations

Median running times:

median
1 thread 957ms
2 threads 491ms
3 threads 353ms
4 threads 271ms

Simulations

Performing Estimation and Imputation

Estimation using stochastic gradient methods.

ADAM and SVRG is implemented.

Performing Estimation and Imputation

##    user  system elapsed 
##  24.250   0.008   6.347

Imputations

##     X1    X2    X3    X4    X5    X6 X7 X8 X9
## 1 1.20 0.825 -1.02 FALSE FALSE FALSE  E  E  C
## 2 0.99 0.604  1.01  TRUE  TRUE FALSE  E  D  B
## 3 1.21 0.715 -1.14  TRUE FALSE  TRUE  A  A  E
##     X1    X2    X3    X4    X5    X6 X7 X8 X9
## 1 1.17 0.927 -1.02 FALSE FALSE FALSE  E  E  C
## 2 0.99 0.690  1.01  TRUE  TRUE FALSE  E  D  B
## 3 1.21 0.715 -1.14  TRUE FALSE  TRUE  C  A  E

Est. Correlation Matrix

Est. Correlation Matrix Zhao et al.

Simulation Study

Time SE Time (Δ) SE Est. error SE Est. error (Δ) SE
mdgc 11.1 0.3 0.062 0.001
mixedgc 15.0 0.2 -3.9 0.3 0.100 0.002 -0.037 0.002
missForest 22.7 0.4 -11.6 0.5

mdgc is the presented method, mixedgc is the method suggested by Zhao and Udell (2019), and missForest is the method suggested by Stekhoven and Buehlmann (2012; Stekhoven 2013). 100 samples are used in the study.

The “Est. error” is \(\lVert \widehat{\mat C} - \mat C\rVert \big/\lVert\mat C\rVert\).

Simulation Study

Bin. SE Bin. (Δ) SE Ord. SE Ord. (Δ) SE
mdgc 0.246 0.003 0.588 0.004
mixedgc 0.252 0.003 -0.006 0.004 0.625 0.004 -0.037 0.006
missForest 0.290 0.003 -0.043 0.004 0.648 0.003 -0.060 0.005
Con. SE Con. (Δ) SE
mdgc 0.465 0.014
mixedgc 0.469 0.014 -0.005 0.020
missForest 0.522 0.015 -0.057 0.021

Bin. is binary classification error, ord. is ordinal classification error, and con. is continuous’ variables RMSE.

More Samples

Time SE Time (Δ) SE Est. error SE Est. error (Δ) SE
mdgc 151 5.1 0.081 0.001
mixedgc 415 3.4 -263 6.4 0.115 0.001 -0.034 0.001

Larger problem with \(p = 60\) and \(n = 10000\). 10 samples are used.

More Samples

Bin. SE Bin. (Δ) SE Ord. SE Ord. (Δ) SE
mdgc 0.242 0.001 0.581 0.002
mixedgc 0.249 0.001 -0.008 0.002 0.612 0.002 -0.031 0.003
Con. SE Con. (Δ) SE
mdgc 0.410 0.013
mixedgc 0.416 0.013 -0.006 0.019

Conclusion

Summary

Introduced Gaussian copulas method to impute missing values.

Covered quasi-random numbers method to approximate the log marginal likelihood and to perform imputations.

Showed that the method is efficient and fast even for larger problems with \(p = 60\) variables.

Extensions

Like Zhao and Udell (2020), we can use a low rank approximation:

\[ \mat C = \mat W\mat W^\top + \sigma^2\mat I \]

where \(\mat W\in\mathbb{R}^{p\times k}\) with \(k < p\).

Easy to add prediction intervals. Probabilities of each level is already provided.

Extensions

\(x_{ij}\) is multinomial with \(k_j\)-categories.

Suppose

\[ \begin{align*} \vec A_{ij} &\sim N^{(k_j)}(\vec \mu_j, 2^{-1}\mat I) \\ x_{ij} &= l\Leftrightarrow A_{ijl} > A_{ijl'} \quad \forall l'\neq l \end{align*} \]

\(\vec\mu_j\) is given by the marginal distribution.

Identity

Let

\[ \Phi^{(k)}(\vec a, \vec b;\vec\mu,\mat\Sigma) = \int_{a_1}^{b_1}\cdots\int_{a_k}^{b_k} \phi^{(k)}(\vec x;\vec\mu,\mat\Sigma)\der\vec x \]

Then

\[ \begin{align*} \int\phi^{(k_1)}(\vec z_1; \vec\mu_1, \mat\Sigma_1) \Phi^{(k_2)}(\vec a + \mat K\vec z_1, \vec b + \mat K\vec z_1; \vec\mu_2,\mat\Sigma_2)\der\vec z_1 & \\ &\hspace{-200pt}= \Phi^{(k_2)}(\vec a, \vec b; \vec\mu_2 - \mat K\vec\mu_1, \mat\Sigma_2 + \mat K\mat\Sigma_1\mat K^\top) \end{align*} \]

Marginal Distribution

\[ \begin{align*} \Prob(X_{ij} = l) &= \int \phi(a;\mu_{jl},2^{-1}) \Phi^{(k_j - 1)}(-\vec\infty, \vec 1a;\vec\mu_{j(-l)}, 2^{-1}\mat I)\der a \\ &= \Phi^{(k_j - 1)}(-\vec\infty, \vec 1\mu_{jl} - \vec\mu_{j(-l)}; \vec 0, 2^{-1}\mat I + 2^{-1}\vec 1\vec 1 ^\top) \end{align*} \]

where \(\vec\mu_{j(-l)} = (\mu_{j1}, \dots, \mu_{j,l - 1}, \mu_{j,l + 1}, \dots \mu_{jk_j})\).

Marginal Distribution (2D)

\[ \begin{align*} \Prob(X_{ij} = 1) &= \Phi(\mu_{j1} - \mu_{j2}; 0, 1) = \Phi(\mu_{j1} - \mu_{j2}) \\ \Prob(X_{ij} = 2) &= \Phi(\mu_{j2} - \mu_{j1}; 0, 1) = \Phi(\mu_{j2} - \mu_{j1}) \end{align*} \]

Select \(\mu_{j2} = 0\) for identifiability:

\[ \begin{align*} \Prob(X_{ij} = 1) &= \Phi( \mu_{j1}) \\ \Prob(X_{ij} = 2) &= \Phi(-\mu_{j1}) \end{align*} \]

I.e. a binary variable with a border at \(0 < \Phi(\mu_{j1}) < 1\).

Using the Identity

Suppose \(x_{ij} = l\) and write:

\[ \vec V_i = (A_{ijl}, \vec A_{ij(-l)}^\top, \vec z_{i\mathcal O_i}^\top)^\top \]

where \(\vec z_{i\mathcal O_i}\) are the remaining variables (ordinal/binary).

Denote:

\[ \vec V_i \sim N^{(h)}\left( \begin{pmatrix} \mu_{g} \\ \vec\mu_{(-g)}\end{pmatrix}, \begin{pmatrix} \sigma_{gg} & \vec\sigma_{g(-g)} \\ \vec\sigma_{(-g)g} & \mat\Sigma_{(-g)(-g)}\end{pmatrix} \right) \]

Using the Identity

The marginal likelihood factor is:

\[ \begin{align*} \int \phi(z;\mu_g;\sigma_{gg}^2) \Phi^{(h - 1)}\bigg(&\vec a_{(-g)} + \vec dz - \mat K(z - \mu_g), \vec b_{(-g)} + \vec dz - \mat K(z - \mu_g); \\ & \vec\mu_{(-g)}, \mat\Sigma_{(-g)(-g)} - \frac{\vec\sigma_{(-g)g}\vec\sigma_{(-g)g}^\top}{\sigma_{gg}}\bigg)\der z \\ \end{align*} \]

with \(\mat K = \vec\sigma_{(-g)g}/\sigma_{gg}\) and

\[\vec d = (\underbrace{\,\,\,\vec 1^\top\,\,\,}_{k_j - 1\text{ times}}, \vec 0^\top)^\top\]

Using the Identity

Thus, we get:

\[ \Phi^{(h - 1)}\bigg(\vec a_{(-g)}, \vec b_{(-g)}; \vec\mu_{(-g)} - \vec d\mu_g, \mat\Sigma_{(-g)(-g)} + \vec d\vec d^\top\sigma_{gg}\bigg) \]

Easy to extend to multiple multinomial variables.

Extending

We have \(G\) categorical variables. The latent variables for the \(l = 1,\dots, G\) categorical variable is at indices \(r_l + 1, \dots, r_l + k_l\).

Suppose that we observe the category for the \(l\)th categorical variable which latent variable is at index \(c_l\) and let \(\mathcal C = \{c_1,\dots, c_G\}\).

Extending

Let \(\mat D = (\vec d_1, \dots \vec d_G)_{(\mathcal -C)\cdot}\), the matrix which columns are \(\vec d_1, \dots, \vec d_G\) without the rows in \(\mathcal C\).

Define a vector \(\vec d_l = (\vec 0^\top, \underbrace{\,\,\,\vec 1^\top\,\,\,}_{k_l\text{ times}}, \vec 0^\top)^\top\) which has ones at indices \(r_l + 1, \dots, r_l + k_l\).

Extending

Then intractable integral is:

\[ \begin{align*} \int \phi^{(G)}(\vec z;\vec \mu_{\mathcal C};\mat\Sigma_{\mathcal C\mathcal C}) \Phi^{(h - G)}\bigg(\hspace{-75pt}& \\ & \vec a_{(-\mathcal C)} + \mat D\vec z - \mat K(\vec z - \vec\mu_{\mathcal C}), \vec b_{(-\mathcal C)} + \mat D\vec z - \mat K(\vec z - \vec \mu_{\mathcal C}); \\ & \vec\mu_{(-\mathcal C)}, \mat\Sigma_{(-\mathcal C)(-\mathcal C)} - \mat\Sigma_{(-\mathcal C)\mathcal C}\mat\Sigma_{\mathcal C\mathcal C}^{-1}\mat\Sigma_{\mathcal C(-\mathcal C)}\bigg) \der \vec z \\ \end{align*} \]

with \(\mat K = \mat\Sigma_{(-\mathcal C)\mathcal C}\mat\Sigma_{\mathcal C\mathcal C}^{-1}\).

Extending

Applying the identity yields:

\[ \Phi^{(h - G)}\bigg(\vec a_{(-\mathcal C)}, \vec b_{(-\mathcal C)}; \vec\mu_{(-\mathcal C)} - \mat D\vec\mu_{\mathcal C}, \mat\Sigma_{(-\mathcal C)(-\mathcal C)} + \mat D\mat\Sigma_{\mathcal C\mathcal C}\mat D^\top) \]

Thank You!

The presentation is at rpubs.com/boennecd/Gaussian-copula-KTH.

The markdown is at github.com/boennecd/Talks.

Code and more examples at github.com/boennecd/mdgc.

References are on the next slide.

References

Genz, Alan, and Frank Bretz. 2002. “Comparison of Methods for the Computation of Multivariate T Probabilities.” Journal of Computational and Graphical Statistics 11 (4). Taylor & Francis: 950–71. https://doi.org/10.1198/106186002394.

———. 2009. Computation of Multivariate Normal and T Probabilities. Vol. 195. Springer Science & Business Media. https://doi.org/10.1007/978-3-642-01689-9.

Hoff, P. D. 2007. “Extending the Rank Likelihood for Semiparametric Copula Estimation.” Ann. Appl. Stat. 1 (1). The Institute of Mathematical Statistics: 265–83. https://doi.org/10.1214/07-AOAS107.

Stekhoven, Daniel J. 2013. MissForest: Nonparametric Missing Value Imputation Using Random Forest.

Stekhoven, Daniel J., and Peter Buehlmann. 2012. “MissForest - Non-Parametric Missing Value Imputation for Mixed-Type Data.” Bioinformatics 28 (1). Oxford Univ Press: 112–18.

Zhao, Yuxuan, and Madeleine Udell. 2019. “Missing Value Imputation for Mixed Data via Gaussian Copula.”

———. 2020. “Matrix Completion with Quantified Uncertainty Through Low Rank Gaussian Copula.”