## Loading required package: randomForest
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
## Loading required package: foreach
## Loading required package: itertools
## Loading required package: iterators
\[ \renewcommand\vec{\boldsymbol} \def\bigO#1{\mathcal{O}(#1)} \def\Cond#1#2{\left(#1\,\middle|\, #2\right)} \def\mat#1{\boldsymbol{#1}} \def\der{{\mathop{}\!\mathrm{d}}} \def\argmax{\text{arg}\,\text{max}} \def\Prob{\text{P}} \def\diag{\text{diag}} \]
## X1 X2 X3 X4 X5 X6 X7 X8 X9
## 1 NA NA -1.0154 FALSE NA FALSE E E C
## 2 0.990 NA 1.0106 TRUE TRUE FALSE <NA> D B
## 3 1.213 0.71453 -1.1448 NA FALSE TRUE <NA> A E
## 4 1.080 0.87564 0.0942 FALSE NA NA B <NA> <NA>
## 5 0.712 NA NA TRUE NA NA A <NA> C
## 6 1.072 0.56038 0.3495 TRUE TRUE TRUE <NA> <NA> <NA>
## 7 NA 0.48548 0.4229 FALSE NA FALSE C C <NA>
## 8 1.725 NA NA NA TRUE TRUE A <NA> <NA>
## 9 NA NA NA FALSE FALSE NA A D D
## 10 NA 0.00989 0.8905 TRUE TRUE NA A A <NA>
Continuous variables maps to a point in \((0, 1)\).
Result of discretization of a hidden variable.
Make assumptions on the joint distribution of hidden variables.
Motivation.
Show the marginal likelihood.
Show CDF approximations, gradient approximation, and imputation method.
Simulation example and simulation study.
Conclusions.
Zhao and Udell (2019) suggest a method to estimate the model.
Use approximate E-step in an EM algorithm and approximate imputation method.
Show six-fold reduction in computation time compared with the MCMC method suggested by Hoff (2007).
Show good performance compared with alternative methods.
Zhao and Udell (2019) approximations may fail in some cases.
Direct approximation of the log marginal likelihood is possible for a moderate amount of variables (\(p < 100\)).
Have \(\vec x_1, \dots, \vec x_n\) outcomes.
Each come from a hidden vector \(\vec z_1, \dots, \vec z_n\in\mathbb{R}^p\).
\[\vec z_i \sim N^{(p)}(\vec 0, \mat C)\]
where \(\mat C\) is an unknown correlation matrix.
\(\Phi\) is the standard normal distribution’s CDF and \(F_j\) is an unknown CDF.
The inverse we will use is \(\Phi^{-1} \circ \widehat F_j\).
Ordinal variable \(j\) with \(k_j\)-levels is a discretization of \(z_{ij}\).
It has unknown borders given by \(b_{j0} = 0 < b_{j1} < b_{j2} < \cdots < b_{jk_j} = 1\).
\(x_{ij}\) is the result of \(F_j^{-1}\circ \Phi\) applied to \(z_{ij}\) where \(F_j^{-1}:\, (0, 1)\rightarrow \{1, \dots, k_j \}\) with
\[ F_j^{-1}(u) = l \Leftrightarrow b_{j,l-1} < u \leq b_{jl}. \]
We estimate \(\widehat b_{j1}, \dots, \widehat b_{j,k_j - 1}\) and define \(\widehat F_j:\, \{1, \dots, k_j\} \rightarrow (0, 1)^2\) such that
\[\widehat F_j(x) = (\widehat b_{j,l - 1}, \widehat b_{jl}] \Leftrightarrow x = l.\]
where \(\Phi^{-1}\) is applied elementwise.
Let
\[\vec x_{i\mathcal I} = (x_{il_1}, \dots, x_{il_m})^\top, \qquad \mathcal I = \{l_1, \dots, l_m\}.\]
We observe:
Where \(\mathcal C_i \cap \mathcal O_i = \mathcal C_i \cap \mathcal M_i = \mathcal O_i \cap \mathcal M_i = \emptyset\) and \(\mathcal C_i \cup \mathcal O_i \cup \mathcal M_i = \{1, \dots, p\}\).
Let
\[ \begin{align*}\mat C_{\mathcal I\mathcal J} &= \begin{pmatrix} c_{l_1h_1} & \cdots & c_{l_1h_s} \\ \vdots & \ddots & \vdots \\ c_{l_mh_1} & \cdots & c_{l_mh_s}\end{pmatrix}, \qquad \mathcal J = \{h_1, \dots, h_s\} \\ \widehat{\vec z}_{\mathcal C_i} &= (\Phi^{-1}(\widehat F_j(x_{ij})))_{ j \in \mathcal C_i} \\ \mathcal D_i &= \left\{j \in \mathcal O_i: (\Phi^{-1}(\widehat b_{j,x_{ij} - 1}), \Phi^{-1}(\widehat b_{jx_{ij}})]\right\} \end{align*} \]
where \(\phi^{(s)}\) is a \(s\)-dimensional multivariate normal density and \(\Phi^{(s)}\) is a \(s\)-dimensional multivariate normal CDF over a hypercube.
Need an approximation for the CDF.
Want an approximation of
\[ \Phi^{(h)}(\mathcal P; \vec\mu, \mat\Sigma) = \int_{a_1}^{b_1}\cdots\int_{a_h}^{b_h} \phi^{(h)}(\vec z; \vec\mu,\mat\Sigma)\der\vec z \]
for some \(\mathcal P = (a_1, b_1]\times \cdots \times (a_h, b_h]\).
Closely follow Genz and Bretz (2009).
Let \(\mat S\) be Cholesky decomposition of \(\mat\Sigma = \mat S\mat S^\top\). Then use \(\mat S\vec y + \vec \mu = \vec z\):
\[ \begin{align*} \Phi^{(h)}(\mathcal P; \vec\mu, \mat\Sigma) = \int_{\tilde a_1}^{\tilde b_1}\phi(y_1) \int_{\tilde a_2(y_1)}^{\tilde b_2(y_1)}\phi(y_2) \cdots\int_{\tilde a_h(y_1, \dots, y_{h-1})}^{ \tilde b_h(y_1, \dots, y_{h-1})} \phi(y_h)\der\vec y \end{align*} \]
\[ \tilde a_j(y_1,\dots y_{j - 1}) = \frac{a_j - \sum_{l = 1}^{j - 1}s_{jl}y_l}{s_{jj}} \]
and \(\tilde b_j\) similarly.
Use \(\Phi^{-1}(u_j) = y_j\):
\[ \begin{align*} \Phi^{(h)}(\mathcal P; \vec\mu, \mat\Sigma) = \int_{\Phi(\tilde a_1)}^{\Phi(\tilde b_1)} \int_{\Phi(\tilde a_2(\Phi^{-1}(u_1)))}^{\Phi(\tilde b_2(\Phi^{-1}(u_1)))} \cdots\int_{\Phi(\tilde a_h(\Phi^{-1}(u_1), \dots, \Phi^{-1}(u_{h-1})))}^{ \Phi(\tilde b_h(\Phi^{-1}(u_1), \dots, \Phi^{-1}(u_{h-1})))} \der\vec u \end{align*} \]
Re-scale and relocate with:
\[ \begin{align*} \bar a_j + (\bar b_j - \bar a_j)w_j &= u_j \\ \bar a_j(w_1, \dots, w_{j -1}) &= \Phi(\tilde a_h(\Phi^{-1}(u_1), \dots, \Phi^{-1}(u_{j-1}))) \\ \bar b_j(w_1, \dots, w_{j -1}) &= \Phi(\tilde b_h(\Phi^{-1}(u_1), \dots, \Phi^{-1}(u_{j-1}))) \\ \Phi^{(h)}(\mathcal P; \vec\mu, \mat\Sigma) &= (\bar b_1 - \bar a_1) \int_0^1(\bar b_2(w_1) - \bar a_2(w_1))\cdots \\ &\hspace{20pt} \cdot\int_0^1(\bar b_h(w_1, \dots, w_{h - 1}) - \bar a_h(w_1, \dots, w_{h - 1})) \int_0^1\der\vec w \end{align*} \]
Notice the legend.
A heuristic variable re-ordering can reduce the variance integrand.
Genz and Bretz (2002) use randomized Korobov rules.
Yields an error which is \(\mathcal O (s^{-1}(\log s)^h)\) instead of \(\mathcal O(s^{-1/2})\) where \(s\) is the number samples.
Derivatives w.r.t. \(\vec \mu\) and \(\mat\Sigma\) and mean imputation requires:
\[ \int_{a_1}^{b_1}\cdots\int_{a_h}^{b_h} \vec h(\vec z, \vec \mu, \mat\Sigma)\phi^{(h)}(\vec z; \vec\mu,\mat\Sigma)\der\vec z \]
for some function \(\vec h\).
If \(\vec g:\, (0,1)^h\rightarrow \mathbb R^h\) is the map from \(\vec w\) to \(\vec z\) then using a transformation and permutation like before is beneficial if \(\vec h(\vec g(\vec w), \vec\mu, \mat\Sigma)\) is not too variable.
Extends the Fortran code By Genz and Bretz (2002).
The computation is compute-bound so this is very beneficial.
## [1] 2000
# approximate the log marginal likelihood
mdgc_obj <- get_mdgc(dat_obs)
comp_ptr <- get_mdgc_log_ml(mdgc_obj)
mdgc_log_ml(comp_ptr, vcov = dat$Sigma, n_threads = 4L, rel_eps = 1e-3)## [1] -14119
Median running times:
| median | |
|---|---|
| 1 thread | 957ms |
| 2 threads | 491ms |
| 3 threads | 353ms |
| 4 threads | 271ms |
Estimation using stochastic gradient methods.
ADAM and SVRG is implemented.
# estimate the model and perform the imputation
set.seed(1)
system.time(
res <- mdgc(dat_obs, lr = 1e-3, maxit = 25L, maxpts = 5000L,
conv_crit = 1e-6, n_threads = 4L, batch_size = 100L))## user system elapsed
## 24.250 0.008 6.347
## X1 X2 X3 X4 X5 X6 X7 X8 X9
## 1 1.20 0.825 -1.02 FALSE FALSE FALSE E E C
## 2 0.99 0.604 1.01 TRUE TRUE FALSE E D B
## 3 1.21 0.715 -1.14 TRUE FALSE TRUE A A E
## X1 X2 X3 X4 X5 X6 X7 X8 X9
## 1 1.17 0.927 -1.02 FALSE FALSE FALSE E E C
## 2 0.99 0.690 1.01 TRUE TRUE FALSE E D B
## 3 1.21 0.715 -1.14 TRUE FALSE TRUE C A E
| Time | SE | Time (Δ) | SE | Est. error | SE | Est. error (Δ) | SE | |
|---|---|---|---|---|---|---|---|---|
| mdgc | 11.1 | 0.3 | 0.062 | 0.001 | ||||
| mixedgc | 15.0 | 0.2 | -3.9 | 0.3 | 0.100 | 0.002 | -0.037 | 0.002 |
| missForest | 22.7 | 0.4 | -11.6 | 0.5 |
mdgc is the presented method, mixedgc is the method suggested by Zhao and Udell (2019), and missForest is the method suggested by Stekhoven and Buehlmann (2012; Stekhoven 2013). 100 samples are used in the study.
The “Est. error” is \(\lVert \widehat{\mat C} - \mat C\rVert \big/\lVert\mat C\rVert\).
| Bin. | SE | Bin. (Δ) | SE | Ord. | SE | Ord. (Δ) | SE | |
|---|---|---|---|---|---|---|---|---|
| mdgc | 0.246 | 0.003 | 0.588 | 0.004 | ||||
| mixedgc | 0.252 | 0.003 | -0.006 | 0.004 | 0.625 | 0.004 | -0.037 | 0.006 |
| missForest | 0.290 | 0.003 | -0.043 | 0.004 | 0.648 | 0.003 | -0.060 | 0.005 |
| Con. | SE | Con. (Δ) | SE | |
|---|---|---|---|---|
| mdgc | 0.465 | 0.014 | ||
| mixedgc | 0.469 | 0.014 | -0.005 | 0.020 |
| missForest | 0.522 | 0.015 | -0.057 | 0.021 |
Bin. is binary classification error, ord. is ordinal classification error, and con. is continuous’ variables RMSE.
| Time | SE | Time (Δ) | SE | Est. error | SE | Est. error (Δ) | SE | |
|---|---|---|---|---|---|---|---|---|
| mdgc | 151 | 5.1 | 0.081 | 0.001 | ||||
| mixedgc | 415 | 3.4 | -263 | 6.4 | 0.115 | 0.001 | -0.034 | 0.001 |
Larger problem with \(p = 60\) and \(n = 10000\). 10 samples are used.
| Bin. | SE | Bin. (Δ) | SE | Ord. | SE | Ord. (Δ) | SE | |
|---|---|---|---|---|---|---|---|---|
| mdgc | 0.242 | 0.001 | 0.581 | 0.002 | ||||
| mixedgc | 0.249 | 0.001 | -0.008 | 0.002 | 0.612 | 0.002 | -0.031 | 0.003 |
| Con. | SE | Con. (Δ) | SE | |
|---|---|---|---|---|
| mdgc | 0.410 | 0.013 | ||
| mixedgc | 0.416 | 0.013 | -0.006 | 0.019 |
Introduced Gaussian copulas method to impute missing values.
Covered quasi-random numbers method to approximate the log marginal likelihood and to perform imputations.
Showed that the method is efficient and fast even for larger problems with \(p = 60\) variables.
Like Zhao and Udell (2020), we can use a low rank approximation:
\[ \mat C = \mat W\mat W^\top + \sigma^2\mat I \]
where \(\mat W\in\mathbb{R}^{p\times k}\) with \(k < p\).
Easy to add prediction intervals. Probabilities of each level is already provided.
\(x_{ij}\) is multinomial with \(k_j\)-categories.
Suppose
\[ \begin{align*} \vec A_{ij} &\sim N^{(k_j)}(\vec \mu_j, 2^{-1}\mat I) \\ x_{ij} &= l\Leftrightarrow A_{ijl} > A_{ijl'} \quad \forall l'\neq l \end{align*} \]
\(\vec\mu_j\) is given by the marginal distribution.
Let
\[ \Phi^{(k)}(\vec a, \vec b;\vec\mu,\mat\Sigma) = \int_{a_1}^{b_1}\cdots\int_{a_k}^{b_k} \phi^{(k)}(\vec x;\vec\mu,\mat\Sigma)\der\vec x \]
Then
\[ \begin{align*} \int\phi^{(k_1)}(\vec z_1; \vec\mu_1, \mat\Sigma_1) \Phi^{(k_2)}(\vec a + \mat K\vec z_1, \vec b + \mat K\vec z_1; \vec\mu_2,\mat\Sigma_2)\der\vec z_1 & \\ &\hspace{-200pt}= \Phi^{(k_2)}(\vec a, \vec b; \vec\mu_2 - \mat K\vec\mu_1, \mat\Sigma_2 + \mat K\mat\Sigma_1\mat K^\top) \end{align*} \]
\[ \begin{align*} \Prob(X_{ij} = l) &= \int \phi(a;\mu_{jl},2^{-1}) \Phi^{(k_j - 1)}(-\vec\infty, \vec 1a;\vec\mu_{j(-l)}, 2^{-1}\mat I)\der a \\ &= \Phi^{(k_j - 1)}(-\vec\infty, \vec 1\mu_{jl} - \vec\mu_{j(-l)}; \vec 0, 2^{-1}\mat I + 2^{-1}\vec 1\vec 1 ^\top) \end{align*} \]
where \(\vec\mu_{j(-l)} = (\mu_{j1}, \dots, \mu_{j,l - 1}, \mu_{j,l + 1}, \dots \mu_{jk_j})\).
\[ \begin{align*} \Prob(X_{ij} = 1) &= \Phi(\mu_{j1} - \mu_{j2}; 0, 1) = \Phi(\mu_{j1} - \mu_{j2}) \\ \Prob(X_{ij} = 2) &= \Phi(\mu_{j2} - \mu_{j1}; 0, 1) = \Phi(\mu_{j2} - \mu_{j1}) \end{align*} \]
Select \(\mu_{j2} = 0\) for identifiability:
\[ \begin{align*} \Prob(X_{ij} = 1) &= \Phi( \mu_{j1}) \\ \Prob(X_{ij} = 2) &= \Phi(-\mu_{j1}) \end{align*} \]
I.e. a binary variable with a border at \(0 < \Phi(\mu_{j1}) < 1\).
Suppose \(x_{ij} = l\) and write:
\[ \vec V_i = (A_{ijl}, \vec A_{ij(-l)}^\top, \vec z_{i\mathcal O_i}^\top)^\top \]
where \(\vec z_{i\mathcal O_i}\) are the remaining variables (ordinal/binary).
Denote:
\[ \vec V_i \sim N^{(h)}\left( \begin{pmatrix} \mu_{g} \\ \vec\mu_{(-g)}\end{pmatrix}, \begin{pmatrix} \sigma_{gg} & \vec\sigma_{g(-g)} \\ \vec\sigma_{(-g)g} & \mat\Sigma_{(-g)(-g)}\end{pmatrix} \right) \]
The marginal likelihood factor is:
\[ \begin{align*} \int \phi(z;\mu_g;\sigma_{gg}^2) \Phi^{(h - 1)}\bigg(&\vec a_{(-g)} + \vec dz - \mat K(z - \mu_g), \vec b_{(-g)} + \vec dz - \mat K(z - \mu_g); \\ & \vec\mu_{(-g)}, \mat\Sigma_{(-g)(-g)} - \frac{\vec\sigma_{(-g)g}\vec\sigma_{(-g)g}^\top}{\sigma_{gg}}\bigg)\der z \\ \end{align*} \]
with \(\mat K = \vec\sigma_{(-g)g}/\sigma_{gg}\) and
\[\vec d = (\underbrace{\,\,\,\vec 1^\top\,\,\,}_{k_j - 1\text{ times}}, \vec 0^\top)^\top\]
Thus, we get:
\[ \Phi^{(h - 1)}\bigg(\vec a_{(-g)}, \vec b_{(-g)}; \vec\mu_{(-g)} - \vec d\mu_g, \mat\Sigma_{(-g)(-g)} + \vec d\vec d^\top\sigma_{gg}\bigg) \]
Easy to extend to multiple multinomial variables.
We have \(G\) categorical variables. The latent variables for the \(l = 1,\dots, G\) categorical variable is at indices \(r_l + 1, \dots, r_l + k_l\).
Suppose that we observe the category for the \(l\)th categorical variable which latent variable is at index \(c_l\) and let \(\mathcal C = \{c_1,\dots, c_G\}\).
Let \(\mat D = (\vec d_1, \dots \vec d_G)_{(\mathcal -C)\cdot}\), the matrix which columns are \(\vec d_1, \dots, \vec d_G\) without the rows in \(\mathcal C\).
Define a vector \(\vec d_l = (\vec 0^\top, \underbrace{\,\,\,\vec 1^\top\,\,\,}_{k_l\text{ times}}, \vec 0^\top)^\top\) which has ones at indices \(r_l + 1, \dots, r_l + k_l\).
Then intractable integral is:
\[ \begin{align*} \int \phi^{(G)}(\vec z;\vec \mu_{\mathcal C};\mat\Sigma_{\mathcal C\mathcal C}) \Phi^{(h - G)}\bigg(\hspace{-75pt}& \\ & \vec a_{(-\mathcal C)} + \mat D\vec z - \mat K(\vec z - \vec\mu_{\mathcal C}), \vec b_{(-\mathcal C)} + \mat D\vec z - \mat K(\vec z - \vec \mu_{\mathcal C}); \\ & \vec\mu_{(-\mathcal C)}, \mat\Sigma_{(-\mathcal C)(-\mathcal C)} - \mat\Sigma_{(-\mathcal C)\mathcal C}\mat\Sigma_{\mathcal C\mathcal C}^{-1}\mat\Sigma_{\mathcal C(-\mathcal C)}\bigg) \der \vec z \\ \end{align*} \]
with \(\mat K = \mat\Sigma_{(-\mathcal C)\mathcal C}\mat\Sigma_{\mathcal C\mathcal C}^{-1}\).
Applying the identity yields:
\[ \Phi^{(h - G)}\bigg(\vec a_{(-\mathcal C)}, \vec b_{(-\mathcal C)}; \vec\mu_{(-\mathcal C)} - \mat D\vec\mu_{\mathcal C}, \mat\Sigma_{(-\mathcal C)(-\mathcal C)} + \mat D\mat\Sigma_{\mathcal C\mathcal C}\mat D^\top) \]
The presentation is at rpubs.com/boennecd/Gaussian-copula-KTH.
The markdown is at github.com/boennecd/Talks.
Code and more examples at github.com/boennecd/mdgc.
References are on the next slide.
Genz, Alan, and Frank Bretz. 2002. “Comparison of Methods for the Computation of Multivariate T Probabilities.” Journal of Computational and Graphical Statistics 11 (4). Taylor & Francis: 950–71. https://doi.org/10.1198/106186002394.
———. 2009. Computation of Multivariate Normal and T Probabilities. Vol. 195. Springer Science & Business Media. https://doi.org/10.1007/978-3-642-01689-9.
Hoff, P. D. 2007. “Extending the Rank Likelihood for Semiparametric Copula Estimation.” Ann. Appl. Stat. 1 (1). The Institute of Mathematical Statistics: 265–83. https://doi.org/10.1214/07-AOAS107.
Stekhoven, Daniel J. 2013. MissForest: Nonparametric Missing Value Imputation Using Random Forest.
Stekhoven, Daniel J., and Peter Buehlmann. 2012. “MissForest - Non-Parametric Missing Value Imputation for Mixed-Type Data.” Bioinformatics 28 (1). Oxford Univ Press: 112–18.
Zhao, Yuxuan, and Madeleine Udell. 2019. “Missing Value Imputation for Mixed Data via Gaussian Copula.”
———. 2020. “Matrix Completion with Quantified Uncertainty Through Low Rank Gaussian Copula.”