Asymmetric predictability

Information theory in causal investigations

Soumik Purkayastha and Peter X.K. Song

Department of Biostatistics, University of Michigan

5/24/23

Overview

  • Causal investigations in observational studies pose a great challenge.
  • We develop an information geometric causal discovery and inference framework of “predictive asymmetry”.

For \((X, Y)\), predictive asymmetry enables assessment of whether \(X\) is more likely to cause \(Y\) or vice-versa. The asymmetry between cause and effect becomes particularly simple if \(X\) and \(Y\) are deterministically related.

  • We propose a new metric called the Directed Mutual Information (\(DMI\)) and establish its key statistical properties.
  • \(DMI\) is not only able to detect complex non-linear association patterns, but also is able to detect and infer causal relations.

Motivating problem

Are \(X\) and \(Y\) associated?

Does \(X\) cause \(Y\)?

Does \(Y\) cause \(X\)?

Even under simplifying assumptions of no confounding, no feedback loops, and no selection bias, directly assessing bivariate causal relationship is a notoriously hard problem Spirtes and Zhang (2016).

  • Our methodology development is motivated by the Early Life Exposures in Mexico to Environmental Toxicants (ELEMENT) cohort (Hernandez-Avila et al. (1996)) study.

  • One of the aims is to investigate the direction of influence between DNA methylation (DNAm) alterations and cardiovascular outcomes.

  • In a candidate genes (namely, ATP2B1) that is linked with blood pressure (BP), researchers wish to investigate whether DNAm (specifically, cytosine-phosphate-guanine (CpG) methylation) status influences change in BP or if the converse is true (Dicorpo et al. (2018)).

The Directed Mutual Information (DMI)

Mutual information (MI)

Let \(X\) and \(Y\) be two random vectors with joint density \(f_{XY}\) with marginals \(f_X\) and \(f_Y\). Mutual information \(MI(X, Y)\) (Shannon (1948)) is: \[\begin{equation} \label{eq:mi} MI(X, Y) = E_{XY}\left\{ \log \frac{f_{XY}(X, Y)}{f_X(X) f_Y(Y)} \right\} \end{equation}\]

  • \(MI \geq 0\) with equality if and only if \(X\) and \(Y\) are independent.
  • \(MI\) actually depends only on underlying joint copula density \(c_{XY}\) and not on the marginal densities.
  • Consider \(U_X = F_X(X)\) and \(U_Y = F_Y(Y)\), where \(F_X\) and \(F_Y\) are the CDF of \(X\) and \(Y\) respectively.

  • By Sklar’s theorem (Sklar (1959)), we have \[c_{X Y}\left(u_X, u_Y\right)=\frac{f_{X Y}\left\{F_X^{-1}\left(u_X\right), F_Y^{-1}\left(u_Y\right)\right\}}{f_X\left\{F_X^{-1}\left(u_X\right)\right\} f_Y\left\{F_Y^{-1}\left(u_Y\right)\right\}}, \] so we can re-write \(MI(X, Y) = MI(U_X, U_Y) = E\left\{ \log c(U_X, U_Y) \right\}\), where \(c(U_X, U_Y)\) is the unique copula density function.

  • Computational burden of estimation methods based on previous understanding of \(MI\) is greatly reduced (Ma and Sun (2008)).

Purkayastha and Song (2022b) “fastMI: a fast and consistent copula-based estimator of mutual information.”

The Directed Mutual Information (DMI)

Mutual information (MI)

The Directed Mutual Information (DMI)

Entropy and entropy ratio (ER)

  • Marginal entropy of \(X\) is defined as \(H(X) = E_X\left\{ - \log f_X(X) \right\}\).
  • \(H(X, Y) = E_{XY} \left\{- \log f_{XY}(X, Y) \right\}\) is the joint entropy of \((X, Y)\).
  • Conditional entropy of \(X\) conditioned on \(Y\) is given by \(H(X|Y) = H(X, Y) - H(Y)\).

Entropy decomposition equation:

\[\begin{equation} \label{eqn:identity} \begin{aligned} H(X,Y) &= MI(X, Y) + H(X|Y) + H(Y|X). \end{aligned} \end{equation}\]

The total entropy \(H(X, Y)\) may be decomposed to establish a sense of ‘’symmetric behaviour’’ through \(MI(X, Y)\) and ‘’asymmetric behaviour’’ through \(H(X|Y)\) and \(H(Y|X)\).

Comparing \(H(X|Y) > \ (\text{or }<) \ H(Y|X)\) reveals if conditioning on \(X\) and predicting \(Y\) yields less (or more) uncertainty. As a result, it highlights which of \(X\) or \(Y\) has a more dominant predictive role to play in a bivariate relationship.

We define the entropy ratio of \(X\) relative to \(Y\)

\[\begin{equation} \label{eq:relcondent} ER(X|Y) = \frac{\exp\{H(X|Y)\}}{ \left[\exp\{H(X|Y)\} + \exp\{H(Y|X)\} \right]}, \end{equation}\]

The Directed Mutual Information (DMI)

Entropy ratio

  • \(ER(X|Y) = ER(Y|X) = 1/2\) if and only if \(H(X|Y) = H(Y|X)\)
  • \(ER(X|Y) \lessgtr 1/2\) implies \(H(X|Y) \lessgtr H(Y|X)\), with \(H(Y|X) < H(X|Y)\) establishing \(X\) as the ‘’dominant predictor variable’’ that exerts more ‘’influence’’ on \(Y\).

\(ER(X|Y) < 1/2\) establishes dominance of \(X\) while \(ER(X|Y) > 1/2\) establishes dominance of \(Y\).

The \(ER\) has a nice transitive property:

  • If we assume \(X\) is more predictive than \(Y\), i.e., \(ER(X|Y) > 1/2\) .
  • And assume \(Y\) is more predictive than \(Z\), i.e., \(ER(Y|Z) > 1/2\).
  • A little algebra yields \(ER(X|Z) > 1/2\).

An ordering of \(ER\) between \((X, Y)\) and \((Y, Z)\) provides insight on asymmetric predictability in a third pair, i.e., \((X, Z)\) within our proposed framework.

The Directed Mutual Information (DMI)

DMI can detect dependence between two random variables and also test for asymmetric predictability between \(X\) and \(Y\). \[\begin{equation} \label{eq:dmi} DMI\left(X\middle| Y\right) = MI(X, Y) \times ER(X|Y). \end{equation}\]

\(DMI(X|Y)\) is a scaled asymmetric function of \(MI\), which is an effective tool to capture symmetric association. The scaling factor \(ER(X|Y)\) aims to capture asymmetric behaviour between \(X\) and \(Y\) by comparing their relative predictive performance when they are associated.

  1. \(DMI\left(X\middle| Y\right) \geq 0\) and \(DMI\left(Y\middle| X\right) \geq 0\), with equality if and only if \(X\) and \(Y\) are independent.
  2. \(DMI\left(X\middle| Y\right) \lessgtr DMI\left(Y\middle| X\right)\) implies \(ER\left(X\middle| Y\right) \lessgtr ER\left(Y\middle| X\right)\).

\[\begin{equation} \label{eq:del} \Delta = DMI\left(X\middle| Y\right) - DMI\left(Y\middle| X\right), \end{equation}\] where \(\Delta > 0\) (or \(\Delta < 0\)) establishes \(X\) as the dominant predictor variable over \(Y\) (or \(Y\) as the dominant predictor variable over \(X\)).

Estimating DMI

  • Estimating densities:

    • Using advances made by O’Brien et al. (2016), we estimate the underlying joint and marginal densities using Fast-Fourier transformations.
    • This is many magnitudes faster than classical bandwidth-based density estimation, while maintaining comparable error performance, making our method scalable.
    • Estimated densities are used to obtain estimates of \(DMI\) and \(\Delta\).
  • Estimating \(DMI\) and \(\Delta\): We use a data-splitting technique for estimation and valid inference, shown by the flowchart below

Theoretical guarantees for valid inference with \(DMI\)

  1. Consistency: Assuming the density functions are bounded below and above, with \(\min(n_1, n_2) \rightarrow \infty\), we get consistent estimates of \(DMI(X|Y)\) and \(DMI(Y|X)\).

  2. Asymptotic normality: Assuming true \(MI \neq 0\), with \(\min(n_1, n_2) \rightarrow \infty\), we have \[\begin{equation} \sqrt{n_2}(\widehat{DMI}(X|Y) - DMI(X|Y)) \rightarrow N(0, \sigma_1^2)\\ \sqrt{n_2}(\hat{\Delta} - \Delta) \rightarrow N(0, \sigma_2^2) \end{equation}\] where both \(\sigma_1\) and \(\sigma_2\) can be estimated by Monte Carlo tools (Robert (2010)).

Simulation study: testing independence

  • In each pattern, the signal parameter \(a\) determines the strength of association between \(X\) and \(Y\) with \(a = 0\) denoting independence.

  • Upon increasing \(a\), we increase the signal strength of \(X\) in \(Y\) relative to the independent noise \(\epsilon\), implying a departure from independence (i.e. the null case

Simulation study: testing independence

  • In each pattern, the signal parameter \(a\) determines the strength of association between \(X\) and \(Y\) with \(a = 0\) denoting independence.

  • Upon increasing \(a\), we increase the signal strength of \(X\) in \(Y\) relative to the independent noise \(\epsilon\), implying a departure from independence (i.e. the null case

Real data application: ELEMENT study.

  • We investigate DNA methylation (DNAm) and blood pressure (BP) in 21 correlated methylation sites of a candidate gene (namely, ATPB21) in the ELEMENT cohort.

  • Overall, DNAm of ATP2B1 is associated with DBP (p-value = 0.042) at the 5% level of significance.

    • One methylation site (17564205) is noted to be significantly associated with DBP after applying Bonferroni correction.

  • \(H_0: DMI = 0\) is test for independence.
    • We perform a permutation-based test of independence for all 21 methylation sites with Systolic and Diastolic BP (SBP/DBP).
    • p-values obtained from each gene are mildly correlated.
    • So we aggregate p-values using the Cauchy combination test Liu and Xie (2019).

Real data application: ELEMENT study.

  • We examine \(\Delta\) and note that DBP exhibits dominance over the site 17564205, with \(\hat{\Delta} (95\% \ CI) = -2.14 \ (-3.85, -0.42)\).

With most studies on the association between DNAm and blood pressure (Han et al. (2016)) still in their infancy, these findings present evidence that DNAm alteration could be influenced by blood pressure and call for further investigation.

Purkayastha and Song (2022a) “Asymmetric predictability in causal discovery: an information theoretic approach”

In summary…

  • Asymmetry is an inherent property of bivariate associations and therefore must not be ignored.

  • We present a new causal discovery framework of asymmetric predictability between two random variables \(X\) and \(Y\) using \(DMI\).

  • The \(DMI\) is inspired by Shannon’s seminal work on information theory: may simultaneously be used to test for association as well as detect and quantify asymmetry.

  • A computationally fast and robust Fourier transformation-based method is used to estimate the \(DMI\) instead of conventional kernel-based methods.

  • Moreover, our method consistently performs faster than existing bandwidth-dependent methods - being approximately 4 orders of magnitude faster for bivariate sample sizes of approximately \(10^4\).

  • \(DMI\) methodology relies on data-splitting inference, enjoys key large-sample properties necessary for valid inference.

  • In absence of knowledge, our framework may serve either as a discovery or confirmatory tool.

References

Dicorpo, Daniel A., Samantha Lent, Weihua Guan, Marie-France Hivert, and James S. Pankow. 2018. “Mendelian Randomization Suggests Causal Influence of Glycemic Traits on DNA Methylation.” Diabetes 67 (Supplement_1). https://doi.org/10.2337/db18-1707-p.
Han, Liyuan, Yanfen Liu, Shiwei Duan, Benjamin Perry, Wen Li, and Yonghan He. 2016. DNA Methylation and Hypertension: Emerging Evidence and Challenges.” Briefings in Functional Genomics, May, elw014. https://doi.org/10.1093/bfgp/elw014.
Hernandez-Avila, M, T Gonzalez-Cossio, E Palazuelos, I Romieu, A Aro, E Fishbein, K E Peterson, and H Hu. 1996. “Dietary and Environmental Determinants of Blood and Bone Lead Levels in Lactating Postpartum Women Living in Mexico City.” Environmental Health Perspectives 104 (10): 1076–82. https://doi.org/10.1289/ehp.961041076.
Liu, Yaowu, and Jun Xie. 2019. “Cauchy Combination Test: A Powerful Test with Analytic p-Value Calculation Under Arbitrary Dependency Structures.” Journal of the American Statistical Association 115 (529): 393–402. https://doi.org/10.1080/01621459.2018.1554485.
Ma, Jian, and Zengqi Sun. 2008. “Mutual Information Is Copula Entropy.” Tsinghua Science and Technology 16 (1): 51–54. https://doi.org/10.1016/s1007-0214(11)70008-6.
O’Brien, Travis A., Karthik Kashinath, Nicholas R. Cavanaugh, William D. Collins, and John P. O’Brien. 2016. “A Fast and Objective Multidimensional Kernel Density Estimation Method: fastKDE.” Computational Statistics & Data Analysis 101 (September): 148–60. https://doi.org/10.1016/j.csda.2016.02.014.
Purkayastha, Soumik, and Peter XK Song. 2022a. “Asymmetric Predictability in Causal Discovery: An Information Theoretic Approach.” arXiv Preprint arXiv:2210.14455.
———. 2022b. “fastMI: A Fast and Consistent Copula-Based Estimator of Mutual Information.” arXiv Preprint arXiv:2212.10268.
Robert, Christian P. 2010. Monte Carlo Statistical Methods (Springer Texts in Statistics). Paperback. Springer. https://lead.to/amazon/com/?op=bt&la=en&cu=usd&key=1441919392.
Shannon, C. E. 1948. “A Mathematical Theory of Communication.” Bell System Technical Journal 27 (3): 379–423. https://doi.org/10.1002/j.1538-7305.1948.tb01338.x.
Sklar, Martin J. 1959. “Fonctions de Repartition a n Dimensions Et Leurs Marges.” In.
Spirtes, Peter L., and Kun Zhang. 2016. “Causal Discovery and Inference: Concepts and Recent Methodological Advances.” Applied Informatics 3.