n <- 200
x <- runif(n)*10
yparab <- (x-5)^2 + 2*rnorm(n)
fitparab <- lm(yparab ~ x)
yvar <- rnorm(n, 0, x)
fitvar <-lm(yvar~x)
plot(x,yparab,xlab="X", ylab="Y$", cex.lab=1.5)
abline(fitparab, col='red')
text(5,10, paste("rho=", round(cor(x, yparab), digits=3)))
plot(x,yvar, xlab="X", ylab="Y", cex.lab=1.5)
abline(fitvar, col="red")
text(0.5,4.5, paste("rho=", round(cor(x, yvar), digits=3)))
cor(x, yparab)
## [1] 0.02509234
cor.test(x,yparab)
##
## Pearson's product-moment correlation
##
## data: x and yparab
## t = 0.35319, df = 198, p-value = 0.7243
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.1140457 0.1632650
## sample estimates:
## cor
## 0.02509234
cor.test(x,yvar)
##
## Pearson's product-moment correlation
##
## data: x and yvar
## t = 0.42392, df = 198, p-value = 0.6721
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.1090839 0.1681513
## sample estimates:
## cor
## 0.03011283
Recall correlation between random variables, \(X\) and \(Y\) is \[ \rho(X,Y) = \frac{\text{cov}(X,Y)}{\text{var}(X)\text{var}(Y)} = \frac{E[XY]-E[X]E[Y]}{\text{var}(X)\text{var}(Y)} \]
\(X\) and \(Y\) are uncorrelated (\(\rho(X,Y) = 0\)) if \(E[XY]=E[X]E[Y]\)
Because we can subtract the mean, \(E[X]\), from any random variable, and divide by its variance, we really only care about \(E[XY]\)
If \(E[X] = E[Y]=0\) and \(\text{var}(X)=\text{var}(Y)=1\), then \[ \rho(XY) = E[XY] \]
Kernel methods attempt to detect dependence of any form
At a high level, we can think of it using a property moment generating functions (MGF) to do this
Recall the moment generating function of a random variable \(X\) with density \(p\) is \[ M_X(t) = E\left[e^{tX}\right] = \int e^{tx}p(x)dx \] Theorem: If \(X\) and \(Y\) are random variables then \[ X\perp Y \Leftrightarrow M_{X+Y}(t) = M_X(t)M_Y(t) \] Proof: \[ \begin{align} M_{X+Y}(t) = M_X(t)M_Y(t) &\Leftrightarrow \int e^{t(X+Y)}p(x,y)dxdy = \int e^{tX}e^{tY}p(x,y)dxdy = \int e^{tX}p(x)dx \int e^{tY}p(y)dy \\ &\Leftrightarrow p(x,y)=p(x)p(y) \\ &\Leftrightarrow X\perp Y \end{align} \]
MGFs implicitly uses higher order moments of \(X\) and \(Y\)
Using Taylor expansion, recall that \[e^x = \sum_{\ell=0}^\infty \frac{x^\ell}{\ell!}\]
So, independence actually requires \[M_{X+Y}(t) =\sum_{\ell=0}^\infty \frac{t^\ell E[(X+Y)^\ell]}{\ell!} = \left(\sum_{\ell=0}^\infty \frac{t^\ell E[X^i]}{\ell!}\right)\left(\sum_{\ell=0}^\infty \frac{t^\ell E[Y^\ell]}{\ell!}\right) = M_X(t)M_Y(t)\]
Should we estimate the higher order moments? \(\widehat{E[X^\ell]} = \frac{1}{n}\sum_{i=1}^n x_i^\ell\)
No, there will be some error with for each \(\ell\) and technically, we have to evaluate infinitely many
High level:
Inner product space is a vector space, \(V\), with a map, \(\langle\cdot, \cdot\rangle: V\times V\rightarrow \mathbb{R}\) with
Example: Euclidean dot product, \([a,b,c]\cdot[x,y,z] = ax+by+cz\)
Inner product spaces have norms: \(\|x\| = \sqrt{\langle x,x \rangle}\)
Spaces with norms are called metric spaces
Cauchy sequence is a sequence of vectors, \(x_1, x_2, x_3, \ldots\), where for every \(\epsilon>0\), there is a positive integer, \(N\), such that for all \(n,m>N, \|x_n-x_m\|<\epsilon\)
A metric space, \(V\), is complete if each Cauchy sequence of points in \(V\) has a limit in \(V\)
Assume \(\mathcal X\subseteq \mathbb R^d\) is the data space (or feature space) and \(\mathcal H\) is a Hilbert space
Kernel: function \(k: \mathcal X \times \mathcal X\rightarrow \mathbb R\) such that there exists a map (called the canonical function) \(\phi: \mathcal X\rightarrow \mathcal H\) such that for \(x,y\in \mathcal X\), \[K(x,y) = \langle \phi(x), \phi(y)\rangle_{\mathcal{H}}\]
Breakdown: \(K\) gives the Hilbert space inner product of the projections of \(x\) and \(y\)
This is helpful because doing operations directly in \(\mathcal H\) can be difficult
Choosing the kernel fully determines \(\mathcal H\)
Mercer kernels are positive definite: for \(f\in\mathcal H\), \[\int_{\mathcal X}\int_{\mathcal X} K(x,y)f(x)f(y)dxdy \geq 0\]
Recall positive definite matrix: \(x^TAx\geq 0\) for all \(x\)
example: Gaussian kernel \[K(x,y) = e^{-\frac{\|x-y\|^2}{\gamma}}\]
By fixing one argument of \(K\) to \(x\in\mathcal X\), we create a mapping from \(K_x:\mathcal X\rightarrow \mathcal H\) defined by \(K_x(y) = K(x,y)\)
Consider the space \[\mathcal H_0 = \left\{f : f(x)=\sum_{j=1}^k\alpha_j K_{x_j}(x)\right\}\]
If \(f(x)=\sum_{j=1}^k\alpha_j K_{x_j}(x)\) and \(g(x)=\sum_{j=1}^m\beta_j K_{y_j}(x)\), we can define an inner product as \[\langle f, g\rangle_K = \sqrt{\sum_{i=1}^k\sum_{j=1}^m \alpha_i\beta_j K(x_i, y_j)}\]
Functional input functions and output a real number
Evaluation functional, \(F_x\) maps a function \(f\in\mathcal{H}\) to \(f(x)\), that is \(F_x f = f(x)\)
Alternative definition: A Hilbert space \(\mathcal H\) of functions is a reproducing kernel Hilbert space if evaluation functionals are bounded
Riesz representation theorem: if \(\mathbf{A}: \mathcal{H}\rightarrow \mathbb{R}\) is a bounded linear operator in a Hilbert space \(\mathcal H\), there exists some \(g_{\mathbf A}\in\mathcal H\) such that \[Af = \langle f, g_{\mathbf A}\rangle_{\mathcal H}\] for all \(f\in\mathcal H\)
We use Riesz representation theorem to prove a sufficient condition for the existence of the kernel mean embedding in an RKHS
Because \(F_x\) is a bounded linear operator, using Riesz, there must be a \(g_x\in\mathcal H\) such that \[F_xf = \langle g_x, f\rangle = f(x)\]
Proof of Riesz representation theorem:
Let \(\mathcal P\) be the set of probability densities on \(\mathcal X\)
Let \(P_X\) be the CDF (\(P_X(x) = P(X<x)=\int_{-\infty}^x p_X(y)dy\)) and \(p_x\) the density (pdf)
A kernel is characteristic if the mapping \(\mu:\mathcal P\rightarrow \mathcal H\) defined \[\mu_P = \int_{\mathcal X} k(x,\cdot) p(x)dx\] is one-to-one (\(\mu_P=\mu_Q \Rightarrow p=q\))
To show that a kernel, \(k\), is characteristic, we must show that for two CDFs, \(P\) and \(Q\), \(\mu_P=\mu_Q\) implies that \(P=Q\).
The function \(\mu\) is the kernel mean embedding function
The following theorem gives a general characterization of characteristic kernels. Here we prove the sufficient condition for a kernel to be characteristic but it is also a necessary condition.
Definition of square integrable function: \(L^2(\mathcal X, P) = \{f: \int_{\mathcal X} f(x)dP(x) <\infty\}\) (Recall \(\int_{\mathcal X} f(x)dP(x) = \int_{\mathcal X} f(x)p(x)dx\))
Theorem Let \(\mathcal H\) be an RKHS with corresponding measurable, bounded kernel, \(k\), on a measurable space \((\mathcal X, \mathcal B)\). If \(\mathcal H + \mathbb R\) (direct sum) is dense in \(L^2(\mathcal X,P)\) for any \(P\in\mathcal P\), then \(k\) is characteristic.
Proof:
We want to show that the Gaussian kernel, \(k(x,y)=\exp\left(-\frac{\|x-y\|}{2\sigma}\right)\), is characteristic.
It turns out there there is still more work to do to show this. The theorem below helps prove that the Gaussian kernel is characteristic.
The Fourier transform of a function \(\phi\) is \[\tilde\phi(\xi) := \int_{\mathcal X} e^{-it\xi}\phi(t)dt.\]
Theorem: Assume that \(k\) is translation invariant; that is, \(k(x,y)=\phi(x-y)\) for some function \(\phi\). If for all \(\xi\in\mathbb R^m\), there exists \(\tau_0\) such that \[\begin{equation} \int \frac{\tilde\phi(\tau(u+\xi))^2}{\tilde\phi(u)}du<\infty \end{equation}\] for all \(\tau>\tau_0\) then \(\mathcal H\) is dense in \(L^2(P)\) for any Borel probability measure \(P\) on \(\mathbb R^m\).
Proof: Without loss of generality, assume that \(\phi(0)=1\). Using positive definiteness of \(k\), we have that \(|\phi(z)|^2\leq \phi(0)^2=1\). Now, since the RKHS associated with \(k\) consists of the functions such that \[\mathcal H = \left\{f\in L^2(\mathbb R^m) : \int \frac{|\tilde f(u)|^2}{\tilde\phi(u)}du<\infty\right\},\] where \(\tilde f\) and \(\tilde\phi\) are the Fourier transforms of \(f\) and \(\phi\).
Let \(P\) be an arbitrary probability measure on \(\mathbb R^m\) and \(\xi\in\mathbb R^m\). Notice that the Fourier transform of \(e^{-i\xi^Tz}\phi(z/\tau)\) is \(\tilde\phi(\tau(u+\xi))\). From the assumption of the theorem we must have that \(e^{-i\xi^Tz}\phi(z/\tau)\in\mathcal H\) for all \(\tau>\tau_0\). Since \(\phi(z/\tau)\rightarrow 1\) as \(\tau\rightarrow\infty\) and \(e^{-i\xi^Tz}\phi(z/\tau)\) is bounded by a constant, we can use the dominated convergence theorem to ensure that \[E_P\left[e^{-i\xi^Tz} - e^{-i\xi^Tz}\phi(z/\tau)\right]\rightarrow 0 \text{ as } \tau\rightarrow\infty.\] This shows that we only have to prove that the span of \(\mathcal A :=\{e^{-i\xi^Tz}:\xi\in\mathbb R^m\}\) is dense in \(L^2(P)\).\
Now, let \(f\in L^2(P)\). We can assume that \(f\) is differentiable with compact support since it’s in \(L^2(P)\). Let \(\epsilon>0\) and \(M=\sup_{x\in\mathbb R}|f(x)|\), and \(A\) such that \([-A,A]^m\) contains the support of \(f\) and \(P([-A,A]^m)>1-\epsilon/4M^2\). From a functional analysis text, we have that the series of functions \[f_N(z) = \sum_{n_1=-N}^N \dots\sum_{n_m=-N}^N c_n\exp\left\{\frac{i\pi}{A}n^Tz\right\} \text{ for } N\in\mathbb{N}\] where \[c_n = \frac{1}{(2A)^m}\int_{[-A,A]^m}f(z)\exp\left|\frac{i\pi}{A}n^Tz\right|dz\] converges uniformly to \(f(z)\) on \([-A,A]^m\) as \(N\rightarrow\infty\). Then there must be some \(N\) large enough such that \(|f(z)-f_N(z)|^2<\epsilon/2\) on \([-A,A]^m\) and from the definition of \(f_N(z)\) it must be the case that \[\sup_{x\in\mathbb R^m}|f_N(z)|^2<\left(M+\sqrt{\frac{\epsilon}{2}}\right) < 2M^2.\] Then we have that \(E_P|f-f_N|^2<\epsilon\) so that span\(\mathcal A\) is dense in \(L^2(P)\).
Now, to apply this theorem to the Gaussian kernel, we must first compute \(\tilde \phi\) where \(\phi(\xi)=\exp\left(-\frac{\xi}{2\sigma}\right)\) \[\tilde \phi(\xi) = \frac{1}{2\sigma\pi(\xi^2+(2\sigma)^{-2})}~.\]
Then show that for some \(\tau_0\), \(\int \frac{\tilde\phi(\tau(u+\xi))^2}{\tilde\phi(u)}du<\infty\) for all \(\tau>\tau_0\) \[\int \frac{\tilde\phi(\tau(u+\xi))^2}{\tilde\phi(u)}du = \int \frac{u^2+(2\sigma)^{-2}}{2\sigma\pi[(\tau(u+\xi))^2+(2\sigma)^{-2}]^2} du\]
Let \(X,Y\) are random variables with marginal densities, \(p_X\) and \(p_Y\), reproducing kernels, \(k\) and \(\ell\), and corresponding RKHSs, \(\mathcal H\) and \(\mathcal F\), respectively
Assume \(E[k(X,X)], E[\ell(Y,Y)]<\infty\)
Cross covariance operator (uncentered): \(\mathcal C_{YX}:\mathcal H\rightarrow\mathcal F\) defined as \[ \mathcal C_{YX} := E_{YX}[\ell(Y,\cdot)\otimes k(X,\cdot)] = \mu_{p_{YX}}\] where \(p_{YX}\) is the joint distribution of \(X\) and \(Y\) and \(\otimes\) is the tensor product
Using a tensor only works here because \(\mathcal H\otimes \mathcal F\) and \(HS(\mathcal H, \mathcal F)\) are isomorphic using this map: \[\sum_{i\in I} f_i\otimes g_i \mapsto \left[h \mapsto \sum_{i\in I} \langle h, f_i\rangle_{\mathcal H} \cdot g_i\right]\]
\(\mathcal C_{YX}\) can be thought of as the expected value of a tensor product and as a mapping from \(\mathcal H\) to \(\mathcal F\)
As a mapping: \[\begin{align} \langle g, \mathcal C_{YX} f\rangle_{\mathcal F} &= \left\langle E_{YX}[\ell(Y,\cdot)\otimes k(X,\cdot)], g\otimes f \right\rangle_{\mathcal F\otimes\mathcal H} & \text{definition of }\mathcal C_{YX} \\ &= E_{YX}\left[ \left\langle \ell(Y,\cdot)\otimes k(X,\cdot), g\otimes f \right\rangle_{\mathcal F\otimes\mathcal H} \right] & \text{linearity of expectations} \\ &= E_{YX}\left[ \left\langle \ell(Y,\cdot), g \right\rangle_{\mathcal F}\cdot \left\langle k(X,\cdot), f \right\rangle_{\mathcal H} \right] & \text{defintion of tensor product}\\ &= E_{YX}[g(Y)f(X)] & \text{reproducing kernel property} \end{align}\]
Covariance Operator: If \(X=Y\), then \(\mathcal C_{XX}\) maps \(f\in\mathcal H\) such that for \(g\in\mathcal H\) \(\langle g, \mathcal C_{XX} f\rangle = E_X[g(X)f(X)]\)
\(\mathcal C_{XY}\) is the adjoint of \(\mathcal C_{YX}\): for \(f\in\mathcal H, g\in\mathcal F, \langle g, \mathcal C_{YX} f\rangle_{\mathcal F} = \langle \mathcal C_{XY}g,f\rangle_{\mathcal H}\)
We can also write \(\mathcal C_{YX}, \mathcal C_{XX}\) as integral functions:
Using the first equation, \[\begin{align*} \langle g, \mathcal C_{YX} f\rangle_{\mathcal F} &= \left\langle g, \int_{\mathcal X\times\mathcal Y} \ell(\cdot, y)f(x) p_{XY}(x,y)dxdy\right\rangle_{\mathcal F} &\text{first equation above} \\ &= \int_{\mathcal X\times\mathcal Y} \langle g,\ell(\cdot, y)\rangle_{\mathcal F} f(x) p_{XY}(x,y)dxdy &\text{DCT} \\ &= \int_{\mathcal X\times\mathcal Y} g(y)f(x) p_{XY}(x,y)dxdy &\text{Reproducing kernel} \\ &= E_{YX}[g(Y)f(X)] &\text{Definition of expectation} \end{align*}\]
Theorem (Fukumizu 2004): If \(E_{Y}[g(Y)|X=\cdot]\in\mathcal H\), then for \(g\in\mathcal F\), \[\mathcal C_{XX} E_{Y}[g(Y)|X=\cdot] = C_{XY} g~.\]
Centered cross covariance operator: \(E_{YX}[(\ell(Y,\cdot)-\mu_Y)\otimes(k(X,\cdot)-\mu_X)]\)
Given an IID sample, \((x_1,y_1), \dots, (x_n, y_n)\), we estimate the centered cross covariance operator as \[ \hat{\mathcal{C}}_{YX} = \frac{1}{n}\sum_{i=1}^n \left(\ell(y_i,\cdot)-\hat\mu_{p_Y}\right)\otimes \left( k(x_i,\cdot)-\hat\mu_{p_X}\right) = \frac{1}{n}\psi^T H \phi \] where \(\psi = [\ell(y_1,\cdot), \dots, \ell(y_n,\cdot)]^T\), \(\phi = [k(x_1,\cdot),\dots, k(x_n, \cdot)]^T\), and \(H = I_n-\frac{1}{n} 1_{n\times n}\)
Berlinet and Thomas-Agnan (2004) showed that \[\|\hat{\mathcal C}_{YX} - \mathcal C_{YX}\|_{HS}=O_P(1/\sqrt n)\] as \(n\rightarrow\infty\)
Theorem (Baker, 1973): There exists a unique bounded operator, \(\mathcal V_{YX}: \mathcal H\rightarrow\mathcal F\) with \(\|\mathcal V_{YX}\|_{HS}\leq 1\) such that \[\mathcal C_{YX} = \mathcal C_{YY}^{1/2}\mathcal V_{YX} \mathcal C_{XX}^{1/2}\].
\(\mathcal V_{YX}\) is called the normalized cross covariance operator
\(\mathcal V_{YX}\) captures the dependence information between \(X\) and \(Y\) with less influence from the marginal distributions
Why is this helpful for independence? If we find \(f\) and \(g\) to make the expectation above look like the product of moment generating functions, we can use that to show independence.
The next theorem show how cross covariance operator and independence are related
Theorem: \(\|C_{YX}\|_{HS}=0\) if and only if \(X \perp Y\)
Proof: \[\begin{align*} \|\tilde C_{YX}\|_{HS}=0 &\Leftrightarrow \tilde C_{YX}=0 \\ &\Leftrightarrow \mu_{P_{YX}}=\mu_{P_Y}\otimes\mu_{P_X} & \text{because } \mu_{P_{YX}}-\mu_{P_Y}\otimes\mu_{P_X}=\tilde C_{YX} \\ &\Leftrightarrow P_{YX} = P_XP_Y \\ &\Leftrightarrow X\perp Y \end{align*}\]
How do we evaluate \(\|\tilde{\mathcal C}_{YX}\|_{HS}\)?
Recall that for any Hilbert-Schmidt Operator, \(\mathcal A\), \[\|\mathcal A\|_{HS}^2 = \sum_{i\in I} \sum_{j\in J} |\langle \mathcal A f_j, h_i\rangle_{\mathcal H}|^2\]
We can also think of the summations like integrals
Example: If \(X\sim P\), \(\|\mu_P\|_{\mathcal H}^2 = \left\langle E[k(X,\cdot), E[k(X,\cdot) \right\rangle = E_{XX'}[k(X,X')] = \int_{\mathcal X}\int_{\mathcal X} k(x,x')p_X(x)p_X(x')dxdx'\)
Similarly \(\hat \mu_P =\frac{1}{n}\sum_{i=1}^n k(x_i,\cdot)\), so \(\|\hat\mu_P\|_{\mathcal H}^2 = \left\langle \frac{1}{n}\sum_{i=1}^n k(x_i,\cdot), \frac{1}{n}\sum_{i=1}^n k(x_i,\cdot) \right\rangle = \frac{1}{n^2}\sum_{i=1}^n\sum_{j=1}^n k(x_i,x_j)\)
Big Picture: We are going to use the data to estimate \(\|\hat{\mathcal C}_{YX}\|\), as a test statistic. It is very unlikely that \(\|\hat{\mathcal C}_{YX}\|=0\) but if we can figure out its distribution under the null hypothesis, we can check if it’s close enough.
How do we use this to evaluate HS norms? There are several ways!
Gretton 2007 Kernel Test of Indepedence
Let \(X'\) and \(Y'\) be an independent copy of \(X\) and \(Y\), respectively. \[\begin{align*} &\|\mathcal C_{YX}\|_{HS}^2 \\ &= \|E_{YX}[(\ell(Y,\cdot)-\mu_Y)\otimes(k(X,\cdot)-\mu_X)] \|_{HS} \\ &= \langle E_{YX}[(\ell(Y,\cdot)-\mu_Y)\otimes(k(X,\cdot)-\mu_X)], E_{Y'X'}[(\ell(Y',\cdot)-\mu_Y)\otimes(k(X',\cdot)-\mu_X)] \rangle_{HS} \\ &= E_{YY'XX'}[\langle (\ell(Y,\cdot)-\mu_Y)\otimes(k(X,\cdot)-\mu_X), (\ell(Y',\cdot)-\mu_Y)\otimes(k(X',\cdot)-\mu_X) \rangle_{HS}] \\ &= E_{YY'XX'}\left[\left(\ell(Y,Y')-\langle \mu_Y, k(Y',\cdot)\rangle -\langle \mu_Y, k(Y,\cdot)\rangle+ \|\mu_Y\|^2\right)\times\left(k(X,X') -\langle k(X,\cdot), \mu_X\rangle -\langle k(X',\cdot), \mu_X\rangle+\|\mu_X\|^2\right)\right] \\ &= E_{YY'XX'}\left[\ell(Y,Y')k(X,X')\right] -2E_{YX}\left[E_{Y'}[\ell(Y,Y')]E_{X'}[k(X,X')]\right] -E_{YY'}[\ell(Y,Y')]E_{XX'}[k(X,X')] \end{align*}\] This can be estimated as a U-statistic \[HSIC = \frac{1}{\binom{n}{2}}\sum_{(i,j)} k(x_i,x_j)\ell(y_i,y_j) -\frac{2}{\binom{n}{3}}\sum_{(i,j,q)} k(x_i,x_j)\ell(y_i,y_q) +\frac{1}{\binom{n}{4}}\sum_{(i,j,q,r)} k(x_i,x_j)\ell(y_r,y_q)\] where each tuple runs over all unique value combinations
And as a V-statistic \[HSIC_b = \frac{1}{m^2}\sum_{i,j}^n k(x_i,x_j)\ell(y_i,y_j) -\frac{2}{n^3}\sum_{i,j,q}^n k(x_i,x_j)\ell(y_i,y_q) +\frac{1}{n^4}\sum_{i,j,q,r}^n k(x_i,x_j)\ell(y_r,y_q) = \frac{1}{m^2}\text{tr}(KHLH)\] where \(K_{ij} = k(x_i,x_j), L_{ij}= \ell(y_i,y_j), H=I_n -\frac{1}{n}\mathbf{1}\mathbf{1}^T\)
Distribution of HSIC: The paper shows that, under the null distribution (\(X\perp Y\)), \[n\cdot HSIC \xrightarrow{D} \sum_{l=1}^\infty \lambda_l z_l^2\] where \(z_l\sim\)Standard Normal and \(\lambda_l\) are unknown, theoretical eigenvalues. In practice, we can either approximate it’s distribution under the null hypothesis with a Gamma distribution or with a permutation test.
Fukumizu 2007 Kernel Conditional Independence
Zhang 2012 Kernel Conditional Test
This paper builds on the previous paper by slightly changing the estimator and providing a null distribution
Big Picture: Use Gaussian process regression to estimate \(E[X|Z]\) and \(E[Y|Z]\) to determine indepdendence
The paper estimates \(\tilde K_{\ddot X|Z}\) as \[\tilde K_{\ddot X|Z} := R_Z \tilde K_\ddot X R_Z\] where \(R_Z\) is the calculated from the regression
\(\tilde K_{Y|Z}\) is calculated similarly
KCI test statistic: \[T_{CI} := \frac{1}{n}\text{TR}\left(\tilde K_{\ddot X|Z} \tilde K_{Y|Z}\right)\]
Under the null hypothesis (\(X\perp Y|Z\)), \[T_{CI} \sim \frac{1}{n} \sum_{k=1}^{n^2} \lambda_k Z_k^2\] where \(\lambda_k\) are eigenvalues and \(Z_k\sim N(0,1)\) (independent standard normal RVs)
The distribution above can be approximated Using a gamma distribution: \[Gamma(k,\theta) = \frac{t^{k-1} \exp\left(-\frac{t}{\theta}\right)}{\theta^k \Gamma(k)}\] where \[k = \frac{(E[T_{CI}])^2}{Var[T_{CI}]} \text{ and } \theta = \frac{Var[T_{CI}]}{E[T_{CI}]}\]
Recall eigenvalue decomposition of a square matrix, \(A\):
Note that \(K_X = [k(x_i,x_j)]_{ij}= [k(x_1,\cdot), \dots, k(x_n,\cdot)]^T \otimes [k(x_1,\cdot), \dots, k(x_n,\cdot)]^T\)
If \(\tilde K_X = K_XH\) is the centralized kernel matrix and \(\tilde K_X = U_X \Lambda_X U_X^T\) , we can estimate the kernel map, \([k(x_1,\cdot)-\hat\mu_X, \dots, k(x_n,\cdot)-\hat\mu_X]^T\), as \[\tilde K^{-1/2} = \left(U_X \Lambda_X U_X^T\right)^{-1/2} = U_X \Lambda_X^{-1/2}\]
Taking a closer look at the code
Below: threshold parameters
#The program codes the KCI statistic
#Zhang 2012 Kernel-based conditional independence test...
setwd(".")
epsilon <- 0 #10^-2
eigThresh <- 10^-10
thres <- 1e-5
trProd <- function(A, B){
stopifnot(ncol(A) == nrow(B))
sum(sapply(1:nrow(A),
function(ii){A[ii,] %*% B[,ii]}
)
)
}
trProdTrans <- function(A){
sum(sapply(1:nrow(A),
function(ii){A[ii,] %*% A[ii,]}
)
)
}
cholsky <- function(origMat){
cholStatus <- try(u <- chol(origMat), silent = TRUE)
cholError <- ifelse(class(cholStatus) == "try-error", TRUE, FALSE)
if (cholError){
u <- chol(origMat + 0.01 * diag(dim(origMat)[1]))
}
return(u)
}
gaussKernMat <- function(dat, theta){
exp(as.matrix(-dist(dat, diag=TRUE, upper=TRUE)^2*theta/2))
}
ardKernel
automatic relevance determination kernel: minimizes features not associated with outcomeardGrad
gradient of ardKernel
, used for optimizationgpLogLik
log-likelihood for GP RegressiongplogLikGrad
gradient of gpLogLik
, used for optimization# ard: automatic relevance determination
ardKernel <- function(params, x, noise=TRUE){
#This function is the ARD kernel
#param = (variable precision params), signal var, model var)
x <- as.matrix(x)
xDim <- dim(x)
prec_param <- exp(params[1:xDim[2]])
sig_var <- exp(2*params[xDim[2] + 1])
if(noise){
mod_var <- exp(2 * params[xDim[2] + 2])
} else { mod_var <- 0}
kern <- sig_var *
exp(-as.matrix(dist(
x %*% diag(1/prec_param, nrow=length(prec_param)),
diag=TRUE, upper = TRUE)^2/2
)
) + (mod_var + epsilon) * diag(xDim[1])
return((kern+t(kern))/2)
}
ardGrad <- function(params, x, coord){
xDim <- dim(x)
prec_param <- exp(params[1:xDim[2]])
sig_var <- exp(2*params[xDim[2] + 1])
mod_var <- exp(2*(params[xDim[2] + 2]))
if(coord <= xDim[2]){
kern <- #sig_var *
exp(-as.matrix(dist(
x %*% diag(1/prec_param, nrow=length(prec_param)),
diag=TRUE, upper = TRUE)^2/2
)
)
out <- kern *
as.matrix(dist(x[,coord],
diag=TRUE, upper=TRUE)/prec_param[coord])^2
return(out)
}
if(coord == xDim[2]+1){
kern <- sig_var *
exp(-as.matrix(dist(
x %*% diag(1/prec_param, nrow=length(prec_param)),
diag=TRUE, upper = TRUE)^2/2
)
)
return(2 * kern)
}
if(coord == xDim[2]+2){
return(2 * mod_var * diag(xDim[1]))
}
}
gplogLik <- function(params, x, y){
xDim <- dim(x)
sig_var <- exp(2*params[xDim[2] + 1]) ### Check this out
x <- as.matrix(x)
y <- as.matrix(y)
xDim <- dim(x)
yDim <- dim(y)
kern <- ardKernel(params, x)
L <- chol(kern)
alpha <- solve(kern, y,
system=signature(a = "dppMatrix", b = "dsparseMatrix"))
0.5 * sum(diag(t(y) %*% alpha)) +
yDim[2]*(0.5 * log(sig_var) + sum(log(diag(L)))) +
0.5 * xDim[1] * yDim[2] * log(2*pi)
}
gplogLikGrad <- function(params, x, y){
paramNum <- length(params)
yDim <- dim(y)
out <- rep(NA, paramNum)
kern <- ardKernel(params, x)
alpha <- solve(kern, y,
system=signature(a = "dppMatrix", b = "dsparseMatrix"))
W <- yDim[2] * solve(kern, system=signature(a = "dgCMatrix")) -
tcrossprod(alpha)
for(itr in 1:paramNum){
out[itr] <- sum((W * ardGrad(params, x, itr)))/2
}
return(out)
}
kciTest <- function(x, y, z=NULL, theta=NULL){
x <- as.matrix(x)
y <- as.matrix(y)
x <- scale(x) #scale standardizes columns
y <- scale(y)
nobs <- dim(x)[1]
H <- diag(nobs) - matrix(1, nrow=nobs, ncol=nobs)/nobs
if(is.null(theta)){
if(nobs <= 200){
width <- 1.2
} else if(nobs > 1200){
width <- 0.4
} else {width <- 0.7}
theta <- 1/width^2
}
if(is.null(z)){ # if nothing is entered for z, runs independence test
kernMatX <- gaussKernMat(x, theta)
kernMatY <- gaussKernMat(y, theta)
kx <- H %*% (kernMatX %*% H)
ky <- H %*% (kernMatY %*% H)
test.stat <- sum(diag(kx %*% ky))/nobs
th.mean <- sum(diag(kx)) * sum(diag(ky)) / nobs^2
th.var <- 2 * trProd(kx, kx) * trProd(ky, ky) / nobs^4
} else{ # if z entered, runs conditional independence test
z <- as.matrix(z)
z <- scale(z)
zDim <- dim(z)[2]
theta <- 1/(width^2 * zDim)
# centralized kernel matrix for ddot X = (X,Z), tilde K_X
kernMatXd <- gaussKernMat(cbind(x,z/2), theta)
kxd <- H %*% (kernMatXd %*% H)
# centralized kernel matrix for Y, tilde K_Y
kernMatY <- gaussKernMat(y, theta)
ky <- H %*% (kernMatY %*% H)
eigx <- eigen(kxd, symmetric = TRUE)
eigy <- eigen(ky, symmetric = TRUE)
# list of T/F for keeping eigenvalues, keep large enough values
eigx_keep <- eigx$value >= thres * eigx$value[1]
eigy_keep <- eigy$value >= thres * eigy$value[1]
# getting sqrt(K_X), sqrt(K_Y)
mapx <- eigx$vectors[,eigx_keep] %*%
diag(sqrt(eigx$values[eigx_keep]), nrow=sum(eigx_keep))
mapy <- eigy$vectors[,eigy_keep] %*%
diag(sqrt(eigy$values[eigy_keep]), nrow=sum(eigy_keep))
# getting GP parameters
init_eta <- log(width * sqrt(zDim))
init_params <- c(rep(init_eta, zDim), 0, log(sqrt(0.1)))
opt_x <- optim(init_params, gplogLik, gplogLikGrad,
x=z, y=2*sqrt(nobs) * mapx/sqrt(eigx$value[1]))
opt_y <- optim(init_params, gplogLik, gplogLikGrad,
x=z, y=2*sqrt(nobs) * mapy/sqrt(eigy$value[1]))
opt_hyp_x <- opt_x$par
opt_hyp_y <- opt_y$par
# calculating regression matrix R_Z in the notes above
Px <- (diag(nobs) - ardKernel(opt_hyp_x, z, FALSE) %*%
solve(ardKernel(opt_hyp_x, z)))
Py <- (diag(nobs) - ardKernel(opt_hyp_y, z, FALSE) %*%
solve(ardKernel(opt_hyp_y, z)))
# tilde K_{ddot X|Z}
kxdz <- Px %*% (kxd %*% t(Px))
# tilde K_{Y|Z}
kyz <- Py %*% (ky %*% t(Py))
# test statistic
test.stat <- trProd(kxdz, kyz)
# get eigenvalues/vectors
eigx <- eigen(kxdz, symmetric = TRUE)
eigy <- eigen(kyz, symmetric = TRUE)
# list T/F for which are large enough to use
thrsX <- sum(eigx$values >= eigThresh)
thrsY <- sum(eigy$values >= eigThresh)
# calcuate sqrt(tilde K_X), etc
mapx <- eigx$vectors[,1:thrsX] %*%
diag(sqrt(eigx$values[1:thrsX]), nrow=thrsX)
mapy <- eigy$vectors[,1:thrsY] %*%
diag(sqrt(eigy$values[1:thrsY]), nrow=thrsY)
w <- do.call(cbind, #stacking M_t
lapply(1:nobs,
function(t) as.vector(mapx[t,] %*% t(mapy[t,]))
)
)
# get gamma distribution parameters
prod_w <- crossprod(w)
sq_prod_w <- crossprod(prod_w)
trww <- sum(diag(prod_w))
th.mean <- trww #/ nobs
sq_trww <- sum(diag(sq_prod_w))
th.var <- 2 * sq_trww #/ nobs ^ 2
}
shape = th.mean^2 / th.var
scale = th.var / th.mean
p.val <- pgamma(test.stat, shape = shape, scale = scale,
lower.tail=FALSE) #gamma approx
return(p.val)
}
kciTest
with pc algorithmsuffStat
, the datasetkciWrap <- function(x,y,S, suffStat){
# Uses kciTest to work on data
# inputs:
# x = name of x variable
# y = name of y variable
# S = name of variables in conditioning set
# suffStat = sufficient statistic (entire dataset here)
# output: p-value from KCI test
compObs <- complete.cases(suffStat[,c(x,y,S)])
compDat <- suffStat[compObs,]
if (length(S) == 0) {
error <- try(res <- kciTest(compDat[,x], compDat[,y]))
if(class(error) == "try-error") res <- NA
} else {
error <- try(res <- kciTest(compDat[,x], compDat[,y], compDat[,S]))
if(class(error) == "try-error") res <- NA
}
res
}
neighbors <- function(outcome, data, alpha, verbose=FALSE){
nfeats <- dim(data)[2]
rfeats <- 1:nfeats
target <- nfeats + 1
dat <- cbind(data, outcome)
for(csize in 0:(nfeats-1)){
if(length(rfeats) - 1 < csize){
return(rfeats)
break
}
current <- rfeats
for(tvar in current){
condvars <- rfeats[tvar != rfeats]
if(length(condvars) != 1){
condsets <- combn(condvars, csize)
} else {
condsets <- as.matrix(condvars)
}
ncondsets <- dim(condsets)[2]
for(itr in 1:ncondsets){
if(verbose) cat("y=", tvar, ", S=",condsets[,itr],", pvalue= ")
pval <- kciWrap(target, tvar, condsets[,itr], dat)
##pval <- myCItest(target, tvar, condsets[,itr], dat)
if(verbose) cat(pval, "\n")
if(pval >= alpha & !is.na(pval)){
rfeats <- rfeats[rfeats != tvar]
break
}
}
if(verbose) cat("Remaining: ", rfeats, "\n")
}
}
}
X
also change Y
?## treatment symptoms mortality
## 687 experimental severe alive
## 239 experimental severe died
## 530 placebo mild died
## 846 placebo mild alive
## 346 experimental mild died
## 634 placebo mild alive
## 509 placebo mild alive
## 475 experimental severe alive
## 213 placebo severe died
## 431 placebo mild alive
## 632 experimental mild alive
## 630 experimental severe alive
## 318 placebo severe died
## 991 placebo mild alive
## 648 placebo severe alive
## 880 placebo mild alive
## 136 experimental severe died
## 655 experimental severe alive
## 497 placebo mild alive
## 358 placebo mild alive
##
## experimental placebo
## 487 513
##
## mild severe
## 514 486
##
## alive died
## 754 246
table(df$treatment, df$mortality)[,2]/table(df$treatment)
##
## experimental placebo
## 0.2648871 0.2280702
table(df$treatment, df$symptoms, df$mortality)[,,2]/table(df$treatment, df$symptoms)
##
## mild severe
## experimental 0.07352941 0.33903134
## placebo 0.16402116 0.40740741
table(df$treatment, df$symptoms)
##
## mild severe
## experimental 136 351
## placebo 378 135
chisq.test(df$treatment, df$symptoms)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: df$treatment and df$symptoms
## X-squared = 207.58, df = 1, p-value < 2.2e-16
X
and Y
are confounded if they are both influenced by a third variabletreatment <- severity -> mortality
treatment -> mortality
treatment <- severity -> mortality
pathwaytreatment -> mortality
A
and B
are confounded by C
and have no direct causal relationshipsize <- 500
C <- 10*runif(size)
X <- 2*C + rnorm(size)
Y <- 3*C + rnorm(size)
summary(lm(Y~X))
##
## Call:
## lm(formula = Y ~ X)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.6491 -1.2526 -0.0418 1.2540 4.1460
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.6102 0.1522 4.01 7.01e-05 ***
## X 1.4342 0.0131 109.52 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.74 on 498 degrees of freedom
## Multiple R-squared: 0.9601, Adjusted R-squared: 0.9601
## F-statistic: 1.199e+04 on 1 and 498 DF, p-value: < 2.2e-16
summary(lm(Y~X+C))
##
## Call:
## lm(formula = Y ~ X + C)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.81847 -0.63492 0.02683 0.65114 2.43640
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.03361 0.08757 0.384 0.701
## X -0.03018 0.04536 -0.665 0.506
## C 3.04750 0.09315 32.717 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.981 on 497 degrees of freedom
## Multiple R-squared: 0.9874, Adjusted R-squared: 0.9873
## F-statistic: 1.941e+04 on 2 and 497 DF, p-value: < 2.2e-16
X
influences Y
but the effect is confounded by Z
. That is, failing to account for Z
in a regression will lead to an incorrect causal parameter estimateY <- 3*C + 5*X + rnorm(size)
summary(lm(Y~X))
##
## Call:
## lm(formula = Y ~ X)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.261 -1.139 -0.012 1.183 5.458
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.65110 0.14861 4.381 1.44e-05 ***
## X 6.43596 0.01279 503.296 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.7 on 498 degrees of freedom
## Multiple R-squared: 0.998, Adjusted R-squared: 0.998
## F-statistic: 2.533e+05 on 1 and 498 DF, p-value: < 2.2e-16
summary(lm(Y~X+C))
##
## Call:
## lm(formula = Y ~ X + C)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.2887 -0.6351 -0.0185 0.6507 2.9811
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.09275 0.08704 1.066 0.287
## X 5.01797 0.04509 111.295 <2e-16 ***
## C 2.95101 0.09258 31.875 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.975 on 497 degrees of freedom
## Multiple R-squared: 0.9994, Adjusted R-squared: 0.9994
## F-statistic: 3.853e+05 on 2 and 497 DF, p-value: < 2.2e-16
size <- 100
locomotor <- 10*runif(size)
respiratory <- 10*runif(size)
hospitalized <- rbinom(size, 1, (locomotor+respiratory)/20)==0
df_all <- data.frame(locomotor, respiratory, hospitalized)
library(ggplot2)
ggplot(df_all[hospitalized==TRUE,], aes(respiratory, locomotor)) + geom_point() +
geom_smooth(method = lm, se = FALSE)
## `geom_smooth()` using formula 'y ~ x'
ggplot(df_all, aes(respiratory, locomotor)) + geom_point(aes(color=hospitalized)) +
stat_smooth(method=lm, se=FALSE)
## `geom_smooth()` using formula 'y ~ x'
If \(A\) and \(B\) are random variables and associated (dependent) \(A\not\perp B\), it’s thought that there are 3 ways this can happen
Direct causation: OR
\(A\) and \(B\) are unmeasured confounding:
Collider bias: \(A\) and \(B\) both influence sampling
If we only have data for \(A\) and \(B\), we cannot distinguish between these case without more assumptions
If there are more variables, we figure out more using conditional independence
Structure | Path type | \(A,C\) independent/dependent? | \(A,C\) conditionally independent/dependent given \(B\) |
---|---|---|---|
Chain | \(A \rightarrow B \rightarrow C\) | \(A\not\perp C\) | \(A\perp C|B\) |
Other Chain | \(A \leftarrow B \leftarrow C\) | \(A\not\perp C\) | \(A\perp C|B\) |
Fork (Confounder) | \(A \leftarrow B \rightarrow C\) | \(A\not\perp C\) | \(A\perp C|B\) |
Collider | \(A \rightarrow B \leftarrow C\) | \(A\perp C\) | \(A\not\perp C|B\) |