Kernel Conditional Independence and PC Algorithm Tutorial

Big Picture

Often, in science, we want to establish causal relationships from data
It is true that “correlation does not imply causation”

But, we can use conditional independence/dependence relationships within the data to establish likely causal structure

Two big ideas:

Non-parametric Independence testing
PC algorithm: using independence tests to build structure

Independence/dependence

Intuitively, \(X\) and \(Y\) are dependence is knowing \(X\) gives information about \(Y\) and visa versa
Similarly, \(X\) and \(Y\) are independent if knowing \(X\) gives no information about \(Y\) and visa versa
Mathematically, we say random variables \(X\) and \(Y\) are independent (\(X\perp Y\)) if and only if \[ \mathbb{P}(X,Y) = \mathbb{P}(X)\mathbb{P}(Y) \]
\(X\) and \(Y\) are dependent (\(X\not\perp Y\)) if and only if \[ \mathbb{P}(X,Y) \neq \mathbb{P}(X)\mathbb{P}(Y) \]
Statistical associations (or dependence) can take forms that are difficult for first order (linear) test to capture

n <- 200
x <- runif(n)*10
yparab <- (x-5)^2 + 2*rnorm(n)
fitparab <- lm(yparab ~ x)

yvar <- rnorm(n, 0, x)
fitvar <-lm(yvar~x)

plot(x,yparab,xlab="X", ylab="Y$", cex.lab=1.5)
abline(fitparab, col='red')
text(5,10, paste("rho=", round(cor(x, yparab), digits=3)))

plot(x,yvar, xlab="X", ylab="Y", cex.lab=1.5)
abline(fitvar, col="red")
text(0.5,4.5, paste("rho=", round(cor(x, yvar), digits=3)))

cor(x, yparab)

## [1] 0.02509234

cor.test(x,yparab)

## 
##  Pearson's product-moment correlation
## 
## data:  x and yparab
## t = 0.35319, df = 198, p-value = 0.7243
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.1140457  0.1632650
## sample estimates:
##        cor 
## 0.02509234

cor.test(x,yvar)

## 
##  Pearson's product-moment correlation
## 
## data:  x and yvar
## t = 0.42392, df = 198, p-value = 0.6721
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.1090839  0.1681513
## sample estimates:
##        cor 
## 0.03011283

Recall correlation between random variables, \(X\) and \(Y\) is \[ \rho(X,Y) = \frac{\text{cov}(X,Y)}{\text{var}(X)\text{var}(Y)} = \frac{E[XY]-E[X]E[Y]}{\text{var}(X)\text{var}(Y)} \]
\(X\) and \(Y\) are uncorrelated (\(\rho(X,Y) = 0\)) if \(E[XY]=E[X]E[Y]\)
Because we can subtract the mean, \(E[X]\), from any random variable, and divide by its variance, we really only care about \(E[XY]\)
If \(E[X] = E[Y]=0\) and \(\text{var}(X)=\text{var}(Y)=1\), then \[ \rho(XY) = E[XY] \]
Kernel methods attempt to detect dependence of any form
At a high level, we can think of it using a property moment generating functions (MGF) to do this
Recall the moment generating function of a random variable \(X\) with density \(p\) is \[ M_X(t) = E\left[e^{tX}\right] = \int e^{tx}p(x)dx \] Theorem: If \(X\) and \(Y\) are random variables then \[ X\perp Y \Leftrightarrow M_{X+Y}(t) = M_X(t)M_Y(t) \] Proof: \[ \begin{align} M_{X+Y}(t) = M_X(t)M_Y(t) &\Leftrightarrow \int e^{t(X+Y)}p(x,y)dxdy = \int e^{tX}e^{tY}p(x,y)dxdy = \int e^{tX}p(x)dx \int e^{tY}p(y)dy \\ &\Leftrightarrow p(x,y)=p(x)p(y) \\ &\Leftrightarrow X\perp Y \end{align} \]
MGFs implicitly uses higher order moments of \(X\) and \(Y\)
Using Taylor expansion, recall that \[e^x = \sum_{\ell=0}^\infty \frac{x^\ell}{\ell!}\]
So, independence actually requires \[M_{X+Y}(t) =\sum_{\ell=0}^\infty \frac{t^\ell E[(X+Y)^\ell]}{\ell!} = \left(\sum_{\ell=0}^\infty \frac{t^\ell E[X^i]}{\ell!}\right)\left(\sum_{\ell=0}^\infty \frac{t^\ell E[Y^\ell]}{\ell!}\right) = M_X(t)M_Y(t)\]
Should we estimate the higher order moments? \(\widehat{E[X^\ell]} = \frac{1}{n}\sum_{i=1}^n x_i^\ell\)
No, there will be some error with for each \(\ell\) and technically, we have to evaluate infinitely many

Reproducing Kernel Hilbert Spaces

High level:

a Reproducing Kernel Hilbert Space (RKHS) is a space of smooth function
we can also think of these functions as infinite dimensional vectors
the exact space is determined by the kernel that we use
We want to use an RKHS that is complex enough to capture higher order moments of our random variables
In theory, we are projecting our data points into a high dimensional space
Then we perform operations in that space that give us information about independence/dependence
Kernel trick: we convert the high-dimensional operation into a lower feasible dimension

Background

Vector space
Inner product space is a vector space, \(V\), with a map, \(\langle\cdot, \cdot\rangle: V\times V\rightarrow \mathbb{R}\) with
- Linearity: \(\langle ax, y\rangle = a\langle x, y\rangle\) and \(\langle x+y, z\rangle = \langle x,z\rangle+\langle y,z\rangle\) for \(a\in \mathbb{R}, x,y,z\in V\)
- Symmetry: \(\langle x, y\rangle = \langle y,x\rangle\) for \(x,y\in V\)
- Positive definiteness: \(\langle x,x\rangle \geq 0\) for \(x\in V\) with \(\langle x,x\rangle = 0 \Leftrightarrow x=0\)
Example: Euclidean dot product, \([a,b,c]\cdot[x,y,z] = ax+by+cz\)
Inner product spaces have norms: \(\|x\| = \sqrt{\langle x,x \rangle}\)
Spaces with norms are called metric spaces
Cauchy sequence is a sequence of vectors, \(x_1, x_2, x_3, \ldots\), where for every \(\epsilon>0\), there is a positive integer, \(N\), such that for all \(n,m>N, \|x_n-x_m\|<\epsilon\)
A metric space, \(V\), is complete if each Cauchy sequence of points in \(V\) has a limit in \(V\)
- This is a technial requirement but ensures a lot of nice properties that we will want

Hilbert Space

A Hilbert space, \(\mathcal{H}\), is complete inner product space
Hilbert space example: Set of square integrable functions on \([0,1]\) \[ L_2[0,1] = \left\{f:[0,1]\rightarrow\mathbb{R}: \int_0^1 f(x)^2dx < \infty\right\} \]
- Inner product: \[ \langle f,g\rangle = \int_0^1 f(x)g(x)dx \]
- Corresponding norm: \[ \|f\| = \sqrt{\langle f,f\rangle} = \sqrt{\int_0^1 f(x)^2dx} \]
- A sequence, \(f_1,f_2,f_3,\ldots\) in \(L_2[0,1]\) converges if \[ \|f_n-f\|\rightarrow 0 \text{ as } n\rightarrow\infty \] We can also write \(f_n\rightarrow f\)

Kernels

Assume \(\mathcal X\subseteq \mathbb R^d\) is the data space (or feature space) and \(\mathcal H\) is a Hilbert space
Kernel: function \(k: \mathcal X \times \mathcal X\rightarrow \mathbb R\) such that there exists a map (called the canonical function) \(\phi: \mathcal X\rightarrow \mathcal H\) such that for \(x,y\in \mathcal X\), \[K(x,y) = \langle \phi(x), \phi(y)\rangle_{\mathcal{H}}\]
Breakdown: \(K\) gives the Hilbert space inner product of the projections of \(x\) and \(y\)
This is helpful because doing operations directly in \(\mathcal H\) can be difficult
Choosing the kernel fully determines \(\mathcal H\)
Mercer kernels are positive definite: for \(f\in\mathcal H\), \[\int_{\mathcal X}\int_{\mathcal X} K(x,y)f(x)f(y)dxdy \geq 0\]
Recall positive definite matrix: \(x^TAx\geq 0\) for all \(x\)
example: Gaussian kernel \[K(x,y) = e^{-\frac{\|x-y\|^2}{\gamma}}\]
By fixing one argument of \(K\) to \(x\in\mathcal X\), we create a mapping from \(K_x:\mathcal X\rightarrow \mathcal H\) defined by \(K_x(y) = K(x,y)\)
Consider the space \[\mathcal H_0 = \left\{f : f(x)=\sum_{j=1}^k\alpha_j K_{x_j}(x)\right\}\]
If \(f(x)=\sum_{j=1}^k\alpha_j K_{x_j}(x)\) and \(g(x)=\sum_{j=1}^m\beta_j K_{y_j}(x)\), we can define an inner product as \[\langle f, g\rangle_K = \sqrt{\sum_{i=1}^k\sum_{j=1}^m \alpha_i\beta_j K(x_i, y_j)}\]

Evaluation Functional

Functional input functions and output a real number
- example \(\int_a^b f(x)dx\) is function of \(f\)
Evaluation functional, \(F_x\) maps a function \(f\in\mathcal{H}\) to \(f(x)\), that is \(F_x f = f(x)\)
Alternative definition: A Hilbert space \(\mathcal H\) of functions is a reproducing kernel Hilbert space if evaluation functionals are bounded
Riesz representation theorem: if \(\mathbf{A}: \mathcal{H}\rightarrow \mathbb{R}\) is a bounded linear operator in a Hilbert space \(\mathcal H\), there exists some \(g_{\mathbf A}\in\mathcal H\) such that \[Af = \langle f, g_{\mathbf A}\rangle_{\mathcal H}\] for all \(f\in\mathcal H\)
We use Riesz representation theorem to prove a sufficient condition for the existence of the kernel mean embedding in an RKHS
Because \(F_x\) is a bounded linear operator, using Riesz, there must be a \(g_x\in\mathcal H\) such that \[F_xf = \langle g_x, f\rangle = f(x)\]
Proof of Riesz representation theorem:

Assume \(A:\mathcal H\rightarrow \mathbb R\) is a bounded linear functional
TP (to prove): (1) there exists an element \(g_A\in\mathcal H\) such that \(Af =\langle f, g_A\rangle\) for each \(f\in\mathcal H\) and (2) \(g_A\) is unique.
We know that \(A\) is bounded if and only if \(A\) is continuous, so \(A\) must also be continuous.
Edge case: if \(Af = 0\) for all \(f\in\mathcal H\), then \(g_A = 0\) because \(Af = 0 = \langle f, 0\rangle\).
Other cases: Assume that \(A\neq 0\) (not the zero functional). That is, for some \(v_1\in\mathcal H, Av_1\neq 0\). Without loss of generality, assume \(Av_1=1\).
Let \(\text{Null}(A) = \left\{f\in\mathcal H: Af=0\right\}\) be the null space of \(A\).
\(\text{Null}(A)\) must be a closed linear subspace of \(\mathcal H\). Because \(A\) is continuous, the inverse image of closed sets are closed and \(\{0\}\) is closed.
Let \(E = \text{Null}(A)+v_1 = \left\{f+v_1: f\in\text{Null}(A)\right\}\). Notice that \(0\not\in E\) because \(A(f+v_1)=Af+Av_1=0+1=1\) for all \(f+v_1\in E\) and \(A0=0\).
Let \(\delta = \inf_{v\in E} \|v\|\) and \(\{u_n\}_{n=1}^\infty\subseteq E\) be a sequence in \(E\) such that \(\lim_{n\rightarrow\infty}\|u_n\|=\delta\)
Let \(\epsilon>0\) and choose \(N\) large enough so that \(\|u_n\|-\delta < \epsilon\) for all \(n\geq N\).
Let \(m,n>N\) and notice that \((u_n+u_m)/2\in E\) because \((u_n+u_m)/2\) can be written to be the sum of \(v_1\) and an element from \(\text{Null}(A)\).
Then we must have \(\|u_n+u_m\|\geq 2\delta\) by 9 and 11.
Using the parallelogram law \[\begin{align*} \|u_n-u_m\|^2 &= 2\|u_n\|^2+2\|u_m\|^2-\|u_n+u_m\|^2 &\text{Parallelogram Law}\\ &\leq 2(\epsilon+\delta)^2+2(\epsilon+\delta)^2-4\delta^2 & \text{by 10}\\ &= 4\epsilon(2\delta+\epsilon) \end{align*}\]
Because \(\epsilon\) was chosen arbitrarily we have shown that \(\{u_n\}_{n=1}^\infty\) is a Cauchy sequence (we may need to change the \(\epsilon\) from 10 to satisfy the definition). Recall Cauchy sequence: For all \(\epsilon>0\), there is an \(N\in\mathbb N\) such that for all \(m,m\geq N, \|u_n-u_m\| < \epsilon\).
Because \(\{u_n\}_{n=1}^\infty\subseteq E\subseteq\mathcal H\), a Hilbert space (complete so that Cauchy sequences converge), \(\{u_n\}_{n=1}^\infty\) must converge, say \(u_n\rightarrow u_0\).
\(u_0\in E\) because \(E\) is closed and \(u_0\neq 0\) because \(0\not\in E\) from 8.
Claim: \(g_A = u_0/\|u_0\|^2\)
Let \(v\in\text{Null}(A)\). Then \(w=u_0+cv\in E\) for any \(c\in\mathbb R\) because \(u_0\in E\).
If we set \(c = -\langle u_0, v\rangle /\|v\|^2\), then \(\|w\|=\sqrt{\delta^2-c^2}\)
Because \(\delta=\inf_{v\in E}\|v\|\), we must have that \(c=0\), otherwise, \(\|w\|<\delta\), a contradiction.
Then \(\langle u_0, v\rangle=0\) for all \(v\in\text{Null}(A)\).
Then \(Av=0\) implies that \(\langle u_0, v\rangle=0\).
Moreover, if \(Av=b\neq 0\), then because \(u_0\in E, Au_0=1\), we have \[A(v-bu_0) = Av-bAu_0 = b-b=0\]
Then \(v-bu_0\in \text{Null}(A)\) and \(\langle u_0, v-bu_0\rangle=0\) from 22 so that \(\langle u_0,v\rangle=b\|u_0\|^2\).
Using \(g_A = u_0/\|u_0\|^2\) from 17, \(Af = \langle g_A, f\rangle\) for all \(f\in\mathcal H\)
This proves the claim in 17: \(g_A = u_0/\|u_0\|^2\)
To show uniqueness, assume for some \(h\in\mathcal H\), \(Af = \langle h, f\rangle\) for all \(f\in\mathcal H\).
Then for all \(f\in\mathcal H\), \[0 = Af-Af = \langle h, f\rangle - \langle g_A, f\rangle = \langle h-g_A, f\rangle\]
Then \(0= \langle h-g_A, h-g_A\rangle=\|h-g_A\|\), so \(h=g_A\).

Reproducing Kernel Property and kernel trick

Reproducing Kernel Property: For \(x\in\mathcal X\) and \(f\in\mathcal H\) \[\langle f, K_x\rangle = f(x)\]
Jumping a head: this means that \(K_x\) is the representer of \(F_x\), the evaluation functional
Then \(\phi(x):= K_x\) fits the definition of canonical map
It follows that for \(x,y\in\mathcal X\) \[K(x,y) = K_y(x) = \langle K_x, K_y\rangle = \langle \phi(x), \phi(y)\rangle \]
Theorem: For every positive definite function, \(k(\cdot,\cdot)\) on \(\mathcal X\times \mathcal X\), there exists a unique RKHS with \(k\) as its reproducing kernel. Conversely, the reproducing kernel of an RKHS is unique and positive definite.
We can also show that \(\mathcal{H} = \overline{\text{span}\left\{k(x,\cdot): x\in \mathcal X\right\}}\), where closure is taken with respect to the RKHS norm
Hilbert-Schmidt Operator: Let \(\mathcal H\) and \(\mathcal F\) be separable Hilbert spaces with orthonormal bases \((h_i)_{i\in I}\) and \((f_j)_{j\in J}\), respectively. A Hilbert Schmidt (HS) operator is a bounded operator \(\mathcal A: \mathcal F\rightarrow \mathcal H\) whose HS norm is \[\|\mathcal A\|^2 = \sum_{i\in I} \sum_{j\in J} |\langle \mathcal A f_j, h_i\rangle_{\mathcal H}|^2\] is finite.
The HS operators from \(\mathcal F\) to \(\mathcal H\) also form a Hilbert space \(HS(\mathcal F, \mathcal H)\) with an inner product defined as \[\langle \mathcal A, \mathcal B\rangle_{HS} = \sum_{j\in J} \langle \mathcal A f_j, \mathcal B f_j\rangle_{\mathcal H}\]

Data to Kernel Mean

Assume \(X\sim p\)
In statistics, we frequently want to evaluate \(E[f(X)] = \int_{\mathcal X} f(x)p(x)dx\) for some function \(f\)
Problem: we almost never know \(p\)
To estimate, \(E[X]\) for example, we use the empirical measure, giving a probability of \(\frac{1}{n}\) to each sample point and integrate with respect to the empirical measure
Let \(x = \{x_1, \dots, x_n\}\). If \(\delta_x\) is the Dirac set function (\(\delta_x(A) = 1\) when \(x\cap A\neq \varnothing\) and \(\delta_x(A)=0\) otherwise)
We approximate the integral for \(E[X]\) as \[\int_{\mathcal X} tp(t)dt \approx \int_{\mathcal X} t \frac{\delta_x(t)}{n}dt = \frac{1}{n}\sum_{i=1}^n x_i\]
Similarly, we can approximate \(E[f(X)]\) with its plug-in estimator as \[\int_{\mathcal X} f(t)p(t)dt \approx \int_{\mathcal X} f(t) \frac{\delta_x(t)}{n}dt = \frac{1}{n}\sum_{i=1}^n f(x_i)\]
Kernel mean embedding: \[\mu_p = E[K(X,\cdot)] = \int_{\mathcal X} K(x,\cdot) p(x)dx\]
Estimate: \[\widehat{\mu_p} = \frac{1}{n}\sum_{i=1}^n K(x_i,\cdot)\]
In fact, \[\begin{align} \int_{\mathcal X} f(t) \frac{\delta_x(t)}{n}dt &= \int_{\mathcal X} \left\langle f, K(t,\cdot) \right\rangle_{\mathcal H} \frac{\delta_x(t)}{n} dt & \text{reproducing kernel property}\\ &= \left\langle f, \int_{\mathcal X} K(t,\cdot)\frac{\delta_x(t)}{n} dt \right\rangle_{\mathcal H} & \text{dominated convergen theorem} \\ &= \left\langle f, \widehat{\mu_p} \right\rangle_{\mathcal H} & \text{definition of } \widehat{\mu_p}\\ \end{align}\]
Lemma (Smola): If \(E[\sqrt{k(X,X)}]<\infty\), then \(\mu_p\in \mathcal H\) and \(E[f(X)] = \langle f, \mu_p\rangle_{\mathcal H}\)
Proof: Let \(L_P: \mathcal H\rightarrow \mathbb R\) be a linear operator (functional) defined by \(L_Pf := E_{X\sim P}[f(X)]\). Using the assumption, \[\begin{align} |L_Pf| &= |E_{X\sim P}[f(X)]| &\text{by definition}\\ &\leq E_{X\sim P}[|f(X)|] &\text{Jensen's inequality}\\ &= E_{X\sim P}\left[ \left|\langle f, k(X,\cdot) \rangle_{\mathcal H}\right|\right] &\text{reproducing kernel property}\\ &\leq E_{X\sim P}\left[ \|f\|_{\mathcal H} \sqrt{k(X,X)} \right] &\text{Cauchy-Schwartz inequality} \end{align}\] Then \(L_P\) is bounded, so we may apply the Riesz representation theorem. Then there must exist \(\lambda\in\mathcal H\) such that \(L_Pf = \langle f,\lambda \rangle_{\mathcal H}\). Choose \(f=k(x,\cdot)\) for some \(x\in\mathcal X\). Then \[\lambda(x) = L_Pk(x,\cdot) = \int_{\mathcal X} k(x,y)dP(y)\] so that \[\lambda=\int_{\mathcal X}k(\cdot, y)dP(y) =\int_{\mathcal X}k(\cdot, y) p(y)dy =\mu_P~.\]
Proposition: If \(k\) and \(\ell\) are reproducing kernels, then \(k\cdot l\) is a reproducing kernel.
Theme: we can take inner products in the RKHS to evaluate integrals and get parameter estimates

Characteristic Kernels

Let \(\mathcal P\) be the set of probability densities on \(\mathcal X\)
Let \(P_X\) be the CDF (\(P_X(x) = P(X<x)=\int_{-\infty}^x p_X(y)dy\)) and \(p_x\) the density (pdf)
A kernel is characteristic if the mapping \(\mu:\mathcal P\rightarrow \mathcal H\) defined \[\mu_P = \int_{\mathcal X} k(x,\cdot) p(x)dx\] is one-to-one (\(\mu_P=\mu_Q \Rightarrow p=q\))
To show that a kernel, \(k\), is characteristic, we must show that for two CDFs, \(P\) and \(Q\), \(\mu_P=\mu_Q\) implies that \(P=Q\).
The function \(\mu\) is the kernel mean embedding function
The following theorem gives a general characterization of characteristic kernels. Here we prove the sufficient condition for a kernel to be characteristic but it is also a necessary condition.
Definition of square integrable function: \(L^2(\mathcal X, P) = \{f: \int_{\mathcal X} f(x)dP(x) <\infty\}\) (Recall \(\int_{\mathcal X} f(x)dP(x) = \int_{\mathcal X} f(x)p(x)dx\))
Theorem Let \(\mathcal H\) be an RKHS with corresponding measurable, bounded kernel, \(k\), on a measurable space \((\mathcal X, \mathcal B)\). If \(\mathcal H + \mathbb R\) (direct sum) is dense in \(L^2(\mathcal X,P)\) for any \(P\in\mathcal P\), then \(k\) is characteristic.
Proof:
1. To show that \(\mu\) is injective, assume \(P\) and \(Q\) are probability measures such that \(\mu_P=\mu_Q\).
2. Proof by contradiction: assume \(P\neq Q\).
3. Let \(R= |P-Q|\); that is \(R\) is a new measures defined as \(R(A) = |P(A)-Q(A)|\).
4. Let \(\epsilon>0\) and \(A\in\mathcal B\).
5. Since \(\mathcal H+\mathbb R\) is dense in \(L^q(\mathcal X,R)\), there must be a function \(f\in\mathcal H+\mathbb R\) such that \[\int_{\mathcal X} |f(x)-I_A(x)|dR(x)<\epsilon\] (Recall that \(\int fdP = E_P[f(X)]\) and \(\int I_AdP=P(A)\))
6. And because \(\mu_P=\mu_Q\), \[E_P[f(X)] = \langle f, \mu_P\rangle = \langle f, \mu_Q\rangle = E_Q[f(X)]\]
7. Then \[\begin{align*} |Q(A)-P(A)| &=\left|(E_P[f(X)]-P(A))-(E_Q[f(X)]-Q(A))\right|\\ &= \left|\int_{\mathcal X} f(x)-I_A(x) dP(x)-\int_{\mathcal X} f(x)-I_A(x) dQ(x)\right|\\ &\leq \left| \int_{\mathcal X} f(x)-I_A(x) dR(x)\right|\\ &\leq \int_{\mathcal X} |f(x)-I_A(x)|dR(x)< \epsilon \end{align*}\]
8. Since, \(A\) and \(\epsilon\) were arbitrary, we must have that \(P(A)=Q(A)\) for all \(A\in\mathcal B\) so that \(P=Q\).
9. This shows that \(\mu\) is injective so that \(k\) is characteristic.
We want to show that the Gaussian kernel, \(k(x,y)=\exp\left(-\frac{\|x-y\|}{2\sigma}\right)\), is characteristic.
It turns out there there is still more work to do to show this. The theorem below helps prove that the Gaussian kernel is characteristic.
The Fourier transform of a function \(\phi\) is \[\tilde\phi(\xi) := \int_{\mathcal X} e^{-it\xi}\phi(t)dt.\]
Theorem: Assume that \(k\) is translation invariant; that is, \(k(x,y)=\phi(x-y)\) for some function \(\phi\). If for all \(\xi\in\mathbb R^m\), there exists \(\tau_0\) such that \[\begin{equation} \int \frac{\tilde\phi(\tau(u+\xi))^2}{\tilde\phi(u)}du<\infty \end{equation}\] for all \(\tau>\tau_0\) then \(\mathcal H\) is dense in \(L^2(P)\) for any Borel probability measure \(P\) on \(\mathbb R^m\).
Proof: Without loss of generality, assume that \(\phi(0)=1\). Using positive definiteness of \(k\), we have that \(|\phi(z)|^2\leq \phi(0)^2=1\). Now, since the RKHS associated with \(k\) consists of the functions such that \[\mathcal H = \left\{f\in L^2(\mathbb R^m) : \int \frac{|\tilde f(u)|^2}{\tilde\phi(u)}du<\infty\right\},\] where \(\tilde f\) and \(\tilde\phi\) are the Fourier transforms of \(f\) and \(\phi\).

Let \(P\) be an arbitrary probability measure on \(\mathbb R^m\) and \(\xi\in\mathbb R^m\). Notice that the Fourier transform of \(e^{-i\xi^Tz}\phi(z/\tau)\) is \(\tilde\phi(\tau(u+\xi))\). From the assumption of the theorem we must have that \(e^{-i\xi^Tz}\phi(z/\tau)\in\mathcal H\) for all \(\tau>\tau_0\). Since \(\phi(z/\tau)\rightarrow 1\) as \(\tau\rightarrow\infty\) and \(e^{-i\xi^Tz}\phi(z/\tau)\) is bounded by a constant, we can use the dominated convergence theorem to ensure that \[E_P\left[e^{-i\xi^Tz} - e^{-i\xi^Tz}\phi(z/\tau)\right]\rightarrow 0 \text{ as } \tau\rightarrow\infty.\] This shows that we only have to prove that the span of \(\mathcal A :=\{e^{-i\xi^Tz}:\xi\in\mathbb R^m\}\) is dense in \(L^2(P)\).\

Now, let \(f\in L^2(P)\). We can assume that \(f\) is differentiable with compact support since it’s in \(L^2(P)\). Let \(\epsilon>0\) and \(M=\sup_{x\in\mathbb R}|f(x)|\), and \(A\) such that \([-A,A]^m\) contains the support of \(f\) and \(P([-A,A]^m)>1-\epsilon/4M^2\). From a functional analysis text, we have that the series of functions \[f_N(z) = \sum_{n_1=-N}^N \dots\sum_{n_m=-N}^N c_n\exp\left\{\frac{i\pi}{A}n^Tz\right\} \text{ for } N\in\mathbb{N}\] where \[c_n = \frac{1}{(2A)^m}\int_{[-A,A]^m}f(z)\exp\left|\frac{i\pi}{A}n^Tz\right|dz\] converges uniformly to \(f(z)\) on \([-A,A]^m\) as \(N\rightarrow\infty\). Then there must be some \(N\) large enough such that \(|f(z)-f_N(z)|^2<\epsilon/2\) on \([-A,A]^m\) and from the definition of \(f_N(z)\) it must be the case that \[\sup_{x\in\mathbb R^m}|f_N(z)|^2<\left(M+\sqrt{\frac{\epsilon}{2}}\right) < 2M^2.\] Then we have that \(E_P|f-f_N|^2<\epsilon\) so that span\(\mathcal A\) is dense in \(L^2(P)\).
Now, to apply this theorem to the Gaussian kernel, we must first compute \(\tilde \phi\) where \(\phi(\xi)=\exp\left(-\frac{\xi}{2\sigma}\right)\) \[\tilde \phi(\xi) = \frac{1}{2\sigma\pi(\xi^2+(2\sigma)^{-2})}~.\]
Then show that for some \(\tau_0\), \(\int \frac{\tilde\phi(\tau(u+\xi))^2}{\tilde\phi(u)}du<\infty\) for all \(\tau>\tau_0\) \[\int \frac{\tilde\phi(\tau(u+\xi))^2}{\tilde\phi(u)}du = \int \frac{u^2+(2\sigma)^{-2}}{2\sigma\pi[(\tau(u+\xi))^2+(2\sigma)^{-2}]^2} du\]

Cross Covariance Operators

Let \(X,Y\) are random variables with marginal densities, \(p_X\) and \(p_Y\), reproducing kernels, \(k\) and \(\ell\), and corresponding RKHSs, \(\mathcal H\) and \(\mathcal F\), respectively
Assume \(E[k(X,X)], E[\ell(Y,Y)]<\infty\)
Cross covariance operator (uncentered): \(\mathcal C_{YX}:\mathcal H\rightarrow\mathcal F\) defined as \[ \mathcal C_{YX} := E_{YX}[\ell(Y,\cdot)\otimes k(X,\cdot)] = \mu_{p_{YX}}\] where \(p_{YX}\) is the joint distribution of \(X\) and \(Y\) and \(\otimes\) is the tensor product
Using a tensor only works here because \(\mathcal H\otimes \mathcal F\) and \(HS(\mathcal H, \mathcal F)\) are isomorphic using this map: \[\sum_{i\in I} f_i\otimes g_i \mapsto \left[h \mapsto \sum_{i\in I} \langle h, f_i\rangle_{\mathcal H} \cdot g_i\right]\]
\(\mathcal C_{YX}\) can be thought of as the expected value of a tensor product and as a mapping from \(\mathcal H\) to \(\mathcal F\)
As a mapping: \[\begin{align} \langle g, \mathcal C_{YX} f\rangle_{\mathcal F} &= \left\langle E_{YX}[\ell(Y,\cdot)\otimes k(X,\cdot)], g\otimes f \right\rangle_{\mathcal F\otimes\mathcal H} & \text{definition of }\mathcal C_{YX} \\ &= E_{YX}\left[ \left\langle \ell(Y,\cdot)\otimes k(X,\cdot), g\otimes f \right\rangle_{\mathcal F\otimes\mathcal H} \right] & \text{linearity of expectations} \\ &= E_{YX}\left[ \left\langle \ell(Y,\cdot), g \right\rangle_{\mathcal F}\cdot \left\langle k(X,\cdot), f \right\rangle_{\mathcal H} \right] & \text{defintion of tensor product}\\ &= E_{YX}[g(Y)f(X)] & \text{reproducing kernel property} \end{align}\]
Covariance Operator: If \(X=Y\), then \(\mathcal C_{XX}\) maps \(f\in\mathcal H\) such that for \(g\in\mathcal H\) \(\langle g, \mathcal C_{XX} f\rangle = E_X[g(X)f(X)]\)
- Self Adjoint: Note that that for \(f, g\in\mathcal H,\langle g, \mathcal C_{XX} f\rangle = \langle \mathcal C_{XX}g,f\rangle\)
\(\mathcal C_{XY}\) is the adjoint of \(\mathcal C_{YX}\): for \(f\in\mathcal H, g\in\mathcal F, \langle g, \mathcal C_{YX} f\rangle_{\mathcal F} = \langle \mathcal C_{XY}g,f\rangle_{\mathcal H}\)
We can also write \(\mathcal C_{YX}, \mathcal C_{XX}\) as integral functions:
- \((\mathcal C_{YX} f)(\cdot) = \int_{\mathcal X\times\mathcal Y} \ell(\cdot, y)f(x) p_{XY}(x,y)dxdy\)
- \((\mathcal C_{XX} f)(\cdot) = \int_{\mathcal X} k(\cdot, x)f(x) p_{X}(x)dx\)
Using the first equation, \[\begin{align*} \langle g, \mathcal C_{YX} f\rangle_{\mathcal F} &= \left\langle g, \int_{\mathcal X\times\mathcal Y} \ell(\cdot, y)f(x) p_{XY}(x,y)dxdy\right\rangle_{\mathcal F} &\text{first equation above} \\ &= \int_{\mathcal X\times\mathcal Y} \langle g,\ell(\cdot, y)\rangle_{\mathcal F} f(x) p_{XY}(x,y)dxdy &\text{DCT} \\ &= \int_{\mathcal X\times\mathcal Y} g(y)f(x) p_{XY}(x,y)dxdy &\text{Reproducing kernel} \\ &= E_{YX}[g(Y)f(X)] &\text{Definition of expectation} \end{align*}\]
Theorem (Fukumizu 2004): If \(E_{Y}[g(Y)|X=\cdot]\in\mathcal H\), then for \(g\in\mathcal F\), \[\mathcal C_{XX} E_{Y}[g(Y)|X=\cdot] = C_{XY} g~.\]
Centered cross covariance operator: \(E_{YX}[(\ell(Y,\cdot)-\mu_Y)\otimes(k(X,\cdot)-\mu_X)]\)
Given an IID sample, \((x_1,y_1), \dots, (x_n, y_n)\), we estimate the centered cross covariance operator as \[ \hat{\mathcal{C}}_{YX} = \frac{1}{n}\sum_{i=1}^n \left(\ell(y_i,\cdot)-\hat\mu_{p_Y}\right)\otimes \left( k(x_i,\cdot)-\hat\mu_{p_X}\right) = \frac{1}{n}\psi^T H \phi \] where \(\psi = [\ell(y_1,\cdot), \dots, \ell(y_n,\cdot)]^T\), \(\phi = [k(x_1,\cdot),\dots, k(x_n, \cdot)]^T\), and \(H = I_n-\frac{1}{n} 1_{n\times n}\)
Berlinet and Thomas-Agnan (2004) showed that \[\|\hat{\mathcal C}_{YX} - \mathcal C_{YX}\|_{HS}=O_P(1/\sqrt n)\] as \(n\rightarrow\infty\)
Theorem (Baker, 1973): There exists a unique bounded operator, \(\mathcal V_{YX}: \mathcal H\rightarrow\mathcal F\) with \(\|\mathcal V_{YX}\|_{HS}\leq 1\) such that \[\mathcal C_{YX} = \mathcal C_{YY}^{1/2}\mathcal V_{YX} \mathcal C_{XX}^{1/2}\].
\(\mathcal V_{YX}\) is called the normalized cross covariance operator
\(\mathcal V_{YX}\) captures the dependence information between \(X\) and \(Y\) with less influence from the marginal distributions
Why is this helpful for independence? If we find \(f\) and \(g\) to make the expectation above look like the product of moment generating functions, we can use that to show independence.
The next theorem show how cross covariance operator and independence are related
Theorem: \(\|C_{YX}\|_{HS}=0\) if and only if \(X \perp Y\)
Proof: \[\begin{align*} \|\tilde C_{YX}\|_{HS}=0 &\Leftrightarrow \tilde C_{YX}=0 \\ &\Leftrightarrow \mu_{P_{YX}}=\mu_{P_Y}\otimes\mu_{P_X} & \text{because } \mu_{P_{YX}}-\mu_{P_Y}\otimes\mu_{P_X}=\tilde C_{YX} \\ &\Leftrightarrow P_{YX} = P_XP_Y \\ &\Leftrightarrow X\perp Y \end{align*}\]
How do we evaluate \(\|\tilde{\mathcal C}_{YX}\|_{HS}\)?
Recall that for any Hilbert-Schmidt Operator, \(\mathcal A\), \[\|\mathcal A\|_{HS}^2 = \sum_{i\in I} \sum_{j\in J} |\langle \mathcal A f_j, h_i\rangle_{\mathcal H}|^2\]
We can also think of the summations like integrals
Example: If \(X\sim P\), \(\|\mu_P\|_{\mathcal H}^2 = \left\langle E[k(X,\cdot), E[k(X,\cdot) \right\rangle = E_{XX'}[k(X,X')] = \int_{\mathcal X}\int_{\mathcal X} k(x,x')p_X(x)p_X(x')dxdx'\)
Similarly \(\hat \mu_P =\frac{1}{n}\sum_{i=1}^n k(x_i,\cdot)\), so \(\|\hat\mu_P\|_{\mathcal H}^2 = \left\langle \frac{1}{n}\sum_{i=1}^n k(x_i,\cdot), \frac{1}{n}\sum_{i=1}^n k(x_i,\cdot) \right\rangle = \frac{1}{n^2}\sum_{i=1}^n\sum_{j=1}^n k(x_i,x_j)\)
Big Picture: We are going to use the data to estimate \(\|\hat{\mathcal C}_{YX}\|\), as a test statistic. It is very unlikely that \(\|\hat{\mathcal C}_{YX}\|=0\) but if we can figure out its distribution under the null hypothesis, we can check if it’s close enough.
How do we use this to evaluate HS norms? There are several ways!
Gretton 2007 Kernel Test of Indepedence
- Let \(X'\) and \(Y'\) be an independent copy of \(X\) and \(Y\), respectively. \[\begin{align*} &\|\mathcal C_{YX}\|_{HS}^2 \\ &= \|E_{YX}[(\ell(Y,\cdot)-\mu_Y)\otimes(k(X,\cdot)-\mu_X)] \|_{HS} \\ &= \langle E_{YX}[(\ell(Y,\cdot)-\mu_Y)\otimes(k(X,\cdot)-\mu_X)], E_{Y'X'}[(\ell(Y',\cdot)-\mu_Y)\otimes(k(X',\cdot)-\mu_X)] \rangle_{HS} \\ &= E_{YY'XX'}[\langle (\ell(Y,\cdot)-\mu_Y)\otimes(k(X,\cdot)-\mu_X), (\ell(Y',\cdot)-\mu_Y)\otimes(k(X',\cdot)-\mu_X) \rangle_{HS}] \\ &= E_{YY'XX'}\left[\left(\ell(Y,Y')-\langle \mu_Y, k(Y',\cdot)\rangle -\langle \mu_Y, k(Y,\cdot)\rangle+ \|\mu_Y\|^2\right)\times\left(k(X,X') -\langle k(X,\cdot), \mu_X\rangle -\langle k(X',\cdot), \mu_X\rangle+\|\mu_X\|^2\right)\right] \\ &= E_{YY'XX'}\left[\ell(Y,Y')k(X,X')\right] -2E_{YX}\left[E_{Y'}[\ell(Y,Y')]E_{X'}[k(X,X')]\right] -E_{YY'}[\ell(Y,Y')]E_{XX'}[k(X,X')] \end{align*}\] This can be estimated as a U-statistic \[HSIC = \frac{1}{\binom{n}{2}}\sum_{(i,j)} k(x_i,x_j)\ell(y_i,y_j) -\frac{2}{\binom{n}{3}}\sum_{(i,j,q)} k(x_i,x_j)\ell(y_i,y_q) +\frac{1}{\binom{n}{4}}\sum_{(i,j,q,r)} k(x_i,x_j)\ell(y_r,y_q)\] where each tuple runs over all unique value combinations
- And as a V-statistic \[HSIC_b = \frac{1}{m^2}\sum_{i,j}^n k(x_i,x_j)\ell(y_i,y_j) -\frac{2}{n^3}\sum_{i,j,q}^n k(x_i,x_j)\ell(y_i,y_q) +\frac{1}{n^4}\sum_{i,j,q,r}^n k(x_i,x_j)\ell(y_r,y_q) = \frac{1}{m^2}\text{tr}(KHLH)\] where \(K_{ij} = k(x_i,x_j), L_{ij}= \ell(y_i,y_j), H=I_n -\frac{1}{n}\mathbf{1}\mathbf{1}^T\)
- Distribution of HSIC: The paper shows that, under the null distribution (\(X\perp Y\)), \[n\cdot HSIC \xrightarrow{D} \sum_{l=1}^\infty \lambda_l z_l^2\] where \(z_l\sim\)Standard Normal and \(\lambda_l\) are unknown, theoretical eigenvalues. In practice, we can either approximate it’s distribution under the null hypothesis with a Gamma distribution or with a permutation test.
Fukumizu 2007 Kernel Conditional Independence
- Covariance of \(X\) and \(Y\): \(cov(X,Y) = E[(X-E[X])(Y-E[Y])]\); if \(X,Y\) are vectors, \(cov(X,Y)\) is a matrix with covariance pair entries
- Conditional covariance: \(cov(X,Y|Z) = E[(X-E[X|Z])(Y-E[Y|Z])|Z]\)
- Assume \(X, Y, Z\) are random vectors, then \(cov(X,Y|Z) = cov(X,Y)-cov(X,Z)[cov(Z,Z)]^{-1}cov(Z,Y)\) (The proof uses a 2x2 matrix inversion identity)
- This works similarly for conditional cross covariance: \[\mathcal{C}_{YX|Z} = \mathcal{C}_{YX}-\mathcal{C}_{YZ}\mathcal{C}_{ZZ}^{-1}\mathcal{C}_{ZX}\] where \(\mathcal{C}_{VW}\) is the cross covariance as before, proof in this paper
- Theorem: \(X\perp Y|Z \Leftrightarrow \|\mathcal{C}_{YX|Z}\|_{HS}=0\)
- However, this paper prefers to use the normalized measure: \[\mathcal{V}_{YX|Z} = \mathcal{C}_{YY}^{-1/2}\left(\mathcal{C}_{YX}-\mathcal{C}_{YZ}\mathcal{C}_{ZZ}^{-1}\mathcal{C}_{ZX}\right)\mathcal{C}_{XX}^{-1/2}\]
- Again, \(X\perp Y|Z \Leftrightarrow \|\mathcal{V}_{YX|Z}\|_{HS}=0\)
- To estimate \(\|\mathcal{V}_{YX|Z}\|_{HS}\), the authors make a small change to the estimator by substituting \(\ddot X := (X,Z), \ddot Y := (Y,Z)\) in place of \(X\) and \(Y\), and instead estimate \(\|\mathcal{V}_{\ddot Y\ddot X|Z}\|_{HS}\)
- \(\|\widehat{\mathcal{V}_{\ddot Y\ddot X|Z}}\|_{HS}^2 = \text{Tr}\left(R_\ddot Y R_\ddot X -2R_\ddot Y R_\ddot X R_Z + R_\ddot Y R_Z R_\ddot X R_Z\right)\) where \(R_W = G_W(G_W+n\epsilon_n I_n)^{-1}\) and \(G_W= \left[k(w_i,w_j)\right]_{ij} H\)
- Theorem: If \(\epsilon_n\rightarrow 0\) and \(\epsilon_n^3n\rightarrow\infty\), then \[\left\| \widehat{\mathcal{V}_{\ddot Y\ddot X|Z}} -\mathcal{V}_{\ddot Y\ddot X|Z}\right\|\xrightarrow{P} 0\] as \(n\rightarrow\infty\).
- This paper does not provide the null distribution of \(\|\mathcal{V}_{YX|Z}\|_{HS}\) but uses permutation testing
  - note that because this is a conditional test, we can only permute \(X\) and \(Y\) for the same values of \(Z\)
Zhang 2012 Kernel Conditional Test
- This paper builds on the previous paper by slightly changing the estimator and providing a null distribution
- Big Picture: Use Gaussian process regression to estimate \(E[X|Z]\) and \(E[Y|Z]\) to determine indepdendence
  - Conditional correlation analogy
  - Regression residuals: \(\epsilon_X :=X-E[X|Z]\) and \(\epsilon_Y :=Y-E[Y|Z]\)
  - If \(Corr(\epsilon_X,\epsilon_Y) = 0\), we can say that \(X\) and \(Y\) are uncorrelated given \(Z\)
  - Using kernels, we can use this framework to show conditional independence, \(\tilde K_{\ddot X|Z}\) and \(\tilde K_{Y|Z}\)
- The paper estimates \(\tilde K_{\ddot X|Z}\) as \[\tilde K_{\ddot X|Z} := R_Z \tilde K_\ddot X R_Z\] where \(R_Z\) is the calculated from the regression
- \(\tilde K_{Y|Z}\) is calculated similarly
- KCI test statistic: \[T_{CI} := \frac{1}{n}\text{TR}\left(\tilde K_{\ddot X|Z} \tilde K_{Y|Z}\right)\]
- Under the null hypothesis (\(X\perp Y|Z\)), \[T_{CI} \sim \frac{1}{n} \sum_{k=1}^{n^2} \lambda_k Z_k^2\] where \(\lambda_k\) are eigenvalues and \(Z_k\sim N(0,1)\) (independent standard normal RVs)
- The distribution above can be approximated Using a gamma distribution: \[Gamma(k,\theta) = \frac{t^{k-1} \exp\left(-\frac{t}{\theta}\right)}{\theta^k \Gamma(k)}\] where \[k = \frac{(E[T_{CI}])^2}{Var[T_{CI}]} \text{ and } \theta = \frac{Var[T_{CI}]}{E[T_{CI}]}\]
- Recall eigenvalue decomposition of a square matrix, \(A\):
  - Definition: An eigenvalue, eigen vector pair for \(A\) are any \(\lambda\in\mathbb R\), \(v\in \mathbb R^d\) such that \(\lambda v = Av\), wlog let \(\|v\|=1\)
  - Let \(U = [v_1,\dots, v_d]\) and \(\Lambda = \text{diag}(\lambda_1,\dots, \lambda_d)\), then \(A = U\Lambda U^T\), and \(U\) is an orthogonal matrix (\(U^{-1}=U^T\))
- Note that \(K_X = [k(x_i,x_j)]_{ij}= [k(x_1,\cdot), \dots, k(x_n,\cdot)]^T \otimes [k(x_1,\cdot), \dots, k(x_n,\cdot)]^T\)
- If \(\tilde K_X = K_XH\) is the centralized kernel matrix and \(\tilde K_X = U_X \Lambda_X U_X^T\) , we can estimate the kernel map, \([k(x_1,\cdot)-\hat\mu_X, \dots, k(x_n,\cdot)-\hat\mu_X]^T\), as \[\tilde K^{-1/2} = \left(U_X \Lambda_X U_X^T\right)^{-1/2} = U_X \Lambda_X^{-1/2}\]
- Taking a closer look at the code
- Below: threshold parameters

#The program codes the KCI statistic
#Zhang 2012 Kernel-based conditional independence test...
setwd(".")
epsilon <- 0 #10^-2
eigThresh <- 10^-10
thres <- 1e-5

Below: Calculate \(\text{Tr}(AB)\) faster
- matrix multiplication is \(O(n^2)\) but we don’t actually need all of the entries, only the diagonal
- below is \(O(n)\)

trProd <- function(A, B){
    stopifnot(ncol(A) == nrow(B))
    sum(sapply(1:nrow(A),
               function(ii){A[ii,] %*% B[,ii]}
               )
        )
}

trProdTrans <- function(A){
    sum(sapply(1:nrow(A),
               function(ii){A[ii,] %*% A[ii,]}
               )
        )
}

Below: if Cholesky decomposition doesn’t work, we add 0.01 to the diagonal and try again

cholsky <- function(origMat){
    cholStatus <- try(u <- chol(origMat), silent = TRUE)
    cholError <- ifelse(class(cholStatus) == "try-error", TRUE, FALSE)
    if (cholError){
        u <- chol(origMat + 0.01 * diag(dim(origMat)[1]))
    }
    return(u)
}

Below: function to generate (uncentered) kernel matrix
- \(K_{ij} = k(x_i, x_j) = \exp\left(-\frac{\theta}{2}\|x_i-x_j\|^2\right)\)

gaussKernMat <- function(dat, theta){
    exp(as.matrix(-dist(dat, diag=TRUE, upper=TRUE)^2*theta/2))
}

Below: Gaussian Process (GP) Regression Functions
- ardKernel automatic relevance determination kernel: minimizes features not associated with outcome
- ardGrad gradient of ardKernel, used for optimization
- gpLogLik log-likelihood for GP Regression
- gplogLikGrad gradient of gpLogLik, used for optimization

# ard: automatic relevance determination
ardKernel <- function(params, x, noise=TRUE){
    #This function is the ARD kernel
    #param = (variable precision params), signal var, model var)
    x <- as.matrix(x)
    xDim <- dim(x)
    prec_param <- exp(params[1:xDim[2]])
    sig_var <- exp(2*params[xDim[2] + 1])
    if(noise){
        mod_var <- exp(2 * params[xDim[2] + 2])
    } else { mod_var <- 0}
    kern <- sig_var *
        exp(-as.matrix(dist(
                 x %*% diag(1/prec_param, nrow=length(prec_param)),
                 diag=TRUE, upper = TRUE)^2/2
                 )
            ) + (mod_var + epsilon) * diag(xDim[1])
    return((kern+t(kern))/2)
}

ardGrad <- function(params, x, coord){
    xDim <- dim(x)
    prec_param <- exp(params[1:xDim[2]])
    sig_var <- exp(2*params[xDim[2] + 1])
    mod_var <- exp(2*(params[xDim[2] + 2]))
    if(coord <= xDim[2]){
    kern <-  #sig_var *
        exp(-as.matrix(dist(
                 x %*% diag(1/prec_param, nrow=length(prec_param)),
                 diag=TRUE, upper = TRUE)^2/2
                 )
            )
    out <- kern *
        as.matrix(dist(x[,coord],
                       diag=TRUE, upper=TRUE)/prec_param[coord])^2
    return(out)
    }
    if(coord == xDim[2]+1){
        kern <-  sig_var *
            exp(-as.matrix(dist(
                     x %*% diag(1/prec_param, nrow=length(prec_param)),
                     diag=TRUE, upper = TRUE)^2/2
                     )
                )
        return(2 * kern)
    }
    if(coord == xDim[2]+2){
        return(2 * mod_var * diag(xDim[1]))
    }
}


gplogLik <- function(params, x, y){
    xDim <- dim(x)
    sig_var <- exp(2*params[xDim[2] + 1])  ###  Check this out
    x <- as.matrix(x)
    y <- as.matrix(y)
    xDim <- dim(x)
    yDim <- dim(y)
    kern <- ardKernel(params, x)
    L <- chol(kern)
    alpha <- solve(kern, y,
                   system=signature(a = "dppMatrix", b = "dsparseMatrix"))
    0.5 * sum(diag(t(y) %*% alpha)) +
        yDim[2]*(0.5 * log(sig_var) + sum(log(diag(L)))) +
        0.5 * xDim[1] * yDim[2] * log(2*pi)
}

gplogLikGrad <- function(params, x, y){
    paramNum <- length(params)
    yDim <- dim(y)
    out <- rep(NA, paramNum)
    kern <- ardKernel(params, x)
    alpha <- solve(kern, y,
                   system=signature(a = "dppMatrix", b = "dsparseMatrix"))
    W <- yDim[2] * solve(kern, system=signature(a = "dgCMatrix")) -
        tcrossprod(alpha)
    for(itr in 1:paramNum){
        out[itr] <- sum((W * ardGrad(params, x, itr)))/2
    }
    return(out)
}

Below: Actual kciTest code
- inputs: x,y, z (optional)
- output: p-value

kciTest <- function(x, y, z=NULL, theta=NULL){
    x <- as.matrix(x)
    y <- as.matrix(y)
    x <- scale(x) #scale standardizes columns
    y <- scale(y)
    nobs <- dim(x)[1]
    H <- diag(nobs) - matrix(1, nrow=nobs, ncol=nobs)/nobs

    if(is.null(theta)){
        if(nobs <= 200){
            width <- 1.2
        } else if(nobs > 1200){
            width <- 0.4
        } else {width <- 0.7}
        theta <- 1/width^2
    }

    if(is.null(z)){ # if nothing is entered for z, runs independence test
        kernMatX <- gaussKernMat(x, theta)
        kernMatY <- gaussKernMat(y, theta)

        kx <- H %*% (kernMatX %*% H)
        ky <- H %*% (kernMatY %*% H)

        test.stat <- sum(diag(kx %*% ky))/nobs
        th.mean <- sum(diag(kx)) * sum(diag(ky)) / nobs^2
        th.var <- 2 * trProd(kx, kx) * trProd(ky, ky) / nobs^4
    } else{ # if z entered, runs conditional independence test
        z <- as.matrix(z)
        z <- scale(z)
        zDim <- dim(z)[2]

        theta <- 1/(width^2 * zDim)

        # centralized kernel matrix for ddot X = (X,Z), tilde K_X
        kernMatXd <- gaussKernMat(cbind(x,z/2), theta) 
        kxd <- H %*% (kernMatXd %*% H)

        # centralized kernel matrix for Y, tilde K_Y
        kernMatY <- gaussKernMat(y, theta)
        ky <- H %*% (kernMatY %*% H)

        eigx <- eigen(kxd, symmetric = TRUE)
        eigy <- eigen(ky, symmetric = TRUE)

        # list of T/F for keeping eigenvalues, keep large enough values
        eigx_keep <- eigx$value >= thres * eigx$value[1]
        eigy_keep <- eigy$value >= thres * eigy$value[1]

        # getting sqrt(K_X), sqrt(K_Y)
        mapx <- eigx$vectors[,eigx_keep] %*%
            diag(sqrt(eigx$values[eigx_keep]), nrow=sum(eigx_keep))
        mapy <- eigy$vectors[,eigy_keep] %*%
            diag(sqrt(eigy$values[eigy_keep]), nrow=sum(eigy_keep))

        # getting GP parameters
        init_eta <- log(width * sqrt(zDim))
        init_params <- c(rep(init_eta, zDim), 0, log(sqrt(0.1)))
        opt_x <- optim(init_params, gplogLik, gplogLikGrad,
                       x=z, y=2*sqrt(nobs) * mapx/sqrt(eigx$value[1]))
        opt_y <- optim(init_params, gplogLik, gplogLikGrad,
                       x=z, y=2*sqrt(nobs) * mapy/sqrt(eigy$value[1]))
        opt_hyp_x <- opt_x$par
        opt_hyp_y <- opt_y$par

        # calculating regression matrix R_Z in the notes above
        Px <- (diag(nobs) - ardKernel(opt_hyp_x, z, FALSE) %*%
               solve(ardKernel(opt_hyp_x, z)))
        Py <- (diag(nobs) - ardKernel(opt_hyp_y, z, FALSE) %*%
               solve(ardKernel(opt_hyp_y, z)))

        # tilde K_{ddot X|Z}
        kxdz <- Px %*% (kxd %*% t(Px))

        # tilde K_{Y|Z}
        kyz <- Py %*% (ky %*% t(Py))

        # test statistic
        test.stat <- trProd(kxdz, kyz)

        # get eigenvalues/vectors
        eigx <- eigen(kxdz, symmetric = TRUE)
        eigy <- eigen(kyz, symmetric = TRUE)

        # list T/F for which are large enough to use
        thrsX <- sum(eigx$values >= eigThresh)
        thrsY <- sum(eigy$values >= eigThresh)

        # calcuate sqrt(tilde K_X), etc
        mapx <- eigx$vectors[,1:thrsX] %*%
            diag(sqrt(eigx$values[1:thrsX]), nrow=thrsX)
        mapy <- eigy$vectors[,1:thrsY] %*%
            diag(sqrt(eigy$values[1:thrsY]), nrow=thrsY)
        w <- do.call(cbind,  #stacking M_t
                     lapply(1:nobs,
                            function(t) as.vector(mapx[t,] %*% t(mapy[t,]))
                            )
                     )
        
        # get gamma distribution parameters
        prod_w <- crossprod(w)
        sq_prod_w <- crossprod(prod_w)
        trww <- sum(diag(prod_w))
        th.mean <- trww #/ nobs
        sq_trww <- sum(diag(sq_prod_w))
        th.var <- 2 * sq_trww #/ nobs ^ 2
    }
    shape = th.mean^2 / th.var
    scale = th.var / th.mean
    p.val <- pgamma(test.stat, shape = shape, scale = scale,
                    lower.tail=FALSE) #gamma approx
    return(p.val)
}

Below: function to connect kciTest with pc algorithm
pc algorithm gives names of variables (\(x,y,S\)) and suffStat, the dataset

kciWrap <- function(x,y,S, suffStat){
  # Uses kciTest to work on data
  # inputs:
  # x = name of x variable
  # y = name of y variable
  # S = name of variables in conditioning set
  # suffStat = sufficient statistic (entire dataset here)
  # output: p-value from KCI test
    compObs <- complete.cases(suffStat[,c(x,y,S)])
    compDat <- suffStat[compObs,]
    if (length(S) == 0) {
        error <- try(res <- kciTest(compDat[,x], compDat[,y]))
        if(class(error) == "try-error") res <- NA
    } else {
        error <- try(res <- kciTest(compDat[,x], compDat[,y], compDat[,S]))
        if(class(error) == "try-error") res <- NA
    }
    res
}

The code below returns variables associated with outcome and could be explained by other variables

neighbors <- function(outcome, data, alpha, verbose=FALSE){
    nfeats <- dim(data)[2]
    rfeats <- 1:nfeats
    target <- nfeats + 1
    dat <- cbind(data, outcome)
    for(csize in 0:(nfeats-1)){
        if(length(rfeats) - 1 < csize){
            return(rfeats)
            break
        }
        current <- rfeats
        for(tvar in current){
            condvars <- rfeats[tvar != rfeats]
            if(length(condvars) != 1){
                condsets <- combn(condvars, csize)
            } else {
                condsets <- as.matrix(condvars)
            }
            ncondsets <- dim(condsets)[2]
            for(itr in 1:ncondsets){
                if(verbose) cat("y=", tvar, ", S=",condsets[,itr],", pvalue= ")
                pval <- kciWrap(target, tvar, condsets[,itr], dat)
                ##pval <- myCItest(target, tvar, condsets[,itr], dat)
                if(verbose) cat(pval, "\n")
                if(pval >= alpha & !is.na(pval)){
                    rfeats <- rfeats[rfeats != tvar]
                    break
                }
            }
            if(verbose) cat("Remaining: ", rfeats, "\n")
        }
    }
}

Association vs Causation

What is causation? Why is it important?
Causal questions:
- Which cancer treatments are best for which patients?
- Will a national gun law result in fewer homicides?
- Does increasing minimum wage reduce job openings?
- Will changing X also change Y?

Simpson’s paradox

How can statistics be misleading on causation?
New infectious disease with high mortality rate
Scientists develop an experimental treatment and give it to doctors to try out

##        treatment symptoms mortality
## 687 experimental   severe     alive
## 239 experimental   severe      died
## 530      placebo     mild      died
## 846      placebo     mild     alive
## 346 experimental     mild      died
## 634      placebo     mild     alive
## 509      placebo     mild     alive
## 475 experimental   severe     alive
## 213      placebo   severe      died
## 431      placebo     mild     alive
## 632 experimental     mild     alive
## 630 experimental   severe     alive
## 318      placebo   severe      died
## 991      placebo     mild     alive
## 648      placebo   severe     alive
## 880      placebo     mild     alive
## 136 experimental   severe      died
## 655 experimental   severe     alive
## 497      placebo     mild     alive
## 358      placebo     mild     alive

## 
## experimental      placebo 
##          487          513

## 
##   mild severe 
##    514    486

## 
## alive  died 
##   754   246

The mortality rate for each group is below:

table(df$treatment, df$mortality)[,2]/table(df$treatment)

## 
## experimental      placebo 
##    0.2648871    0.2280702

The death rate is higher in the experimental treatment group than the placebo group
Taking a closer look, we stratify death rate by symptom severity:

table(df$treatment, df$symptoms, df$mortality)[,,2]/table(df$treatment, df$symptoms)

##               
##                      mild     severe
##   experimental 0.07352941 0.33903134
##   placebo      0.16402116 0.40740741

Someone with severe symptoms is much more likely to die than someone with mild symptoms
But, in both groups, the experimental treatment was associated with fewer deaths
How can this happen?

table(df$treatment, df$symptoms)

##               
##                mild severe
##   experimental  136    351
##   placebo       378    135

chisq.test(df$treatment, df$symptoms)

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  df$treatment and df$symptoms
## X-squared = 207.58, df = 1, p-value < 2.2e-16

Those with severe symptoms were more likely to be on the experimental treatment
Timing matters here! Was severity taken before or after treatment?
Neither scenario below is conclusive from the data, but knowing the time ordering can help rule one out

Confounding

Example: Sleeping with shoes on is associated with waking up with a headache
Example: Yellow teeth and cancer are confounded by smoking
Confounding: X and Y are confounded if they are both influenced by a third variable
Example from last section: Assume that severity was taken before treatment was given

There are two “open” paths from treatment to mortality:
- treatment <- severity -> mortality
- treatment -> mortality
Looking at the total association between treatment and mortality (without severity) will use include assocations from both paths
Controlling for severity “blocks” the treatment <- severity -> mortality pathway
We’re left with the direct causal path treatment -> mortality

Confounding Simulations

Diagram above: A and B are confounded by C and have no direct causal relationship
\(X = 2C + \epsilon_X \Rightarrow C = 0.5X + 0.5\epsilon_X\), so \(Y = 0.5 \cdot 3X + \text{noise}\)

size <- 500
C <- 10*runif(size)
X <- 2*C + rnorm(size)
Y <- 3*C + rnorm(size)

summary(lm(Y~X))

## 
## Call:
## lm(formula = Y ~ X)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.6491 -1.2526 -0.0418  1.2540  4.1460 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.6102     0.1522    4.01 7.01e-05 ***
## X             1.4342     0.0131  109.52  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.74 on 498 degrees of freedom
## Multiple R-squared:  0.9601, Adjusted R-squared:  0.9601 
## F-statistic: 1.199e+04 on 1 and 498 DF,  p-value: < 2.2e-16

summary(lm(Y~X+C))

## 
## Call:
## lm(formula = Y ~ X + C)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.81847 -0.63492  0.02683  0.65114  2.43640 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.03361    0.08757   0.384    0.701    
## X           -0.03018    0.04536  -0.665    0.506    
## C            3.04750    0.09315  32.717   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.981 on 497 degrees of freedom
## Multiple R-squared:  0.9874, Adjusted R-squared:  0.9873 
## F-statistic: 1.941e+04 on 2 and 497 DF,  p-value: < 2.2e-16

Diagram above: X influences Y but the effect is confounded by Z. That is, failing to account for Z in a regression will lead to an incorrect causal parameter estimate
parameter = (counfounding bias) + (causal effect) = 1.5 + 5 = 6.5

Y <- 3*C + 5*X + rnorm(size)

summary(lm(Y~X))

## 
## Call:
## lm(formula = Y ~ X)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -5.261 -1.139 -0.012  1.183  5.458 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.65110    0.14861   4.381 1.44e-05 ***
## X            6.43596    0.01279 503.296  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.7 on 498 degrees of freedom
## Multiple R-squared:  0.998,  Adjusted R-squared:  0.998 
## F-statistic: 2.533e+05 on 1 and 498 DF,  p-value: < 2.2e-16

summary(lm(Y~X+C))

## 
## Call:
## lm(formula = Y ~ X + C)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2887 -0.6351 -0.0185  0.6507  2.9811 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.09275    0.08704   1.066    0.287    
## X            5.01797    0.04509 111.295   <2e-16 ***
## C            2.95101    0.09258  31.875   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.975 on 497 degrees of freedom
## Multiple R-squared:  0.9994, Adjusted R-squared:  0.9994 
## F-statistic: 3.853e+05 on 2 and 497 DF,  p-value: < 2.2e-16

Note: the covariate causal effect math only works here because the simulated data is linear. In the real world, we typically can’t assume linearity. This math doesn’t work without if the true data aren’t linear.
In most real-world datasets, there will always be the possibility of latent confounding

Sampling Bias

Sampling Bias occurs when sample is collected in such a way that some members of the intended population have a lower or higher sampling probability than others.
Sampling bias can lead to incorrect estimates when variables of interest influence sampling
Example: In 1936, the American Literary Digest sent out two million surveys to its readers and predicted that Alf Landon would beat incumbent president, Franklin Roosevelt, by a landslide, but the opposite happened. This was because readers over-represented Republicans.
Example: A researcher analysed data from 257 hospitalized individuals and detected an association between locomotor disease and respiratory disease (odds ratio 4.06). The researcher repeated the analysis in a sample of 2783 individuals from the general population and found no association (odds ratio 1.06)

size <- 100
locomotor <- 10*runif(size)
respiratory <- 10*runif(size)
hospitalized <- rbinom(size, 1, (locomotor+respiratory)/20)==0
df_all <- data.frame(locomotor, respiratory, hospitalized)
library(ggplot2)
ggplot(df_all[hospitalized==TRUE,], aes(respiratory, locomotor)) + geom_point() + 
  geom_smooth(method = lm, se = FALSE)

## `geom_smooth()` using formula 'y ~ x'

ggplot(df_all, aes(respiratory, locomotor)) + geom_point(aes(color=hospitalized)) +
  stat_smooth(method=lm, se=FALSE)

## `geom_smooth()` using formula 'y ~ x'

Sampling bias is an example of collider bias
Here we are conditioning on hospitalization
Even though locomotor and respiratory are independent, conditioning on a collider induces an association
Conditioning on a collider “open” an association pathway

Statistical Dependency and Causation

If \(A\) and \(B\) are random variables and associated (dependent) \(A\not\perp B\), it’s thought that there are 3 ways this can happen
- Direct causation: OR
- \(A\) and \(B\) are unmeasured confounding:
- Collider bias: \(A\) and \(B\) both influence sampling
- If we only have data for \(A\) and \(B\), we cannot distinguish between these case without more assumptions
If there are more variables, we figure out more using conditional independence
- Thinking about if knowing one variable gives information another
- Variables: Drank too much alcohol last night, Woke up with a hangover, Woke up with shoes on
- Hangover\(\perp\)Shoes?
- Hangover\(\perp\)Shoes\(|\)Alcohol?
  - Given that someone drank too much alcohol last night, does knowing that they woke up with their shoes on give any information about if they have a hangover?
- Variables: Burglary, Earthquake, Home motion detector
- Burglary\(\perp\)Earthquake
- Burglary\(\perp\)Earthquake\(|\)Detector?
  - Given that the home motion detector alerted, does knowing that there was no earthquake give any information about a burglary?

Structure	Path type	\(A,C\) independent/dependent?	\(A,C\) conditionally independent/dependent given \(B\)
Chain	\(A \rightarrow B \rightarrow C\)	\(A\not\perp C\)	\(A\perp C\|B\)
Other Chain	\(A \leftarrow B \leftarrow C\)	\(A\not\perp C\)	\(A\perp C\|B\)
Fork (Confounder)	\(A \leftarrow B \rightarrow C\)	\(A\not\perp C\)	\(A\perp C\|B\)
Collider	\(A \rightarrow B \leftarrow C\)	\(A\perp C\)	\(A\not\perp C\|B\)

PC Algorithm

Assume we have a dataset with many variables, \(A\), \(B\), \(C\), \(D\), etc
Assume no unmeasured confounders
Question: Is there a direct causal connection between \(A\) and \(B\)?
- If we can find a set of variables, \(S\) (that don’t include \(A,B\)) such that \(A\not\perp B|S\) then there is NO direct causal connection between \(A\) and \(B\)

Colliders

In the table above, only the collider independence/conditional independence was different from other structures
If we see for variables, \(A, B, C\) that \(A\perp C\) and \(A\not\perp C|D\), then \(A \rightarrow B \leftarrow C\) is the only option

Y-structures

Assuming no confounding, either \(C\rightarrow D\) or \(D \rightarrow C\)
If \(A-C-D\) and \(B-C-D\) are not colliders, we must have \(C\rightarrow D\)

Algorithm

Create complete, undirected graph

Pairwise Conditional Independence Testing

Conditioning set size, \(d = 0\):
- For each edge in current graph, \((x,y)\):
  - For each, \(S\), subset of neighbors of \(x\) or \(y\) with size \(d\)
    - if neighbors of \(x,y<d\), leave edge, go to next edge
    - if \(x\perp y|S\) remove edge, go to next edge
- \(d\leftarrow d+1\)

For all non-looping triples:

If collider, orident edges

Find all Y-structures in graph:

Orient accordingly

Kernel Conditional Independence and PC Algorithm Tutorial

Octavio Mesner

6/15/2021

Big Picture

Two big ideas:

Independence/dependence

Reproducing Kernel Hilbert Spaces

Background

Hilbert Space

Kernels

Evaluation Functional

Reproducing Kernel Property and kernel trick

Data to Kernel Mean

Characteristic Kernels

Cross Covariance Operators

Association vs Causation

Simpson’s paradox

Confounding

Confounding Simulations

Sampling Bias

Statistical Dependency and Causation

PC Algorithm

Colliders

Y-structures

Algorithm