Big Picture

Two big ideas:

  • Non-parametric Independence testing
  • PC algorithm: using independence tests to build structure

Independence/dependence

  • Intuitively, \(X\) and \(Y\) are dependence is knowing \(X\) gives information about \(Y\) and visa versa
  • Similarly, \(X\) and \(Y\) are independent if knowing \(X\) gives no information about \(Y\) and visa versa
  • Mathematically, we say random variables \(X\) and \(Y\) are independent (\(X\perp Y\)) if and only if \[ \mathbb{P}(X,Y) = \mathbb{P}(X)\mathbb{P}(Y) \]
  • \(X\) and \(Y\) are dependent (\(X\not\perp Y\)) if and only if \[ \mathbb{P}(X,Y) \neq \mathbb{P}(X)\mathbb{P}(Y) \]
  • Statistical associations (or dependence) can take forms that are difficult for first order (linear) test to capture
n <- 200
x <- runif(n)*10
yparab <- (x-5)^2 + 2*rnorm(n)
fitparab <- lm(yparab ~ x)

yvar <- rnorm(n, 0, x)
fitvar <-lm(yvar~x)

plot(x,yparab,xlab="X", ylab="Y$", cex.lab=1.5)
abline(fitparab, col='red')
text(5,10, paste("rho=", round(cor(x, yparab), digits=3)))

plot(x,yvar, xlab="X", ylab="Y", cex.lab=1.5)
abline(fitvar, col="red")
text(0.5,4.5, paste("rho=", round(cor(x, yvar), digits=3)))

cor(x, yparab)
## [1] 0.02509234
cor.test(x,yparab)
## 
##  Pearson's product-moment correlation
## 
## data:  x and yparab
## t = 0.35319, df = 198, p-value = 0.7243
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.1140457  0.1632650
## sample estimates:
##        cor 
## 0.02509234
cor.test(x,yvar)
## 
##  Pearson's product-moment correlation
## 
## data:  x and yvar
## t = 0.42392, df = 198, p-value = 0.6721
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.1090839  0.1681513
## sample estimates:
##        cor 
## 0.03011283
  • Recall correlation between random variables, \(X\) and \(Y\) is \[ \rho(X,Y) = \frac{\text{cov}(X,Y)}{\text{var}(X)\text{var}(Y)} = \frac{E[XY]-E[X]E[Y]}{\text{var}(X)\text{var}(Y)} \]

  • \(X\) and \(Y\) are uncorrelated (\(\rho(X,Y) = 0\)) if \(E[XY]=E[X]E[Y]\)

  • Because we can subtract the mean, \(E[X]\), from any random variable, and divide by its variance, we really only care about \(E[XY]\)

  • If \(E[X] = E[Y]=0\) and \(\text{var}(X)=\text{var}(Y)=1\), then \[ \rho(XY) = E[XY] \]

  • Kernel methods attempt to detect dependence of any form

  • At a high level, we can think of it using a property moment generating functions (MGF) to do this

  • Recall the moment generating function of a random variable \(X\) with density \(p\) is \[ M_X(t) = E\left[e^{tX}\right] = \int e^{tx}p(x)dx \] Theorem: If \(X\) and \(Y\) are random variables then \[ X\perp Y \Leftrightarrow M_{X+Y}(t) = M_X(t)M_Y(t) \] Proof: \[ \begin{align} M_{X+Y}(t) = M_X(t)M_Y(t) &\Leftrightarrow \int e^{t(X+Y)}p(x,y)dxdy = \int e^{tX}e^{tY}p(x,y)dxdy = \int e^{tX}p(x)dx \int e^{tY}p(y)dy \\ &\Leftrightarrow p(x,y)=p(x)p(y) \\ &\Leftrightarrow X\perp Y \end{align} \]

  • MGFs implicitly uses higher order moments of \(X\) and \(Y\)

  • Using Taylor expansion, recall that \[e^x = \sum_{\ell=0}^\infty \frac{x^\ell}{\ell!}\]

  • So, independence actually requires \[M_{X+Y}(t) =\sum_{\ell=0}^\infty \frac{t^\ell E[(X+Y)^\ell]}{\ell!} = \left(\sum_{\ell=0}^\infty \frac{t^\ell E[X^i]}{\ell!}\right)\left(\sum_{\ell=0}^\infty \frac{t^\ell E[Y^\ell]}{\ell!}\right) = M_X(t)M_Y(t)\]

  • Should we estimate the higher order moments? \(\widehat{E[X^\ell]} = \frac{1}{n}\sum_{i=1}^n x_i^\ell\)

  • No, there will be some error with for each \(\ell\) and technically, we have to evaluate infinitely many

Reproducing Kernel Hilbert Spaces

High level:

Background

  • Vector space

  • Inner product space is a vector space, \(V\), with a map, \(\langle\cdot, \cdot\rangle: V\times V\rightarrow \mathbb{R}\) with

    • Linearity: \(\langle ax, y\rangle = a\langle x, y\rangle\) and \(\langle x+y, z\rangle = \langle x,z\rangle+\langle y,z\rangle\) for \(a\in \mathbb{R}, x,y,z\in V\)
    • Symmetry: \(\langle x, y\rangle = \langle y,x\rangle\) for \(x,y\in V\)
    • Positive definiteness: \(\langle x,x\rangle \geq 0\) for \(x\in V\) with \(\langle x,x\rangle = 0 \Leftrightarrow x=0\)
  • Example: Euclidean dot product, \([a,b,c]\cdot[x,y,z] = ax+by+cz\)

  • Inner product spaces have norms: \(\|x\| = \sqrt{\langle x,x \rangle}\)

  • Spaces with norms are called metric spaces

  • Cauchy sequence is a sequence of vectors, \(x_1, x_2, x_3, \ldots\), where for every \(\epsilon>0\), there is a positive integer, \(N\), such that for all \(n,m>N, \|x_n-x_m\|<\epsilon\)

  • A metric space, \(V\), is complete if each Cauchy sequence of points in \(V\) has a limit in \(V\)

    • This is a technial requirement but ensures a lot of nice properties that we will want

Hilbert Space

  • A Hilbert space, \(\mathcal{H}\), is complete inner product space
  • Hilbert space example: Set of square integrable functions on \([0,1]\) \[ L_2[0,1] = \left\{f:[0,1]\rightarrow\mathbb{R}: \int_0^1 f(x)^2dx < \infty\right\} \]
    • Inner product: \[ \langle f,g\rangle = \int_0^1 f(x)g(x)dx \]
    • Corresponding norm: \[ \|f\| = \sqrt{\langle f,f\rangle} = \sqrt{\int_0^1 f(x)^2dx} \]
    • A sequence, \(f_1,f_2,f_3,\ldots\) in \(L_2[0,1]\) converges if \[ \|f_n-f\|\rightarrow 0 \text{ as } n\rightarrow\infty \] We can also write \(f_n\rightarrow f\)

Kernels

  • Assume \(\mathcal X\subseteq \mathbb R^d\) is the data space (or feature space) and \(\mathcal H\) is a Hilbert space

  • Kernel: function \(k: \mathcal X \times \mathcal X\rightarrow \mathbb R\) such that there exists a map (called the canonical function) \(\phi: \mathcal X\rightarrow \mathcal H\) such that for \(x,y\in \mathcal X\), \[K(x,y) = \langle \phi(x), \phi(y)\rangle_{\mathcal{H}}\]

  • Breakdown: \(K\) gives the Hilbert space inner product of the projections of \(x\) and \(y\)

  • This is helpful because doing operations directly in \(\mathcal H\) can be difficult

  • Choosing the kernel fully determines \(\mathcal H\)

  • Mercer kernels are positive definite: for \(f\in\mathcal H\), \[\int_{\mathcal X}\int_{\mathcal X} K(x,y)f(x)f(y)dxdy \geq 0\]

  • Recall positive definite matrix: \(x^TAx\geq 0\) for all \(x\)

  • example: Gaussian kernel \[K(x,y) = e^{-\frac{\|x-y\|^2}{\gamma}}\]

  • By fixing one argument of \(K\) to \(x\in\mathcal X\), we create a mapping from \(K_x:\mathcal X\rightarrow \mathcal H\) defined by \(K_x(y) = K(x,y)\)

  • Consider the space \[\mathcal H_0 = \left\{f : f(x)=\sum_{j=1}^k\alpha_j K_{x_j}(x)\right\}\]

  • If \(f(x)=\sum_{j=1}^k\alpha_j K_{x_j}(x)\) and \(g(x)=\sum_{j=1}^m\beta_j K_{y_j}(x)\), we can define an inner product as \[\langle f, g\rangle_K = \sqrt{\sum_{i=1}^k\sum_{j=1}^m \alpha_i\beta_j K(x_i, y_j)}\]

Evaluation Functional

  • Functional input functions and output a real number

    • example \(\int_a^b f(x)dx\) is function of \(f\)
  • Evaluation functional, \(F_x\) maps a function \(f\in\mathcal{H}\) to \(f(x)\), that is \(F_x f = f(x)\)

  • Alternative definition: A Hilbert space \(\mathcal H\) of functions is a reproducing kernel Hilbert space if evaluation functionals are bounded

  • Riesz representation theorem: if \(\mathbf{A}: \mathcal{H}\rightarrow \mathbb{R}\) is a bounded linear operator in a Hilbert space \(\mathcal H\), there exists some \(g_{\mathbf A}\in\mathcal H\) such that \[Af = \langle f, g_{\mathbf A}\rangle_{\mathcal H}\] for all \(f\in\mathcal H\)

  • We use Riesz representation theorem to prove a sufficient condition for the existence of the kernel mean embedding in an RKHS

  • Because \(F_x\) is a bounded linear operator, using Riesz, there must be a \(g_x\in\mathcal H\) such that \[F_xf = \langle g_x, f\rangle = f(x)\]

  • Proof of Riesz representation theorem:

  1. Assume \(A:\mathcal H\rightarrow \mathbb R\) is a bounded linear functional
  2. TP (to prove): (1) there exists an element \(g_A\in\mathcal H\) such that \(Af =\langle f, g_A\rangle\) for each \(f\in\mathcal H\) and (2) \(g_A\) is unique.
  3. We know that \(A\) is bounded if and only if \(A\) is continuous, so \(A\) must also be continuous.
  4. Edge case: if \(Af = 0\) for all \(f\in\mathcal H\), then \(g_A = 0\) because \(Af = 0 = \langle f, 0\rangle\).
  5. Other cases: Assume that \(A\neq 0\) (not the zero functional). That is, for some \(v_1\in\mathcal H, Av_1\neq 0\). Without loss of generality, assume \(Av_1=1\).
  6. Let \(\text{Null}(A) = \left\{f\in\mathcal H: Af=0\right\}\) be the null space of \(A\).
  7. \(\text{Null}(A)\) must be a closed linear subspace of \(\mathcal H\). Because \(A\) is continuous, the inverse image of closed sets are closed and \(\{0\}\) is closed.
  8. Let \(E = \text{Null}(A)+v_1 = \left\{f+v_1: f\in\text{Null}(A)\right\}\). Notice that \(0\not\in E\) because \(A(f+v_1)=Af+Av_1=0+1=1\) for all \(f+v_1\in E\) and \(A0=0\).
  9. Let \(\delta = \inf_{v\in E} \|v\|\) and \(\{u_n\}_{n=1}^\infty\subseteq E\) be a sequence in \(E\) such that \(\lim_{n\rightarrow\infty}\|u_n\|=\delta\)
  10. Let \(\epsilon>0\) and choose \(N\) large enough so that \(\|u_n\|-\delta < \epsilon\) for all \(n\geq N\).
  11. Let \(m,n>N\) and notice that \((u_n+u_m)/2\in E\) because \((u_n+u_m)/2\) can be written to be the sum of \(v_1\) and an element from \(\text{Null}(A)\).
  12. Then we must have \(\|u_n+u_m\|\geq 2\delta\) by 9 and 11.
  13. Using the parallelogram law \[\begin{align*} \|u_n-u_m\|^2 &= 2\|u_n\|^2+2\|u_m\|^2-\|u_n+u_m\|^2 &\text{Parallelogram Law}\\ &\leq 2(\epsilon+\delta)^2+2(\epsilon+\delta)^2-4\delta^2 & \text{by 10}\\ &= 4\epsilon(2\delta+\epsilon) \end{align*}\]
  14. Because \(\epsilon\) was chosen arbitrarily we have shown that \(\{u_n\}_{n=1}^\infty\) is a Cauchy sequence (we may need to change the \(\epsilon\) from 10 to satisfy the definition). Recall Cauchy sequence: For all \(\epsilon>0\), there is an \(N\in\mathbb N\) such that for all \(m,m\geq N, \|u_n-u_m\| < \epsilon\).
  15. Because \(\{u_n\}_{n=1}^\infty\subseteq E\subseteq\mathcal H\), a Hilbert space (complete so that Cauchy sequences converge), \(\{u_n\}_{n=1}^\infty\) must converge, say \(u_n\rightarrow u_0\).
  16. \(u_0\in E\) because \(E\) is closed and \(u_0\neq 0\) because \(0\not\in E\) from 8.
  17. Claim: \(g_A = u_0/\|u_0\|^2\)
  18. Let \(v\in\text{Null}(A)\). Then \(w=u_0+cv\in E\) for any \(c\in\mathbb R\) because \(u_0\in E\).
  19. If we set \(c = -\langle u_0, v\rangle /\|v\|^2\), then \(\|w\|=\sqrt{\delta^2-c^2}\)
  20. Because \(\delta=\inf_{v\in E}\|v\|\), we must have that \(c=0\), otherwise, \(\|w\|<\delta\), a contradiction.
  21. Then \(\langle u_0, v\rangle=0\) for all \(v\in\text{Null}(A)\).
  22. Then \(Av=0\) implies that \(\langle u_0, v\rangle=0\).
  23. Moreover, if \(Av=b\neq 0\), then because \(u_0\in E, Au_0=1\), we have \[A(v-bu_0) = Av-bAu_0 = b-b=0\]
  24. Then \(v-bu_0\in \text{Null}(A)\) and \(\langle u_0, v-bu_0\rangle=0\) from 22 so that \(\langle u_0,v\rangle=b\|u_0\|^2\).
  25. Using \(g_A = u_0/\|u_0\|^2\) from 17, \(Af = \langle g_A, f\rangle\) for all \(f\in\mathcal H\)
  26. This proves the claim in 17: \(g_A = u_0/\|u_0\|^2\)
  27. To show uniqueness, assume for some \(h\in\mathcal H\), \(Af = \langle h, f\rangle\) for all \(f\in\mathcal H\).
  28. Then for all \(f\in\mathcal H\), \[0 = Af-Af = \langle h, f\rangle - \langle g_A, f\rangle = \langle h-g_A, f\rangle\]
  29. Then \(0= \langle h-g_A, h-g_A\rangle=\|h-g_A\|\), so \(h=g_A\).

Reproducing Kernel Property and kernel trick

  • Reproducing Kernel Property: For \(x\in\mathcal X\) and \(f\in\mathcal H\) \[\langle f, K_x\rangle = f(x)\]
  • Jumping a head: this means that \(K_x\) is the representer of \(F_x\), the evaluation functional
  • Then \(\phi(x):= K_x\) fits the definition of canonical map
  • It follows that for \(x,y\in\mathcal X\) \[K(x,y) = K_y(x) = \langle K_x, K_y\rangle = \langle \phi(x), \phi(y)\rangle \]
  • Theorem: For every positive definite function, \(k(\cdot,\cdot)\) on \(\mathcal X\times \mathcal X\), there exists a unique RKHS with \(k\) as its reproducing kernel. Conversely, the reproducing kernel of an RKHS is unique and positive definite.
  • We can also show that \(\mathcal{H} = \overline{\text{span}\left\{k(x,\cdot): x\in \mathcal X\right\}}\), where closure is taken with respect to the RKHS norm
  • Hilbert-Schmidt Operator: Let \(\mathcal H\) and \(\mathcal F\) be separable Hilbert spaces with orthonormal bases \((h_i)_{i\in I}\) and \((f_j)_{j\in J}\), respectively. A Hilbert Schmidt (HS) operator is a bounded operator \(\mathcal A: \mathcal F\rightarrow \mathcal H\) whose HS norm is \[\|\mathcal A\|^2 = \sum_{i\in I} \sum_{j\in J} |\langle \mathcal A f_j, h_i\rangle_{\mathcal H}|^2\] is finite.
  • The HS operators from \(\mathcal F\) to \(\mathcal H\) also form a Hilbert space \(HS(\mathcal F, \mathcal H)\) with an inner product defined as \[\langle \mathcal A, \mathcal B\rangle_{HS} = \sum_{j\in J} \langle \mathcal A f_j, \mathcal B f_j\rangle_{\mathcal H}\]

Data to Kernel Mean

  • Assume \(X\sim p\)
  • In statistics, we frequently want to evaluate \(E[f(X)] = \int_{\mathcal X} f(x)p(x)dx\) for some function \(f\)
  • Problem: we almost never know \(p\)
  • To estimate, \(E[X]\) for example, we use the empirical measure, giving a probability of \(\frac{1}{n}\) to each sample point and integrate with respect to the empirical measure
  • Let \(x = \{x_1, \dots, x_n\}\). If \(\delta_x\) is the Dirac set function (\(\delta_x(A) = 1\) when \(x\cap A\neq \varnothing\) and \(\delta_x(A)=0\) otherwise)
  • We approximate the integral for \(E[X]\) as \[\int_{\mathcal X} tp(t)dt \approx \int_{\mathcal X} t \frac{\delta_x(t)}{n}dt = \frac{1}{n}\sum_{i=1}^n x_i\]
  • Similarly, we can approximate \(E[f(X)]\) with its plug-in estimator as \[\int_{\mathcal X} f(t)p(t)dt \approx \int_{\mathcal X} f(t) \frac{\delta_x(t)}{n}dt = \frac{1}{n}\sum_{i=1}^n f(x_i)\]
  • Kernel mean embedding: \[\mu_p = E[K(X,\cdot)] = \int_{\mathcal X} K(x,\cdot) p(x)dx\]
  • Estimate: \[\widehat{\mu_p} = \frac{1}{n}\sum_{i=1}^n K(x_i,\cdot)\]
  • In fact, \[\begin{align} \int_{\mathcal X} f(t) \frac{\delta_x(t)}{n}dt &= \int_{\mathcal X} \left\langle f, K(t,\cdot) \right\rangle_{\mathcal H} \frac{\delta_x(t)}{n} dt & \text{reproducing kernel property}\\ &= \left\langle f, \int_{\mathcal X} K(t,\cdot)\frac{\delta_x(t)}{n} dt \right\rangle_{\mathcal H} & \text{dominated convergen theorem} \\ &= \left\langle f, \widehat{\mu_p} \right\rangle_{\mathcal H} & \text{definition of } \widehat{\mu_p}\\ \end{align}\]
  • Lemma (Smola): If \(E[\sqrt{k(X,X)}]<\infty\), then \(\mu_p\in \mathcal H\) and \(E[f(X)] = \langle f, \mu_p\rangle_{\mathcal H}\)
  • Proof: Let \(L_P: \mathcal H\rightarrow \mathbb R\) be a linear operator (functional) defined by \(L_Pf := E_{X\sim P}[f(X)]\). Using the assumption, \[\begin{align} |L_Pf| &= |E_{X\sim P}[f(X)]| &\text{by definition}\\ &\leq E_{X\sim P}[|f(X)|] &\text{Jensen's inequality}\\ &= E_{X\sim P}\left[ \left|\langle f, k(X,\cdot) \rangle_{\mathcal H}\right|\right] &\text{reproducing kernel property}\\ &\leq E_{X\sim P}\left[ \|f\|_{\mathcal H} \sqrt{k(X,X)} \right] &\text{Cauchy-Schwartz inequality} \end{align}\] Then \(L_P\) is bounded, so we may apply the Riesz representation theorem. Then there must exist \(\lambda\in\mathcal H\) such that \(L_Pf = \langle f,\lambda \rangle_{\mathcal H}\). Choose \(f=k(x,\cdot)\) for some \(x\in\mathcal X\). Then \[\lambda(x) = L_Pk(x,\cdot) = \int_{\mathcal X} k(x,y)dP(y)\] so that \[\lambda=\int_{\mathcal X}k(\cdot, y)dP(y) =\int_{\mathcal X}k(\cdot, y) p(y)dy =\mu_P~.\]
  • Proposition: If \(k\) and \(\ell\) are reproducing kernels, then \(k\cdot l\) is a reproducing kernel.
  • Theme: we can take inner products in the RKHS to evaluate integrals and get parameter estimates

Characteristic Kernels

  • Let \(\mathcal P\) be the set of probability densities on \(\mathcal X\)

  • Let \(P_X\) be the CDF (\(P_X(x) = P(X<x)=\int_{-\infty}^x p_X(y)dy\)) and \(p_x\) the density (pdf)

  • A kernel is characteristic if the mapping \(\mu:\mathcal P\rightarrow \mathcal H\) defined \[\mu_P = \int_{\mathcal X} k(x,\cdot) p(x)dx\] is one-to-one (\(\mu_P=\mu_Q \Rightarrow p=q\))

  • To show that a kernel, \(k\), is characteristic, we must show that for two CDFs, \(P\) and \(Q\), \(\mu_P=\mu_Q\) implies that \(P=Q\).

  • The function \(\mu\) is the kernel mean embedding function

  • The following theorem gives a general characterization of characteristic kernels. Here we prove the sufficient condition for a kernel to be characteristic but it is also a necessary condition.

  • Definition of square integrable function: \(L^2(\mathcal X, P) = \{f: \int_{\mathcal X} f(x)dP(x) <\infty\}\) (Recall \(\int_{\mathcal X} f(x)dP(x) = \int_{\mathcal X} f(x)p(x)dx\))

  • Theorem Let \(\mathcal H\) be an RKHS with corresponding measurable, bounded kernel, \(k\), on a measurable space \((\mathcal X, \mathcal B)\). If \(\mathcal H + \mathbb R\) (direct sum) is dense in \(L^2(\mathcal X,P)\) for any \(P\in\mathcal P\), then \(k\) is characteristic.

  • Proof:

    1. To show that \(\mu\) is injective, assume \(P\) and \(Q\) are probability measures such that \(\mu_P=\mu_Q\).
    2. Proof by contradiction: assume \(P\neq Q\).
    3. Let \(R= |P-Q|\); that is \(R\) is a new measures defined as \(R(A) = |P(A)-Q(A)|\).
    4. Let \(\epsilon>0\) and \(A\in\mathcal B\).
    5. Since \(\mathcal H+\mathbb R\) is dense in \(L^q(\mathcal X,R)\), there must be a function \(f\in\mathcal H+\mathbb R\) such that \[\int_{\mathcal X} |f(x)-I_A(x)|dR(x)<\epsilon\] (Recall that \(\int fdP = E_P[f(X)]\) and \(\int I_AdP=P(A)\))
    6. And because \(\mu_P=\mu_Q\), \[E_P[f(X)] = \langle f, \mu_P\rangle = \langle f, \mu_Q\rangle = E_Q[f(X)]\]
    7. Then \[\begin{align*} |Q(A)-P(A)| &=\left|(E_P[f(X)]-P(A))-(E_Q[f(X)]-Q(A))\right|\\ &= \left|\int_{\mathcal X} f(x)-I_A(x) dP(x)-\int_{\mathcal X} f(x)-I_A(x) dQ(x)\right|\\ &\leq \left| \int_{\mathcal X} f(x)-I_A(x) dR(x)\right|\\ &\leq \int_{\mathcal X} |f(x)-I_A(x)|dR(x)< \epsilon \end{align*}\]
    8. Since, \(A\) and \(\epsilon\) were arbitrary, we must have that \(P(A)=Q(A)\) for all \(A\in\mathcal B\) so that \(P=Q\).
    9. This shows that \(\mu\) is injective so that \(k\) is characteristic.
  • We want to show that the Gaussian kernel, \(k(x,y)=\exp\left(-\frac{\|x-y\|}{2\sigma}\right)\), is characteristic.

  • It turns out there there is still more work to do to show this. The theorem below helps prove that the Gaussian kernel is characteristic.

  • The Fourier transform of a function \(\phi\) is \[\tilde\phi(\xi) := \int_{\mathcal X} e^{-it\xi}\phi(t)dt.\]

  • Theorem: Assume that \(k\) is translation invariant; that is, \(k(x,y)=\phi(x-y)\) for some function \(\phi\). If for all \(\xi\in\mathbb R^m\), there exists \(\tau_0\) such that \[\begin{equation} \int \frac{\tilde\phi(\tau(u+\xi))^2}{\tilde\phi(u)}du<\infty \end{equation}\] for all \(\tau>\tau_0\) then \(\mathcal H\) is dense in \(L^2(P)\) for any Borel probability measure \(P\) on \(\mathbb R^m\).

  • Proof: Without loss of generality, assume that \(\phi(0)=1\). Using positive definiteness of \(k\), we have that \(|\phi(z)|^2\leq \phi(0)^2=1\). Now, since the RKHS associated with \(k\) consists of the functions such that \[\mathcal H = \left\{f\in L^2(\mathbb R^m) : \int \frac{|\tilde f(u)|^2}{\tilde\phi(u)}du<\infty\right\},\] where \(\tilde f\) and \(\tilde\phi\) are the Fourier transforms of \(f\) and \(\phi\).

    Let \(P\) be an arbitrary probability measure on \(\mathbb R^m\) and \(\xi\in\mathbb R^m\). Notice that the Fourier transform of \(e^{-i\xi^Tz}\phi(z/\tau)\) is \(\tilde\phi(\tau(u+\xi))\). From the assumption of the theorem we must have that \(e^{-i\xi^Tz}\phi(z/\tau)\in\mathcal H\) for all \(\tau>\tau_0\). Since \(\phi(z/\tau)\rightarrow 1\) as \(\tau\rightarrow\infty\) and \(e^{-i\xi^Tz}\phi(z/\tau)\) is bounded by a constant, we can use the dominated convergence theorem to ensure that \[E_P\left[e^{-i\xi^Tz} - e^{-i\xi^Tz}\phi(z/\tau)\right]\rightarrow 0 \text{ as } \tau\rightarrow\infty.\] This shows that we only have to prove that the span of \(\mathcal A :=\{e^{-i\xi^Tz}:\xi\in\mathbb R^m\}\) is dense in \(L^2(P)\).\

    Now, let \(f\in L^2(P)\). We can assume that \(f\) is differentiable with compact support since it’s in \(L^2(P)\). Let \(\epsilon>0\) and \(M=\sup_{x\in\mathbb R}|f(x)|\), and \(A\) such that \([-A,A]^m\) contains the support of \(f\) and \(P([-A,A]^m)>1-\epsilon/4M^2\). From a functional analysis text, we have that the series of functions \[f_N(z) = \sum_{n_1=-N}^N \dots\sum_{n_m=-N}^N c_n\exp\left\{\frac{i\pi}{A}n^Tz\right\} \text{ for } N\in\mathbb{N}\] where \[c_n = \frac{1}{(2A)^m}\int_{[-A,A]^m}f(z)\exp\left|\frac{i\pi}{A}n^Tz\right|dz\] converges uniformly to \(f(z)\) on \([-A,A]^m\) as \(N\rightarrow\infty\). Then there must be some \(N\) large enough such that \(|f(z)-f_N(z)|^2<\epsilon/2\) on \([-A,A]^m\) and from the definition of \(f_N(z)\) it must be the case that \[\sup_{x\in\mathbb R^m}|f_N(z)|^2<\left(M+\sqrt{\frac{\epsilon}{2}}\right) < 2M^2.\] Then we have that \(E_P|f-f_N|^2<\epsilon\) so that span\(\mathcal A\) is dense in \(L^2(P)\).

  • Now, to apply this theorem to the Gaussian kernel, we must first compute \(\tilde \phi\) where \(\phi(\xi)=\exp\left(-\frac{\xi}{2\sigma}\right)\) \[\tilde \phi(\xi) = \frac{1}{2\sigma\pi(\xi^2+(2\sigma)^{-2})}~.\]

  • Then show that for some \(\tau_0\), \(\int \frac{\tilde\phi(\tau(u+\xi))^2}{\tilde\phi(u)}du<\infty\) for all \(\tau>\tau_0\) \[\int \frac{\tilde\phi(\tau(u+\xi))^2}{\tilde\phi(u)}du = \int \frac{u^2+(2\sigma)^{-2}}{2\sigma\pi[(\tau(u+\xi))^2+(2\sigma)^{-2}]^2} du\]

Cross Covariance Operators

  • Let \(X,Y\) are random variables with marginal densities, \(p_X\) and \(p_Y\), reproducing kernels, \(k\) and \(\ell\), and corresponding RKHSs, \(\mathcal H\) and \(\mathcal F\), respectively

  • Assume \(E[k(X,X)], E[\ell(Y,Y)]<\infty\)

  • Cross covariance operator (uncentered): \(\mathcal C_{YX}:\mathcal H\rightarrow\mathcal F\) defined as \[ \mathcal C_{YX} := E_{YX}[\ell(Y,\cdot)\otimes k(X,\cdot)] = \mu_{p_{YX}}\] where \(p_{YX}\) is the joint distribution of \(X\) and \(Y\) and \(\otimes\) is the tensor product

  • Using a tensor only works here because \(\mathcal H\otimes \mathcal F\) and \(HS(\mathcal H, \mathcal F)\) are isomorphic using this map: \[\sum_{i\in I} f_i\otimes g_i \mapsto \left[h \mapsto \sum_{i\in I} \langle h, f_i\rangle_{\mathcal H} \cdot g_i\right]\]

  • \(\mathcal C_{YX}\) can be thought of as the expected value of a tensor product and as a mapping from \(\mathcal H\) to \(\mathcal F\)

  • As a mapping: \[\begin{align} \langle g, \mathcal C_{YX} f\rangle_{\mathcal F} &= \left\langle E_{YX}[\ell(Y,\cdot)\otimes k(X,\cdot)], g\otimes f \right\rangle_{\mathcal F\otimes\mathcal H} & \text{definition of }\mathcal C_{YX} \\ &= E_{YX}\left[ \left\langle \ell(Y,\cdot)\otimes k(X,\cdot), g\otimes f \right\rangle_{\mathcal F\otimes\mathcal H} \right] & \text{linearity of expectations} \\ &= E_{YX}\left[ \left\langle \ell(Y,\cdot), g \right\rangle_{\mathcal F}\cdot \left\langle k(X,\cdot), f \right\rangle_{\mathcal H} \right] & \text{defintion of tensor product}\\ &= E_{YX}[g(Y)f(X)] & \text{reproducing kernel property} \end{align}\]

  • Covariance Operator: If \(X=Y\), then \(\mathcal C_{XX}\) maps \(f\in\mathcal H\) such that for \(g\in\mathcal H\) \(\langle g, \mathcal C_{XX} f\rangle = E_X[g(X)f(X)]\)

    • Self Adjoint: Note that that for \(f, g\in\mathcal H,\langle g, \mathcal C_{XX} f\rangle = \langle \mathcal C_{XX}g,f\rangle\)
  • \(\mathcal C_{XY}\) is the adjoint of \(\mathcal C_{YX}\): for \(f\in\mathcal H, g\in\mathcal F, \langle g, \mathcal C_{YX} f\rangle_{\mathcal F} = \langle \mathcal C_{XY}g,f\rangle_{\mathcal H}\)

  • We can also write \(\mathcal C_{YX}, \mathcal C_{XX}\) as integral functions:

    • \((\mathcal C_{YX} f)(\cdot) = \int_{\mathcal X\times\mathcal Y} \ell(\cdot, y)f(x) p_{XY}(x,y)dxdy\)
    • \((\mathcal C_{XX} f)(\cdot) = \int_{\mathcal X} k(\cdot, x)f(x) p_{X}(x)dx\)
  • Using the first equation, \[\begin{align*} \langle g, \mathcal C_{YX} f\rangle_{\mathcal F} &= \left\langle g, \int_{\mathcal X\times\mathcal Y} \ell(\cdot, y)f(x) p_{XY}(x,y)dxdy\right\rangle_{\mathcal F} &\text{first equation above} \\ &= \int_{\mathcal X\times\mathcal Y} \langle g,\ell(\cdot, y)\rangle_{\mathcal F} f(x) p_{XY}(x,y)dxdy &\text{DCT} \\ &= \int_{\mathcal X\times\mathcal Y} g(y)f(x) p_{XY}(x,y)dxdy &\text{Reproducing kernel} \\ &= E_{YX}[g(Y)f(X)] &\text{Definition of expectation} \end{align*}\]

  • Theorem (Fukumizu 2004): If \(E_{Y}[g(Y)|X=\cdot]\in\mathcal H\), then for \(g\in\mathcal F\), \[\mathcal C_{XX} E_{Y}[g(Y)|X=\cdot] = C_{XY} g~.\]

  • Centered cross covariance operator: \(E_{YX}[(\ell(Y,\cdot)-\mu_Y)\otimes(k(X,\cdot)-\mu_X)]\)

  • Given an IID sample, \((x_1,y_1), \dots, (x_n, y_n)\), we estimate the centered cross covariance operator as \[ \hat{\mathcal{C}}_{YX} = \frac{1}{n}\sum_{i=1}^n \left(\ell(y_i,\cdot)-\hat\mu_{p_Y}\right)\otimes \left( k(x_i,\cdot)-\hat\mu_{p_X}\right) = \frac{1}{n}\psi^T H \phi \] where \(\psi = [\ell(y_1,\cdot), \dots, \ell(y_n,\cdot)]^T\), \(\phi = [k(x_1,\cdot),\dots, k(x_n, \cdot)]^T\), and \(H = I_n-\frac{1}{n} 1_{n\times n}\)

  • Berlinet and Thomas-Agnan (2004) showed that \[\|\hat{\mathcal C}_{YX} - \mathcal C_{YX}\|_{HS}=O_P(1/\sqrt n)\] as \(n\rightarrow\infty\)

  • Theorem (Baker, 1973): There exists a unique bounded operator, \(\mathcal V_{YX}: \mathcal H\rightarrow\mathcal F\) with \(\|\mathcal V_{YX}\|_{HS}\leq 1\) such that \[\mathcal C_{YX} = \mathcal C_{YY}^{1/2}\mathcal V_{YX} \mathcal C_{XX}^{1/2}\].

  • \(\mathcal V_{YX}\) is called the normalized cross covariance operator

  • \(\mathcal V_{YX}\) captures the dependence information between \(X\) and \(Y\) with less influence from the marginal distributions

  • Why is this helpful for independence? If we find \(f\) and \(g\) to make the expectation above look like the product of moment generating functions, we can use that to show independence.

  • The next theorem show how cross covariance operator and independence are related

  • Theorem: \(\|C_{YX}\|_{HS}=0\) if and only if \(X \perp Y\)

  • Proof: \[\begin{align*} \|\tilde C_{YX}\|_{HS}=0 &\Leftrightarrow \tilde C_{YX}=0 \\ &\Leftrightarrow \mu_{P_{YX}}=\mu_{P_Y}\otimes\mu_{P_X} & \text{because } \mu_{P_{YX}}-\mu_{P_Y}\otimes\mu_{P_X}=\tilde C_{YX} \\ &\Leftrightarrow P_{YX} = P_XP_Y \\ &\Leftrightarrow X\perp Y \end{align*}\]

  • How do we evaluate \(\|\tilde{\mathcal C}_{YX}\|_{HS}\)?

  • Recall that for any Hilbert-Schmidt Operator, \(\mathcal A\), \[\|\mathcal A\|_{HS}^2 = \sum_{i\in I} \sum_{j\in J} |\langle \mathcal A f_j, h_i\rangle_{\mathcal H}|^2\]

  • We can also think of the summations like integrals

  • Example: If \(X\sim P\), \(\|\mu_P\|_{\mathcal H}^2 = \left\langle E[k(X,\cdot), E[k(X,\cdot) \right\rangle = E_{XX'}[k(X,X')] = \int_{\mathcal X}\int_{\mathcal X} k(x,x')p_X(x)p_X(x')dxdx'\)

  • Similarly \(\hat \mu_P =\frac{1}{n}\sum_{i=1}^n k(x_i,\cdot)\), so \(\|\hat\mu_P\|_{\mathcal H}^2 = \left\langle \frac{1}{n}\sum_{i=1}^n k(x_i,\cdot), \frac{1}{n}\sum_{i=1}^n k(x_i,\cdot) \right\rangle = \frac{1}{n^2}\sum_{i=1}^n\sum_{j=1}^n k(x_i,x_j)\)

  • Big Picture: We are going to use the data to estimate \(\|\hat{\mathcal C}_{YX}\|\), as a test statistic. It is very unlikely that \(\|\hat{\mathcal C}_{YX}\|=0\) but if we can figure out its distribution under the null hypothesis, we can check if it’s close enough.

  • How do we use this to evaluate HS norms? There are several ways!

  • Gretton 2007 Kernel Test of Indepedence

    • Let \(X'\) and \(Y'\) be an independent copy of \(X\) and \(Y\), respectively. \[\begin{align*} &\|\mathcal C_{YX}\|_{HS}^2 \\ &= \|E_{YX}[(\ell(Y,\cdot)-\mu_Y)\otimes(k(X,\cdot)-\mu_X)] \|_{HS} \\ &= \langle E_{YX}[(\ell(Y,\cdot)-\mu_Y)\otimes(k(X,\cdot)-\mu_X)], E_{Y'X'}[(\ell(Y',\cdot)-\mu_Y)\otimes(k(X',\cdot)-\mu_X)] \rangle_{HS} \\ &= E_{YY'XX'}[\langle (\ell(Y,\cdot)-\mu_Y)\otimes(k(X,\cdot)-\mu_X), (\ell(Y',\cdot)-\mu_Y)\otimes(k(X',\cdot)-\mu_X) \rangle_{HS}] \\ &= E_{YY'XX'}\left[\left(\ell(Y,Y')-\langle \mu_Y, k(Y',\cdot)\rangle -\langle \mu_Y, k(Y,\cdot)\rangle+ \|\mu_Y\|^2\right)\times\left(k(X,X') -\langle k(X,\cdot), \mu_X\rangle -\langle k(X',\cdot), \mu_X\rangle+\|\mu_X\|^2\right)\right] \\ &= E_{YY'XX'}\left[\ell(Y,Y')k(X,X')\right] -2E_{YX}\left[E_{Y'}[\ell(Y,Y')]E_{X'}[k(X,X')]\right] -E_{YY'}[\ell(Y,Y')]E_{XX'}[k(X,X')] \end{align*}\] This can be estimated as a U-statistic \[HSIC = \frac{1}{\binom{n}{2}}\sum_{(i,j)} k(x_i,x_j)\ell(y_i,y_j) -\frac{2}{\binom{n}{3}}\sum_{(i,j,q)} k(x_i,x_j)\ell(y_i,y_q) +\frac{1}{\binom{n}{4}}\sum_{(i,j,q,r)} k(x_i,x_j)\ell(y_r,y_q)\] where each tuple runs over all unique value combinations

    • And as a V-statistic \[HSIC_b = \frac{1}{m^2}\sum_{i,j}^n k(x_i,x_j)\ell(y_i,y_j) -\frac{2}{n^3}\sum_{i,j,q}^n k(x_i,x_j)\ell(y_i,y_q) +\frac{1}{n^4}\sum_{i,j,q,r}^n k(x_i,x_j)\ell(y_r,y_q) = \frac{1}{m^2}\text{tr}(KHLH)\] where \(K_{ij} = k(x_i,x_j), L_{ij}= \ell(y_i,y_j), H=I_n -\frac{1}{n}\mathbf{1}\mathbf{1}^T\)

    • Distribution of HSIC: The paper shows that, under the null distribution (\(X\perp Y\)), \[n\cdot HSIC \xrightarrow{D} \sum_{l=1}^\infty \lambda_l z_l^2\] where \(z_l\sim\)Standard Normal and \(\lambda_l\) are unknown, theoretical eigenvalues. In practice, we can either approximate it’s distribution under the null hypothesis with a Gamma distribution or with a permutation test.

  • Fukumizu 2007 Kernel Conditional Independence

    • Covariance of \(X\) and \(Y\): \(cov(X,Y) = E[(X-E[X])(Y-E[Y])]\); if \(X,Y\) are vectors, \(cov(X,Y)\) is a matrix with covariance pair entries
    • Conditional covariance: \(cov(X,Y|Z) = E[(X-E[X|Z])(Y-E[Y|Z])|Z]\)
    • Assume \(X, Y, Z\) are random vectors, then \(cov(X,Y|Z) = cov(X,Y)-cov(X,Z)[cov(Z,Z)]^{-1}cov(Z,Y)\) (The proof uses a 2x2 matrix inversion identity)
    • This works similarly for conditional cross covariance: \[\mathcal{C}_{YX|Z} = \mathcal{C}_{YX}-\mathcal{C}_{YZ}\mathcal{C}_{ZZ}^{-1}\mathcal{C}_{ZX}\] where \(\mathcal{C}_{VW}\) is the cross covariance as before, proof in this paper
    • Theorem: \(X\perp Y|Z \Leftrightarrow \|\mathcal{C}_{YX|Z}\|_{HS}=0\)
    • However, this paper prefers to use the normalized measure: \[\mathcal{V}_{YX|Z} = \mathcal{C}_{YY}^{-1/2}\left(\mathcal{C}_{YX}-\mathcal{C}_{YZ}\mathcal{C}_{ZZ}^{-1}\mathcal{C}_{ZX}\right)\mathcal{C}_{XX}^{-1/2}\]
    • Again, \(X\perp Y|Z \Leftrightarrow \|\mathcal{V}_{YX|Z}\|_{HS}=0\)
    • To estimate \(\|\mathcal{V}_{YX|Z}\|_{HS}\), the authors make a small change to the estimator by substituting \(\ddot X := (X,Z), \ddot Y := (Y,Z)\) in place of \(X\) and \(Y\), and instead estimate \(\|\mathcal{V}_{\ddot Y\ddot X|Z}\|_{HS}\)
    • \(\|\widehat{\mathcal{V}_{\ddot Y\ddot X|Z}}\|_{HS}^2 = \text{Tr}\left(R_\ddot Y R_\ddot X -2R_\ddot Y R_\ddot X R_Z + R_\ddot Y R_Z R_\ddot X R_Z\right)\) where \(R_W = G_W(G_W+n\epsilon_n I_n)^{-1}\) and \(G_W= \left[k(w_i,w_j)\right]_{ij} H\)
    • Theorem: If \(\epsilon_n\rightarrow 0\) and \(\epsilon_n^3n\rightarrow\infty\), then \[\left\| \widehat{\mathcal{V}_{\ddot Y\ddot X|Z}} -\mathcal{V}_{\ddot Y\ddot X|Z}\right\|\xrightarrow{P} 0\] as \(n\rightarrow\infty\).
    • This paper does not provide the null distribution of \(\|\mathcal{V}_{YX|Z}\|_{HS}\) but uses permutation testing
      • note that because this is a conditional test, we can only permute \(X\) and \(Y\) for the same values of \(Z\)
  • Zhang 2012 Kernel Conditional Test

    • This paper builds on the previous paper by slightly changing the estimator and providing a null distribution

    • Big Picture: Use Gaussian process regression to estimate \(E[X|Z]\) and \(E[Y|Z]\) to determine indepdendence

      • Conditional correlation analogy
      • Regression residuals: \(\epsilon_X :=X-E[X|Z]\) and \(\epsilon_Y :=Y-E[Y|Z]\)
      • If \(Corr(\epsilon_X,\epsilon_Y) = 0\), we can say that \(X\) and \(Y\) are uncorrelated given \(Z\)
      • Using kernels, we can use this framework to show conditional independence, \(\tilde K_{\ddot X|Z}\) and \(\tilde K_{Y|Z}\)
    • The paper estimates \(\tilde K_{\ddot X|Z}\) as \[\tilde K_{\ddot X|Z} := R_Z \tilde K_\ddot X R_Z\] where \(R_Z\) is the calculated from the regression

    • \(\tilde K_{Y|Z}\) is calculated similarly

    • KCI test statistic: \[T_{CI} := \frac{1}{n}\text{TR}\left(\tilde K_{\ddot X|Z} \tilde K_{Y|Z}\right)\]

    • Under the null hypothesis (\(X\perp Y|Z\)), \[T_{CI} \sim \frac{1}{n} \sum_{k=1}^{n^2} \lambda_k Z_k^2\] where \(\lambda_k\) are eigenvalues and \(Z_k\sim N(0,1)\) (independent standard normal RVs)

    • The distribution above can be approximated Using a gamma distribution: \[Gamma(k,\theta) = \frac{t^{k-1} \exp\left(-\frac{t}{\theta}\right)}{\theta^k \Gamma(k)}\] where \[k = \frac{(E[T_{CI}])^2}{Var[T_{CI}]} \text{ and } \theta = \frac{Var[T_{CI}]}{E[T_{CI}]}\]

    • Recall eigenvalue decomposition of a square matrix, \(A\):

      • Definition: An eigenvalue, eigen vector pair for \(A\) are any \(\lambda\in\mathbb R\), \(v\in \mathbb R^d\) such that \(\lambda v = Av\), wlog let \(\|v\|=1\)
      • Let \(U = [v_1,\dots, v_d]\) and \(\Lambda = \text{diag}(\lambda_1,\dots, \lambda_d)\), then \(A = U\Lambda U^T\), and \(U\) is an orthogonal matrix (\(U^{-1}=U^T\))
    • Note that \(K_X = [k(x_i,x_j)]_{ij}= [k(x_1,\cdot), \dots, k(x_n,\cdot)]^T \otimes [k(x_1,\cdot), \dots, k(x_n,\cdot)]^T\)

    • If \(\tilde K_X = K_XH\) is the centralized kernel matrix and \(\tilde K_X = U_X \Lambda_X U_X^T\) , we can estimate the kernel map, \([k(x_1,\cdot)-\hat\mu_X, \dots, k(x_n,\cdot)-\hat\mu_X]^T\), as \[\tilde K^{-1/2} = \left(U_X \Lambda_X U_X^T\right)^{-1/2} = U_X \Lambda_X^{-1/2}\]

    • Taking a closer look at the code

    • Below: threshold parameters

#The program codes the KCI statistic
#Zhang 2012 Kernel-based conditional independence test...
setwd(".")
epsilon <- 0 #10^-2
eigThresh <- 10^-10
thres <- 1e-5
  • Below: Calculate \(\text{Tr}(AB)\) faster
    • matrix multiplication is \(O(n^2)\) but we don’t actually need all of the entries, only the diagonal
    • below is \(O(n)\)
trProd <- function(A, B){
    stopifnot(ncol(A) == nrow(B))
    sum(sapply(1:nrow(A),
               function(ii){A[ii,] %*% B[,ii]}
               )
        )
}

trProdTrans <- function(A){
    sum(sapply(1:nrow(A),
               function(ii){A[ii,] %*% A[ii,]}
               )
        )
}
  • Below: if Cholesky decomposition doesn’t work, we add 0.01 to the diagonal and try again
cholsky <- function(origMat){
    cholStatus <- try(u <- chol(origMat), silent = TRUE)
    cholError <- ifelse(class(cholStatus) == "try-error", TRUE, FALSE)
    if (cholError){
        u <- chol(origMat + 0.01 * diag(dim(origMat)[1]))
    }
    return(u)
}
  • Below: function to generate (uncentered) kernel matrix
    • \(K_{ij} = k(x_i, x_j) = \exp\left(-\frac{\theta}{2}\|x_i-x_j\|^2\right)\)
gaussKernMat <- function(dat, theta){
    exp(as.matrix(-dist(dat, diag=TRUE, upper=TRUE)^2*theta/2))
}
  • Below: Gaussian Process (GP) Regression Functions
    • ardKernel automatic relevance determination kernel: minimizes features not associated with outcome
    • ardGrad gradient of ardKernel, used for optimization
    • gpLogLik log-likelihood for GP Regression
    • gplogLikGrad gradient of gpLogLik, used for optimization
# ard: automatic relevance determination
ardKernel <- function(params, x, noise=TRUE){
    #This function is the ARD kernel
    #param = (variable precision params), signal var, model var)
    x <- as.matrix(x)
    xDim <- dim(x)
    prec_param <- exp(params[1:xDim[2]])
    sig_var <- exp(2*params[xDim[2] + 1])
    if(noise){
        mod_var <- exp(2 * params[xDim[2] + 2])
    } else { mod_var <- 0}
    kern <- sig_var *
        exp(-as.matrix(dist(
                 x %*% diag(1/prec_param, nrow=length(prec_param)),
                 diag=TRUE, upper = TRUE)^2/2
                 )
            ) + (mod_var + epsilon) * diag(xDim[1])
    return((kern+t(kern))/2)
}

ardGrad <- function(params, x, coord){
    xDim <- dim(x)
    prec_param <- exp(params[1:xDim[2]])
    sig_var <- exp(2*params[xDim[2] + 1])
    mod_var <- exp(2*(params[xDim[2] + 2]))
    if(coord <= xDim[2]){
    kern <-  #sig_var *
        exp(-as.matrix(dist(
                 x %*% diag(1/prec_param, nrow=length(prec_param)),
                 diag=TRUE, upper = TRUE)^2/2
                 )
            )
    out <- kern *
        as.matrix(dist(x[,coord],
                       diag=TRUE, upper=TRUE)/prec_param[coord])^2
    return(out)
    }
    if(coord == xDim[2]+1){
        kern <-  sig_var *
            exp(-as.matrix(dist(
                     x %*% diag(1/prec_param, nrow=length(prec_param)),
                     diag=TRUE, upper = TRUE)^2/2
                     )
                )
        return(2 * kern)
    }
    if(coord == xDim[2]+2){
        return(2 * mod_var * diag(xDim[1]))
    }
}


gplogLik <- function(params, x, y){
    xDim <- dim(x)
    sig_var <- exp(2*params[xDim[2] + 1])  ###  Check this out
    x <- as.matrix(x)
    y <- as.matrix(y)
    xDim <- dim(x)
    yDim <- dim(y)
    kern <- ardKernel(params, x)
    L <- chol(kern)
    alpha <- solve(kern, y,
                   system=signature(a = "dppMatrix", b = "dsparseMatrix"))
    0.5 * sum(diag(t(y) %*% alpha)) +
        yDim[2]*(0.5 * log(sig_var) + sum(log(diag(L)))) +
        0.5 * xDim[1] * yDim[2] * log(2*pi)
}

gplogLikGrad <- function(params, x, y){
    paramNum <- length(params)
    yDim <- dim(y)
    out <- rep(NA, paramNum)
    kern <- ardKernel(params, x)
    alpha <- solve(kern, y,
                   system=signature(a = "dppMatrix", b = "dsparseMatrix"))
    W <- yDim[2] * solve(kern, system=signature(a = "dgCMatrix")) -
        tcrossprod(alpha)
    for(itr in 1:paramNum){
        out[itr] <- sum((W * ardGrad(params, x, itr)))/2
    }
    return(out)
}
  • Below: Actual kciTest code
    • inputs: x,y, z (optional)
    • output: p-value
kciTest <- function(x, y, z=NULL, theta=NULL){
    x <- as.matrix(x)
    y <- as.matrix(y)
    x <- scale(x) #scale standardizes columns
    y <- scale(y)
    nobs <- dim(x)[1]
    H <- diag(nobs) - matrix(1, nrow=nobs, ncol=nobs)/nobs

    if(is.null(theta)){
        if(nobs <= 200){
            width <- 1.2
        } else if(nobs > 1200){
            width <- 0.4
        } else {width <- 0.7}
        theta <- 1/width^2
    }

    if(is.null(z)){ # if nothing is entered for z, runs independence test
        kernMatX <- gaussKernMat(x, theta)
        kernMatY <- gaussKernMat(y, theta)

        kx <- H %*% (kernMatX %*% H)
        ky <- H %*% (kernMatY %*% H)

        test.stat <- sum(diag(kx %*% ky))/nobs
        th.mean <- sum(diag(kx)) * sum(diag(ky)) / nobs^2
        th.var <- 2 * trProd(kx, kx) * trProd(ky, ky) / nobs^4
    } else{ # if z entered, runs conditional independence test
        z <- as.matrix(z)
        z <- scale(z)
        zDim <- dim(z)[2]

        theta <- 1/(width^2 * zDim)

        # centralized kernel matrix for ddot X = (X,Z), tilde K_X
        kernMatXd <- gaussKernMat(cbind(x,z/2), theta) 
        kxd <- H %*% (kernMatXd %*% H)

        # centralized kernel matrix for Y, tilde K_Y
        kernMatY <- gaussKernMat(y, theta)
        ky <- H %*% (kernMatY %*% H)

        eigx <- eigen(kxd, symmetric = TRUE)
        eigy <- eigen(ky, symmetric = TRUE)

        # list of T/F for keeping eigenvalues, keep large enough values
        eigx_keep <- eigx$value >= thres * eigx$value[1]
        eigy_keep <- eigy$value >= thres * eigy$value[1]

        # getting sqrt(K_X), sqrt(K_Y)
        mapx <- eigx$vectors[,eigx_keep] %*%
            diag(sqrt(eigx$values[eigx_keep]), nrow=sum(eigx_keep))
        mapy <- eigy$vectors[,eigy_keep] %*%
            diag(sqrt(eigy$values[eigy_keep]), nrow=sum(eigy_keep))

        # getting GP parameters
        init_eta <- log(width * sqrt(zDim))
        init_params <- c(rep(init_eta, zDim), 0, log(sqrt(0.1)))
        opt_x <- optim(init_params, gplogLik, gplogLikGrad,
                       x=z, y=2*sqrt(nobs) * mapx/sqrt(eigx$value[1]))
        opt_y <- optim(init_params, gplogLik, gplogLikGrad,
                       x=z, y=2*sqrt(nobs) * mapy/sqrt(eigy$value[1]))
        opt_hyp_x <- opt_x$par
        opt_hyp_y <- opt_y$par

        # calculating regression matrix R_Z in the notes above
        Px <- (diag(nobs) - ardKernel(opt_hyp_x, z, FALSE) %*%
               solve(ardKernel(opt_hyp_x, z)))
        Py <- (diag(nobs) - ardKernel(opt_hyp_y, z, FALSE) %*%
               solve(ardKernel(opt_hyp_y, z)))

        # tilde K_{ddot X|Z}
        kxdz <- Px %*% (kxd %*% t(Px))

        # tilde K_{Y|Z}
        kyz <- Py %*% (ky %*% t(Py))

        # test statistic
        test.stat <- trProd(kxdz, kyz)

        # get eigenvalues/vectors
        eigx <- eigen(kxdz, symmetric = TRUE)
        eigy <- eigen(kyz, symmetric = TRUE)

        # list T/F for which are large enough to use
        thrsX <- sum(eigx$values >= eigThresh)
        thrsY <- sum(eigy$values >= eigThresh)

        # calcuate sqrt(tilde K_X), etc
        mapx <- eigx$vectors[,1:thrsX] %*%
            diag(sqrt(eigx$values[1:thrsX]), nrow=thrsX)
        mapy <- eigy$vectors[,1:thrsY] %*%
            diag(sqrt(eigy$values[1:thrsY]), nrow=thrsY)
        w <- do.call(cbind,  #stacking M_t
                     lapply(1:nobs,
                            function(t) as.vector(mapx[t,] %*% t(mapy[t,]))
                            )
                     )
        
        # get gamma distribution parameters
        prod_w <- crossprod(w)
        sq_prod_w <- crossprod(prod_w)
        trww <- sum(diag(prod_w))
        th.mean <- trww #/ nobs
        sq_trww <- sum(diag(sq_prod_w))
        th.var <- 2 * sq_trww #/ nobs ^ 2
    }
    shape = th.mean^2 / th.var
    scale = th.var / th.mean
    p.val <- pgamma(test.stat, shape = shape, scale = scale,
                    lower.tail=FALSE) #gamma approx
    return(p.val)
}
  • Below: function to connect kciTest with pc algorithm
  • pc algorithm gives names of variables (\(x,y,S\)) and suffStat, the dataset
kciWrap <- function(x,y,S, suffStat){
  # Uses kciTest to work on data
  # inputs:
  # x = name of x variable
  # y = name of y variable
  # S = name of variables in conditioning set
  # suffStat = sufficient statistic (entire dataset here)
  # output: p-value from KCI test
    compObs <- complete.cases(suffStat[,c(x,y,S)])
    compDat <- suffStat[compObs,]
    if (length(S) == 0) {
        error <- try(res <- kciTest(compDat[,x], compDat[,y]))
        if(class(error) == "try-error") res <- NA
    } else {
        error <- try(res <- kciTest(compDat[,x], compDat[,y], compDat[,S]))
        if(class(error) == "try-error") res <- NA
    }
    res
}
  • The code below returns variables associated with outcome and could be explained by other variables
neighbors <- function(outcome, data, alpha, verbose=FALSE){
    nfeats <- dim(data)[2]
    rfeats <- 1:nfeats
    target <- nfeats + 1
    dat <- cbind(data, outcome)
    for(csize in 0:(nfeats-1)){
        if(length(rfeats) - 1 < csize){
            return(rfeats)
            break
        }
        current <- rfeats
        for(tvar in current){
            condvars <- rfeats[tvar != rfeats]
            if(length(condvars) != 1){
                condsets <- combn(condvars, csize)
            } else {
                condsets <- as.matrix(condvars)
            }
            ncondsets <- dim(condsets)[2]
            for(itr in 1:ncondsets){
                if(verbose) cat("y=", tvar, ", S=",condsets[,itr],", pvalue= ")
                pval <- kciWrap(target, tvar, condsets[,itr], dat)
                ##pval <- myCItest(target, tvar, condsets[,itr], dat)
                if(verbose) cat(pval, "\n")
                if(pval >= alpha & !is.na(pval)){
                    rfeats <- rfeats[rfeats != tvar]
                    break
                }
            }
            if(verbose) cat("Remaining: ", rfeats, "\n")
        }
    }
}

Association vs Causation

Simpson’s paradox

##        treatment symptoms mortality
## 687 experimental   severe     alive
## 239 experimental   severe      died
## 530      placebo     mild      died
## 846      placebo     mild     alive
## 346 experimental     mild      died
## 634      placebo     mild     alive
## 509      placebo     mild     alive
## 475 experimental   severe     alive
## 213      placebo   severe      died
## 431      placebo     mild     alive
## 632 experimental     mild     alive
## 630 experimental   severe     alive
## 318      placebo   severe      died
## 991      placebo     mild     alive
## 648      placebo   severe     alive
## 880      placebo     mild     alive
## 136 experimental   severe      died
## 655 experimental   severe     alive
## 497      placebo     mild     alive
## 358      placebo     mild     alive
## 
## experimental      placebo 
##          487          513
## 
##   mild severe 
##    514    486
## 
## alive  died 
##   754   246
table(df$treatment, df$mortality)[,2]/table(df$treatment)
## 
## experimental      placebo 
##    0.2648871    0.2280702
table(df$treatment, df$symptoms, df$mortality)[,,2]/table(df$treatment, df$symptoms)
##               
##                      mild     severe
##   experimental 0.07352941 0.33903134
##   placebo      0.16402116 0.40740741
table(df$treatment, df$symptoms)
##               
##                mild severe
##   experimental  136    351
##   placebo       378    135
chisq.test(df$treatment, df$symptoms)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  df$treatment and df$symptoms
## X-squared = 207.58, df = 1, p-value < 2.2e-16

Confounding

Confounding Simulations

  • Diagram above: A and B are confounded by C and have no direct causal relationship
  • \(X = 2C + \epsilon_X \Rightarrow C = 0.5X + 0.5\epsilon_X\), so \(Y = 0.5 \cdot 3X + \text{noise}\)
size <- 500
C <- 10*runif(size)
X <- 2*C + rnorm(size)
Y <- 3*C + rnorm(size)

summary(lm(Y~X))
## 
## Call:
## lm(formula = Y ~ X)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.6491 -1.2526 -0.0418  1.2540  4.1460 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.6102     0.1522    4.01 7.01e-05 ***
## X             1.4342     0.0131  109.52  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.74 on 498 degrees of freedom
## Multiple R-squared:  0.9601, Adjusted R-squared:  0.9601 
## F-statistic: 1.199e+04 on 1 and 498 DF,  p-value: < 2.2e-16
summary(lm(Y~X+C))
## 
## Call:
## lm(formula = Y ~ X + C)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.81847 -0.63492  0.02683  0.65114  2.43640 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.03361    0.08757   0.384    0.701    
## X           -0.03018    0.04536  -0.665    0.506    
## C            3.04750    0.09315  32.717   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.981 on 497 degrees of freedom
## Multiple R-squared:  0.9874, Adjusted R-squared:  0.9873 
## F-statistic: 1.941e+04 on 2 and 497 DF,  p-value: < 2.2e-16

  • Diagram above: X influences Y but the effect is confounded by Z. That is, failing to account for Z in a regression will lead to an incorrect causal parameter estimate
  • parameter = (counfounding bias) + (causal effect) = 1.5 + 5 = 6.5
Y <- 3*C + 5*X + rnorm(size)

summary(lm(Y~X))
## 
## Call:
## lm(formula = Y ~ X)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -5.261 -1.139 -0.012  1.183  5.458 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.65110    0.14861   4.381 1.44e-05 ***
## X            6.43596    0.01279 503.296  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.7 on 498 degrees of freedom
## Multiple R-squared:  0.998,  Adjusted R-squared:  0.998 
## F-statistic: 2.533e+05 on 1 and 498 DF,  p-value: < 2.2e-16
summary(lm(Y~X+C))
## 
## Call:
## lm(formula = Y ~ X + C)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2887 -0.6351 -0.0185  0.6507  2.9811 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.09275    0.08704   1.066    0.287    
## X            5.01797    0.04509 111.295   <2e-16 ***
## C            2.95101    0.09258  31.875   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.975 on 497 degrees of freedom
## Multiple R-squared:  0.9994, Adjusted R-squared:  0.9994 
## F-statistic: 3.853e+05 on 2 and 497 DF,  p-value: < 2.2e-16
  • Note: the covariate causal effect math only works here because the simulated data is linear. In the real world, we typically can’t assume linearity. This math doesn’t work without if the true data aren’t linear.
  • In most real-world datasets, there will always be the possibility of latent confounding

Sampling Bias

size <- 100
locomotor <- 10*runif(size)
respiratory <- 10*runif(size)
hospitalized <- rbinom(size, 1, (locomotor+respiratory)/20)==0
df_all <- data.frame(locomotor, respiratory, hospitalized)
library(ggplot2)
ggplot(df_all[hospitalized==TRUE,], aes(respiratory, locomotor)) + geom_point() + 
  geom_smooth(method = lm, se = FALSE)
## `geom_smooth()` using formula 'y ~ x'

ggplot(df_all, aes(respiratory, locomotor)) + geom_point(aes(color=hospitalized)) +
  stat_smooth(method=lm, se=FALSE)
## `geom_smooth()` using formula 'y ~ x'

Statistical Dependency and Causation

  • If \(A\) and \(B\) are random variables and associated (dependent) \(A\not\perp B\), it’s thought that there are 3 ways this can happen

    • Direct causation: OR

    • \(A\) and \(B\) are unmeasured confounding:

    • Collider bias: \(A\) and \(B\) both influence sampling

    • If we only have data for \(A\) and \(B\), we cannot distinguish between these case without more assumptions

  • If there are more variables, we figure out more using conditional independence

    • Thinking about if knowing one variable gives information another
    • Variables: Drank too much alcohol last night, Woke up with a hangover, Woke up with shoes on

    • Hangover\(\perp\)Shoes?
    • Hangover\(\perp\)Shoes\(|\)Alcohol?
      • Given that someone drank too much alcohol last night, does knowing that they woke up with their shoes on give any information about if they have a hangover?
    • Variables: Burglary, Earthquake, Home motion detector

    • Burglary\(\perp\)Earthquake
    • Burglary\(\perp\)Earthquake\(|\)Detector?
      • Given that the home motion detector alerted, does knowing that there was no earthquake give any information about a burglary?
Structure Path type \(A,C\) independent/dependent? \(A,C\) conditionally independent/dependent given \(B\)
Chain \(A \rightarrow B \rightarrow C\) \(A\not\perp C\) \(A\perp C|B\)
Other Chain \(A \leftarrow B \leftarrow C\) \(A\not\perp C\) \(A\perp C|B\)
Fork (Confounder) \(A \leftarrow B \rightarrow C\) \(A\not\perp C\) \(A\perp C|B\)
Collider \(A \rightarrow B \leftarrow C\) \(A\perp C\) \(A\not\perp C|B\)

PC Algorithm

  • Assume we have a dataset with many variables, \(A\), \(B\), \(C\), \(D\), etc
  • Assume no unmeasured confounders
  • Question: Is there a direct causal connection between \(A\) and \(B\)?
    • If we can find a set of variables, \(S\) (that don’t include \(A,B\)) such that \(A\not\perp B|S\) then there is NO direct causal connection between \(A\) and \(B\)

Colliders

  • In the table above, only the collider independence/conditional independence was different from other structures
  • If we see for variables, \(A, B, C\) that \(A\perp C\) and \(A\not\perp C|D\), then \(A \rightarrow B \leftarrow C\) is the only option

Y-structures

  • Assuming no confounding, either \(C\rightarrow D\) or \(D \rightarrow C\)
  • If \(A-C-D\) and \(B-C-D\) are not colliders, we must have \(C\rightarrow D\)

Algorithm

  1. Create complete, undirected graph

  1. Pairwise Conditional Independence Testing
  • Conditioning set size, \(d = 0\):
    • For each edge in current graph, \((x,y)\):
      • For each, \(S\), subset of neighbors of \(x\) or \(y\) with size \(d\)
        • if neighbors of \(x,y<d\), leave edge, go to next edge
        • if \(x\perp y|S\) remove edge, go to next edge
    • \(d\leftarrow d+1\)
  1. For all non-looping triples:
  • If collider, orident edges
  1. Find all Y-structures in graph:
  • Orient accordingly