Test For Independence

Arpan Dutta, Debanjan Bhattacharjee, Soumyajit Roy

Library

used libraries

require(car) #for qqPlot()
require(kableExtra) # for printing tables
require(MASS) #for generaing bivariate normal sample
require(fMultivar) #for bivariate t sample
require(copula) # for generating samples from copula
require(extraDistr) #for laplace distribution

Problem Statement

Our main aim is to check if we have paired observations ,we want to check whether the corresponding variables are independent or not.

Example:

Say we have data on a shop’s pop-corn sell and cold drinks sell of a movie.We want to check whether there is a relationship between these two i.e is person who buy pop-corn also buys cold-drinks?

kendall’s $\tau$

Kendall’s $\tau$

$(X_1,Y_1),(X_2,Y_2),\ldots(X_n,Y_n)\hspace{0.5cm}iid\hspace{0.5cm} F(x,y)$ (Continuous)

kendall’s $\tau$ is defined as

\[\mathcal{\tau} = \dfrac{1}{\binom n2}\sum_{i = 1}^{n-1}\sum_{j = i+1}^{n}sign(X_i - X_j)sign(Y_i - Y_j) \] where, \[ \begin{equation} sign(u)= \begin{cases} 0 & \text{if } u = 0\\ \dfrac{\left|u\right|}{u} & \text{if } u \neq 0 \end{cases} \end{equation} \]

Another form

Let, A := concordant pairs B := discordant pairs

\[ \tau = \dfrac{A - B}{\binom n2} \] In general:

\[ -1 \leq \tau \leq 1\]

Visualisation

We present a visual explanation of choosing such a function as a measure of dependency.

Mean and variance under $H_0$

$\mathbb{E}_{H_0} (\tau) = 0$
$\mathbb{V}_{H_0} (\tau) = \dfrac{2(2n+5)}{9n(n-1)}$

Here we take 1000 sample of size 10 each and calculate the mean and variance of the Kendall’s $\tau$ based on these 1000 values.

N(0,1), U(0,1), n = 10
	Theoritical	Observed
mean	0.0000000	-0.0005778
variance	0.0617284	0.0618136

Distribution free: EDA

We know that Kendal’s Tao test statistics is distribution free under $H_0$ .We will verify it here visually.Here we will generate data from various distribution independently and plot the histogram of the Kendal’s $\tau$ .

Large sample test:

$\sqrt{\dfrac{9n(n-1)}{2(2n+5)}}\tau \overset{\mathcal{d}}{\Longrightarrow} N(0,1)$ as $n\rightarrow \infty$

Comment: Good fit

QQplot

Large sample Power curve for $\mathcal{N}_2(0,0,1,1,\rho)$

We take n i.i.d samples from $\mathcal{N}_2(0,0,1,1,\rho)$ test for: \[H_0: \rho = 0 \hspace{0.3cm}vs\hspace{0.3cm}H_a: \rho\not=0\] test statistic: \[\sqrt{\dfrac{9n(n-1)}{2(2n+5)}}\hat\tau \] test rule: Reject $H_0$ at size 0.05 if $\left|\sqrt{\dfrac{9n(n-1)}{2(2n+5)}}\hat\tau\right|>\mathcal{z}_{0.025}$

Visualisation

Spearman’s $\rho$ vs kendall’s $\tau$

For fix sample size(n = 30) we plot power function for spearman $\rho$ and kendall’s $\tau$ with respect to $\rho$

From this plot also we can conclude that testing with these two statistics are equivalent.

Comparison with parametric test

Kendalls tau and t-test for BVN

$(X_1,Y_1),(X_2,Y_2),\ldots(X_n,Y_n)\overset{iid}{\sim}\mathcal{N}_2(0,0,1,1,\rho)$

Test for: \[H_0: \rho = 0 \hspace{0.3cm}vs\hspace{0.3cm}H_a: \rho\not=0\] Test Statistics under $H_0$:

\[T = \dfrac{r\sqrt{n-1}}{\sqrt{1-r^2}}\overset{H_0}\sim t_{n-2} \] Rejection rule: Reject $H_0$ at size $\alpha = 0.05$ if $\left|T_{obs}\right|>t_{0.025,n-2}$

Visualisation

Samples are taken from bivariate normal and plot the power function of t-test and kendalls $\tau$ to compare them.

$\textbf{Comment}$: t-test performs better

When sample is not comming from Normal Distribution

Now we generate bivariate sample of size 20 from (F(7,4),U(0,1)) using gaussian copula.

$\textbf{Comment}$: Kendall’s $\tau$ performs better

Copula

Gaussian Copula

Algorithm Generating data from Gaussian Copula (To generate $(Z_1, Z_2)$ with correlation coefficient $\rho$ and marginals $G_1$ and $G_2$ )

Generate $(X_1, X_2)$ from bivariate normal distribution with mean vector $\textbf{0}$ and variance covariance matrix Σ such that $\sigma_{ii} = 1$ and $\sigma_{ij} = \rho$
$U_i = \Phi^{-1}(X_i)$, for $i = 1, 2$ $Z_i = G_i(U_i)$, for $i = 1, 2$. $(Z_1, Z_2)$ is the required data point

FGM Copula

$C(u_1, u_2;\theta)$= $uv(1+\theta(1-u)(1-v))$

where $u_i \in(0,1)$ for $i=1,2,\, and $\theta\in(-1,1)$. Here, $\theta$ is the dependence parameter controlling the tail dependence.

Gumbel copula

$C(u_1, u_2, \ldots, u_d;\theta)$ = $\exp\left(-\left(\sum_{i=1}^{d} (-\ln u_i)^{\theta}\right)^{1/\theta}\right)$

where $u_i \in(0,1)$ for $i=1,2,\ldots,d$, and $\theta>1$. Here $\theta$ is the dependence parameter controlling the tail dependence.

Distribution free Statistics: Sample from copula with n = 5

Gaussian Copula

FGM Copula

Gumbel Copula

Distribution free Statistics: Sample from copula with n = 10

Gaussian Copula

FGM Copula

Gumbel copula

Large sample Distn:

\[\sqrt{\dfrac{9n}{4}}\tau\overset{\mathcal{d}}{\Longrightarrow}N(0,1)\hspace{0.4cm} as\hspace{0.4cm} n \rightarrow\infty\]

qqPlot

Power comparison

Gaussian Copula(U(-5,10),N(0,1)) for different n

FGM Copula

Gumbel Copula

FGM and Gumbel Copula(With one sided interval)

generating c(0,1) and laplace(0,1)

$H_0$:$\rho=0$ vs $H_1:\rho>0$

Violation of assumptions

Our very first assumption was the continuous setup. We will see what happens if that said assumption is violated.

Distribution free ?

Comment

If we assume normality t-test will perform better.
Non-Normal cases,Kendal’s $\tau$ works better than the conventional t-test and also in Normality cases ,it is not much worse. So when we don’t have any information about the underlying population,it’s safe to use Kendal’s $\tau$ statistics.

Thought experiment

$x_i\overset{iid}{\sim} U(-1,1)\hspace{1cm}i = 1,2,\ldots,30$


    Kendall's rank correlation tau

data:  x and y
T = 170, p-value = 0.09369
alternative hypothesis: true tau is not equal to 0
sample estimates:
       tau 
-0.2183908

comment: independent sample

$|x| + |y| = 1$


    Kendall's rank correlation tau

data:  c(x1, x2) and c(1 - abs(x1), abs(x2) - 1)
z = 0.30614, p-value = 0.7595
alternative hypothesis: true tau is not equal to 0
sample estimates:
       tau 
0.02711864

comment: independent sample

Dependent sample

$x_i\overset{iid}{\sim} U(-\pi,\pi)\hspace{1cm}i = 1,2,\ldots,40$


    Kendall's rank correlation tau

data:  x and y
T = 517, p-value = 0.002807
alternative hypothesis: true tau is not equal to 0
sample estimates:
     tau 
0.325641

comment: dependent sample

Acknowledgement

We want to say a big thank you to everyone who helped make this project possible. Special thanks to Subhrangsu, Sourav, Subhendu and to Isha Dewan ma'am for helping us.

Test For Independence

Library

Problem Statement

Example:

kendall’s \(\tau\)

Kendall’s \(\tau\)

Another form

Visualisation

Mean and variance under \(H_0\)

Distribution free: EDA

Large sample test:

QQplot

Large sample Power curve for \(\mathcal{N}_2(0,0,1,1,\rho)\)

Visualisation

Spearman’s \(\rho\) vs kendall’s \(\tau\)

Comparison with parametric test

Kendalls tau and t-test for BVN

Visualisation

When sample is not comming from Normal Distribution

Copula

Gaussian Copula

Algorithm Generating data from Gaussian Copula (To generate \((Z_1, Z_2)\) with correlation coefficient \(\rho\) and marginals \(G_1\) and \(G_2\) )

FGM Copula

Gumbel copula

Distribution free Statistics: Sample from copula with n = 5

Gaussian Copula

FGM Copula

Gumbel Copula

Distribution free Statistics: Sample from copula with n = 10

Gaussian Copula

FGM Copula

Gumbel copula

Large sample Distn:

qqPlot

Power comparison

Gaussian Copula(U(-5,10),N(0,1)) for different n

FGM Copula

Gumbel Copula

FGM and Gumbel Copula(With one sided interval)

Violation of assumptions

Distribution free ?

Comment

Thought experiment

\(|x| + |y| = 1\)

Dependent sample

Acknowledgement

Contribution