Lecture 8: Chi-Square, Fisher's Exact Test and Mc Nemar Test

Joel Correa da Rosa
March 1st, 2017

Hypotheses Tests

Chi-Square

Association Between Two Categorical Variables
Equality of proportions

Fisher's Exact Test (2 x 2 Contingency Tables)

Association Between Two Categorical Variables

Mc Nemar Test (2 x 2 Contingency Tables)

Binary Variable in paired Samples

Contingency Table (R x C)

Consider the random experiment that will sample \( n \) subjects and for each one two categorical variables \( X \) and \( Y \) will be observed.

\( X \in \{A_1,A_2,...,A_R\} \)

\( Y \in \{B_1,B_2,...,B_C\} \)

\( R: \) Number of categories (levels) of \( X \)

\( C: \) Number of categories (levels) of \( Y \)

Example

\( X \): Blood Type

\( Y \): Race

race.levels<-c('AA','EA','Other')
blood.type.levels<-c('A','B','AB','O')

set.seed(102)

sample.races<-sample(race.levels,60,replace=T)
sample.blood<-sample(blood.type.levels,60,replace=T)

Example

mydata<-cbind.data.frame(blood = sample.blood , race = sample.races)
head(mydata,n=10)

   blood  race
1      O    EA
2      A    EA
3      B Other
4     AB    EA
5      A    AA
6     AB    EA
7      B Other
8      A    AA
9      O Other
10    AB    EA

Contingency Table

X\Y	1	2	C
1	\( O_{11} \)	\( O_{12} \)	\( O_{1C} \)
2	\( O_{21} \)	\( O_{22} \)	\( O_{2C} \)
—
R	\( O_{R1} \)	\( O_{R2} \)	\( O{RC} \)

Labelling the categories as

\( A_1 = 1, A_2 = 2,..., A_{R} = R \)

and

\( B_1 = 1, B_2 = 2,...,B_C = C \)

\( O_{ij} \): number of events where \( X=i \) and \( Y=j \)

Marginals in Contingency Table

Frequency of observations in the i-th category of \( X \)

\( Oi. = \sum_{j=1}^C O_{ij} \)

Frequency of observations in the j-th category of \( Y \)

\( O.j= \sum_{i=1}^R O_{ij} \)

Expected Frequencies

\( E_{ij} \): Expected frequency of observations when \( X=i \) AND \( Y=j \) assuming these two variables to be statistical independent.

Statistical Independence

\( P(X=i \cap Y=j) = P(X=i)P(Y=j) \)

Expected Frequencies

The proportion of expected observations in the \( (i,j) \) entry of the table can be estimated under the assumption of independence as:

\( \hat{p}_ij = \hat{p}_i \times \hat{p}_j \) where

\( \hat{p}_i = \frac{O_{i.}}{n} \)

\( \hat{p}_j = \frac{O_{.j}}{n} \)

consequently

\( E_{ij}=n\times\hat{p}_i\times\hat{p}_j=\frac{O_{i.}O_{.j}}{n} \)

Residuals Frequencies

The differences between observed frequencies and expected frequencies will be key elements for the Chi-square test statistic. These differences are called residuals.

Residuals

\( O_{ij}-E_{ij} \)

Standardized Residuals Frequencies

Each entry \( (i,j) \) of the table is a count (number of events). Probability distribution of counts is often described by the Poisson law. As a consequence of it, the mean (expected frequency ) and the variance are equal.

Based on this assumption, the standardized Poisson is similar to the Z score (standardized normal)

\( \frac{O_{ij}-E_{ij}}{\sqrt{E_{ij}}} \)

The Chi-square test statistic

The test statistic for the \( \chi^2 \) test can be seen as a sum of squared standardized residuals.

\( \chi^2 = \frac{\sum_{i=1}^{R}\sum_{j=1}^C (O_{ij}-E_{ij})^2}{E_{ij}} \)

this statistic has \( \chi^2 \) approximated distribution with \( (R-1)\times(C-1) \) degrees of freedom.

Examples of Chi-square Distribution

\( \chi^2 \) distribution with 1 degree of freedom

plot(1:30,dchisq(1:30,1),type='l')

plot of chunk unnamed-chunk-3

Examples of Chi-square Distribution

\( \chi^2 \) distribution with 5 degrees of freedom

plot(1:30,dchisq(1:30,5),type='l')

plot of chunk unnamed-chunk-4

Examples of Chi-square Distribution

\( \chi^2 \) distribution with 10 degrees of freedom

plot(1:50,dchisq(1:50,10),type='l')

plot of chunk unnamed-chunk-5

Note that the \( \chi^2 \) distribution approximates to the normal distribution as the number of degrees of freedom increases. This is a consequence of the Central Limit Theorem.

Chi-Square Test in R

tab<-table(mydata$blood,mydata$race)
tab


     AA EA Other
  A   4  8     2
  AB  1  7     5
  B   4  8     9
  O   1  6     5

Chi-Square

Q<-chisq.test(tab)
Q


    Pearson's Chi-squared test

data:  tab
X-squared = 5.4426, df = 6, p-value = 0.4884

Observed vs. Expected Frequencies

Q$observed


     AA EA Other
  A   4  8     2
  AB  1  7     5
  B   4  8     9
  O   1  6     5

Q$expected


           AA        EA Other
  A  2.333333  6.766667  4.90
  AB 2.166667  6.283333  4.55
  B  3.500000 10.150000  7.35
  O  2.000000  5.800000  4.20

Residuals

Q$residuals


              AA          EA       Other
  A   1.09108945  0.47412524 -1.31008646
  AB -0.79259392  0.28590527  0.21096325
  B   0.26726124 -0.67484718  0.60861167
  O  -0.70710678  0.08304548  0.39036003

Fisher's Exact Test (2 x 2 Tables)

Assumptions :

Marginals are fixed and the frequencies will change according to different random experiments

X\Y	0	1
0	a	b
1	c	d

Probability of a Table (hypergeometric distribution)

\( p = \frac{C_{a}^{a+b}C_{c}^{c+d}}{C_{n}^{a+c}} \)

\( p = \frac{(a+b)!(c+d)!(a+c)!(b+d)!}{a!b!c!d!n!} \)

Fisher's Exact Test (2 x 2 Tables)

Lady Tasting Tea Example : observed table

X\Y	+	-
+	4	0
-	0	4

dhyper(4,4,4,4)

[1] 0.01428571

All Possible results

Table I

X\Y	+	-
+	0	4
-	4	0

dhyper(0,4,4,4)

[1] 0.01428571

All Possible results

Table II

X\Y	+	-
+	1	3
-	3	1

dhyper(1,4,4,4)

[1] 0.2285714

All Possible results

Table III

X\Y	+	-
+	2	2
-	2	2

dhyper(2,4,4,4)

[1] 0.5142857

All Possible results

Table IV

X\Y	+	-
+	3	1
-	1	3

dhyper(3,4,4,4)

[1] 0.2285714

All Possible results

Table V

X\Y	+	-
+	4	0
-	0	4

dhyper(4,4,4,4)

[1] 0.01428571

Hypergeometric Distribution

barplot(dhyper(0:4,4,4,4))

plot of chunk unnamed-chunk-16

Expanding Lady Tasting tea Example

Imagine that instead of 4 cups with tea infusion and 4 cups with milk infusion, we had 48 cups of each type

barplot(dhyper(0:48,48,48,48))

plot of chunk unnamed-chunk-17

Note that the Normal approximation could be used in this situation as well.

McNemar Test

\( H_0 \): \( p_b = p_c \)

\( H_1 \): \( p_b \neq p_c \)

X\Y	1	0
1	a	b
0	c	d

The McNemar test statistic is :

\( \chi^2 = \frac{(b-c)^2}{b+c} \)

McNemar Test (Application)

In case-control study, consider \( b \) the number of times a case was exposed to the risk factor but the control was not. Thus, \( c \) is the number of times that the control was exposed to the risk factor but the case wasn't. The hypotheses of no association between the risk factor and disease is the null hypothesis for the McNemar test.

Appendix

Poisson Distribution

Consider \( X \) the number of events of interest in a window of time or in a spatial region.

If probabilities for \( X \) follow the Poisson distribution, then

\( P(X=k) = \frac{e^{-\lambda}\lambda^k}{k!} \)

Parameter of the Poisson Distribution

To evaluate probabilities with the Poisson distribution we need to know only one parameter: (\( \lambda \)). This parameter is often called intensity.