Joel Correa da Rosa
March 1st, 2017
Consider the random experiment that will sample \( n \) subjects and for each one two categorical variables \( X \) and \( Y \) will be observed.
\( X \in \{A_1,A_2,...,A_R\} \)
\( Y \in \{B_1,B_2,...,B_C\} \)
\( R: \) Number of categories (levels) of \( X \)
\( C: \) Number of categories (levels) of \( Y \)
\( X \): Blood Type
\( Y \): Race
race.levels<-c('AA','EA','Other')
blood.type.levels<-c('A','B','AB','O')
set.seed(102)
sample.races<-sample(race.levels,60,replace=T)
sample.blood<-sample(blood.type.levels,60,replace=T)
mydata<-cbind.data.frame(blood = sample.blood , race = sample.races)
head(mydata,n=10)
blood race
1 O EA
2 A EA
3 B Other
4 AB EA
5 A AA
6 AB EA
7 B Other
8 A AA
9 O Other
10 AB EA
X\Y | 1 | 2 | — | C |
---|---|---|---|---|
1 | \( O_{11} \) | \( O_{12} \) | \( O_{1C} \) | |
2 | \( O_{21} \) | \( O_{22} \) | \( O_{2C} \) | |
— | ||||
R | \( O_{R1} \) | \( O_{R2} \) | \( O{RC} \) |
Labelling the categories as
\( A_1 = 1, A_2 = 2,..., A_{R} = R \)
and
\( B_1 = 1, B_2 = 2,...,B_C = C \)
\( O_{ij} \): number of events where \( X=i \) and \( Y=j \)
\( Oi. = \sum_{j=1}^C O_{ij} \)
\( O.j= \sum_{i=1}^R O_{ij} \)
\( E_{ij} \): Expected frequency of observations when \( X=i \) AND \( Y=j \) assuming these two variables to be statistical independent.
\( P(X=i \cap Y=j) = P(X=i)P(Y=j) \)
The proportion of expected observations in the \( (i,j) \) entry of the table can be estimated under the assumption of independence as:
\( \hat{p}_ij = \hat{p}_i \times \hat{p}_j \) where
\( \hat{p}_i = \frac{O_{i.}}{n} \)
\( \hat{p}_j = \frac{O_{.j}}{n} \)
consequently
\( E_{ij}=n\times\hat{p}_i\times\hat{p}_j=\frac{O_{i.}O_{.j}}{n} \)
The differences between observed frequencies and expected frequencies will be key elements for the Chi-square test statistic. These differences are called residuals.
\( O_{ij}-E_{ij} \)
Each entry \( (i,j) \) of the table is a count (number of events). Probability distribution of counts is often described by the Poisson law. As a consequence of it, the mean (expected frequency ) and the variance are equal.
Based on this assumption, the standardized Poisson is similar to the Z score (standardized normal)
\( \frac{O_{ij}-E_{ij}}{\sqrt{E_{ij}}} \)
The test statistic for the \( \chi^2 \) test can be seen as a sum of squared standardized residuals.
\( \chi^2 = \frac{\sum_{i=1}^{R}\sum_{j=1}^C (O_{ij}-E_{ij})^2}{E_{ij}} \)
this statistic has \( \chi^2 \) approximated distribution with \( (R-1)\times(C-1) \) degrees of freedom.
\( \chi^2 \) distribution with 1 degree of freedom
plot(1:30,dchisq(1:30,1),type='l')
\( \chi^2 \) distribution with 5 degrees of freedom
plot(1:30,dchisq(1:30,5),type='l')
\( \chi^2 \) distribution with 10 degrees of freedom
plot(1:50,dchisq(1:50,10),type='l')
Note that the \( \chi^2 \) distribution approximates to the normal distribution as the number of degrees of freedom increases. This is a consequence of the Central Limit Theorem.
tab<-table(mydata$blood,mydata$race)
tab
AA EA Other
A 4 8 2
AB 1 7 5
B 4 8 9
O 1 6 5
Q<-chisq.test(tab)
Q
Pearson's Chi-squared test
data: tab
X-squared = 5.4426, df = 6, p-value = 0.4884
Q$observed
AA EA Other
A 4 8 2
AB 1 7 5
B 4 8 9
O 1 6 5
Q$expected
AA EA Other
A 2.333333 6.766667 4.90
AB 2.166667 6.283333 4.55
B 3.500000 10.150000 7.35
O 2.000000 5.800000 4.20
Q$residuals
AA EA Other
A 1.09108945 0.47412524 -1.31008646
AB -0.79259392 0.28590527 0.21096325
B 0.26726124 -0.67484718 0.60861167
O -0.70710678 0.08304548 0.39036003
Assumptions :
Marginals are fixed and the frequencies will change according to different random experiments
X\Y | 0 | 1 |
---|---|---|
0 | a | b |
1 | c | d |
Probability of a Table (hypergeometric distribution)
\( p = \frac{C_{a}^{a+b}C_{c}^{c+d}}{C_{n}^{a+c}} \)
\( p = \frac{(a+b)!(c+d)!(a+c)!(b+d)!}{a!b!c!d!n!} \)
Lady Tasting Tea Example : observed table
X\Y | + | - |
---|---|---|
+ | 4 | 0 |
- | 0 | 4 |
dhyper(4,4,4,4)
[1] 0.01428571
Table I
X\Y | + | - |
---|---|---|
+ | 0 | 4 |
- | 4 | 0 |
dhyper(0,4,4,4)
[1] 0.01428571
Table II
X\Y | + | - |
---|---|---|
+ | 1 | 3 |
- | 3 | 1 |
dhyper(1,4,4,4)
[1] 0.2285714
Table III
X\Y | + | - |
---|---|---|
+ | 2 | 2 |
- | 2 | 2 |
dhyper(2,4,4,4)
[1] 0.5142857
Table IV
X\Y | + | - |
---|---|---|
+ | 3 | 1 |
- | 1 | 3 |
dhyper(3,4,4,4)
[1] 0.2285714
Table V
X\Y | + | - |
---|---|---|
+ | 4 | 0 |
- | 0 | 4 |
dhyper(4,4,4,4)
[1] 0.01428571
barplot(dhyper(0:4,4,4,4))
Imagine that instead of 4 cups with tea infusion and 4 cups with milk infusion, we had 48 cups of each type
barplot(dhyper(0:48,48,48,48))
Note that the Normal approximation could be used in this situation as well.
\( H_0 \): \( p_b = p_c \)
\( H_1 \): \( p_b \neq p_c \)
X\Y | 1 | 0 |
---|---|---|
1 | a | b |
0 | c | d |
The McNemar test statistic is :
\( \chi^2 = \frac{(b-c)^2}{b+c} \)
In case-control study, consider \( b \) the number of times a case was exposed to the risk factor but the control was not. Thus, \( c \) is the number of times that the control was exposed to the risk factor but the case wasn't. The hypotheses of no association between the risk factor and disease is the null hypothesis for the McNemar test.
Consider \( X \) the number of events of interest in a window of time or in a spatial region.
If probabilities for \( X \) follow the Poisson distribution, then
\( P(X=k) = \frac{e^{-\lambda}\lambda^k}{k!} \)
To evaluate probabilities with the Poisson distribution we need to know only one parameter: (\( \lambda \)). This parameter is often called intensity.