M. Drew LaMar
October 19, 2018
Our data in this chapter consists of two categorical variables.
We are interested in:
Estimation of association for 2 x 2 contingency tables: odds ratios and relative risks
What about hypothesis testing?
Remember: Hypothesis tests will give you a yes/no answer, but will NOT give you the magnitude of the effect if there is one (hence, always use confidence intervals as well, if possible)
\( \chi^2 \) contingency test
Example 9.4
Many parasites have more than one species of host, so the individual parasite must get from one host to another to complete its life cycle. Trematodes of the species Euhaplorchis californiensis use three hosts during their life cycle.
Example 9.4
Lafferty and Morris (1996) tested the hypothesis that infection influences risk of predation by birds. A large outdoor tank was stocked with three kinds of killfish: unparasitized, lightly infected, and heavily infected. This tank was left open to foraging by birds… The numbers of fish eaten according to their levels of parasitism is given as follows:
Uninfected Lightly Highly Sum
Eaten 1 10 37 48
Not eaten 49 35 9 93
Sum 50 45 46 141
Uninfected Lightly Highly Sum
Eaten 1 10 37 48
Not eaten 49 35 9 93
Sum 50 45 46 141
If we call our two categorical variable “fate” and “infection status”, then what we want to know is are these two variables independent.
How do we compute the frequency table we would expect if the variables were independent? If two variables are independent, we have
\[ \mathrm{Pr[A \ and \ B]} = \mathrm{Pr[A]}\times\mathrm{Pr[B]} \]
Uninfected | Lightly | Highly | Sum | |
---|---|---|---|---|
Eaten | Pr[Eaten and Uninfected] | * | * | Pr[Eaten] |
Not eaten | * | * | * | * |
Sum | Pr[Uninfected] | * | * | * |
The observed table (in relative frequencies) is given by:
Uninfected | Lightly | Highly | Sum | |
---|---|---|---|---|
Eaten | Pr[Eaten and Uninfected] | * | * | Pr[Eaten] |
Not eaten | * | * | * | * |
Sum | Pr[Uninfected] | * | * | * |
Assuming independence, the expected (relative frequencies) table should have:
Uninfected | Lightly | Highly | Sum | |
---|---|---|---|---|
Eaten | Pr[Eaten]\( \cdot \) Pr[Uninfected] | * | * | Pr[Eaten] |
Not eaten | * | * | * | * |
Sum | Pr[Uninfected] | * | * | * |
Uninfected | Lightly | Highly | Sum | |
---|---|---|---|---|
Eaten | \( 141\cdot\frac{48}{141}\cdot\frac{50}{141} \) | * | * | 48 |
Not eaten | * | * | * | 93 |
Sum | 50 | 45 | 46 | 141 |
Uninfected | Lightly | Highly | Sum | |
---|---|---|---|---|
Eaten | 17.0 | \( 141\cdot\frac{48}{141}\cdot\frac{45}{141} \) | * | 48 |
Not eaten | * | * | * | 93 |
Sum | 50 | 45 | 46 | 141 |
Uninfected | Lightly | Highly | Sum | |
---|---|---|---|---|
Eaten | 17.0 | 15.3 | \( 141\cdot\frac{48}{141}\cdot\frac{46}{141} \) | 48 |
Not eaten | * | * | * | 93 |
Sum | 50 | 45 | 46 | 141 |
Expected frequency table
Uninfected | Lightly | Highly | Sum | |
---|---|---|---|---|
Eaten | 17.0 | 15.3 | 15.7 | 48 |
Not eaten | 33.0 | 29.7 | 30.3 | 93 |
Sum | 50 | 45 | 46 | 141 |
Observed frequency table
Uninfected | Lightly | Highly | Sum | |
---|---|---|---|---|
Eaten | 1 | 10 | 37 | 48 |
Not eaten | 49 | 35 | 9 | 93 |
Sum | 50 | 45 | 46 | 141 |
The \( \chi^2 \) contingency test is just a special case of the \( \chi^2 \) goodness-of-fit test, so the test statistic is the same.
\[ \chi^2 = \sum_{i=1}^{r}\sum_{j=1}^{c} \frac{(Observed_{ij}-Expected_{ij})^2}{Expected_{ij}}, \]
where \( r \) and \( c \) are number of rows and columns, respectively. The number of degrees of freedom is given by
\[ \begin{align} df & = rc - 1 - (r-1) - (c - 1) \\ & = rc - r - c + 1 \\ & = (r-1)\times(c-1) \end{align} \]
Assumptions of the \( \chi^2 \) contingency test are the same as the \( \chi^2 \) goodness-of-fit test.
What do you do if the assumptions are violated?
Expected frequency table
Uninfected | Lightly | Highly | Sum | |
---|---|---|---|---|
Eaten | 17.0 | 15.3 | 15.7 | 48 |
Not eaten | 33.0 | 29.7 | 30.3 | 93 |
Sum | 50 | 45 | 46 | 141 |
Assumptions seem to be met, so let's use R.
First, let's see how the table was created in R:
(parTable <- matrix(c(1, 10, 37, 49, 35, 9),
nrow = 2,
byrow = TRUE,
dimnames = list(c("Eaten", "Not eaten"),
c("Uninfected", "Lightly", "Highly"))))
Uninfected Lightly Highly
Eaten 1 10 37
Not eaten 49 35 9
chisq.test(parTable, correct = FALSE)
Note: correct = FALSE
means no Yates correction (see p. 251)
Pearson's Chi-squared test
data: parTable
X-squared = 69.756, df = 2, p-value = 7.124e-16
Conclusion?
Conclusion: Since \( p \)-value is less than 0.05 (actually less than 0.001), then we can reject the null hypothesis of independence.
Fisher’s exact test: (2 x 2 tables only) Examines the independence of two categorical variables, even with small expected values
\( G \)-test: (any table) Derived from principles of likelihood.
Pro : Great with complicated experimental designs with multiple explanatory variables.
Con : Can be less accurate for small sample sizes.
Situation: You've found a statistically significant association between two categorical variables using the \( \chi^2 \) contingency test.
Question: Where is the association? In other words, in which levels of the categories is the association present and how large is the association?
We need to estimate the magnitude of the association, which the \( P \)-value does not give us!
We can estimate odds ratios or relative risks for 2 \( \times \) 2 sub-tables within the contingency table by either subsetting or collapsing.
Uninfected Lightly Highly
Eaten 1 10 37
Not eaten 49 35 9
Uninfected Highly
Eaten 1 37
Not eaten 49 9
$data
Uninfected Highly Total
Eaten 1 37 38
Not eaten 49 9 58
Total 50 46 96
$measure
NA
odds ratio with 95% C.I. estimate lower upper
Eaten 1.000000000 NA NA
Not eaten 0.004964148 0.0006020703 0.04093004
$p.value
NA
two-sided midp.exact fisher.exact chi.square
Eaten NA NA NA
Not eaten 1.110223e-16 6.861412e-17 4.140762e-15
$correction
[1] FALSE
attr(,"method")
[1] "Unconditional MLE & normal approximation (Wald) CI"