Alban Guillaumet, Troy University
Our data in this chapter consists of two categorical variables.
We may be interested in:
Left: Death of adult passengers following Titanic shipwreck
Right: Mosaic plot if death and sex were independent
\( \chi^2 \) contingency test
Example 9.4
Lafferty and Morris (1996) tested the hypothesis that infection influences risk of predation by birds. A large outdoor tank was stocked with three kinds of killfish: unparasitized, lightly infected, and heavily infected. This tank was left open to foraging by birds. The numbers of fish eaten according to their levels of parasitism is given below:
Uninfected Lightly Highly Sum
Eaten 1 10 37 48
Not eaten 49 35 9 93
Sum 50 45 46 141
Calling our two categorical variables “fate” and “infection status”, what we want to know is whether these two variables are independent.
How do we compute the frequency table we would expect if the variables were independent? If two variables are independent, we have:
\[ \mathrm{Pr[A \ and \ B]} = \mathrm{Pr[A]}\times\mathrm{Pr[B]} \]
The observed table (in relative frequencies) is given by:
| Uninfected | Lightly | Highly | Sum | |
|---|---|---|---|---|
| Eaten | Pr[Eaten and Uninfected] | * | * | * |
| Not eaten | * | * | * | * |
| Sum | * | * | * | * |
Assuming independence, the expected (relative frequencies) table should have:
| Uninfected | Lightly | Highly | Sum | |
|---|---|---|---|---|
| Eaten | Pr[Eaten] \( \cdot \) Pr[Uninfected] | * | * | * |
| Not eaten | * | * | * | * |
| Sum | * | * | * | * |
| Uninfected | Lightly | Highly | Sum | |
|---|---|---|---|---|
| Eaten | \( 141\cdot\frac{48}{141}\cdot\frac{50}{141} \) | * | * | 48 |
| Not eaten | * | * | * | 93 |
| Sum | 50 | 45 | 46 | 141 |
| Uninfected | Lightly | Highly | Sum | |
|---|---|---|---|---|
| Eaten | 17.0 | \( 141\cdot\frac{48}{141}\cdot\frac{45}{141} \) | * | 48 |
| Not eaten | * | * | * | 93 |
| Sum | 50 | 45 | 46 | 141 |
| Uninfected | Lightly | Highly | Sum | |
|---|---|---|---|---|
| Eaten | 17.0 | 15.3 | \( 141\cdot\frac{48}{141}\cdot\frac{46}{141} \) | 48 |
| Not eaten | * | * | * | 93 |
| Sum | 50 | 45 | 46 | 141 |
Expected frequency table
| Uninfected | Lightly | Highly | Sum | |
|---|---|---|---|---|
| Eaten | 17.0 | 15.3 | 15.7 | 48 |
| Not eaten | 33.0 | 29.7 | 30.3 | 93 |
| Sum | 50 | 45 | 46 | 141 |
Observed frequency table
| Uninfected | Lightly | Highly | Sum | |
|---|---|---|---|---|
| Eaten | 1 | 10 | 37 | 48 |
| Not eaten | 49 | 35 | 9 | 93 |
| Sum | 50 | 45 | 46 | 141 |
The \( \chi^2 \) contingency test is just a special case of the \( \chi^2 \) goodness-of-fit test, so the test statistic is the same.
\[ \chi^2 = \sum_{i=1}^{r}\sum_{j=1}^{c} \frac{(Observed_{ij}-Expected_{ij})^2}{Expected_{ij}}, \]
where \( r \) and \( c \) are number of rows and columns, respectively.
\[ \chi^2 = \frac{(1-17.0)^2}{17.0}+\frac{(10-15.3)^2}{15.3}+...+\frac{(9-30.3)^2}{30.3} \]
The number of degrees of freedom is given by
\[ df = (r-1)\times(c-1) = (2-1)\times(3-1) = 2 \]
Assumptions of the \( \chi^2 \) contingency test are the same as the \( \chi^2 \) goodness-of-fit test.
What do you do if the assumptions are violated?
Expected frequency table
| Uninfected | Lightly | Highly | Sum | |
|---|---|---|---|---|
| Eaten | 17.0 | 15.3 | 15.7 | 48 |
| Not eaten | 33.0 | 29.7 | 30.3 | 93 |
| Sum | 50 | 45 | 46 | 141 |
The assumptions are met, so let's use R.
First, let's see how the table was created in R:
(parTable <- matrix(c(1, 10, 37, 49, 35, 9),
nrow = 2,
byrow = TRUE,
dimnames = list(c("Eaten", "Not eaten"),
c("Uninfected", "Lightly", "Highly"))))
Uninfected Lightly Highly
Eaten 1 10 37
Not eaten 49 35 9
chisq.test(parTable, correct = FALSE)
Note: correct = FALSE means no Yates correction (see W&S p. 251: Correction for continuity)
Pearson's Chi-squared test
data: parTable
X-squared = 69.756, df = 2, p-value = 7.124e-16
Conclusion?
Conclusion: Since \( p \)-value is less than 0.05 (actually less than 0.001), then we can reject the null hypothesis of independence.