Contingency Analysis

Alban Guillaumet, Troy University

Contingency analysis

Our data in this chapter consists of two categorical variables.

We may be interested in:

Estimating the magnitude of association between two categorical variables:
- odds ratio and relative risk (2 x 2 tables)
- NOT covered in class
Testing if there is an association (or dependence) between two categorical variables:
- \( \chi^2 \! \) contingency test

Quick example

Left: Death of adult passengers following Titanic shipwreck
Right: Mosaic plot if death and sex were independent

alt text

Hypothesis testing

\( \chi^2 \) contingency test

\( H_{0} \): There is no association between two categorical variables
\( H_{A} \): There is an association between two categorical variables

Example of contingency test

Example 9.4

Lafferty and Morris (1996) tested the hypothesis that infection influences risk of predation by birds. A large outdoor tank was stocked with three kinds of killfish: unparasitized, lightly infected, and heavily infected. This tank was left open to foraging by birds. The numbers of fish eaten according to their levels of parasitism is given below:

          Uninfected Lightly Highly Sum
Eaten              1      10     37  48
Not eaten         49      35      9  93
Sum               50      45     46 141

Example of contingency test

plot of chunk unnamed-chunk-3

Example of contingency test

Calling our two categorical variables “fate” and “infection status”, what we want to know is whether these two variables are independent.

How do we compute the frequency table we would expect if the variables were independent? If two variables are independent, we have:

\[ \mathrm{Pr[A \ and \ B]} = \mathrm{Pr[A]}\times\mathrm{Pr[B]} \]

Example of contingency test

The observed table (in relative frequencies) is given by:

	Uninfected	Lightly	Highly	Sum
Eaten	Pr[Eaten and Uninfected]	*	*	*
Not eaten	*	*	*	*
Sum	*	*	*	*

Assuming independence, the expected (relative frequencies) table should have:

	Uninfected	Lightly	Highly	Sum
Eaten	Pr[Eaten] \( \cdot \) Pr[Uninfected]	*	*	*
Not eaten	*	*	*	*
Sum	*	*	*	*

Expected frequencies

	Uninfected	Lightly	Highly	Sum
Eaten	\( 141\cdot\frac{48}{141}\cdot\frac{50}{141} \)	*	*	48
Not eaten	*	*	*	93
Sum	50	45	46	141

	Uninfected	Lightly	Highly	Sum
Eaten	17.0	\( 141\cdot\frac{48}{141}\cdot\frac{45}{141} \)	*	48
Not eaten	*	*	*	93
Sum	50	45	46	141

	Uninfected	Lightly	Highly	Sum
Eaten	17.0	15.3	\( 141\cdot\frac{48}{141}\cdot\frac{46}{141} \)	48
Not eaten	*	*	*	93
Sum	50	45	46	141

Example of contingency test

Expected frequency table

	Uninfected	Lightly	Highly	Sum
Eaten	17.0	15.3	15.7	48
Not eaten	33.0	29.7	30.3	93
Sum	50	45	46	141

Observed frequency table

	Uninfected	Lightly	Highly	Sum
Eaten	1	10	37	48
Not eaten	49	35	9	93
Sum	50	45	46	141

Example of contingency test

The \( \chi^2 \) contingency test is just a special case of the \( \chi^2 \) goodness-of-fit test, so the test statistic is the same.

\[ \chi^2 = \sum_{i=1}^{r}\sum_{j=1}^{c} \frac{(Observed_{ij}-Expected_{ij})^2}{Expected_{ij}}, \]

where \( r \) and \( c \) are number of rows and columns, respectively.

\[ \chi^2 = \frac{(1-17.0)^2}{17.0}+\frac{(10-15.3)^2}{15.3}+...+\frac{(9-30.3)^2}{30.3} \]

Example of contingency test

The number of degrees of freedom is given by

\[ df = (r-1)\times(c-1) = (2-1)\times(3-1) = 2 \]

Example of contingency test

Assumptions of the \( \chi^2 \) contingency test are the same as the \( \chi^2 \) goodness-of-fit test.

What do you do if the assumptions are violated?

If table is larger than 2 x 2, you can sometimes combine row and/or columns.
If table is 2 x 2, you can use Fisher's exact test.
You can use a permutation test (not discussed in class, but you may see Chapter 13).

Example of contingency test

Expected frequency table

	Uninfected	Lightly	Highly	Sum
Eaten	17.0	15.3	15.7	48
Not eaten	33.0	29.7	30.3	93
Sum	50	45	46	141

The assumptions are met, so let's use R.

Example of contingency test

First, let's see how the table was created in R:

(parTable <- matrix(c(1, 10, 37, 49, 35, 9), 
                    nrow = 2, 
                    byrow = TRUE, 
                    dimnames = list(c("Eaten", "Not eaten"), 
                                    c("Uninfected", "Lightly", "Highly"))))

          Uninfected Lightly Highly
Eaten              1      10     37
Not eaten         49      35      9

Example of contingency test

chisq.test(parTable, correct = FALSE)

Note: correct = FALSE means no Yates correction (see W&S p. 251: Correction for continuity)

Example of contingency test


    Pearson's Chi-squared test

data:  parTable
X-squared = 69.756, df = 2, p-value = 7.124e-16

Conclusion?

Conclusion: Since \( p \)-value is less than 0.05 (actually less than 0.001), then we can reject the null hypothesis of independence.