Contingency analysis

M. Drew LaMar
March 2, 2016

Class Announcements

Homework will be posted soon (Chaps 8 and 9) - ~~Due date: March 16, 11 am~~

Wrapping up Chapter 8

Assumptions of \( \chi^2 \) goodness-of-fit test

The sampling distribution for the \( \chi^2 \) test statistic follows a \( \chi^2 \) distribution only approximately! When is the approximation “good”?

No categories have expected frequencies less than 1
No more than 20% have expected frequences less than 5

What to do if violated?

Combine categories with small expected frequencies (if new categories make biological sense)
Use another technique (computer-intensive methods)

Random in space or time: Poisson

Example of another common probability model

1980 explosion of Mount St. Helens - spiders recolonized landscape by dropping out of the airstream (~~!!!!~~)

What does it mean for spiders to land on landscape “randomly” in space?

Probability that spider will land on given point is:

same everywhere
independent of other landings

Random in space or time: Poisson

Probability that spider will land on given point is:

same everywhere
independent of other landings

alt text

Random in space or time: Poisson

To count spiders, place large equal-sized grid across landscape and count number of spiders in each block.

Definition: The Poisson distribution describes number of successes in blocks of time or space, when successes happen with equal probability and independently across time or space.

Random in space or time: Poisson

To count spiders, place large equal-sized grid across landscape and count number of spiders in each block.

\[ \mathrm{Pr[}X \ \mathrm{successes]} = \frac{e^{-\mu}\mu^{X}}{X!} \]

where \( \mu \) is mean number of successes in time or space.

Random in space or time: Poisson

Definition: The index of dispersion (or variance-to-mean ratio VMR) is
\[ VMR = \frac{\sigma^{2}}{\mu} \]

Random in space or time: Poisson

For a Poisson distribution

barplot(dpois(0:10, lambda=1), xlab="Number of successes(X)", ylab="Probability", col="firebrick", names.arg=as.character(0:10), cex.axis=1.5, cex.lab=1.5, cex.names=1.5, ylim=c(0,0.4))

plot of chunk unnamed-chunk-2

Random in space or time: Poisson

For a Poisson distribution

barplot(dpois(0:10, lambda=2), xlab="Number of successes(X)", ylab="Probability", col="firebrick", names.arg=as.character(0:10), cex.axis=1.5, cex.lab=1.5, cex.names=1.5, ylim=c(0,0.4))

plot of chunk unnamed-chunk-3

Random in space or time: Poisson

For a Poisson distribution

barplot(dpois(0:10, lambda=3), xlab="Number of successes(X)", ylab="Probability", col="firebrick", names.arg=as.character(0:10), cex.axis=1.5, cex.lab=1.5, cex.names=1.5, ylim=c(0,0.4))

plot of chunk unnamed-chunk-4

Example: Mass extinctions

Do extinctions occur randomly through the long fossil record of Earth's history, or are there periods in which extinction rates are unusually high (“mass extinctions”) compared with background rates? The best record of extinctions through Earth's history comes from fossil marine invertebrates because they have hard shells and therefore tend to preserve well.

R Pubs

Contingency analysis

Our data in this chapter consists of two categorical variables.

We are interested in:

Estimating association in 2 x 2 tables (i.e. special case of two categoricals with only 2 levels each)
Testing if there is an association (or dependence) between two categorical variables [\( \chi^2 \! \) contingency test]

Quick example

So far, we've said things like “Yeah, that looks like those variables are associated.” It's time to quantify the evidence.

alt text

Left: Death of adult passengers following Titanic shipwreck
Right: Mosaic plot if death and sex were independent

Coffee consumption and healthy prostate

Practice Problem #1

Wilson et al. (2011) followed a set of male health professionals for 20 years. Of all the men in the study, 7890 drank no coffee and 2492 drank on average more than 6 cups per day. In the “no coffee” group, 122 developed advanced prostate cancer during the course of the study, and 19 in the “high coffee” group did.

(dataTable <- matrix(c(19, 122, 2473, 7768), nrow = 2, byrow = TRUE, dimnames = list(c("Cancer", "No cancer"), c("Coffee", "No coffee"))))

          Coffee No coffee
Cancer        19       122
No cancer   2473      7768

Coffee consumption and healthy prostate

          Coffee No coffee
Cancer        19       122
No cancer   2473      7768

plot of chunk unnamed-chunk-7

Coffee consumption and healthy prostate

Definition: The odds of success are the probability of success divided by the probability of failure.
\[ O = \frac{p}{1-p} \]

          Coffee No coffee
Cancer        19       122
No cancer   2473      7768

Discuss: What is the estimated odds of developing cancer in the coffee and no coffee groups?

Coffee consumption and healthy prostate

          Coffee No coffee
Cancer        19       122
No cancer   2473      7768
Sum         2492      7890

Discuss: What is the estimated odds of developing cancer in the coffee and no coffee groups?

Answer: \[ \begin{eqnarray*} \hat{p}_{c} & = & \frac{19}{2492} = 0.0076244 \\ \hat{O}_{c} & = & \frac{0.0076244}{1 - 0.0076244} = 0.007683 \end{eqnarray*} \]

Coffee consumption and healthy prostate

          Coffee No coffee
Cancer        19       122
No cancer   2473      7768
Sum         2492      7890

Discuss: What is the estimated odds of developing cancer in the coffee and no coffee groups?

Answer: \[ \begin{eqnarray*} \hat{p}_{nc} & = & \frac{122}{7890} = 0.0154626 \\ \hat{O}_{nc} & = & \frac{0.0154626}{1 - 0.0154626} = 0.0157055 \end{eqnarray*} \]

Coffee consumption and healthy prostate

          Coffee No coffee
Cancer        19       122
No cancer   2473      7768

Definition: The odds ratio is the odds of success in one group divided by the odds of success in a second group.

\[ \begin{eqnarray*} \hat{O}_{c} & = & 0.007683 \\ \hat{O}_{nc} & = & 0.0157055 \\ \hat{OR} & = & \frac{\hat{O}_{c}}{\hat{O}_{nc}} = 0.4891915 \end{eqnarray*} \]

Coffee consumption and healthy prostate

        Treatment Control
Success "a"       "b"    
Failure "c"       "d"

Notes:

The odds and odds ratio are population parameters.
The odds of success is a ratio of conditional probabilities, \[ \hat{O}_{t} = \frac{\frac{a}{a+c}}{1-\frac{a}{a+c}} = \frac{a}{c} \]
The odds ratio can be calculated as follows: \[ \hat{OR} = \frac{ad}{bc} \]

Coffee consumption and healthy prostate

Definition: The standard error for the log-odds ratio is given by

\[ \mathrm{SE}[\ln(\hat{OR})] = \sqrt{\frac{1}{a} + \frac{1}{b} + \frac{1}{c} + \frac{1}{d}} \]

You can use this with the “1.96 rule of thumb” to calculate a 95% confidence interval.