M. Drew LaMar
March 2, 2016
Assumptions of \( \chi^2 \) goodness-of-fit test
The sampling distribution for the \( \chi^2 \) test statistic follows a \( \chi^2 \) distribution only approximately! When is the approximation “good”?
What to do if violated?
Example of another common probability model
1980 explosion of Mount St. Helens - spiders recolonized landscape by dropping out of the airstream (!!!!)
What does it mean for spiders to land on landscape “randomly” in space?
Probability that spider will land on given point is:
Probability that spider will land on given point is:
To count spiders, place large equal-sized grid across landscape and count number of spiders in each block.
Definition: The
Poisson distribution describes number of successes in blocks of time or space, when successes happen with equal probability and independently across time or space.
To count spiders, place large equal-sized grid across landscape and count number of spiders in each block.
\[ \mathrm{Pr[}X \ \mathrm{successes]} = \frac{e^{-\mu}\mu^{X}}{X!} \]
where \( \mu \) is mean number of successes in time or space.
Definition: The
index of dispersion (or variance-to-mean ratio VMR) is
\[ VMR = \frac{\sigma^{2}}{\mu} \]
For a Poisson distribution
barplot(dpois(0:10, lambda=1), xlab="Number of successes(X)", ylab="Probability", col="firebrick", names.arg=as.character(0:10), cex.axis=1.5, cex.lab=1.5, cex.names=1.5, ylim=c(0,0.4))
For a Poisson distribution
barplot(dpois(0:10, lambda=2), xlab="Number of successes(X)", ylab="Probability", col="firebrick", names.arg=as.character(0:10), cex.axis=1.5, cex.lab=1.5, cex.names=1.5, ylim=c(0,0.4))
For a Poisson distribution
barplot(dpois(0:10, lambda=3), xlab="Number of successes(X)", ylab="Probability", col="firebrick", names.arg=as.character(0:10), cex.axis=1.5, cex.lab=1.5, cex.names=1.5, ylim=c(0,0.4))
Do extinctions occur randomly through the long fossil record of Earth's history, or are there periods in which extinction rates are unusually high (“mass extinctions”) compared with background rates? The best record of extinctions through Earth's history comes from fossil marine invertebrates because they have hard shells and therefore tend to preserve well.
Our data in this chapter consists of two categorical variables.
We are interested in:
So far, we've said things like “Yeah, that looks like those variables are associated.” It's time to quantify the evidence.
Left: Death of adult passengers following Titanic shipwreck
Right: Mosaic plot if death and sex were independent
Practice Problem #1
Wilson et al. (2011) followed a set of male health professionals for 20 years. Of all the men in the study, 7890 drank no coffee and 2492 drank on average more than 6 cups per day. In the “no coffee” group, 122 developed advanced prostate cancer during the course of the study, and 19 in the “high coffee” group did.
(dataTable <- matrix(c(19, 122, 2473, 7768), nrow = 2, byrow = TRUE, dimnames = list(c("Cancer", "No cancer"), c("Coffee", "No coffee"))))
Coffee No coffee
Cancer 19 122
No cancer 2473 7768
Coffee No coffee
Cancer 19 122
No cancer 2473 7768
Definition: The
odds of success are the probability of success divided by the probability of failure.
\[ O = \frac{p}{1-p} \]
Coffee No coffee
Cancer 19 122
No cancer 2473 7768
Discuss: What is the estimated odds of developing cancer in the coffee and no coffee groups?
Coffee No coffee
Cancer 19 122
No cancer 2473 7768
Sum 2492 7890
Discuss: What is the estimated odds of developing cancer in the coffee and no coffee groups?
Answer: \[ \begin{eqnarray*} \hat{p}_{c} & = & \frac{19}{2492} = 0.0076244 \\ \hat{O}_{c} & = & \frac{0.0076244}{1 - 0.0076244} = 0.007683 \end{eqnarray*} \]
Coffee No coffee
Cancer 19 122
No cancer 2473 7768
Sum 2492 7890
Discuss: What is the estimated odds of developing cancer in the coffee and no coffee groups?
Answer: \[ \begin{eqnarray*} \hat{p}_{nc} & = & \frac{122}{7890} = 0.0154626 \\ \hat{O}_{nc} & = & \frac{0.0154626}{1 - 0.0154626} = 0.0157055 \end{eqnarray*} \]
Coffee No coffee
Cancer 19 122
No cancer 2473 7768
Definition: The
odds ratio is the odds of success in one group divided by the odds of success in a second group.
\[
\begin{eqnarray*}
\hat{O}_{c} & = & 0.007683 \\
\hat{O}_{nc} & = & 0.0157055 \\
\hat{OR} & = & \frac{\hat{O}_{c}}{\hat{O}_{nc}} = 0.4891915
\end{eqnarray*}
\]
Treatment Control
Success "a" "b"
Failure "c" "d"
Notes:
Definition: The
standard error for the log-odds ratio is given by
\[ \mathrm{SE}[\ln(\hat{OR})] = \sqrt{\frac{1}{a} + \frac{1}{b} + \frac{1}{c} + \frac{1}{d}} \]
You can use this with the “1.96 rule of thumb” to calculate a 95% confidence interval.