M. Drew LaMar
March 4, 2016
“Standard normal deviate: Not to be confused with 'everyday ordinary pervert.' You don't often find a jargon term that seems to be both redundant and self-contradictory.”
- Whitlock & Schluter
Practice Problem #1
Wilson et al. (2011) followed a set of male health professionals for 20 years. Of all the men in the study, 7890 drank no coffee and 2492 drank on average more than 6 cups per day. In the “no coffee” group, 122 developed advanced prostate cancer during the course of the study, and 19 in the “high coffee” group did.
Treatment Control
Success "a" "b"
Failure "c" "d"
Notes:
Definition: The
standard error for the log-odds ratio is given by
\[ \mathrm{SE}[\ln(\hat{OR})] = \sqrt{\frac{1}{a} + \frac{1}{b} + \frac{1}{c} + \frac{1}{d}} \]
You can use this with the “1.96 rule of thumb” to calculate a 95% confidence interval. We use log-odds ratios as the sampling distribution for the odds ratio is highly skewed (more on why this matters in Chapter 13).
\[ \begin{array}{c} \ln(\hat{OR}) - 1.96\times\mathrm{SE}[\ln(\hat{OR})] < \ln(OR) < \ln(\hat{OR}) + 1.96\times\mathrm{SE}[\ln(\hat{OR})] \\ c_{L} < \ln(OR) < c_{R} \\ e^{c_{L}} < OR < e^{c_{R}} \end{array} \]
Coffee No coffee
Cancer 19 122
No cancer 2473 7768
(SE_logOR <- sqrt(sum(1/dataTable)))
[1] 0.2477123
(logOR <- log(dataTable[1,1]*dataTable[2,2] / (dataTable[1,2]*dataTable[2,1])))
[1] -0.7150013
Coffee No coffee
Cancer 19 122
No cancer 2473 7768
(CI_logOR <- c(logOR - 1.96*SE_logOR, logOR + 1.96*SE_logOR))
[1] -1.2005175 -0.2294851
(CI <- exp(CI_logOR))
[1] 0.3010384 0.7949428
Discuss: Conclusion?
Using epitools
package
Coffee No coffee
Cancer 19 122
No cancer 2473 7768
Assumes table is in the form:
treatment | control | |
---|---|---|
sick | a | b |
healthy | c | d |
Using epitools
package
library(epitools)
oddsratio(dataTable, method = "wald")
$data
Coffee No coffee Total
Cancer 19 122 141
No cancer 2473 7768 10241
Total 2492 7890 10382
$measure
NA
odds ratio with 95% C.I. estimate lower upper
Cancer 1.0000000 NA NA
No cancer 0.4891915 0.3010411 0.7949357
$p.value
NA
two-sided midp.exact fisher.exact chi.square
Cancer NA NA NA
No cancer 0.001956364 0.002722787 0.003208084
$correction
[1] FALSE
attr(,"method")
[1] "Unconditional MLE & normal approximation (Wald) CI"
oddsratio(dataTable, method = "wald")$measure[-1,]
estimate lower upper
0.4891915 0.3010411 0.7949357
CI
[1] 0.3010384 0.7949428
Can also get hypothesis testing information for association…
oddsratio(dataTable, method = "wald")$p.value
NA
two-sided midp.exact fisher.exact chi.square
Cancer NA NA NA
No cancer 0.001956364 0.002722787 0.003208084
Before moving on to hypothesis testing, know that there is another measure of association: relative risk. More specific context…
Definition:
Relative risk is the probability of an undesired outcome in the treatment group divided by the probability of the same outcome in a control group.
treatment | control | |
---|---|---|
undesired | a | b |
desired | c | d |
\[ \hat{RR} = \frac{\mathrm{Pr[undesired \ | \ treatment]}}{\mathrm{Pr[undesired \ | \ control]}} = \frac{\frac{a}{a+c}}{\frac{b}{b+d}} \]
dTmargins <- addmargins(dataTable, margin=1)
(RR <- (dTmargins[1,1]/dTmargins[3,1]) / (dTmargins[1,2]/dTmargins[3,2]))
[1] 0.4930861
Now using epitools
package. Strangely enough, we need the table to be both transposed and levels reversed for the riskratio
function to work.
riskratio(t(dataTable), rev = "both", method = "wald")
$data
No cancer Cancer Total
No coffee 7768 122 7890
Coffee 2473 19 2492
Total 10241 141 10382
$measure
NA
risk ratio with 95% C.I. estimate lower upper
No coffee 1.0000000 NA NA
Coffee 0.4930861 0.3047198 0.7978932
$p.value
NA
two-sided midp.exact fisher.exact chi.square
No coffee NA NA NA
Coffee 0.001956364 0.002722787 0.003208084
$correction
[1] FALSE
attr(,"method")
[1] "Unconditional MLE & normal approximation (Wald) CI"
riskratio(t(dataTable), rev = "both", method = "wald")$measure[-1,]
estimate lower upper
0.4930861 0.3047198 0.7978932
Remember, the odds ratio is 0.4891915. Hmmm, so close!!!
The relative risk and odds ratio will be very close when the undesired outcome is rare in the population.
Which one do we use to measure association?
Definition: A
case-control study is a method of observational study in which a sample of individuals having a disease or other focal condition (“cases”) is compared to a second sample of individuals who do not have the condition (“controls”).
Total number of cases and controls are chosen by researcher and not sampled randomly from the population. Thus, we cannot estimate relative risk.
Thus begins our venture into parasitology…
Toxoplasma gondii is a protozoan parasite that can infect the brains of many birds and mammals, including humans… In humans, toxoplasmosis may be associated with some mental illnesses, and it may be associated with risky behavior. Yereli et al. (2006) compared the prevalence of Toxoplasma gondii in a sample of 185 drivers between 21 and 40 years old who had been involved in a driving accident (cases) with a sample of 185 drivers of similar age and sex who had not had accidents (control). The researchers were interested in whether Toxoplasma infection may cause a change in the probability of an accident.
infected uninfected
accidents 61 124
no accidents 16 169
par(cex=1.5)
par(mar=c(3,2,1,1))
mosaicplot(toxTable, col=c("forestgreen", "goldenrod1"), cex.axis = 1.2, sub = "Condition", dir = c("h","v"), ylab = "Relative frequency", main="")
infected uninfected
accidents 61 124
no accidents 16 169
Question: Why can’t we estimate relative risk?
Answer: Because we can’t estimate the necessary probabilities!!!
\[
\mathrm{Pr[accidents \ | \ infected]} = \frac{\mathrm{Pr[accidents \ AND \ infected]}}{\mathrm{Pr[infected]}}
\]
infected | uninfected | |
---|---|---|
accidents | 61 | 124 |
no accidents | 16x | 169x |
There exists a scaling factor \( x \) that scales the “no accidents” counts that will make the proportions representative (i.e. with no bias) of the population. The odds ratio thus becomes
\[ OR = \frac{61\times 169x}{124\times 16x} = \frac{61\times 169}{124\times 16} = 5.1960685. \]
Ah hah! So, \( x \) cancels out, and the odds ratio can be computed without worrying about under- or over-representation of the levels of the “cases” in the population.
Discuss: Turn the result \( OR = 5.1960685 \) into a statement.
Answer: The odds of having an accident when infected with toxoplasma is about 5 times larger than having an accident when not infected.
Let's check if this has statistical significance or not by computing a confidence interval on the odds ratio:
oddsratio(toxTable, method="wald")$measure[-1,]
estimate lower upper
5.196069 2.859352 9.442394
estimate lower upper
5.196069 2.859352 9.442394
Discuss: Given this 95% confidence interval, what do you conclude?
Conclusion: Since the 95% confidence interval does not contain 1, an odds ratio of 1 is not a plausible parameter, and thus there is evidence that there is increased odds of having an accident when infected with toxoplasma.
Estimation of association for 2 x 2 contingency tables: odds ratios and relative risks
What about hypothesis testing?
Remember: Hypothesis tests will give you a yes/no answer, but will NOT give you the magnitude of the effect if there is one (hence, always use confidence intervals as well, if possible)
\( \chi^2 \) contingency test
Example 9.4
Many parasites have more than one species of host, so the individual parasite must get from one host to another to complete its life cycle. Trematodes of the species Euhaplorchis californiensis use three hosts during their life cycle.
Example 9.4
Lafferty and Morris (1996) tested the hypothesis that infection influences risk of predation by birds. A large outdoor tank was stocked with three kinds of killfish: unparasitized, lightly infected, and heavily infected. This tank was left open to foraging by birds… The numbers of fish eaten according to their levels of parasitism is given as follows:
Uninfected Lightly Highly Sum
Eaten 1 10 37 48
Not eaten 49 35 9 93
Sum 50 45 46 141
Uninfected Lightly Highly Sum
Eaten 1 10 37 48
Not eaten 49 35 9 93
Sum 50 45 46 141
If we call our two categorical variable “fate” and “infection status”, then what we want to know is are these two variables independent.
How do we compute the frequency table we would expect if the variables were independent? If two variables are independent, we have
\[ \mathrm{Pr[A \ and \ B]} = \mathrm{Pr[A]}\times\mathrm{Pr[B]} \]
Uninfected | Lightly | Highly | Sum | |
---|---|---|---|---|
Eaten | Pr[Eaten and Uninfected] | * | * | Pr[Eaten] |
Not eaten | * | * | * | * |
Sum | Pr[Uninfected] | * | * | * |
For expected and actual table, we have:
Uninfected | Lightly | Highly | Sum | |
---|---|---|---|---|
Eaten | Pr[Eaten and Uninfected] | * | * | Pr[Eaten] |
Not eaten | * | * | * | * |
Sum | Pr[Uninfected] | * | * | * |
Assuming independence, the expected table should have:
Uninfected | Lightly | Highly | Sum | |
---|---|---|---|---|
Eaten | Pr[Eaten]\( \cdot \) Pr[Uninfected] | * | * | Pr[Eaten] |
Not eaten | * | * | * | * |
Sum | Pr[Uninfected] | * | * | * |
Uninfected | Lightly | Highly | Sum | |
---|---|---|---|---|
Eaten | \( 141\cdot\frac{48}{141}\cdot\frac{50}{141} \) | * | * | 48 |
Not eaten | * | * | * | 93 |
Sum | 50 | 45 | 46 | 141 |
Uninfected | Lightly | Highly | Sum | |
---|---|---|---|---|
Eaten | 17.0 | \( 141\cdot\frac{48}{141}\cdot\frac{45}{141} \) | * | 48 |
Not eaten | * | * | * | 93 |
Sum | 50 | 45 | 46 | 141 |
Uninfected | Lightly | Highly | Sum | |
---|---|---|---|---|
Eaten | 17.0 | 15.3 | \( 141\cdot\frac{48}{141}\cdot\frac{46}{141} \) | 48 |
Not eaten | * | * | * | 93 |
Sum | 50 | 45 | 46 | 141 |
Expected frequency table
Uninfected | Lightly | Highly | Sum | |
---|---|---|---|---|
Eaten | 17.0 | 15.3 | 15.7 | 48 |
Not eaten | 33.0 | 29.7 | 30.3 | 93 |
Sum | 50 | 45 | 46 | 141 |
Observed frequency table
Uninfected | Lightly | Highly | Sum | |
---|---|---|---|---|
Eaten | 1 | 10 | 37 | 48 |
Not eaten | 49 | 35 | 9 | 93 |
Sum | 50 | 45 | 46 | 141 |
The \( \chi^2 \) contingency test is just a special case of the \( \chi^2 \) goodness-of-fit test, so the test statistic is the same.
\[ \chi^2 = \sum_{i=1}^{r}\sum_{j=1}^{c} \frac{(Observed_{ij}-Expected_{ij})^2}{Expected_{ij}}, \]
where \( r \) and \( c \) are number of rows and columns, respectively. The number of degrees of freedom is given by
\[ df = rc - r - (c - 1) = (r-1)\times(c-1) \]
Effectively, to get the same marginals as the observed, you set the row sums (\( r \) of them), which then imply the total sum. Once you have the total sum, you only have \( c-1 \) number of columns to specify.
Assumptions of the \( \chi^2 \) contingency test are the same as the \( \chi^2 \) goodness-of-fit test.
What do you do if the assumptions are violated?
Expected frequency table
Uninfected | Lightly | Highly | Sum | |
---|---|---|---|---|
Eaten | 17.0 | 15.3 | 15.7 | 48 |
Not eaten | 33.0 | 29.7 | 30.3 | 93 |
Sum | 50 | 45 | 46 | 141 |
Assumptions seem to be met, so let's use R.
chisq.test(parTable, correct = FALSE)
Pearson's Chi-squared test
data: parTable
X-squared = 69.756, df = 2, p-value = 7.124e-16
Note: correct = FALSE
means no Yates correction (see p. 251)
Conclusion?
Conclusion: Since \( p \)-value is less than 0.05 (actually less than 0.001), then we can reject the null hypothesis of independence.
Fisher’s exact test: (2 x 2 tables only) Examines the independence of two categorical variables, even with small expected values
\( G \)-test: (any table) Derived from principles of likelihood.
Pro : Great with complicated experimental designs with multiple explanatory variables.
Con : Can be less accurate for small sample sizes.