Contingency Analysis

October 16, 2023

Meme-of-the-day

Class Project!

Class Announcements

  • Reading Assignment #7 (WITH QUIZ) due Wednesday (Chapter 9)
  • Homework #5 posted (due Monday, October 23 at 11:59 pm)
  • Lab #4 posted (due Monday, October 23 at 11:59 pm)

Contingency analysis

Our data in this chapter consists of two categorical variables.

We are interested in:

  • Estimating association in 2 x 2 tables (i.e. special case of two categoricals with only 2 levels each)
  • Testing if there is an association (or dependence) between two categorical variables [\(\chi^2 \!\) contingency test]

Quick example

So far, we’ve said things like “Yeah, that looks like those variables are associated.” It’s time to quantify the evidence.

Left: Death of adult passengers following Titanic shipwreck
Right: Mosaic plot if death and sex were independent

Coffee consumption and healthy prostate

Practice Problem #1

Wilson et al. (2011) followed a set of male health professionals for 20 years. Of all the men in the study, 7890 drank no coffee and 2492 drank on average more than 6 cups per day. In the “no coffee” group, 122 developed advanced prostate cancer during the course of the study, and 19 in the “high coffee” group did.

(dataTable <- matrix(c(19, 122, 2473, 7768), nrow = 2, byrow = TRUE, dimnames = list(c("Cancer", "No cancer"), c("Coffee", "No coffee"))))
          Coffee No coffee
Cancer        19       122
No cancer   2473      7768

Coffee consumption and healthy prostate

dataTable
          Coffee No coffee
Cancer        19       122
No cancer   2473      7768

Coffee consumption and healthy prostate

Definition: The odds of success are the probability of success divided by the probability of failure. <br > \[ O = \frac{p}{1-p} \]

dataTable
          Coffee No coffee
Cancer        19       122
No cancer   2473      7768

Discuss: What is the estimated odds of developing cancer in the coffee and no coffee groups?

Coffee consumption and healthy prostate

addmargins(dataTable, margin=1)
          Coffee No coffee
Cancer        19       122
No cancer   2473      7768
Sum         2492      7890

Discuss: What is the estimated odds of developing cancer in the coffee and no coffee groups?

Answer: \[ \begin{eqnarray*} \hat{p}_{c} & = & \frac{19}{2492} = 0.0076244 \\ \hat{O}_{c} & = & \frac{0.0076244}{1 - 0.0076244} = 0.007683 \end{eqnarray*} \]

Coffee consumption and healthy prostate

addmargins(dataTable, margin=1)
          Coffee No coffee
Cancer        19       122
No cancer   2473      7768
Sum         2492      7890

Discuss: What is the estimated odds of developing cancer in the coffee and no coffee groups?

Answer: \[ \begin{eqnarray*} \hat{p}_{nc} & = & \frac{122}{7890} = 0.0154626 \\ \hat{O}_{nc} & = & \frac{0.0154626}{1 - 0.0154626} = 0.0157055 \end{eqnarray*} \]

Coffee consumption and healthy prostate

dataTable
          Coffee No coffee
Cancer        19       122
No cancer   2473      7768
c_odds <- (19/2492)/(1 - (19/2492))
nc_odds <- (122/7890)/(1 - (122/7890))

Definition: The odds ratio is the odds of success in one group divided by the odds of success in a second group.

\[ \begin{eqnarray*} \hat{O}_{c} & = & 0.007683 \\ \hat{O}_{nc} & = & 0.0157055 \\ \hat{OR} & = & \frac{\hat{O}_{c}}{\hat{O}_{nc}} = 0.4891915 \end{eqnarray*} \]

Coffee consumption and healthy prostate

        Treatment Control
Success "a"       "b"    
Failure "c"       "d"    

Notes:

  • The odds and odds ratio are population parameters.
  • The odds of success is a ratio of conditional probabilities, \[\hat{O}_{t} = \frac{\frac{a}{a+c}}{1-\frac{a}{a+c}} = \frac{a}{c}\]
  • The odds ratio can be calculated as follows: \[\hat{OR} = \frac{ad}{bc}\]

Coffee consumption and healthy prostate

Definition: The standard error for the log-odds ratio is given by \[ \mathrm{SE}[\ln(\hat{OR})] = \sqrt{\frac{1}{a} + \frac{1}{b} + \frac{1}{c} + \frac{1}{d}} \]

You can use this with the “1.96 rule of thumb” to calculate a 95% confidence interval. We use log-odds ratios as the sampling distribution for the odds ratio is highly skewed (more on why this matters in Chapter 13).

\[ \begin{array}{c} \ln(\hat{OR}) - 1.96\times\mathrm{SE}[\ln(\hat{OR})] < \ln(OR) < \ln(\hat{OR}) + 1.96\times\mathrm{SE}[\ln(\hat{OR})] \\ c_{L} < \ln(OR) < c_{R} \\ e^{c_{L}} < OR < e^{c_{R}} \end{array} \]

Coffee consumption and healthy prostate

(dataTable <- matrix(c(19, 122, 2473, 7768), nrow = 2, byrow = TRUE, dimnames = list(c("Cancer", "No cancer"), c("Coffee", "No coffee"))))
          Coffee No coffee
Cancer        19       122
No cancer   2473      7768
SE_logOR <- sqrt(sum(1/dataTable))
logOR <- log(dataTable[1,1]*dataTable[2,2] / (dataTable[1,2]*dataTable[2,1]))

\[ \begin{align} \mathrm{SE}[\ln(\hat{OR})] & = \sqrt{\frac{1}{19} + \frac{1}{122} + \frac{1}{2473} + \frac{1}{7768}} = 0.248 \\ \ln(\hat{OR}) & = \ln\bigl(\frac{ad}{bc}\bigr) = \ln\bigl(\frac{19\cdot 7768}{122\cdot 2473}\bigr) = -0.715 \end{align} \]

Coffee consumption and healthy prostate

          Coffee No coffee
Cancer        19       122
No cancer   2473      7768
(CI_logOR <- c(logOR - 1.96*SE_logOR, logOR + 1.96*SE_logOR))
[1] -1.2005175 -0.2294851
(CI <- exp(CI_logOR))
[1] 0.3010384 0.7949428

Discuss: Conclusion?

Coffee consumption and healthy prostate

Using epitools package

          Coffee No coffee
Cancer        19       122
No cancer   2473      7768

Assumes table is in the form:

treatment control
sick a b
healthy c d

Coffee consumption and healthy prostate

Using epitools package

(OR_result <- oddsratio(dataTable, method = "wald"))

Coffee consumption and healthy prostate

$data
          Coffee No coffee Total
Cancer        19       122   141
No cancer   2473      7768 10241
Total       2492      7890 10382

$measure
                        NA
odds ratio with 95% C.I.  estimate     lower     upper
               Cancer    1.0000000        NA        NA
               No cancer 0.4891915 0.3010411 0.7949357

$p.value
           NA
two-sided    midp.exact fisher.exact  chi.square
  Cancer             NA           NA          NA
  No cancer 0.001956364  0.002722787 0.003208084

$correction
[1] FALSE

attr(,"method")
[1] "Unconditional MLE & normal approximation (Wald) CI"

Coffee consumption and healthy prostate

TMI!!

OR_result$measure[-1,]
 estimate     lower     upper 
0.4891915 0.3010411 0.7949357 
CI
[1] 0.3010384 0.7949428

Coffee consumption and healthy prostate

Can also get hypothesis testing information for association…

OR_result$p.value
           NA
two-sided    midp.exact fisher.exact  chi.square
  Cancer             NA           NA          NA
  No cancer 0.001956364  0.002722787 0.003208084

Coffee consumption and healthy prostate

Before moving on to hypothesis testing, know that there is another measure of association: relative risk. More specific context…

Definition: Relative risk is the probability of an undesired outcome in the treatment group divided by the probability of the same outcome in a control group.

treatment control
undesired a b
desired c d

\[ \hat{RR} = \frac{\mathrm{Pr[undesired \ | \ treatment]}}{\mathrm{Pr[undesired \ | \ control]}} = \frac{\frac{a}{a+c}}{\frac{b}{b+d}} \]

Coffee consumption and healthy prostate

(dTmargins <- addmargins(dataTable, margin=1))
          Coffee No coffee
Cancer        19       122
No cancer   2473      7768
Sum         2492      7890
(RR <- (dTmargins[1,1]/dTmargins[3,1]) / (dTmargins[1,2]/dTmargins[3,2]))
[1] 0.4930861

Coffee consumption and healthy prostate

Now using epitools package. Strangely enough, we need the table to be both transposed and levels reversed for the riskratio function to work.

(RR_result <- riskratio(t(dataTable), rev = "both", method = "wald"))

Coffee consumption and healthy prostate

$data
          No cancer Cancer Total
No coffee      7768    122  7890
Coffee         2473     19  2492
Total         10241    141 10382

$measure
                        NA
risk ratio with 95% C.I.  estimate     lower     upper
               No coffee 1.0000000        NA        NA
               Coffee    0.4930861 0.3047198 0.7978932

$p.value
           NA
two-sided    midp.exact fisher.exact  chi.square
  No coffee          NA           NA          NA
  Coffee    0.001956364  0.002722787 0.003208084

$correction
[1] FALSE

attr(,"method")
[1] "Unconditional MLE & normal approximation (Wald) CI"

Coffee consumption and healthy prostate

RR_result$measure[-1,]
 estimate     lower     upper 
0.4930861 0.3047198 0.7978932 

Remember, the odds ratio is 0.4891915. Hmmm, so close!!!

The relative risk and odds ratio will be very close when the undesired outcome is rare in the population.

Which one do we use to measure association?

Case-control studies

Definition: A case-control study is a method of observational study in which a sample of individuals having a disease or other focal condition (“cases”) is compared to a second sample of individuals who do not have the condition (“controls”).

Total number of cases and controls are chosen by researcher and not sampled randomly from the population. This means we can’t estimate probability of undesired result in control and treatment. Thus, we cannot estimate relative risk.

Thus begins our venture into parasitology…

A few parasitology examples…

A few parasitology examples…

A few parasitology examples…

A few parasitology examples…

…starting with Toxoplasma gondii

Example of case-control study

Toxoplasma gondii is a protozoan parasite that can infect the brains of many birds and mammals, including humans… In humans, toxoplasmosis may be associated with some mental illnesses, and it may be associated with risky behavior. Yereli et al. (2006) compared the prevalence of Toxoplasma gondii in a sample of 185 drivers between 21 and 40 years old who had been involved in a driving accident (cases) with a sample of 185 drivers of similar age and sex who had not had accidents (control). The researchers were interested in whether Toxoplasma infection may cause a change in the probability of an accident.

Example of case-control study

(toxTable <- matrix(c(61, 124, 16, 169), 
                    nrow = 2, 
                    byrow = TRUE, 
                    dimnames = list(c("accidents", 
                                      "no accidents"), 
                                    c("infected", 
                                      "uninfected"))))
             infected uninfected
accidents          61        124
no accidents       16        169

Example of case-control study

par(cex=1.5)
par(mar=c(3,2,1,1))
mosaicplot(toxTable, col=c("forestgreen", "goldenrod1"), cex.axis = 1.2, sub = "Condition", dir = c("h","v"), ylab = "Relative frequency", main="")

Example of case-control study

             infected uninfected
accidents          61        124
no accidents       16        169

Question: Why can’t we estimate relative risk?

Answer: Because we can’t estimate the necessary probabilities!!!

\[ \mathrm{Pr[accidents \ | \ infected]} = \frac{\mathrm{Pr[accidents \ AND \ infected]}}{\mathrm{Pr[infected]}} \]

Example of case-control study

infected uninfected
accidents 61 124
no accidents 16x 169x

There exists a scaling factor \(x\) that scales the “no accidents” counts that will make the proportions representative (i.e. with no bias) of the population.

The odds ratio thus becomes

\[ OR = \frac{61\times 169x}{124\times 16x} = \frac{61\times 169}{124\times 16} = 5.1960685. \]

Ah hah! So, \(x\) cancels out, and the odds ratio can be computed without worrying about under- or over-representation of the levels of the “cases” in the population.

Example of case-control study

Discuss: Turn the result \(OR = 5.1960685\) into a statement.

Answer: The odds of having an accident when infected with toxoplasma is about 5 times larger than having an accident when not infected.

Let’s check if this has statistical significance or not by computing a confidence interval on the odds ratio:

oddsratio(toxTable, method="wald")$measure[-1,]
estimate    lower    upper 
5.196069 2.859352 9.442394 

Example of case-control study

estimate    lower    upper 
5.196069 2.859352 9.442394 

Discuss: Given this 95% confidence interval, what do you conclude?

Conclusion: Since the 95% confidence interval does not contain 1, an odds ratio of 1 is not a plausible parameter, and thus there is evidence that there is increased odds of having an accident when infected with toxoplasma.

Flipside of the parasitic coin

Estimation of association for 2 x 2 contingency tables: odds ratios and relative risks

What about hypothesis testing?

  • \(H_{0}\): There is no association between two categorical variables
  • \(H_{A}\): There is an association between two categorical variables

Remember: Hypothesis tests will give you a yes/no answer, but will NOT give you the magnitude of the effect if there is one (hence, always use confidence intervals as well, if possible)

\(\chi^2\) contingency test

Example of contingency test

Example 9.4

Many parasites have more than one species of host, so the individual parasite must get from one host to another to complete its life cycle. Trematodes of the species Euhaplorchis californiensis use three hosts during their life cycle.

Example of contingency test

Example 9.4

Lafferty and Morris (1996) tested the hypothesis that infection influences risk of predation by birds. A large outdoor tank was stocked with three kinds of killfish: unparasitized, lightly infected, and heavily infected. This tank was left open to foraging by birds… The numbers of fish eaten according to their levels of parasitism is given as follows:

          Uninfected Lightly Highly Sum
Eaten              1      10     37  48
Not eaten         49      35      9  93
Sum               50      45     46 141

Example of contingency test

          Uninfected Lightly Highly Sum
Eaten              1      10     37  48
Not eaten         49      35      9  93
Sum               50      45     46 141

Example of contingency test

If we call our two categorical variable “fate” and “infection status”, then what we want to know is are these two variables independent.

How do we compute the frequency table we would expect if the variables were independent? If two variables are independent, we have

\[\mathrm{Pr[A \ and \ B]} = \mathrm{Pr[A]}\times\mathrm{Pr[B]}\]

Uninfected Lightly Highly Sum
Eaten Pr[Eaten and Uninfected] * * Pr[Eaten]
Not eaten * * * *
Sum Pr[Uninfected] * * *

Example of contingency test

The observed table (in relative frequencies) is given by:

Uninfected Lightly Highly Sum
Eaten Pr[Eaten and Uninfected] * * Pr[Eaten]
Not eaten * * * *
Sum Pr[Uninfected] * * *

Assuming independence, the expected (relative frequencies) table should have:

Uninfected Lightly Highly Sum
Eaten Pr[Eaten]\(\cdot\) Pr[Uninfected] * * Pr[Eaten]
Not eaten * * * *
Sum Pr[Uninfected] * * *

Example of contingency test

Uninfected Lightly Highly Sum
Eaten \(141\cdot\frac{48}{141}\cdot\frac{50}{141}\) * * 48
Not eaten * * * 93
Sum 50 45 46 141
Uninfected Lightly Highly Sum
Eaten 17.0 \(141\cdot\frac{48}{141}\cdot\frac{45}{141}\) * 48
Not eaten * * * 93
Sum 50 45 46 141
Uninfected Lightly Highly Sum
Eaten 17.0 15.3 \(141\cdot\frac{48}{141}\cdot\frac{46}{141}\) 48
Not eaten * * * 93
Sum 50 45 46 141

Example of contingency test

Expected frequency table

Uninfected Lightly Highly Sum
Eaten 17.0 15.3 15.7 48
Not eaten 33.0 29.7 30.3 93
Sum 50 45 46 141

Observed frequency table

Uninfected Lightly Highly Sum
Eaten 1 10 37 48
Not eaten 49 35 9 93
Sum 50 45 46 141

Example of contingency test

The \(\chi^2\) contingency test is just a special case of the \(\chi^2\) goodness-of-fit test, so the test statistic is the same.

\[ \chi^2 = \sum_{i=1}^{r}\sum_{j=1}^{c} \frac{(Observed_{ij}-Expected_{ij})^2}{Expected_{ij}}, \]

where \(r\) and \(c\) are number of rows and columns, respectively. The number of degrees of freedom is given by

\[ \begin{align} df & = rc - 1 - (r-1) - (c - 1) \\ & = rc - r - c + 1 \\ & = (r-1)\times(c-1) \end{align} \]

Example of contingency test

Assumptions of the \(\chi^2\) contingency test are the same as the \(\chi^2\) goodness-of-fit test.

What do you do if the assumptions are violated?

  • If table larger than 2 x 2, you can combine row and/or columns.
  • If table is 2 x 2, you can use Fisher’s exact test.
  • You can use a permutation test (see Chapter 13).

Example of contingency test

Expected frequency table

Uninfected Lightly Highly Sum
Eaten 17.0 15.3 15.7 48
Not eaten 33.0 29.7 30.3 93
Sum 50 45 46 141

Assumptions seem to be met, so let’s use R.

Example of contingency test

First, let’s see how the table was created in R:

(parTable <- matrix(c(1, 10, 37, 49, 35, 9), 
                    nrow = 2, 
                    byrow = TRUE, 
                    dimnames = list(c("Eaten", "Not eaten"), 
                                    c("Uninfected", "Lightly", "Highly"))))
          Uninfected Lightly Highly
Eaten              1      10     37
Not eaten         49      35      9

Example of contingency test

chisq.test(parTable, correct = FALSE)

Note: correct = FALSE means no Yates correction (see p. 251)

Example of contingency test

chisq.test(parTable, correct = FALSE)

    Pearson's Chi-squared test

data:  parTable
X-squared = 69.756, df = 2, p-value = 7.124e-16

Conclusion?

Conclusion: Since \(p\)-value is less than 0.05 (actually less than 0.001), then we can reject the null hypothesis of independence.

Other type of contingency tests

Fisher’s exact test: (2 x 2 tables only) Examines the independence of two categorical variables, even with small expected values

\(G\)-test: (any table) Derived from principles of likelihood.

     Pro: Great with complicated experimental designs with multiple explanatory variables.
     Con: Can be less accurate for small sample sizes.

Magnitudes of association

Situation: You’ve found a statistically significant association between two categorical variables using the \(\chi^2\) contingency test.

Question: Where is the association? In other words, in which levels of the categories is the association present and how large is the association?

We need to estimate the magnitude of the association, which the \(P\)-value does not give us!

We can estimate odds ratios or relative risks for 2 \(\times\) 2 sub-tables within the contingency table by either subsetting or collapsing.

Magnitudes of association

          Uninfected Lightly Highly
Eaten              1      10     37
Not eaten         49      35      9

Magnitudes of association

          Uninfected Highly
Eaten              1     37
Not eaten         49      9

Magnitudes of association

oddsratio(parTable[,c(1,3)], method = "wald")
$data
          Uninfected Highly Total
Eaten              1     37    38
Not eaten         49      9    58
Total             50     46    96

$measure
                        NA
odds ratio with 95% C.I.    estimate        lower      upper
               Eaten     1.000000000           NA         NA
               Not eaten 0.004964148 0.0006020703 0.04093004

$p.value
           NA
two-sided     midp.exact fisher.exact   chi.square
  Eaten               NA           NA           NA
  Not eaten 1.110223e-16 6.861412e-17 4.140762e-15

$correction
[1] FALSE

attr(,"method")
[1] "Unconditional MLE & normal approximation (Wald) CI"