Categorical Data Analysis 4

Data:

These data involve a case-control study with ovarian cancer cases (Cancer = Yes) and non-cancer controls (Cancer = No), with covariates given by OralCont (oral contraceptive use with two levels, Ever or Never) and Smoking (two levels, Yes or No).

Objective:

The goal of this study is to find the Mutual, Joint and Marginal independences using Pearson’s test Statistics.

Load Libraries

library(gmodels)
library(RColorBrewer)

#PREPARING WORK SPAcE
# Clear the workspace: 
rm(list = ls())

Creating a Contingency Table for Data

# Using matrix function to create 2x2 contingency table
ovarian<- array(c(15, 9, 50, 8, 4, 32, 12, 28),
             dim = c(2, 2, 2),
             dimnames = list(
               OralCont = c("Ever", "Never"),
               Cancer = c("Yes", "No"),
               Smoking = c("Yes", "No")))

ovarian

## , , Smoking = Yes
## 
##         Cancer
## OralCont Yes No
##    Ever   15 50
##    Never   9  8
## 
## , , Smoking = No
## 
##         Cancer
## OralCont Yes No
##    Ever    4 12
##    Never  32 28

Marginal probabilities are: \[ \pi_{i++}= P(A=i), i=1,2,3,....I\]

\[ \pi_{+j+}= P(B=j), j=1,2,3,....J \] \[ \pi_{++k}= P(C=k), k=1,2,3,....K \] so that \[ \pi_{ijk}= \pi_{i++}\pi_{+j+}\pi_{++k} \]

Mutual Independence

Compute the Pearson test statistic of the null hypothesis of mutual independence, \[ H_0:\; \pi_{ijk}=\pi_{i++}\pi_{+j+}\pi_{++k} \] “by hand”, without using any non-trivial R functions. The corresponding estimates of the expected counts under \(H_0\) are \[ \widehat{\mbox{expected}}_{ijk}=n_{+++}\hat\pi_{i++}\hat\pi_{+j+}\hat\pi_{++k}=n_{+++}\frac{n_{i++}}{n_{+++}}\frac{n_{+j+}}{n_{+++}}\frac{n_{++k}}{n_{+++}} \] and the Pearson statistic is \[ X^2=\sum_{i=1}^I\sum_{j=1}^J\sum_{k=1}^K\frac{\left(\mbox{observed}_{ijk}-\widehat{\mbox{expected}}_{ijk}\right)^2}{\widehat{\mbox{expected}}_{ijk}}, \] which we compute as:

# I returns NON-Oral contraceptive patients
n <- ovarian 
for (i in 1:2) {ii=(n[i, , ])}
ii

##       Smoking
## Cancer Yes No
##    Yes   9 32
##    No    8 28

# J returns NON-cancer patients
for (j in 1:2) {jj=  (n[, j, ]) }
jj

##         Smoking
## OralCont Yes No
##    Ever   50 12
##    Never   8 28

# K returns NON-smoking patients
for (k in 1:2) {kk=  (n[, , k])}
kk

##         Cancer
## OralCont Yes No
##    Ever    4 12
##    Never  32 28

X2 <- 0
for (i in 1:2) {
  for (j in 1:2) {
    for (k in 1:2) {
      expected <- sum(n[i, , ]) * sum(n[, j, ]) * sum(n[, , k]) / sum(n) ^ 2
      X2 <- X2 + (n[i, j, k] - expected) ^ 2 / expected
}}}
X2

## [1] 73.86607

More simply,

we can use the loglm function in the MASS library to test for mutual independence:

library(MASS)
# Test for mutual independence
loglm(formula = ~Cancer + Smoking + OralCont, data = ovarian)

## Call:
## loglm(formula = ~Cancer + Smoking + OralCont, data = ovarian)
## 
## Statistics:
##                       X^2 df     P(> X^2)
## Likelihood Ratio 72.16844  4 7.882583e-15
## Pearson          73.86607  4 3.441691e-15

This is the same Pearson statistic.

H0: They are mutually independent
HA: They are NOT mutually independent

Since the P-Value is 0, we reject the null Hypothesis. It means there is some relationship among Cancer, Smoking and Oral contraceptive.

Joint Independence

Now test for joint independence of \((X,Z)=\) (OralCont, Smoking) from \(Y=\) Cancer. The null hypothesis is that \[ H_0:\;\pi_{ijk}=\pi_{i+k}\pi_{+j+} \] with estimated expected counts \[ \widehat{\mbox{expected}}_{ijk}=n_{+++}\hat\pi_{i+k}\hat\pi_{+j+}=n_{+++}\frac{n_{i+k}}{n_{+++}}\frac{n_{+j+}}{n_{+++}}. \]

First, compute Pearson by hand:

X2 <- 0
for(i in 1:2) {
  for (j in 1:2) {
    for (k in 1:2) {
      expected <- sum(n[i, , k]) * sum(n[, j, ]) / sum(n)
      X2 <- X2 + (n[i, j, k] - expected) ^ 2 / expected
    }}}
X2

## [1] 14.89392

More simply, we can use the loglm function in the MASS library to test for joint independence:

 loglm(formula = ~Cancer + Smoking * OralCont, data = ovarian)

## Call:
## loglm(formula = ~Cancer + Smoking * OralCont, data = ovarian)
## 
## Statistics:
##                       X^2 df    P(> X^2)
## Likelihood Ratio 15.16485  3 0.001681057
## Pearson          14.89392  3 0.001909585

H0: Cancer is INDEPENDENT from Oral contraceptive and Smoking.
HA: Cancer is DEPENDENT on Oral contraceptive and Smoking.
HA: Cancer is DEPENDENT on Oral contraceptive.

Since Confidence Interval does not cover Zero, we REJECT the Null hypothesis. It means that cancer is dependent on Oral contraceptive.

chisq.test(n_marginal)

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  n_marginal
## X-squared = 13.635, df = 1, p-value = 0.000222

Conditional Independence

Now look at conditional independence of \(X\) and \(Y\), given \(Z\). First estimate conditional odds ratios by using the OR function on sub-arrays (partial tables), one for each level of smoking:

OR(n[, , 1]) # odds ratio given Smoking = Yes

## $thetaHat
## [1] 0.2666667
## 
## $CI
## [1] 0.08757316 0.81201943

OR(n[, , 2]) # odds ratio given Smoking = No

## $thetaHat
## [1] 0.2916667
## 
## $CI
## [1] 0.08439941 1.00793880

Next, conduct (Cochran) Mantel-Haenszel (CMH) test, first using the R function mantelhaen.test:

#Note: R defaults to the last entry in array as conditioning variable Z.
cmh <- mantelhaen.test(ovarian)
cmh

## 
##  Mantel-Haenszel chi-squared test with continuity correction
## 
## data:  ovarian
## Mantel-Haenszel X-squared = 8.3882, df = 1, p-value = 0.003777
## alternative hypothesis: true common odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.1211236 0.6410490
## sample estimates:
## common odds ratio 
##         0.2786506

Now try CMH via direct computations:

num <- 0
den <- 0
for (k in 1:2){
  E_n11k <- sum(n[1, , k]) * sum(n[, 1, k]) / sum(n[, , k])
  Var_n11k <- sum(n[1, , k]) * sum(n[2, , k]) * sum(n[, 1, k]) * sum(n[, 2, k]) /
    (sum(n[, , k]) ^ 2 * (sum(n[, , k]) - 1))
  num <- num + (n[1, 1, k] - E_n11k)
  den <- den + Var_n11k
}
num ^ 2 / den            # CMH without continuity correction

## [1] 9.610629

(abs(num) - 0.5) ^ 2 / den # CMH with continuity correction

## [1] 8.388188

Finally, let’s compute CMH estimator of the common odds ratio “by hand” and compare to the result from the mantelhaen.test function:

num <- 0
den <- 0
for (k in 1:2) {
  num <- num + n[1, 1, k] * n[2, 2, k] / sum(n[, , k])
  den <- den + n[1, 2, k] * n[2, 1, k] / sum(n[, , k])
}
num / den

## [1] 0.2786506

More formal presentation of the conditional independence test

To summarize the test:

Research question: are the odds of ovarian cancer (\(Y\)) affected by oral contraceptive use (\(X\)), controlling for the potential confounder of smoking (\(Z\))?

Alternative hypothesis (corresponding to the research question): \[ H_a:X\,\mbox{and}\, Y\,\mbox{are not conditionally independent given}\, Z \] or \[ H_a:\theta_{XY(1)}\ne 1\,\mbox{or}\,\theta_{XY(2)}\ne 1\,\mbox{(or both)}. \]

Null hypothesis: \[ H_0:\mbox{conditional independence of}\, X\,\mbox{and}\,Y\,\mbox{given}\,Z \] or \[ H_0:\theta_{XY(1)}=\theta_{XY(2)}=1. \]

Test statistic: CMH

Distribution of the test statistic, under the null hypothesis: \[ \mbox{CMH is approximately}\sim\chi^2_1. \]

Results: With continuity correction,
\[ X_{CMH}^2=8.3881882, \] with corresponding \(p\)-value of \[ \mbox{P}(\chi^2_1>X_{CMH}^2)=0.0037767. \] We reject the null hypothesis and conclude that \(X\) and \(Y\) are not conditionally independent given \(Z\).

Conclusions in the scientific context: We have evidence that the odds of ovarian cancer (\(Y\)) are affected by oral contraceptive use (\(X\)), controlling for the potential confounder of smoking (\(Z\)). The size of this effect is large. The odds ratios are quite similar for both smokers and nonsmokers (homogeneous association). The estimated common odds ratio is 0.2786506, so that oral contraceptive users have about 0.2786506 times the odds of ovarian cancer as non-users, whether or not they smoke.

References:
1. Colorado state Lesson Notes.(Generalized Liner models)
2. https://online.stat.psu.edu/stat504/lesson/3/3.1/3.1.1

***********************

Categorical Data Analysis 4

Mustafa Arslan

10/27/2021