Tuyns et al. (1977) carried out a case-control study of esophageal cancer in the region known as Ille-et-Vilaine in Brittany, France. The referring data set is oesoph_new.dta, and use logistic regression models to answer each of the following questions. For each question, carefully state the appropriate logistic regression model and relevant hypothesis, both in the contest of the problem and in terms of model parameters. Use both the Wald and likelihood ratio methods to carry out any hypothesis tests, and provide relevant estimated Odds Ratios (with 95% confidence intervals) where appropriate.
Read dataset and processing
oesoph <- read_dta("oesoph_new.dta")
oesoph$agegp <- oesoph$agegp %>% as.factor() %>%
factor(levels = c("0","1","2","3","4","5"),
labels = c("25-34","35-44","45-54",
"55-64","65-74","75+"))
oesoph$alcgp <- oesoph$alcgp %>% as.factor() %>%
factor(levels = c("0","1","2","3"),
labels = c("0-39","40-79","80-119","120+"))
oesoph$tobgp <- oesoph$tobgp %>% as.factor() %>%
factor(levels = c("0","1","2","3"),
labels = c("0-9","10-19","20-29","30+"))
oesoph$casestatus <- oesoph$casestatus %>% as.factor() %>%
factor(levels = c("0","1"),
labels = c("control","case"))
a. Investigate the relationship between alcohol consumption and incidence of esophageal cancer. Treat alcohol consumption as a dichotomous variable (> 80 g/day vs. < 80 g/day), ignoring age.
oesoph <- oesoph %>%
mutate(alc80 = case_when(
alcgp %in% c("0-39","40-79") ~ FALSE,
alcgp %in% c("80-119","120+") ~ TRUE))
summary(oesoph$alc80)
## Mode FALSE TRUE
## logical 770 205
eso.alc80 <- glm(casestatus ~ alc80, data = oesoph, family = binomial)
eso.alc80 %>% tbl_regression(exponentiate = TRUE)
| Characteristic | OR1 | 95% CI1 | p-value |
|---|---|---|---|
| alc80 | |||
| alc80TRUE | 5.64 | 4.00, 7.96 | <0.001 |
|
1
OR = Odds Ratio, CI = Confidence Interval
|
|||
=> “alcohol consumption >= 80g/day” is a significant predictive factor for esophageal cancer (OR = 5.6, 95% C.I. = 4.0 - 8.0).
b. Investigate the relationship between alcohol consumption and incidence of oesophageal cancer, controlling for the potential confounding effects of age. Treat alcohol consumption as a dichotomous variable (> 80 g/day vs. < 80 g/day), and age as a dichotomous variable (25 to 54 years old or 55 to 75+ years old). Give your assessment of the extent of confounding by age using the models fit in (a) and (b).
oesoph <- oesoph %>%
mutate(age55 = case_when(
agegp %in% c("25-34","35-44","45-54") ~ FALSE,
agegp %in% c("55-64","65-74","75+") ~ TRUE))
summary(oesoph$age55)
## Mode FALSE TRUE
## logical 528 447
eso.age55.alc80 <- glm(casestatus ~ age55 + alc80, data = oesoph, family = binomial)
eso.age55.alc80 %>% tbl_regression(exponentiate = TRUE)
| Characteristic | OR1 | 95% CI1 | p-value |
|---|---|---|---|
| age55 | |||
| age55TRUE | 4.04 | 2.83, 5.83 | <0.001 |
| alc80 | |||
| alc80TRUE | 5.68 | 3.96, 8.18 | <0.001 |
|
1
OR = Odds Ratio, CI = Confidence Interval
|
|||
=> with multiple regression adjusting age (dichotomous 25-54 vs 55-75+), alcohol consumption >80g/day is a predictive factor for esophageal cancer (adjusted OR = 4.7, 95% C.I. = 4.0 - 8.2)
c. Investigate the evidence of interaction between age and alcohol consumption in relation to incidence of esophageal cancer. Treat alcohol consumption and age as dichotomous variables as in (b).
glm(casestatus ~ age55*alc80, data = oesoph, family = binomial) %>%
tbl_regression(exponentiate = TRUE)
| Characteristic | OR1 | 95% CI1 | p-value |
|---|---|---|---|
| age55 | |||
| age55TRUE | 4.74 | 3.00, 7.72 | <0.001 |
| alc80 | |||
| alc80TRUE | 7.36 | 4.10, 13.3 | <0.001 |
| age55 * alc80 | |||
| age55TRUE * alc80TRUE | 0.66 | 0.31, 1.39 | 0.3 |
|
1
OR = Odds Ratio, CI = Confidence Interval
|
|||
=> interaction analysis showed no significant interaction between age and alcohol consumption (p = 0.3, age as dichotomous by 55 years old, alcohol consumption as dichotomous by 80g/day)
d. Investigate the relationship between alcohol consumption and incidence of esophageal cancer. First, treat alcohol consumption as a categorical variable with four categories (0 to 39 g/day, 40 to 79 g/day, 80 to 119 g/day, and > 120 g/day), by using indicator variables for the various categories (select 0 to 39 g/day as the reference group); second, treat alcohol consumption as an ordered variable by appropriately coding the four categories of consumption. Compare the two analyses and discuss whether an increasing trend in risk, as alcohol consumption increases, adequately fits the pattern of risks for the four categories.
summary(oesoph$alcgp)
## 0-39 40-79 80-119 120+
## 415 355 138 67
glm(casestatus ~ alcgp, data = oesoph, family = binomial) %>%
tbl_regression(exponentiate = TRUE)
| Characteristic | OR1 | 95% CI1 | p-value |
|---|---|---|---|
| alcgp | |||
| 0-39 | — | — | |
| 40-79 | 3.57 | 2.29, 5.70 | <0.001 |
| 80-119 | 7.80 | 4.71, 13.2 | <0.001 |
| 120+ | 27.2 | 14.7, 52.3 | <0.001 |
|
1
OR = Odds Ratio, CI = Confidence Interval
|
|||
oesoph <- oesoph %>%
mutate(alcgp2 = as.numeric(alcgp))
summary(oesoph$alcgp2)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.000 2.000 1.853 2.000 4.000
glm(casestatus ~ alcgp2, data = oesoph, family = binomial) %>%
tbl_regression(exponentiate = TRUE)
| Characteristic | OR1 | 95% CI1 | p-value |
|---|---|---|---|
| alcgp2 | 2.85 | 2.38, 3.43 | <0.001 |
|
1
OR = Odds Ratio, CI = Confidence Interval
|
|||