Tuyns et al. (1977) carried out a case-control study of esophageal cancer in the region known as Ille-et-Vilaine in Brittany, France. The referring data set is oesoph_new.dta, and use logistic regression models to answer each of the following questions. For each question, carefully state the appropriate logistic regression model and relevant hypothesis, both in the contest of the problem and in terms of model parameters. Use both the Wald and likelihood ratio methods to carry out any hypothesis tests, and provide relevant estimated Odds Ratios (with 95% confidence intervals) where appropriate.

Read dataset and processing

oesoph <- read_dta("oesoph_new.dta")

oesoph$agegp <- oesoph$agegp %>% as.factor() %>% 
  factor(levels = c("0","1","2","3","4","5"),
         labels = c("25-34","35-44","45-54",
                    "55-64","65-74","75+"))
oesoph$alcgp <- oesoph$alcgp %>% as.factor() %>%
  factor(levels = c("0","1","2","3"),
         labels = c("0-39","40-79","80-119","120+"))
oesoph$tobgp <- oesoph$tobgp %>% as.factor() %>%
  factor(levels = c("0","1","2","3"),
         labels = c("0-9","10-19","20-29","30+"))
oesoph$casestatus <- oesoph$casestatus %>% as.factor() %>%
  factor(levels = c("0","1"),
         labels = c("control","case"))

a. Investigate the relationship between alcohol consumption and incidence of esophageal cancer. Treat alcohol consumption as a dichotomous variable (> 80 g/day vs. < 80 g/day), ignoring age.

oesoph <- oesoph %>% 
  mutate(alc80 = case_when(
    alcgp %in% c("0-39","40-79") ~ FALSE, 
    alcgp %in% c("80-119","120+") ~ TRUE))
summary(oesoph$alc80)
##    Mode   FALSE    TRUE 
## logical     770     205
eso.alc80 <- glm(casestatus ~ alc80, data = oesoph, family = binomial)
eso.alc80 %>% tbl_regression(exponentiate = TRUE)
Characteristic OR1 95% CI1 p-value
alc80
alc80TRUE 5.64 4.00, 7.96 <0.001

1 OR = Odds Ratio, CI = Confidence Interval

=> “alcohol consumption >= 80g/day” is a significant predictive factor for esophageal cancer (OR = 5.6, 95% C.I. = 4.0 - 8.0).

b. Investigate the relationship between alcohol consumption and incidence of oesophageal cancer, controlling for the potential confounding effects of age. Treat alcohol consumption as a dichotomous variable (> 80 g/day vs. < 80 g/day), and age as a dichotomous variable (25 to 54 years old or 55 to 75+ years old). Give your assessment of the extent of confounding by age using the models fit in (a) and (b).

oesoph <- oesoph %>%
  mutate(age55 = case_when(
    agegp %in% c("25-34","35-44","45-54") ~ FALSE,
    agegp %in% c("55-64","65-74","75+") ~ TRUE))
summary(oesoph$age55)
##    Mode   FALSE    TRUE 
## logical     528     447
eso.age55.alc80 <- glm(casestatus ~ age55 + alc80, data = oesoph, family = binomial)
eso.age55.alc80 %>% tbl_regression(exponentiate = TRUE)
Characteristic OR1 95% CI1 p-value
age55
age55TRUE 4.04 2.83, 5.83 <0.001
alc80
alc80TRUE 5.68 3.96, 8.18 <0.001

1 OR = Odds Ratio, CI = Confidence Interval

=> with multiple regression adjusting age (dichotomous 25-54 vs 55-75+), alcohol consumption >80g/day is a predictive factor for esophageal cancer (adjusted OR = 4.7, 95% C.I. = 4.0 - 8.2)

c. Investigate the evidence of interaction between age and alcohol consumption in relation to incidence of esophageal cancer. Treat alcohol consumption and age as dichotomous variables as in (b).

glm(casestatus ~ age55*alc80, data = oesoph, family = binomial) %>%
  tbl_regression(exponentiate = TRUE)
Characteristic OR1 95% CI1 p-value
age55
age55TRUE 4.74 3.00, 7.72 <0.001
alc80
alc80TRUE 7.36 4.10, 13.3 <0.001
age55 * alc80
age55TRUE * alc80TRUE 0.66 0.31, 1.39 0.3

1 OR = Odds Ratio, CI = Confidence Interval

=> interaction analysis showed no significant interaction between age and alcohol consumption (p = 0.3, age as dichotomous by 55 years old, alcohol consumption as dichotomous by 80g/day)

d. Investigate the relationship between alcohol consumption and incidence of esophageal cancer. First, treat alcohol consumption as a categorical variable with four categories (0 to 39 g/day, 40 to 79 g/day, 80 to 119 g/day, and > 120 g/day), by using indicator variables for the various categories (select 0 to 39 g/day as the reference group); second, treat alcohol consumption as an ordered variable by appropriately coding the four categories of consumption. Compare the two analyses and discuss whether an increasing trend in risk, as alcohol consumption increases, adequately fits the pattern of risks for the four categories.

summary(oesoph$alcgp)
##   0-39  40-79 80-119   120+ 
##    415    355    138     67
glm(casestatus ~ alcgp, data = oesoph, family = binomial) %>%
  tbl_regression(exponentiate = TRUE)
Characteristic OR1 95% CI1 p-value
alcgp
0-39
40-79 3.57 2.29, 5.70 <0.001
80-119 7.80 4.71, 13.2 <0.001
120+ 27.2 14.7, 52.3 <0.001

1 OR = Odds Ratio, CI = Confidence Interval

oesoph <- oesoph %>%
  mutate(alcgp2 = as.numeric(alcgp))
summary(oesoph$alcgp2)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   2.000   1.853   2.000   4.000
glm(casestatus ~ alcgp2, data = oesoph, family = binomial) %>%
  tbl_regression(exponentiate = TRUE)
Characteristic OR1 95% CI1 p-value
alcgp2 2.85 2.38, 3.43 <0.001

1 OR = Odds Ratio, CI = Confidence Interval