Homework: Regression standardization and time-to-event outcomes

PHTH 6800

Author

Zaiba Amla

This homework covers regression standardization for binary and time-to-event outcomes, building on material from weeks 6 and 7. You will work with the cancer dataset from the survival package throughout. To load it, run:

library(survival)
library(tidyverse)
# add any additional packages you need here

data("cancer")

The dataset contains survival data on lung cancer patients. The exposure of interest is ph.ecog, a measure of physical functioning (ECOG performance status), where 0 = asymptomatic and higher scores (1, 2, and 3) indicate greater impairment. The time variable is time (days) and the event indicator is status (1 = censored, 2 = dead) (though we will convert that). We will assume age (age) and sex (sex) are confounders (so we have exchangeability conditional on those variables). You may treat all variables as linear terms with no interactions or special transformations.

Before doing anything else, run the following to create dead as a 0/1 indicator and restrict to complete cases on the variables of interest. We’ll also create a binary exposure to use in some problems.

cancer <- cancer |>
  filter(!is.na(ph.ecog), !is.na(age), !is.na(sex)) |>
  mutate(
    dead = as.integer(status == 2),
    ecog_bin = as.integer(ph.ecog >= 1)
  )

Question 1

One causal estimand we care about is the average treatment effect (ATE). For a binary outcome \(Y\) and a binary exposure \(A\), the ATE is \(E(Y^1 - Y^0)\).

For a time-to-event outcome, however, we need to anchor our estimand to a specific point in time, because the probability of experiencing the event changes over time.

Let \(T\) be the time to death and let \(A \in \{0, 1\}\) represent whether a patient has any ECOG impairment (\(A = 1\) if ph.ecog \(\geq 1\), \(A = 0\) if ph.ecog \(= 0\), which is what we now have in ecog_bin). Write down the causal estimand for the risk difference at 365 days in potential outcomes notation.

\(RD_{365} = P(T^1 \leq 365) - P(T^0 \leq 365)\)

Now write down the corresponding causal estimand for the survival difference (rather than risk difference) at 365 days. Explain in words how the risk difference and survival difference at 365 days relate to each other algebraically. Do they give equivalent information?

\(SD_{365} = P(T^1 > 365) - P(T^0 > 365)\) - Since \(P(T > 365) = 1 - p(T \leq 365)\) then we get \(SD_{365} = -RD_{365}\). - This gives us the equivalent information: same magnitude and opposite sign.

Let’s define \(Y = \mathbf{1}(T \leq 365)\)—a binary indicator of dying within one year. Write down the ATE for this derived binary outcome and show that it equals the risk difference you wrote in part (a).

\(ATE = E(Y^1 - Y^0) = E(Y^1) - E(Y^0) = P(Y^1 = 1) - P(Y^0 = 1) = P(T^1 \leq 365) - P(T^0 \leq 365)\)
Same as the risk difference in part a as by the definition of expectation for a binary indicator is \(E[1(T^a \leq 365)] = P(T^a \leq 365)\). Converting the survival outcome to a binary variable and ATE is the same as the casual risk difference.
When there is a time horizon, time to event estimates become outcome estimates.

Question 2

In the absence of censoring, the simplest way to estimate the causal risk difference at a specific time point is to convert the time-to-event outcome into a binary variable and then apply regression standardization.

Create a binary variable dead_365 that equals 1 if the patient died within 365 days and 0 otherwise. Note that some patients are censored before 365 days—for now, treat them as if they did not die (i.e., set dead_365 = 0 if time < 365 and dead = 0). How many events do you have?

cancer <- cancer |> mutate(dead_365 = as.integer(dead == 1 & time <= 365))

sum(cancer$dead_365)

[1] 120

Using standardization of a logistic regression model of dead_365, estimate the risks at 365 days as well as the risk difference and ratio comparing any ECOG impairment (\(A = 1\)) to no impairment (\(A = 0\)), adjusting for age and sex.

fit <- glm(dead_365 ~ ecog_bin + age + sex, data = cancer, family = binomial)

std <- standardize_glm(fit, values = list(ecog_bin = c(0, 1)), data = cancer,
                       contrast = "difference", reference = 0)

Error in standardize_glm(fit, values = list(ecog_bin = c(0, 1)), data = cancer, : could not find function "standardize_glm"

print(std)

Error: object 'std' not found

Draw a DAG that shows how treating censored observations as non-events introduces measurement error. The DAG should include the true outcome (death within 365 days), the measured outcome, the exposure (ECOG), and confounders (age, sex) and any other variables you need. Describe/justify your DAG in a sentence or two.

Question 3

Another option would be to exclude the early censored observations and then apply regression standardization to the remaining cohort. This is what people are doing when they require a minimum amount of follow-up time in a cohort study.

Create a restricted cohort that includes only patients who were either dead by or followed for at least 365 days. How many observations do you have, and how many events?

cohort_restricted <- cancer |> filter(time >= 365 | dead == 1)

nrow(cohort_restricted)

[1] 185

sum(cohort_restricted$dead_365)

[1] 120

Next, fit a logistic regression model and use standardize_glm() to estimate the risks, risk difference, and risk ratio at 365 days.

fit2 <- glm(dead_365 ~ ecog_bin + age + sex, data = cohort_restricted, family = binomial)

std2 <- standardize_glm(fit2, values = list(ecog_bin = c(0, 1)), data = cohort_restricted, contrast = "difference", reference = 0)

Error in standardize_glm(fit2, values = list(ecog_bin = c(0, 1)), data = cohort_restricted, : could not find function "standardize_glm"

print(std2)

Error: object 'std2' not found

Draw a DAG that represents the bias introduced by restricting the cohort to those with adequate follow-up. This should show how restricting on follow-up time creates selection bias. Include the restriction variable \(S\) (indicator for being included in analysis), the exposure, confounders, and outcome. Describe/justify your DAG in a sentence or two.

Question 4

Now we will use the time-to-event outcome properly, accounting for censoring through the Cox model.

Fit a Cox proportional hazards model for the time-to-event outcome (Surv(time, dead)) as a function of ecog_bin, age, and sex. What is the estimated hazard ratio for ecog_bin? Interpret it.

cox_fit <- coxph(Surv(time, dead) ~ ecog_bin + age + sex, data = cancer)
summary(cox_fit)

Call:
coxph(formula = Surv(time, dead) ~ ecog_bin + age + sex, data = cancer)

  n= 227, number of events= 164 

              coef exp(coef)  se(coef)      z Pr(>|z|)   
ecog_bin  0.553606  1.739515  0.189132  2.927  0.00342 **
age       0.014609  1.014716  0.009154  1.596  0.11052   
sex      -0.552300  0.575624  0.168322 -3.281  0.00103 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

         exp(coef) exp(-coef) lower .95 upper .95
ecog_bin    1.7395     0.5749    1.2007    2.5201
age         1.0147     0.9855    0.9967    1.0331
sex         0.5756     1.7372    0.4139    0.8006

Concordance= 0.623  (se = 0.024 )
Likelihood ratio test= 23.24  on 3 df,   p=4e-05
Wald test            = 21.71  on 3 df,   p=7e-05
Score (logrank) test = 22.1  on 3 df,   p=6e-05

Use standardize_coxph() from stdReg2 to estimate the standardized survival for ecog_bin = 0 and ecog_bin = 1, evaluated at times = 365. Report the standardized survival probabilities and 95% confidence intervals.

std_cox <- standardize_coxph(formula = Surv(time, dead) ~ ecog_bin + age + sex, data = cancer, values               = list(ecog_bin = c(0, 1)), times = 365, contrasts = "difference", reference = 0)

Error in standardize_coxph(formula = Surv(time, dead) ~ ecog_bin + age + : could not find function "standardize_coxph"

print(std_cox)

Error: object 'std_cox' not found

Compute the risks (cumulative incidence), risk difference and risk ratio at 365 days and define the quantities using potential outcomes notation. How do these compare to your estimates from the other approaches?

std_cox_ratio <- standardize_coxph(Surv(time, dead) ~ ecog_bin + age + sex, data = cancer, values = list(ecog_bin = c(0, 1)), times = 365, contrasts = "ratio", reference = 0)

Error in standardize_coxph(Surv(time, dead) ~ ecog_bin + age + sex, data = cancer, : could not find function "standardize_coxph"

print(std_cox_ratio)

Error: object 'std_cox_ratio' not found

Question 5

So far we have dichotomized ph.ecog into a binary variable. The original variable takes values 0, 1, 2, and 3.

Fit a Cox proportional hazards model for time to death on the ph.ecog variable, adjusting for age and sex. (You can just use ph.ecog as a linear term.) Interpret the hazard ratio for ph.ecog.

cox_fit2 <- coxph(Surv(time, dead) ~ ph.ecog + age + sex, data = cancer, x = TRUE)
summary(cox_fit2)

Call:
coxph(formula = Surv(time, dead) ~ ph.ecog + age + sex, data = cancer, 
    x = TRUE)

  n= 227, number of events= 164 

             coef exp(coef)  se(coef)      z Pr(>|z|)    
ph.ecog  0.463728  1.589991  0.113577  4.083 4.45e-05 ***
age      0.011067  1.011128  0.009267  1.194 0.232416    
sex     -0.552612  0.575445  0.167739 -3.294 0.000986 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

        exp(coef) exp(-coef) lower .95 upper .95
ph.ecog    1.5900     0.6289    1.2727    1.9864
age        1.0111     0.9890    0.9929    1.0297
sex        0.5754     1.7378    0.4142    0.7994

Concordance= 0.637  (se = 0.025 )
Likelihood ratio test= 30.5  on 3 df,   p=1e-06
Wald test            = 29.93  on 3 df,   p=1e-06
Score (logrank) test = 30.5  on 3 df,   p=1e-06

A one-unit increase in ECOG score is associates with X times the instantaneous hazard of death when we adjust for age and sex.

Use standardize_coxph() to estimate the standardized survival at 365 days for each level of ph.ecog (0, 1, 2, 3), treating the original four-level variable as the exposure (and properly accounting for censoring). Report the risk ratio at 365 days for each level compared to ph.ecog = 0. (You can write out the calculations manually if you want instead of writing code to calculate them.)

std_cox2 <- standardize_coxph(Surv(time, dead) ~ ph.ecog + age + sex, data = cancer, values = list(ph.ecog = c(0,1,2,3)), times = 365)

Error in standardize_coxph(Surv(time, dead) ~ ph.ecog + age + sex, data = cancer, : could not find function "standardize_coxph"

print(std_cox2)

Error: object 'std_cox2' not found

print(standardize_coxph(Surv(time, dead) ~ ph.ecog + age + sex, data = cancer, values = list(ph.ecog = c(0,1,2,3)), times = 365, contrasts = "ratio", reference = 0))

Error in standardize_coxph(Surv(time, dead) ~ ph.ecog + age + sex, data = cancer, : could not find function "standardize_coxph"

Write the estimand from part (b) for ph.ecog = 2 vs. ph.ecog = 0 in potential outcomes notation.

\(RR = \frac {P(T^{a=2} \leq 365)}{P(T^{a=0} \leq 365)}\)

A colleague says: “Just report the hazard ratio from the Cox model, it’s a lot easier.” Write a brief response to your colleague explaining why you prefer to report your estimates from part (b).

There are two main issues with the hazard ratio: - The ratio is an instantaneous rate at every moment in time which is not an interpretative probability. - The proportional hazards assumption means that its an average over time and doesn’t correspond to any single moment.

We archive a standardized risk at 365 days as we get an absolute and this is clinically meaningful estimated that is anchored to a specific time.