Survival Analysis

Contigency Table Methods for Binary Outcome

We disuss contigency table methods for assessing associations between binary outcomes and categorical predictors.

Measures of Risk and Association for Binary Outcomes

Example comes from the Western Collaborative Group Study (WCGS) of coronary heart disease, a prospective cohort study design. An association of interest to the original investigators was the relationship between CHD and the presence/absence of corneal arcus senilis among participants upon entry into the study.

library(readxl)
library(epitools)
data <- read_xls("/Users/nirajanbudhathoki/Downloads/wcgs.xls")

Get various measures

contingency_table <- table(data$arcus, data$chd69)
# Display the contingency table

Perform risk analysis

risk_analysis <- riskratio(contingency_table)
print(risk_analysis)

$data
       
          No Yes Total
  0     2058 153  2211
  1      839 102   941
  Total 2897 255  3152

$measure
   risk ratio with 95% C.I.
    estimate    lower    upper
  0 1.000000       NA       NA
  1 1.566419 1.233865 1.988603

$p.value
   two-sided
      midp.exact fisher.exact   chi.square
  0           NA           NA           NA
  1 0.0003159313 0.0003407814 0.0002216321

$correction
[1] FALSE

attr(,"method")
[1] "Unconditional MLE & normal approximation (Wald) CI"

odds_analysis <- oddsratio(contingency_table)
print(odds_analysis)

$data
       
          No Yes Total
  0     2058 153  2211
  1      839 102   941
  Total 2897 255  3152

$measure
   odds ratio with 95% C.I.
    estimate    lower    upper
  0 1.000000       NA       NA
  1 1.635794 1.254341 2.125278

$p.value
   two-sided
      midp.exact fisher.exact   chi.square
  0           NA           NA           NA
  1 0.0003159313 0.0003407814 0.0002216321

$correction
[1] FALSE

attr(,"method")
[1] "median-unbiased estimate & mid-p exact CI"

mosaicplot(contingency_table, color = c("darkred","gold"),xlab = "CHD", ylab="Exposure")

Relation between CHD and Age Categories

c_table <- table(data$agec, data$chd69)
odds_ana <- oddsratio(c_table)
print(odds_ana)

$data
       
          No Yes Total
  35-40  512  31   543
  41-45 1036  55  1091
  46-50  680  70   750
  51-55  463  65   528
  56-60  206  36   242
  Total 2897 257  3154

$measure
       odds ratio with 95% C.I.
        estimate     lower    upper
  35-40 1.000000        NA       NA
  41-45 0.874988 0.5594913 1.392239
  46-50 1.694848 1.1025702 2.662486
  51-55 2.310984 1.4905110 3.656761
  56-60 2.879861 1.7331311 4.812690

$p.value
       two-sided
          midp.exact fisher.exact   chi.square
  35-40           NA           NA           NA
  41-45 5.665956e-01 5.587536e-01 5.690688e-01
  46-50 1.573609e-02 2.049340e-02 1.653286e-02
  51-55 1.499986e-04 1.631541e-04 1.561175e-04
  56-60 5.019449e-05 4.717556e-05 2.186623e-05

$correction
[1] FALSE

attr(,"method")
[1] "median-unbiased estimate & mid-p exact CI"

# Set seed for reproducibility
set.seed(123)

# Generate 100 random BMI categories
BMI <- factor(sample(c("Underweight", "Normal weight", "Overweight", "Obese"), 
                     size = 100, 
                     replace = TRUE))

# Generate 100 binary readmissions (0 or 1)
# Assuming some randomness; you can adjust probabilities for categories if needed
Readmissions <- sample(c(0, 1), size = 100, replace = TRUE, prob = c(0.6, 0.4))

# Combine into a data frame
hypothetical_data <- data.frame(BMI = BMI, Readmissions = Readmissions)

# Display the first few rows of the dataset
head(hypothetical_data)

            BMI Readmissions
1    Overweight            0
2    Overweight            0
3    Overweight            0
4 Normal weight            1
5    Overweight            0
6 Normal weight            1

library(ggplot2)
ggplot(hypothetical_data, aes(x = BMI, fill = factor(Readmissions))) +
  geom_bar(position = "fill") +
  ylab("Proportion") +
  xlab("BMI Category") +
  ggtitle("Proportion of Readmissions by BMI Category") +
  scale_fill_discrete(name = "Readmissions", labels = c("No", "Yes"))

# Convert BMI to a factor with appropriate reference level
hypothetical_data$BMI <- relevel(hypothetical_data$BMI, ref = "Normal weight")

# Logistic regression model
logit_model <- glm(Readmissions ~ BMI, data = hypothetical_data, family = binomial)

# Summary of the model
summary(logit_model)


Call:
glm(formula = Readmissions ~ BMI, family = binomial, data = hypothetical_data)

Coefficients:
               Estimate Std. Error z value Pr(>|z|)
(Intercept)     -0.1542     0.3934  -0.392    0.695
BMIObese        -0.2025     0.6306  -0.321    0.748
BMIOverweight   -0.6444     0.5620  -1.146    0.252
BMIUnderweight  -0.4336     0.5571  -0.778    0.436

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 132.81  on 99  degrees of freedom
Residual deviance: 131.35  on 96  degrees of freedom
AIC: 139.35

Number of Fisher Scoring iterations: 4

# Odds ratios and confidence intervals
exp(cbind(Odds_Ratio = coef(logit_model), confint(logit_model)))

Waiting for profiling to be done...

               Odds_Ratio     2.5 %   97.5 %
(Intercept)     0.8571429 0.3896217 1.856692
BMIObese        0.8166667 0.2311126 2.803878
BMIOverweight   0.5250000 0.1702348 1.566756
BMIUnderweight  0.6481481 0.2136164 1.925463

If the chi-squared test or logistic regression shows a significant relationship, conduct post-hoc pairwise comparisons between BMI categories.

library(multcomp)

Loading required package: mvtnorm

Loading required package: survival


Attaching package: 'survival'

The following object is masked from 'package:epitools':

    ratetable

Loading required package: TH.data

Loading required package: MASS


Attaching package: 'TH.data'

The following object is masked from 'package:MASS':

    geyser

# Perform pairwise comparisons
pairwise <- glht(logit_model, linfct = mcp(BMI = "Tukey"))
summary(pairwise)


     Simultaneous Tests for General Linear Hypotheses

Multiple Comparisons of Means: Tukey Contrasts


Fit: glm(formula = Readmissions ~ BMI, family = binomial, data = hypothetical_data)

Linear Hypotheses:
                                 Estimate Std. Error z value Pr(>|z|)
Obese - Normal weight == 0        -0.2025     0.6306  -0.321    0.989
Overweight - Normal weight == 0   -0.6444     0.5620  -1.146    0.660
Underweight - Normal weight == 0  -0.4336     0.5571  -0.778    0.864
Overweight - Obese == 0           -0.4418     0.6356  -0.695    0.899
Underweight - Obese == 0          -0.2311     0.6312  -0.366    0.983
Underweight - Overweight == 0      0.2107     0.5627   0.374    0.982
(Adjusted p values reported -- single-step method)

library(ResourceSelection)

ResourceSelection 0.3-6      2023-06-27

# Hosmer-Lemeshow test
hoslem.test(hypothetical_data$Readmissions, fitted(logit_model), g = 3)


    Hosmer and Lemeshow goodness of fit (GOF) test

data:  hypothetical_data$Readmissions, fitted(logit_model)
X-squared = 1.8412e-24, df = 1, p-value = 1

# ROC curve and AUC
library(pROC)

Type 'citation("pROC")' for a citation.


Attaching package: 'pROC'

The following objects are masked from 'package:stats':

    cov, smooth, var

roc_curve <- roc(hypothetical_data$Readmissions, fitted(logit_model))

Setting levels: control = 0, case = 1

Setting direction: controls < cases

plot(roc_curve)

auc(roc_curve)

Area under the curve: 0.5692

Use decision tree algorithms to explore the relationship in a non-linear way.

library(rpart)
library(rpart.plot)
tree_model <- rpart(Readmissions ~ BMI, data = hypothetical_data, method = "class")
rpart.plot(tree_model)

# Convert BMI to an ordered factor
hypothetical_data$BMI <- ordered(hypothetical_data$BMI, 
                                 levels = c("Underweight", "Normal weight", "Overweight", "Obese"))

# Fit the logistic regression model with ordered BMI
logit_model_ordered <- glm(Readmissions ~ BMI, data = hypothetical_data, family = binomial)

# Summary of the model
summary(logit_model_ordered)


Call:
glm(formula = Readmissions ~ BMI, family = binomial, data = hypothetical_data)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)  
(Intercept) -0.474280   0.211289  -2.245   0.0248 *
BMI.L        0.010952   0.441678   0.025   0.9802  
BMI.Q        0.004098   0.422577   0.010   0.9923  
BMI.C        0.483926   0.402571   1.202   0.2293  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 132.81  on 99  degrees of freedom
Residual deviance: 131.35  on 96  degrees of freedom
AIC: 139.35

Number of Fisher Scoring iterations: 4

# Odds ratios and confidence intervals
exp(cbind(Odds_Ratio = coef(logit_model_ordered), confint(logit_model_ordered)))

Waiting for profiling to be done...

            Odds_Ratio     2.5 %    97.5 %
(Intercept)   0.622333 0.4079083 0.9371228
BMI.L         1.011012 0.4203971 2.4111499
BMI.Q         1.004107 0.4357372 2.3016944
BMI.C         1.622432 0.7406156 3.6226329

library(ggplot2)
ggplot(hypothetical_data, aes(x = BMI, y = Readmissions)) +
  stat_summary(fun = mean, geom = "point") +  # Mean readmissions per BMI category
  stat_summary(fun = mean, geom = "line", group = 1) +
  xlab("BMI Category") +
  ylab("Mean Readmissions") +
  ggtitle("Relationship Between BMI and Readmissions")

The absence of “last follow-up” data in your study introduces a challenge because survival analysis requires accurate information on whether a patient was still alive (censored) or had died (event occurred) at the time of data extraction (September–October 2024). Without this, there is uncertainty about the survival status of some patients. Key issues:

Accurate classification of censoring:

In survival analysis, patients who have not experienced the event (death in your study) by the end of the study are considered censored. For accurate censoring, you need to know:

Last known follow-up date: The most recent date when it was confirmed that the patient was alive.
If you don’t have this, you might incorrectly assume that the patient was alive until the date of data extraction, which can:
- Overestimate survival time for some censored patients.
- Introduce bias in your survival estimates and hazard ratios

Missclassification of events

If you cannot confirm follow-up for patients who are not recorded as deceased:

Some patients might have died but are incorrectly classified as alive (censored) at the time of data extraction.
This misclassification affects the calculation of survival probabilities and distorts the relationship between predictors (e.g., smoking) and mortality.

Survival bias

The lack of last follow-up data assumes that all censored patients survived until the time of data extraction. This assumption may not hold, especially for patients who dropped out of follow-up earlier:

If these patients had worse health outcomes (e.g., death shortly after transplant), the model might underestimate mortality.

Example:

Imagine a patient underwent a liver transplant in 2015 and has no recorded follow-up after 2019. Without knowing their last follow-up date:

You might assume the patient was alive in September 2024 (censored), but they could have died anytime between 2019 and 2024.
This leads to overestimating the survival time for this patient, which could skew the survival curve and hazard ratios.

How This Affects Analysis

Kaplan-Meier Survival Curves:
- Kaplan-Meier analysis relies on precise classification of events and censoring.
- If censoring dates are inaccurate, survival estimates will be biased.
Cox Proportional Hazards Model:
- Misclassification of events (censoring vs. death) will result in incorrect hazard ratio estimates for smoking and other predictors.

Strategies to Address the Issue

Sensitivity Analysis:
- Test how results change under different assumptions about patients with unknown last follow-up:
  - Best-case scenario: Assume all censored patients were alive at the time of data extraction.
  - Worst-case scenario: Assume all censored patients died at the midpoint between their last known transplant date and data extraction.
Impute Last Follow-Up Dates:
- If feasible, estimate follow-up times based on available data (e.g., average follow-up duration for similar patients).
Collaborate with Clinicians or Data Collectors:
- Attempt to retrieve follow-up data for as many patients as possible, especially those with incomplete records.
Transparent Reporting:
- Clearly state the limitation in your study’s methodology section and discuss its potential impact on your findings.

Abstract Example:

Background: The impact of smoking on post-liver transplant mortality remains poorly understood. This study investigates the association between smoking status (non-smoker, former smoker, current smoker) and mortality in liver transplant recipients from January 2013 to January 2023.

Methods: This retrospective cohort study included all adult liver transplant recipients during the study period. Smoking status was categorized into three levels. Mortality was the primary outcome, analyzed using Kaplan-Meier survival curves and Cox proportional hazards models. Patients without recorded deaths by the time of data extraction (September–October 2024) were assumed to be alive, with follow-up censored at the date of extraction.

Results: Among X patients included in the study, Y% were non-smokers, Z% were former smokers, and W% were current smokers. Mortality was observed in P% of the cohort. Kaplan-Meier survival analysis demonstrated [specific trends]. In Cox regression, current smoking was associated with a [hazard ratio] compared to non-smokers (95% CI: [X-Y], p = [value]).

Conclusion: Current smoking may increase post-liver transplant mortality. However, the assumption that all censored patients were alive at the time of data extraction could bias survival estimates, warranting caution in interpreting results. Future studies should incorporate complete follow-up data to validate these findings.

Key Features of Transparent Reporting:

State Assumptions Clearly: Explicitly mention the assumption about censored patients being alive at the time of data extraction.
Acknowledge Potential Bias: Highlight the limitation and its implications for interpreting results.
Emphasize Need for Future Work: Suggest ways to address the limitation in future research (e.g., complete follow-up data).

# Load necessary libraries
library(survival)
library(survminer)

Loading required package: ggpubr


Attaching package: 'survminer'

The following object is masked from 'package:survival':

    myeloma

# Hypothetical dataset
set.seed(42)
n <- 150
hypothetical_data <- data.frame(
  smoking = factor(sample(c("Non-smoker", "Former Smoker", "Current Smoker"), n, replace = TRUE)),
  time_to_event = rexp(n, rate = 1/2000),  # Simulate survival times (in days)
  mortality = sample(c(0, 1), n, replace = TRUE, prob = c(0.7, 0.3))  # Simulate event status
)

attach(hypothetical_data)
# Create a survival object
surv_object <- Surv(time_to_event, mortality)

# Fit Kaplan-Meier model stratified by smoking status
km_fit <- survfit(surv_object ~ smoking, data = hypothetical_data)

# Summary of the Kaplan-Meier model
summary(km_fit)

Call: survfit(formula = surv_object ~ smoking, data = hypothetical_data)

                smoking=Current Smoker 
 time n.risk n.event survival std.err lower 95% CI upper 95% CI
  150     34       1    0.971  0.0290       0.9154        1.000
  412     31       1    0.939  0.0417       0.8611        1.000
  960     26       1    0.903  0.0535       0.8042        1.000
 1594     25       1    0.867  0.0624       0.7530        0.998
 1842     23       1    0.829  0.0701       0.7027        0.979
 1858     22       1    0.792  0.0764       0.6552        0.956
 1896     21       1    0.754  0.0815       0.6099        0.932
 2757     12       1    0.691  0.0959       0.5265        0.907
 3218      8       1    0.605  0.1165       0.4145        0.882
 6095      3       1    0.403  0.1820       0.1664        0.977
 6639      2       1    0.202  0.1691       0.0389        1.000

                smoking=Former Smoker 
   time n.risk n.event survival std.err lower 95% CI upper 95% CI
    3.3     61       1    0.984  0.0163        0.952        1.000
   56.6     59       1    0.967  0.0230        0.923        1.000
  119.3     58       1    0.950  0.0280        0.897        1.000
  150.4     57       1    0.934  0.0321        0.873        0.999
  171.7     55       1    0.917  0.0357        0.849        0.989
  237.1     53       1    0.899  0.0390        0.826        0.979
  398.9     51       1    0.882  0.0420        0.803        0.968
  434.9     50       1    0.864  0.0447        0.781        0.956
  502.7     48       1    0.846  0.0473        0.758        0.944
  599.5     45       1    0.827  0.0498        0.735        0.931
 1195.0     36       1    0.804  0.0535        0.706        0.916
 1469.5     34       1    0.781  0.0569        0.677        0.901
 1724.1     29       1    0.754  0.0610        0.643        0.883
 2077.6     21       1    0.718  0.0678        0.596        0.864
 2080.0     20       1    0.682  0.0733        0.552        0.842
 2580.5     17       1    0.642  0.0792        0.504        0.817
 3099.0     14       1    0.596  0.0858        0.449        0.790
 5626.3      6       1    0.497  0.1155        0.315        0.783
 6136.4      5       1    0.397  0.1282        0.211        0.748

                smoking=Non-smoker 
 time n.risk n.event survival std.err lower 95% CI upper 95% CI
  189     50       1    0.980  0.0198        0.942        1.000
  238     49       1    0.960  0.0277        0.907        1.000
  399     46       1    0.939  0.0341        0.875        1.000
  597     44       1    0.918  0.0394        0.844        0.998
  808     38       1    0.894  0.0452        0.809        0.987
 1314     31       1    0.865  0.0521        0.768        0.973
 1460     30       1    0.836  0.0578        0.730        0.957
 1605     27       1    0.805  0.0634        0.690        0.939
 1774     25       1    0.773  0.0686        0.649        0.920
 3184     20       1    0.734  0.0752        0.601        0.898
 4155     16       1    0.688  0.0834        0.543        0.873
 4491     13       1    0.635  0.0922        0.478        0.844
 7178      1       1    0.000     NaN           NA           NA

# Plot Kaplan-Meier curves
ggsurvplot(
  km_fit,
  data = hypothetical_data,
  pval = TRUE,  # Add log-rank test p-value
  conf.int = TRUE,  # Confidence intervals
  risk.table = TRUE,  # Add risk table below the plot
  xlab = "Time (days)",
  ylab = "Survival Probability",
  legend.labs = c("Non-smoker", "Former Smoker", "Current Smoker"),
  title = "Kaplan-Meier Survival Analysis by Smoking Status"
)

# Summary of Kaplan-Meier fit
km_summary <- summary(km_fit)

# Median survival times for each smoking group
median_survival <- summary(km_fit)$table[, "median"]

# Print median survival times
print(median_survival)

smoking=Current Smoker  smoking=Former Smoker     smoking=Non-smoker 
              6095.335               5626.296               7177.657

The log-rank test (survdiff) only indicates whether there is a statistically significant difference in survival among the groups overall (global test), but it does not specify which specific groups differ. To determine which groups are significantly different, you can perform pairwise comparisons between groups.

pairwise_results <- pairwise_survdiff(
  Surv(time_to_event, mortality) ~ smoking,
  data = hypothetical_data,
  p.adjust.method = "bonferroni"  # Options: "holm", "BH", "fdr", etc.
)

# Display adjusted p-values
print(pairwise_results)


    Pairwise comparisons using Log-Rank test 

data:  hypothetical_data and smoking 

              Current Smoker Former Smoker
Former Smoker 1              -            
Non-smoker    1              1            

P value adjustment method: bonferroni

If the overall log-rank test is significant (p = 0.02), report the specific pairwise comparisons where significant differences are observed (e.g., “Current smokers had significantly worse survival compared to non-smokers and former smokers, with adjusted p-values of 0.01 and 0.02, respectively”).
Emphasize clinical relevance and magnitude of differences, such as hazard ratios or median survival times.

Code to Test the Proportional Hazards Assumption

Fit a Cox Proportional Hazards Model: Even if the focus is on the log-rank test, you need a Cox model to test the PH assumption.

# Fit the Cox proportional hazards model
cox_model <- coxph(Surv(time_to_event, mortality) ~ smoking, data = hypothetical_data)
# Test proportional hazards assumption
ph_test <- cox.zph(cox_model)

# Display results
print(ph_test)

        chisq df    p
smoking  3.06  2 0.22
GLOBAL   3.06  2 0.22

# Plot scaled Schoenfeld residuals for visual inspection
plot(ph_test)

Output and interpretation

Smoking: Tests the PH assumption for the smoking variable.
- p > 0.05: No evidence of violation for smoking levels.
- p < 0.05: Evidence of a violation, suggesting non-proportional hazards for smoking.
GLOBAL: Tests the overall PH assumption for the entire model.
- p > 0.05: No evidence of a global violation.
- p < 0.05: Indicates at least one variable violates the PH assumption.

Visual Inspection:

The plot from plot(ph_test) shows scaled Schoenfeld residuals over time.
- If residuals fluctuate randomly around 0, the PH assumption holds.
- Systematic trends or patterns suggest a violation.

What to Do If the PH Assumption is Violated?

Stratify the Model: Stratify the Cox model by the variable violating the PH assumption (e.g., smoking) if the goal is to adjust for it without estimating its effect:

cox_model_strat <- coxph(Surv(time_to_event, mortality) ~ strata(smoking), data = hypothetical_data)

Include Time-Dependent Covariates: Model non-proportional hazards by including an interaction term between the variable and time:
Switch to Time-Varying Analysis: If the assumption fails globally, consider using flexible models like Aalen’s additive model or stratified log-rank tests.

Summary

If PH Holds: Proceed with the log-rank test and Cox regression.
If PH Fails: Use stratified Cox models, time-varying covariates, or alternative survival analysis methods to adjust for non-proportional hazards.