library(readxl)
library(epitools)
data <- read_xls("/Users/nirajanbudhathoki/Downloads/wcgs.xls")Survival Analysis
Contigency Table Methods for Binary Outcome
We disuss contigency table methods for assessing associations between binary outcomes and categorical predictors.
Measures of Risk and Association for Binary Outcomes
Example comes from the Western Collaborative Group Study (WCGS) of coronary heart disease, a prospective cohort study design. An association of interest to the original investigators was the relationship between CHD and the presence/absence of corneal arcus senilis among participants upon entry into the study.
Get various measures
contingency_table <- table(data$arcus, data$chd69)
# Display the contingency tablePerform risk analysis
risk_analysis <- riskratio(contingency_table)
print(risk_analysis)$data
No Yes Total
0 2058 153 2211
1 839 102 941
Total 2897 255 3152
$measure
risk ratio with 95% C.I.
estimate lower upper
0 1.000000 NA NA
1 1.566419 1.233865 1.988603
$p.value
two-sided
midp.exact fisher.exact chi.square
0 NA NA NA
1 0.0003159313 0.0003407814 0.0002216321
$correction
[1] FALSE
attr(,"method")
[1] "Unconditional MLE & normal approximation (Wald) CI"
odds_analysis <- oddsratio(contingency_table)
print(odds_analysis)$data
No Yes Total
0 2058 153 2211
1 839 102 941
Total 2897 255 3152
$measure
odds ratio with 95% C.I.
estimate lower upper
0 1.000000 NA NA
1 1.635794 1.254341 2.125278
$p.value
two-sided
midp.exact fisher.exact chi.square
0 NA NA NA
1 0.0003159313 0.0003407814 0.0002216321
$correction
[1] FALSE
attr(,"method")
[1] "median-unbiased estimate & mid-p exact CI"
mosaicplot(contingency_table, color = c("darkred","gold"),xlab = "CHD", ylab="Exposure")Relation between CHD and Age Categories
c_table <- table(data$agec, data$chd69)
odds_ana <- oddsratio(c_table)
print(odds_ana)$data
No Yes Total
35-40 512 31 543
41-45 1036 55 1091
46-50 680 70 750
51-55 463 65 528
56-60 206 36 242
Total 2897 257 3154
$measure
odds ratio with 95% C.I.
estimate lower upper
35-40 1.000000 NA NA
41-45 0.874988 0.5594913 1.392239
46-50 1.694848 1.1025702 2.662486
51-55 2.310984 1.4905110 3.656761
56-60 2.879861 1.7331311 4.812690
$p.value
two-sided
midp.exact fisher.exact chi.square
35-40 NA NA NA
41-45 5.665956e-01 5.587536e-01 5.690688e-01
46-50 1.573609e-02 2.049340e-02 1.653286e-02
51-55 1.499986e-04 1.631541e-04 1.561175e-04
56-60 5.019449e-05 4.717556e-05 2.186623e-05
$correction
[1] FALSE
attr(,"method")
[1] "median-unbiased estimate & mid-p exact CI"
# Set seed for reproducibility
set.seed(123)
# Generate 100 random BMI categories
BMI <- factor(sample(c("Underweight", "Normal weight", "Overweight", "Obese"),
size = 100,
replace = TRUE))
# Generate 100 binary readmissions (0 or 1)
# Assuming some randomness; you can adjust probabilities for categories if needed
Readmissions <- sample(c(0, 1), size = 100, replace = TRUE, prob = c(0.6, 0.4))
# Combine into a data frame
hypothetical_data <- data.frame(BMI = BMI, Readmissions = Readmissions)
# Display the first few rows of the dataset
head(hypothetical_data) BMI Readmissions
1 Overweight 0
2 Overweight 0
3 Overweight 0
4 Normal weight 1
5 Overweight 0
6 Normal weight 1
library(ggplot2)
ggplot(hypothetical_data, aes(x = BMI, fill = factor(Readmissions))) +
geom_bar(position = "fill") +
ylab("Proportion") +
xlab("BMI Category") +
ggtitle("Proportion of Readmissions by BMI Category") +
scale_fill_discrete(name = "Readmissions", labels = c("No", "Yes"))# Convert BMI to a factor with appropriate reference level
hypothetical_data$BMI <- relevel(hypothetical_data$BMI, ref = "Normal weight")
# Logistic regression model
logit_model <- glm(Readmissions ~ BMI, data = hypothetical_data, family = binomial)
# Summary of the model
summary(logit_model)
Call:
glm(formula = Readmissions ~ BMI, family = binomial, data = hypothetical_data)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.1542 0.3934 -0.392 0.695
BMIObese -0.2025 0.6306 -0.321 0.748
BMIOverweight -0.6444 0.5620 -1.146 0.252
BMIUnderweight -0.4336 0.5571 -0.778 0.436
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 132.81 on 99 degrees of freedom
Residual deviance: 131.35 on 96 degrees of freedom
AIC: 139.35
Number of Fisher Scoring iterations: 4
# Odds ratios and confidence intervals
exp(cbind(Odds_Ratio = coef(logit_model), confint(logit_model)))Waiting for profiling to be done...
Odds_Ratio 2.5 % 97.5 %
(Intercept) 0.8571429 0.3896217 1.856692
BMIObese 0.8166667 0.2311126 2.803878
BMIOverweight 0.5250000 0.1702348 1.566756
BMIUnderweight 0.6481481 0.2136164 1.925463
If the chi-squared test or logistic regression shows a significant relationship, conduct post-hoc pairwise comparisons between BMI categories.
library(multcomp)Loading required package: mvtnorm
Loading required package: survival
Attaching package: 'survival'
The following object is masked from 'package:epitools':
ratetable
Loading required package: TH.data
Loading required package: MASS
Attaching package: 'TH.data'
The following object is masked from 'package:MASS':
geyser
# Perform pairwise comparisons
pairwise <- glht(logit_model, linfct = mcp(BMI = "Tukey"))
summary(pairwise)
Simultaneous Tests for General Linear Hypotheses
Multiple Comparisons of Means: Tukey Contrasts
Fit: glm(formula = Readmissions ~ BMI, family = binomial, data = hypothetical_data)
Linear Hypotheses:
Estimate Std. Error z value Pr(>|z|)
Obese - Normal weight == 0 -0.2025 0.6306 -0.321 0.989
Overweight - Normal weight == 0 -0.6444 0.5620 -1.146 0.660
Underweight - Normal weight == 0 -0.4336 0.5571 -0.778 0.864
Overweight - Obese == 0 -0.4418 0.6356 -0.695 0.899
Underweight - Obese == 0 -0.2311 0.6312 -0.366 0.983
Underweight - Overweight == 0 0.2107 0.5627 0.374 0.982
(Adjusted p values reported -- single-step method)
library(ResourceSelection)ResourceSelection 0.3-6 2023-06-27
# Hosmer-Lemeshow test
hoslem.test(hypothetical_data$Readmissions, fitted(logit_model), g = 3)
Hosmer and Lemeshow goodness of fit (GOF) test
data: hypothetical_data$Readmissions, fitted(logit_model)
X-squared = 1.8412e-24, df = 1, p-value = 1
# ROC curve and AUC
library(pROC)Type 'citation("pROC")' for a citation.
Attaching package: 'pROC'
The following objects are masked from 'package:stats':
cov, smooth, var
roc_curve <- roc(hypothetical_data$Readmissions, fitted(logit_model))Setting levels: control = 0, case = 1
Setting direction: controls < cases
plot(roc_curve)auc(roc_curve)Area under the curve: 0.5692
Use decision tree algorithms to explore the relationship in a non-linear way.
library(rpart)
library(rpart.plot)
tree_model <- rpart(Readmissions ~ BMI, data = hypothetical_data, method = "class")
rpart.plot(tree_model)# Convert BMI to an ordered factor
hypothetical_data$BMI <- ordered(hypothetical_data$BMI,
levels = c("Underweight", "Normal weight", "Overweight", "Obese"))
# Fit the logistic regression model with ordered BMI
logit_model_ordered <- glm(Readmissions ~ BMI, data = hypothetical_data, family = binomial)
# Summary of the model
summary(logit_model_ordered)
Call:
glm(formula = Readmissions ~ BMI, family = binomial, data = hypothetical_data)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.474280 0.211289 -2.245 0.0248 *
BMI.L 0.010952 0.441678 0.025 0.9802
BMI.Q 0.004098 0.422577 0.010 0.9923
BMI.C 0.483926 0.402571 1.202 0.2293
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 132.81 on 99 degrees of freedom
Residual deviance: 131.35 on 96 degrees of freedom
AIC: 139.35
Number of Fisher Scoring iterations: 4
# Odds ratios and confidence intervals
exp(cbind(Odds_Ratio = coef(logit_model_ordered), confint(logit_model_ordered)))Waiting for profiling to be done...
Odds_Ratio 2.5 % 97.5 %
(Intercept) 0.622333 0.4079083 0.9371228
BMI.L 1.011012 0.4203971 2.4111499
BMI.Q 1.004107 0.4357372 2.3016944
BMI.C 1.622432 0.7406156 3.6226329
library(ggplot2)
ggplot(hypothetical_data, aes(x = BMI, y = Readmissions)) +
stat_summary(fun = mean, geom = "point") + # Mean readmissions per BMI category
stat_summary(fun = mean, geom = "line", group = 1) +
xlab("BMI Category") +
ylab("Mean Readmissions") +
ggtitle("Relationship Between BMI and Readmissions")The absence of “last follow-up” data in your study introduces a challenge because survival analysis requires accurate information on whether a patient was still alive (censored) or had died (event occurred) at the time of data extraction (September–October 2024). Without this, there is uncertainty about the survival status of some patients. Key issues:
- Accurate classification of censoring:
In survival analysis, patients who have not experienced the event (death in your study) by the end of the study are considered censored. For accurate censoring, you need to know:
Last known follow-up date: The most recent date when it was confirmed that the patient was alive.
If you don’t have this, you might incorrectly assume that the patient was alive until the date of data extraction, which can:
Overestimate survival time for some censored patients.
Introduce bias in your survival estimates and hazard ratios
Missclassification of events
If you cannot confirm follow-up for patients who are not recorded as deceased:
Some patients might have died but are incorrectly classified as alive (censored) at the time of data extraction.
This misclassification affects the calculation of survival probabilities and distorts the relationship between predictors (e.g., smoking) and mortality.
Survival bias
The lack of last follow-up data assumes that all censored patients survived until the time of data extraction. This assumption may not hold, especially for patients who dropped out of follow-up earlier:
- If these patients had worse health outcomes (e.g., death shortly after transplant), the model might underestimate mortality.
Example:
Imagine a patient underwent a liver transplant in 2015 and has no recorded follow-up after 2019. Without knowing their last follow-up date:
You might assume the patient was alive in September 2024 (censored), but they could have died anytime between 2019 and 2024.
This leads to overestimating the survival time for this patient, which could skew the survival curve and hazard ratios.
How This Affects Analysis
Kaplan-Meier Survival Curves:
Kaplan-Meier analysis relies on precise classification of events and censoring.
If censoring dates are inaccurate, survival estimates will be biased.
Cox Proportional Hazards Model:
- Misclassification of events (censoring vs. death) will result in incorrect hazard ratio estimates for smoking and other predictors.
Strategies to Address the Issue
Sensitivity Analysis:
Test how results change under different assumptions about patients with unknown last follow-up:
Best-case scenario: Assume all censored patients were alive at the time of data extraction.
Worst-case scenario: Assume all censored patients died at the midpoint between their last known transplant date and data extraction.
Impute Last Follow-Up Dates:
- If feasible, estimate follow-up times based on available data (e.g., average follow-up duration for similar patients).
Collaborate with Clinicians or Data Collectors:
- Attempt to retrieve follow-up data for as many patients as possible, especially those with incomplete records.
Transparent Reporting:
- Clearly state the limitation in your study’s methodology section and discuss its potential impact on your findings.
Abstract Example:
Background: The impact of smoking on post-liver transplant mortality remains poorly understood. This study investigates the association between smoking status (non-smoker, former smoker, current smoker) and mortality in liver transplant recipients from January 2013 to January 2023.
Methods: This retrospective cohort study included all adult liver transplant recipients during the study period. Smoking status was categorized into three levels. Mortality was the primary outcome, analyzed using Kaplan-Meier survival curves and Cox proportional hazards models. Patients without recorded deaths by the time of data extraction (September–October 2024) were assumed to be alive, with follow-up censored at the date of extraction.
Results: Among X patients included in the study, Y% were non-smokers, Z% were former smokers, and W% were current smokers. Mortality was observed in P% of the cohort. Kaplan-Meier survival analysis demonstrated [specific trends]. In Cox regression, current smoking was associated with a [hazard ratio] compared to non-smokers (95% CI: [X-Y], p = [value]).
Conclusion: Current smoking may increase post-liver transplant mortality. However, the assumption that all censored patients were alive at the time of data extraction could bias survival estimates, warranting caution in interpreting results. Future studies should incorporate complete follow-up data to validate these findings.
Key Features of Transparent Reporting:
State Assumptions Clearly: Explicitly mention the assumption about censored patients being alive at the time of data extraction.
Acknowledge Potential Bias: Highlight the limitation and its implications for interpreting results.
Emphasize Need for Future Work: Suggest ways to address the limitation in future research (e.g., complete follow-up data).
# Load necessary libraries
library(survival)
library(survminer)Loading required package: ggpubr
Attaching package: 'survminer'
The following object is masked from 'package:survival':
myeloma
# Hypothetical dataset
set.seed(42)
n <- 150
hypothetical_data <- data.frame(
smoking = factor(sample(c("Non-smoker", "Former Smoker", "Current Smoker"), n, replace = TRUE)),
time_to_event = rexp(n, rate = 1/2000), # Simulate survival times (in days)
mortality = sample(c(0, 1), n, replace = TRUE, prob = c(0.7, 0.3)) # Simulate event status
)attach(hypothetical_data)
# Create a survival object
surv_object <- Surv(time_to_event, mortality)
# Fit Kaplan-Meier model stratified by smoking status
km_fit <- survfit(surv_object ~ smoking, data = hypothetical_data)
# Summary of the Kaplan-Meier model
summary(km_fit)Call: survfit(formula = surv_object ~ smoking, data = hypothetical_data)
smoking=Current Smoker
time n.risk n.event survival std.err lower 95% CI upper 95% CI
150 34 1 0.971 0.0290 0.9154 1.000
412 31 1 0.939 0.0417 0.8611 1.000
960 26 1 0.903 0.0535 0.8042 1.000
1594 25 1 0.867 0.0624 0.7530 0.998
1842 23 1 0.829 0.0701 0.7027 0.979
1858 22 1 0.792 0.0764 0.6552 0.956
1896 21 1 0.754 0.0815 0.6099 0.932
2757 12 1 0.691 0.0959 0.5265 0.907
3218 8 1 0.605 0.1165 0.4145 0.882
6095 3 1 0.403 0.1820 0.1664 0.977
6639 2 1 0.202 0.1691 0.0389 1.000
smoking=Former Smoker
time n.risk n.event survival std.err lower 95% CI upper 95% CI
3.3 61 1 0.984 0.0163 0.952 1.000
56.6 59 1 0.967 0.0230 0.923 1.000
119.3 58 1 0.950 0.0280 0.897 1.000
150.4 57 1 0.934 0.0321 0.873 0.999
171.7 55 1 0.917 0.0357 0.849 0.989
237.1 53 1 0.899 0.0390 0.826 0.979
398.9 51 1 0.882 0.0420 0.803 0.968
434.9 50 1 0.864 0.0447 0.781 0.956
502.7 48 1 0.846 0.0473 0.758 0.944
599.5 45 1 0.827 0.0498 0.735 0.931
1195.0 36 1 0.804 0.0535 0.706 0.916
1469.5 34 1 0.781 0.0569 0.677 0.901
1724.1 29 1 0.754 0.0610 0.643 0.883
2077.6 21 1 0.718 0.0678 0.596 0.864
2080.0 20 1 0.682 0.0733 0.552 0.842
2580.5 17 1 0.642 0.0792 0.504 0.817
3099.0 14 1 0.596 0.0858 0.449 0.790
5626.3 6 1 0.497 0.1155 0.315 0.783
6136.4 5 1 0.397 0.1282 0.211 0.748
smoking=Non-smoker
time n.risk n.event survival std.err lower 95% CI upper 95% CI
189 50 1 0.980 0.0198 0.942 1.000
238 49 1 0.960 0.0277 0.907 1.000
399 46 1 0.939 0.0341 0.875 1.000
597 44 1 0.918 0.0394 0.844 0.998
808 38 1 0.894 0.0452 0.809 0.987
1314 31 1 0.865 0.0521 0.768 0.973
1460 30 1 0.836 0.0578 0.730 0.957
1605 27 1 0.805 0.0634 0.690 0.939
1774 25 1 0.773 0.0686 0.649 0.920
3184 20 1 0.734 0.0752 0.601 0.898
4155 16 1 0.688 0.0834 0.543 0.873
4491 13 1 0.635 0.0922 0.478 0.844
7178 1 1 0.000 NaN NA NA
# Plot Kaplan-Meier curves
ggsurvplot(
km_fit,
data = hypothetical_data,
pval = TRUE, # Add log-rank test p-value
conf.int = TRUE, # Confidence intervals
risk.table = TRUE, # Add risk table below the plot
xlab = "Time (days)",
ylab = "Survival Probability",
legend.labs = c("Non-smoker", "Former Smoker", "Current Smoker"),
title = "Kaplan-Meier Survival Analysis by Smoking Status"
)# Summary of Kaplan-Meier fit
km_summary <- summary(km_fit)
# Median survival times for each smoking group
median_survival <- summary(km_fit)$table[, "median"]
# Print median survival times
print(median_survival)smoking=Current Smoker smoking=Former Smoker smoking=Non-smoker
6095.335 5626.296 7177.657
The log-rank test (survdiff) only indicates whether there is a statistically significant difference in survival among the groups overall (global test), but it does not specify which specific groups differ. To determine which groups are significantly different, you can perform pairwise comparisons between groups.
pairwise_results <- pairwise_survdiff(
Surv(time_to_event, mortality) ~ smoking,
data = hypothetical_data,
p.adjust.method = "bonferroni" # Options: "holm", "BH", "fdr", etc.
)
# Display adjusted p-values
print(pairwise_results)
Pairwise comparisons using Log-Rank test
data: hypothetical_data and smoking
Current Smoker Former Smoker
Former Smoker 1 -
Non-smoker 1 1
P value adjustment method: bonferroni
If the overall log-rank test is significant (
p = 0.02), report the specific pairwise comparisons where significant differences are observed (e.g., “Current smokers had significantly worse survival compared to non-smokers and former smokers, with adjusted p-values of 0.01 and 0.02, respectively”).Emphasize clinical relevance and magnitude of differences, such as hazard ratios or median survival times.
Code to Test the Proportional Hazards Assumption
- Fit a Cox Proportional Hazards Model: Even if the focus is on the log-rank test, you need a Cox model to test the PH assumption.
# Fit the Cox proportional hazards model
cox_model <- coxph(Surv(time_to_event, mortality) ~ smoking, data = hypothetical_data)
# Test proportional hazards assumption
ph_test <- cox.zph(cox_model)
# Display results
print(ph_test) chisq df p
smoking 3.06 2 0.22
GLOBAL 3.06 2 0.22
# Plot scaled Schoenfeld residuals for visual inspection
plot(ph_test)Output and interpretation
Smoking: Tests the PH assumption for the
smokingvariable.p > 0.05: No evidence of violation for smoking levels.p < 0.05: Evidence of a violation, suggesting non-proportional hazards forsmoking.
GLOBAL: Tests the overall PH assumption for the entire model.
p > 0.05: No evidence of a global violation.p < 0.05: Indicates at least one variable violates the PH assumption.
Visual Inspection:
The plot from
plot(ph_test)shows scaled Schoenfeld residuals over time.If residuals fluctuate randomly around 0, the PH assumption holds.
Systematic trends or patterns suggest a violation.
What to Do If the PH Assumption is Violated?
- Stratify the Model: Stratify the Cox model by the variable violating the PH assumption (e.g.,
smoking) if the goal is to adjust for it without estimating its effect:
cox_model_strat <- coxph(Surv(time_to_event, mortality) ~ strata(smoking), data = hypothetical_data)- Include Time-Dependent Covariates: Model non-proportional hazards by including an interaction term between the variable and time:
- Switch to Time-Varying Analysis: If the assumption fails globally, consider using flexible models like Aalen’s additive model or stratified log-rank tests.
Summary
If PH Holds: Proceed with the log-rank test and Cox regression.
If PH Fails: Use stratified Cox models, time-varying covariates, or alternative survival analysis methods to adjust for non-proportional hazards.