library(gtsummary)
library(datasets)
library(tidyverse)
library(survival)
library(survminer)
library(car)
library(haven)
library(rms)
library(Hmisc)
library(splines)
library(purrr)
library(tibble)
library(plotly)
Evaluating the Effects of D-Penicillamine on Survival of Patients with Primary Biliary Cirrhosis: A Survival Analysis
Biostatistics for Health Researchers III Course Code: COMH7066A
Aim of the study
To evaluate the Effect of DPCA on survival in Patients with Primary Biliary Cirrhosis.
Motivation for use of the technique
Since the aim of this study is to assess the effect of DPCA treatment on overall survival while adjusting for relevant covariates, the Cox proportional hazards (PH) model is a suitable choice. Unlike non-parametric methods that primarily compare survival between groups (e.g., log-rank tests), the Cox model allows us to quantify the effect of DPCA treatment on survival while controlling for potential confounding variables.
In our cohort of patients with primary biliary cirrhosis (PBC), approximately 60% were still alive at the end of the follow-up period. The Cox model accommodates this right-censored data naturally, ensuring that information from individuals who have not yet experienced the event (death) by the study’s end is still appropriately included in the analysis.
Furthermore, the Cox PH model provides insight beyond simple group comparisons. It allows us to examine how the hazard (risk of death) evolves over time for individuals, giving a more nuanced understanding of disease progression and how DPCA treatment may influence mortality risk at different follow-up intervals.
Finally, the Cox model is especially useful in real-world clinical studies like ours, where not all patients begin treatment or follow-up at the same time. It can include patients recruited at different points during the study period, allowing for a comprehensive and slightly limited (compared to parametric models) analysis of treatment effects despite staggered entry.
Theoretical Motivation for Using the Cox Proportional Hazards Model
The Cox proportional hazards (PH) model is a semi-parametric model used to investigate the association between the survival time of subjects and one or more predictor variables. It is especially useful in clinical studies because of its flexibility and ability to handle censored data.
Why the Cox regression?
Semi-parametric nature: The Cox model does not assume a specific distribution for the baseline hazard function \(h_0(t)\), which makes it less prone to model misspecification.
Model formulation: \(h(t \mid \mathbf{X}) = h_0(t) \exp(\beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_p X_p)\)where:
- \(h(t|\mathbf{X})\) is the hazard at time \(t\) for an individual with covariate vector \(\mathbf{X}\),
- \(h_0(t)\) is the unspecified baseline hazard,
- \(\beta_1, \dots, \beta_p\) are coefficients estimating the log-relative hazard.
Hazard ratios: The model estimates the effect of covariates via hazard ratios, where \(\exp(\beta_j)\) represents the multiplicative effect on the hazard for a one-unit increase in \(X_j\), holding other variables constant.
Handles censoring: The Cox model accommodates right-censored observations using partial likelihood estimation, which leverages the relative order of events without needing exact survival times for all individuals.
Partial Likelihood
Let \(t_1 < t_2 < \dots < t_k\) be the ordered distinct event times, and let \(R(t_j)\) denote the risk set at time \(t_j\), i.e., all individuals still under observation just before time \(t_j\). For each event time \(t_j\) with corresponding covariate vector \(\mathbf{X}_j\), the partial likelihood is:
\((\boldsymbol{\beta}) = \prod_{j=1}^{k} \frac{\exp(\boldsymbol{\beta}^\top \mathbf{X}_j)}{\sum_{l \in R(t_j)} \exp(\boldsymbol{\beta}^\top \mathbf{X}_l)}\)
The log-partial likelihood is maximized to estimate \(\boldsymbol{\beta}\).
Application to Our Study
In our clinical study of primary biliary cirrhosis (PBC), the goal is to assess how treatment with DPCA affects survival, while adjusting for covariates such as sex, edema, age, and biomarkers like copper and bilirubin levels. The Cox PH model is particularly suitable to this dataset because:
- It allows for the inclusion of time-varying covariates (e.g., edema and cholesterol),
- It accommodates right-censored data, which is relevant since most patients were still alive at the end of follow-up,
- It enables fair inclusion of patients recruited at different times.
LIST OF REFERENCES
1. Ahmed FE, Vos PW, Holbert D. Modeling survival in colon cancer: a methodological review. Mol Cancer. 2007;6(1):15.
2. Başar E. Aalen’s Additive, Cox Proportional Hazards and the Cox-Aalen Model: Application to Kidney Transplant Data. Sains Malays. 2017 Mar 31;46(3):469–76.
3. Mandal S, Wang S, Sinha S. Analysis of linear transformation models with covariate measurement error and interval censoring. Stat Med. 2019 Oct 15;38(23):4642–55.
4. Abd ElHafeez S, D’Arrigo G, Leonardis D, Fusaro M, Tripepi G, Roumeliotis S. Methods to Analyze Time‐to‐Event Data: The Cox Regression Analysis. Georgakilas A, editor. Oxid Med Cell Longev [Internet]. 2021 Jan [cited 2025 Jul 30];2021(1). Available from: https://onlinelibrary.wiley.com/doi/10.1155/2021/1302811
5. Dickson RE, Grambsch PM, Fleming TR, Fisher LD, Langworthy A. Prognosis in primary biliary cirrhosis: Model for decision making. Hepatology. 1989 Jul;10(1):1–7.
6. Shao Y, Ye Z, Zhang Z. Exact test and exact confidence interval for the Cox model. Stat Med. 2024 Oct 15;43(23):4499–518.
7. Ratnaningsih DJ, Saefuddin A, Kurnia A, Mangku IW. Stratified-extended cox model in survival modeling of non-proportional hazard. IOP Conf Ser Earth Environ Sci. 2019 Jul 1;299(1):012023.
8. Faruk A, Syahputra E, Cahyono ES, Aisyah SM. A non-proportional hazards model with time-dependent variables for under-five mortality in Indonesia. J Phys Conf Ser. 2019 Jul 1;1282(1):012011.
9. Zhang Z, Reinikainen J, Adeleke KA, Pieterse ME, Groothuis-Oudshoorn CGM. Time-varying covariates and coefficients in Cox regression models. Ann Transl Med. 2018 Apr;6(7):121–121.
Analyses of the study data applying the Semi-Parametric Cox regression
setwd("C:/Users/VUSI/Downloads")
<- read_dta("pbc_stata_group_bio3.dta")) (Vusi
<- Vusi %>%
bio3 mutate(
status = as_factor(status),
rx = as_factor(rx),
sex = as_factor(sex), # For 'sex' if 0/1 are categories but no labels
asictes = as_factor(asictes),
spiders = as_factor(spiders),
hepatom = as_factor(hepatom),
edema = as_factor(edema)
)
bio3
<- bio3 %>%
bio3_1 ::mutate(
dplyrsex = dplyr::case_when(
== 1 ~ "Male",
sex == 0 ~ "Female",
sex TRUE ~ as.factor(sex)
),edema = dplyr::case_when(
== 1 ~ "Yes",
edema == 0 ~ "No",
edema TRUE ~ as.factor(edema)))
bio3_1
# A tibble: 312 × 23
number status rx sex asictes hepatom spiders edema bilirubin cholest
<dbl> <fct> <fct> <chr> <fct> <fct> <fct> <chr> <dbl> <dbl>
1 1 Dead Placebo Male Yes Yes Yes Yes 14.5 261
2 2 Censored Placebo Male No Yes Yes No 1.10 302
3 3 Dead Placebo Fema… No No No Yes 1.40 176
4 4 Dead Placebo Male No Yes Yes Yes 1.80 244
5 5 Censored DPCA Male No Yes Yes No 3.40 279
6 6 Dead DPCA Male No Yes No No 0.800 248
7 7 Censored DPCA Male No Yes No No 1 322
8 8 Dead DPCA Male No No No No 0.300 280
9 9 Dead Placebo Male No No Yes No 3.20 562
10 10 Dead DPCA Male Yes No Yes Yes 12.6 200
# ℹ 302 more rows
# ℹ 13 more variables: albumin <dbl>, copper <dbl>, alkphos <dbl>, sgot <dbl>,
# trigli <dbl>, platel <dbl>, prothrom <dbl>, histol <dbl>, age <dbl>,
# years <dbl>, logbili <dbl>, logalbu <dbl>, logprot <dbl>
saveRDS(bio3_1, "bio3_2.rds")
%>%
bio3_1 ::select(status, rx, sex, asictes, hepatom, spiders, edema,bilirubin, cholest, albumin, copper, alkphos, sgot, trigli, platel, prothrom, age, years, logbili, logalbu, logprot) %>%
dplyrtbl_summary(
by = rx,
missing = "always",
missing_text = "Missing" ,
statistic = list(
all_of(c("bilirubin", "cholest", "albumin", "copper", "alkphos", "sgot", "trigli", "platel", "prothrom", "age", "years", "logbili", "logalbu", "logprot")) ~ "{mean} ({sd})",
all_of(c("status", "sex", "asictes", "hepatom", "spiders", "edema")) ~ "{n} / {N} ({p}%)"
)%>%
) add_stat_label() %>% add_overall() %>%
bold_labels() %>%
modify_header(
~ "**Characteristics**",
label all_stat_cols() ~ "**{level}**"
%>%
) modify_spanning_header(
all_stat_cols()~ "**Treatment group**"
%>%
) as_gt() %>%
::tab_header(
gttitle = gt::md("**Table 1. Demographic Characteristics**"),
subtitle = gt::md("")
%>%
) ::tab_source_note(source_note = "Data updated August by Vusi, 2025") gt
Table 1. Demographic Characteristics | |||
Characteristics |
Treatment group
|
||
---|---|---|---|
Overall1 | Placebo | DPCA | |
Alive/Dead, n / N (%) | |||
Censored | 187 / 312 (60%) | 93 / 158 (59%) | 94 / 154 (61%) |
Dead | 125 / 312 (40%) | 65 / 158 (41%) | 60 / 154 (39%) |
Missing | 0 | 0 | 0 |
sex, n / N (%) | |||
Female | 36 / 312 (12%) | 21 / 158 (13%) | 15 / 154 (9.7%) |
Male | 276 / 312 (88%) | 137 / 158 (87%) | 139 / 154 (90%) |
Missing | 0 | 0 | 0 |
Presence of Asictes, n / N (%) | 24 / 312 (7.7%) | 14 / 158 (8.9%) | 10 / 154 (6.5%) |
Missing | 0 | 0 | 0 |
Hepatomegaly, n / N (%) | 160 / 312 (51%) | 73 / 158 (46%) | 87 / 154 (56%) |
Missing | 0 | 0 | 0 |
Presence of Spiders, n / N (%) | 90 / 312 (29%) | 45 / 158 (28%) | 45 / 154 (29%) |
Missing | 0 | 0 | 0 |
edema, n / N (%) | 49 / 312 (16%) | 26 / 158 (16%) | 23 / 154 (15%) |
Missing | 0 | 0 | 0 |
Serum Bilirubin in mg/dl, Mean (SD) | 3.3 (4.5) | 2.9 (3.6) | 3.6 (5.3) |
Missing | 0 | 0 | 0 |
Serum Cholesterol in mg/dl, Mean (SD) | 370 (232) | 365 (210) | 374 (252) |
Missing | 28 | 18 | 10 |
Albumin in gm/dl, Mean (SD) | 3.52 (0.42) | 3.52 (0.44) | 3.52 (0.40) |
Missing | 0 | 0 | 0 |
Urine Copper in ug/day, Mean (SD) | 98 (86) | 98 (91) | 98 (80) |
Missing | 2 | 1 | 1 |
Alkaline Phosphatase in U/liter, Mean (SD) | 1,983 (2,140) | 2,021 (2,183) | 1,943 (2,102) |
Missing | 0 | 0 | 0 |
SGOT in U/ml, Mean (SD) | 123 (57) | 120 (55) | 125 (59) |
Missing | 0 | 0 | 0 |
Triglicerides in mg/dl, Mean (SD) | 125 (65) | 124 (72) | 125 (59) |
Missing | 30 | 19 | 11 |
Platelets per cubic ml / 1000, Mean (SD) | 262 (96) | 259 (100) | 265 (91) |
Missing | 4 | 2 | 2 |
Prothrombin time in seconds, Mean (SD) | 10.73 (1.00) | 10.65 (0.85) | 10.80 (1.14) |
Missing | 0 | 0 | 0 |
Age (years), Mean (SD) | 50 (11) | 51 (11) | 49 (10) |
Missing | 0 | 0 | 0 |
Time to Death (in Years), Mean (SD) | 5.5 (3.1) | 5.5 (3.0) | 5.5 (3.2) |
Missing | 0 | 0 | 0 |
Bilirubin (log), Mean (SD) | 0.58 (1.03) | 0.54 (0.97) | 0.61 (1.10) |
Missing | 0 | 0 | 0 |
Albumin (log), Mean (SD) | 1.25 (0.13) | 1.25 (0.13) | 1.25 (0.12) |
Missing | 0 | 0 | 0 |
Prothom time (log), Mean (SD) | 2.37 (0.09) | 2.36 (0.08) | 2.37 (0.10) |
Missing | 0 | 0 | 0 |
Data updated August by Vusi, 2025 | |||
1 n / N (%); Mean (SD) |
Full multivariable Cox regression model
colnames(bio3_1)
[1] "number" "status" "rx" "sex" "asictes" "hepatom"
[7] "spiders" "edema" "bilirubin" "cholest" "albumin" "copper"
[13] "alkphos" "sgot" "trigli" "platel" "prothrom" "histol"
[19] "age" "years" "logbili" "logalbu" "logprot"
$event <- ifelse(bio3_1$status == "Dead", 1, 0)
bio3_1$years <- as.numeric(bio3_1$years)
bio3_1
<- coxph(
model_survival Surv(time = years, event = event) ~ . - number - status - event - years,
data = bio3_1
)summary(model_survival)
Call:
coxph(formula = Surv(time = years, event = event) ~ . - number -
status - event - years, data = bio3_1)
n= 276, number of events= 111
(36 observations deleted due to missingness)
coef exp(coef) se(coef) z Pr(>|z|)
rxDPCA -2.058e-02 9.796e-01 2.207e-01 -0.093 0.925725
sexMale -2.177e-01 8.043e-01 3.281e-01 -0.664 0.506971
asictesYes 2.505e-01 1.285e+00 3.842e-01 0.652 0.514471
hepatomYes -1.296e-01 8.784e-01 2.574e-01 -0.504 0.614551
spidersYes 1.021e-01 1.107e+00 2.395e-01 0.426 0.669975
edemaYes 6.272e-01 1.872e+00 2.856e-01 2.196 0.028110 *
bilirubin -1.252e-02 9.876e-01 3.999e-02 -0.313 0.754229
cholest 1.611e-04 1.000e+00 4.522e-04 0.356 0.721680
albumin 1.083e+00 2.955e+00 1.898e+00 0.571 0.568062
copper 1.828e-03 1.002e+00 1.284e-03 1.424 0.154427
alkphos 2.722e-06 1.000e+00 4.087e-05 0.067 0.946899
sgot 2.597e-03 1.003e+00 2.062e-03 1.260 0.207767
trigli -2.096e-03 9.979e-01 1.379e-03 -1.520 0.128614
platel 8.200e-04 1.001e+00 1.209e-03 0.678 0.497647
prothrom 5.540e-01 1.740e+00 1.231e+00 0.450 0.652744
histol 3.876e-01 1.473e+00 1.841e-01 2.105 0.035254 *
age 2.449e-02 1.025e+00 1.129e-02 2.168 0.030148 *
logbili 8.385e-01 2.313e+00 2.414e-01 3.474 0.000513 ***
logalbu -5.663e+00 3.470e-03 5.962e+00 -0.950 0.342176
logprot -4.082e+00 1.687e-02 1.450e+01 -0.281 0.778343
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
exp(coef) exp(-coef) lower .95 upper .95
rxDPCA 0.97963 1.0208 6.356e-01 1.510e+00
sexMale 0.80433 1.2433 4.228e-01 1.530e+00
asictesYes 1.28464 0.7784 6.049e-01 2.728e+00
hepatomYes 0.87843 1.1384 5.304e-01 1.455e+00
spidersYes 1.10745 0.9030 6.926e-01 1.771e+00
edemaYes 1.87239 0.5341 1.070e+00 3.277e+00
bilirubin 0.98756 1.0126 9.131e-01 1.068e+00
cholest 1.00016 0.9998 9.993e-01 1.001e+00
albumin 2.95454 0.3385 7.166e-02 1.218e+02
copper 1.00183 0.9982 9.993e-01 1.004e+00
alkphos 1.00000 1.0000 9.999e-01 1.000e+00
sgot 1.00260 0.9974 9.986e-01 1.007e+00
trigli 0.99791 1.0021 9.952e-01 1.001e+00
platel 1.00082 0.9992 9.985e-01 1.003e+00
prothrom 1.74025 0.5746 1.558e-01 1.944e+01
histol 1.47347 0.6787 1.027e+00 2.114e+00
age 1.02479 0.9758 1.002e+00 1.048e+00
logbili 2.31280 0.4324 1.441e+00 3.712e+00
logalbu 0.00347 288.1518 2.919e-08 4.126e+02
logprot 0.01687 59.2610 7.643e-15 3.726e+10
Concordance= 0.852 (se = 0.018 )
Likelihood ratio test= 178.6 on 20 df, p=<2e-16
Wald test = 169.5 on 20 df, p=<2e-16
Score (logrank) test = 285.4 on 20 df, p=<2e-16
names(bio3_1)
[1] "number" "status" "rx" "sex" "asictes" "hepatom"
[7] "spiders" "edema" "bilirubin" "cholest" "albumin" "copper"
[13] "alkphos" "sgot" "trigli" "platel" "prothrom" "histol"
[19] "age" "years" "logbili" "logalbu" "logprot" "event"
Multi-collinearity for the Cox model
vif(model_survival)
rxDPCA sexMale asictesYes hepatomYes spidersYes edemaYes bilirubin
1.325506 1.801613 1.947398 1.517800 1.401001 1.673546 4.877026
cholest albumin copper alkphos sgot trigli platel
1.822286 71.688146 1.946742 1.448359 1.782706 1.741475 1.684452
prothrom histol age logbili logalbu logprot
174.404706 1.955595 1.662151 5.099236 70.090287 179.384959
In the final multivariable Cox proportional hazards model, the variables bilirubin, prothrombin, and albumin were excluded due to their high collinearity with their respective log-transformed counterparts. Instead, the log-transformed versions, namely, logbili, logprot, and logalbu, were retained in the model to address skewness and better meet model assumptions.
<- bio3_1 %>% dplyr::select(-prothrom, -albumin, -bilirubin ,-number)) (bio_model
# A tibble: 312 × 20
status rx sex asictes hepatom spiders edema cholest copper alkphos sgot
<fct> <fct> <chr> <fct> <fct> <fct> <chr> <dbl> <dbl> <dbl> <dbl>
1 Dead Plac… Male Yes Yes Yes Yes 261 156 1718 138.
2 Censo… Plac… Male No Yes Yes No 302 54 7395. 114.
3 Dead Plac… Fema… No No No Yes 176 210 516 96.1
4 Dead Plac… Male No Yes Yes Yes 244 64 6122. 60.6
5 Censo… DPCA Male No Yes Yes No 279 143 671 113.
6 Dead DPCA Male No Yes No No 248 50 944 93
7 Censo… DPCA Male No Yes No No 322 52 824 60.5
8 Dead DPCA Male No No No No 280 52 4651. 28.4
9 Dead Plac… Male No No Yes No 562 79 2276 144.
10 Dead DPCA Male Yes No Yes Yes 200 140 918 147.
# ℹ 302 more rows
# ℹ 9 more variables: trigli <dbl>, platel <dbl>, histol <dbl>, age <dbl>,
# years <dbl>, logbili <dbl>, logalbu <dbl>, logprot <dbl>, event <dbl>
$event <- as.numeric(bio_model$event)
bio_model$histol <- as.factor(bio_model$histol)
bio_model$edema <- as.factor(bio_model$edema) bio_model
names(bio_model)
[1] "status" "rx" "sex" "asictes" "hepatom" "spiders" "edema"
[8] "cholest" "copper" "alkphos" "sgot" "trigli" "platel" "histol"
[15] "age" "years" "logbili" "logalbu" "logprot" "event"
SECOND MODEL AFTER VIF
<- coxph(Surv(years, event) ~ rx + sex + asictes + hepatom + spiders + edema + histol+
model_survival1 + copper + alkphos + sgot + trigli + platel + age + logbili + logalbu + logprot, data = bio_model)
cholest
summary(model_survival1)
Call:
coxph(formula = Surv(years, event) ~ rx + sex + asictes + hepatom +
spiders + edema + histol + cholest + copper + alkphos + sgot +
trigli + platel + age + logbili + logalbu + logprot, data = bio_model)
n= 276, number of events= 111
(36 observations deleted due to missingness)
coef exp(coef) se(coef) z Pr(>|z|)
rxDPCA -2.749e-02 9.729e-01 2.127e-01 -0.129 0.8972
sexMale -2.588e-01 7.719e-01 3.230e-01 -0.801 0.4230
asictesYes 2.199e-01 1.246e+00 3.884e-01 0.566 0.5713
hepatomYes -1.392e-01 8.701e-01 2.586e-01 -0.538 0.5904
spidersYes 1.007e-01 1.106e+00 2.385e-01 0.422 0.6727
edemaYes 5.691e-01 1.767e+00 2.808e-01 2.027 0.0427 *
histol2 1.262e+00 3.532e+00 1.058e+00 1.193 0.2328
histol3 1.503e+00 4.497e+00 1.034e+00 1.454 0.1459
histol4 1.852e+00 6.373e+00 1.057e+00 1.752 0.0797 .
cholest 1.028e-04 1.000e+00 4.524e-04 0.227 0.8202
copper 1.776e-03 1.002e+00 1.283e-03 1.384 0.1664
alkphos -4.378e-06 1.000e+00 4.165e-05 -0.105 0.9163
sgot 2.621e-03 1.003e+00 2.057e-03 1.274 0.2026
trigli -2.012e-03 9.980e-01 1.340e-03 -1.501 0.1334
platel 9.215e-04 1.001e+00 1.201e-03 0.767 0.4428
age 2.402e-02 1.024e+00 1.122e-02 2.141 0.0323 *
logbili 7.718e-01 2.164e+00 1.589e-01 4.857 1.19e-06 ***
logalbu -2.372e+00 9.326e-02 9.586e-01 -2.475 0.0133 *
logprot 2.658e+00 1.427e+01 1.374e+00 1.935 0.0530 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
exp(coef) exp(-coef) lower .95 upper .95
rxDPCA 0.97288 1.02787 0.64118 1.4762
sexMale 0.77194 1.29544 0.40984 1.4540
asictesYes 1.24596 0.80260 0.58199 2.6674
hepatomYes 0.87008 1.14932 0.52414 1.4443
spidersYes 1.10599 0.90417 0.69305 1.7650
edemaYes 1.76673 0.56602 1.01889 3.0635
histol2 3.53227 0.28310 0.44441 28.0755
histol3 4.49720 0.22236 0.59272 34.1222
histol4 6.37340 0.15690 0.80295 50.5890
cholest 1.00010 0.99990 0.99922 1.0010
copper 1.00178 0.99823 0.99926 1.0043
alkphos 1.00000 1.00000 0.99991 1.0001
sgot 1.00262 0.99738 0.99859 1.0067
trigli 0.99799 1.00201 0.99537 1.0006
platel 1.00092 0.99908 0.99857 1.0033
age 1.02431 0.97627 1.00203 1.0471
logbili 2.16374 0.46216 1.58470 2.9544
logalbu 0.09326 10.72314 0.01425 0.6104
logprot 14.26768 0.07009 0.96596 210.7411
Concordance= 0.852 (se = 0.018 )
Likelihood ratio test= 178.9 on 19 df, p=<2e-16
Wald test = 169.8 on 19 df, p=<2e-16
Score (logrank) test = 257.2 on 19 df, p=<2e-16
Checking Assumptions
Testing the proportional hazard assumption
<- cox.zph(model_survival1)
test_ph
test_ph
chisq df p
rx 0.4480 1 0.503
sex 0.0898 1 0.764
asictes 3.1194 1 0.077
hepatom 0.1496 1 0.699
spiders 0.1618 1 0.687
edema 6.5612 1 0.010
histol 3.5058 3 0.320
cholest 8.8267 1 0.003
copper 0.2416 1 0.623
alkphos 0.4429 1 0.506
sgot 0.7737 1 0.379
trigli 2.1733 1 0.140
platel 1.4567 1 0.227
age 1.3418 1 0.247
logbili 1.1767 1 0.278
logalbu 0.3060 1 0.580
logprot 6.2798 1 0.012
GLOBAL 24.0981 19 0.192
plot(test_ph,
resid = TRUE,
se = TRUE,
col = 4:6,
lwd = 2)
The Schoenfeld residual tests were used to evaluate the proportional hazards assumption for each covariate in the Cox model. Most covariates showed no evidence of violating the PH assumption, with non-significant p-values (e.g., rx, sex, hepatom, spiders, age, logbili). However, edema (p = 0.010), cholest (p = 0.003), and logprot (p = 0.012) showed significant departures, indicating potential violations of the PH assumption. The global test (χ² = 24.10, df = 19, p = 0.192) suggests that, overall, the PH assumption holds reasonably well for the model.
Linearity assumption testing using Restricted Cubic Splines (RCS)
# Define continuous variables
<- c("cholest", "copper", "alkphos", "sgot",
cont_vars "trigli", "platel", "age",
"logbili", "logalbu", "logprot")
# Create survival object
<- Surv(time = bio_model$years, event = bio_model$event)
surv_obj
# Set up data distribution object (needed by rms)
<- datadist(bio_model)
dd options(datadist = 'dd')
# Loop through variables
for (var in cont_vars) {
<- as.formula(paste0("surv_obj ~ rcs(", var, ", 3)"))
formula
cat("\nLinearity test for:", var, "\n")
tryCatch({
<- cph(
model
formula,data = bio_model,
x = TRUE, y = TRUE,
surv = TRUE,
maxiter = 100000, # Increase iterations
eps = 1e-6 # Convergence tolerance
)print(anova(model)) # Wald test for non-linearity
error = function(e) {
}, message("⚠️ Model for ", var, " did not converge: ", e$message)
}) }
Linearity test for: cholest
Wald Statistics Response: surv_obj
Factor Chi-Square d.f. P
cholest 13.34 2 0.0013
Nonlinear 0.02 1 0.8769
TOTAL 13.34 2 0.0013
Linearity test for: copper
Wald Statistics Response: surv_obj
Factor Chi-Square d.f. P
copper 67.98 2 <.0001
Nonlinear 15.01 1 1e-04
TOTAL 67.98 2 <.0001
Linearity test for: alkphos
Wald Statistics Response: surv_obj
Factor Chi-Square d.f. P
alkphos 12.65 2 0.0018
Nonlinear 10.67 1 0.0011
TOTAL 12.65 2 0.0018
Linearity test for: sgot
Wald Statistics Response: surv_obj
Factor Chi-Square d.f. P
sgot 28.89 2 <.0001
Nonlinear 8.80 1 0.003
TOTAL 28.89 2 <.0001
Linearity test for: trigli
Wald Statistics Response: surv_obj
Factor Chi-Square d.f. P
trigli 14.13 2 0.0009
Nonlinear 1.21 1 0.2713
TOTAL 14.13 2 0.0009
Linearity test for: platel
Wald Statistics Response: surv_obj
Factor Chi-Square d.f. P
platel 20.81 2 <.0001
Nonlinear 5.96 1 0.0147
TOTAL 20.81 2 <.0001
Linearity test for: age
Wald Statistics Response: surv_obj
Factor Chi-Square d.f. P
age 20.43 2 <.0001
Nonlinear 0.01 1 0.9352
TOTAL 20.43 2 <.0001
Linearity test for: logbili
Wald Statistics Response: surv_obj
Factor Chi-Square d.f. P
logbili 127.72 2 <.0001
Nonlinear 0.60 1 0.4367
TOTAL 127.72 2 <.0001
Linearity test for: logalbu
Wald Statistics Response: surv_obj
Factor Chi-Square d.f. P
logalbu 86.24 2 <.0001
Nonlinear 0.05 1 0.8232
TOTAL 86.24 2 <.0001
Linearity test for: logprot
Wald Statistics Response: surv_obj
Factor Chi-Square d.f. P
logprot 45.86 2 <.0001
Nonlinear 7.36 1 0.0067
TOTAL 45.86 2 <.0001
Martingale residuals to support the Restricted Cubic Spline results
# Define your model formula
<- Surv(years, event) ~ rx + sex + asictes + hepatom + spiders + edema + histol +
cox_formula + copper + alkphos + sgot + trigli + platel + age + logbili + logalbu + logprot
cholest
# Extract complete cases used in the model
<- all.vars(cox_formula)
model_vars <- bio_model[complete.cases(bio_model[, model_vars]), ]
bio_model_cc
# Fit model on complete cases
<- coxph(cox_formula, data = bio_model_cc)
model_survival1
# Get martingale residuals
<- residuals(model_survival1, type = "martingale")
martingale_resid
# List of continuous variables
<- c("cholest", "copper", "alkphos", "sgot", "trigli", "platel", "age", "logbili", "logalbu", "logprot")
cont_vars
# Plot
par(mfrow = c(3, 4)) # Adjust as needed
for (var in cont_vars) {
<- bio_model_cc[[var]]
x_vals plot(x_vals, martingale_resid,
xlab = var,
ylab = "Martingale Residuals",
main = paste("Residuals vs", var),
pch = 20)
lines(lowess(x_vals, martingale_resid), col = "blue", lwd = 2)
}
To assess whether the continuous covariates in our Cox proportional hazards model satisfied the linearity assumption, we conducted tests using restricted cubic splines and Wald statistics for each variable. The null hypothesis in this context is that the relationship between the continuous predictor and the log hazard is linear. The test output reports both the overall association (factor p-value) and the nonlinearity component. A p-value < 0.05 for the nonlinearity test suggests evidence of departure from linearity.
Significant evidence of nonlinearity (p < 0.05) was found for copper, alkphos, sgot, platelets,, and logprot. We believe that these variables may benefit from transformation or modeling using spline terms to better capture their relationships with the hazard function. Variables such as cholest, trigli, age, and logbili did not show evidence of nonlinearity and were retained as linear terms.
In addition to statistical tests, we examined Martingale residual plots (see figure above) for each continuous covariate. These plots visually assess the functional form of the covariates in the model. Deviations from a horizontal band centered around zero further supported evidence of nonlinearity for variables such as copper, alkphos, sgot, and logprot. These visual assessments reinforced our decision to use spline functions or transformations for those variables.
Analysis of deviance to assess predictor variables that improve the Cox regression model - Variable selection
anova(coxph(Surv(years, event) ~ rx + sex + asictes + hepatom + spiders + edema + histol+
(+ copper + alkphos + sgot + trigli + platel + age + logbili + logalbu + logprot, data = bio_model))) cholest
Analysis of Deviance Table
Cox model: response is Surv(years, event)
Terms added sequentially (first to last)
loglik Chisq Df Pr(>|Chi|)
NULL -550.19
rx -549.99 0.4059 1 0.5240359
sex -548.04 3.8957 1 0.0484086 *
asictes -525.46 45.1622 1 1.814e-11 ***
hepatom -515.32 20.2704 1 6.723e-06 ***
spiders -511.93 6.7793 1 0.0092219 **
edema -506.27 11.3336 1 0.0007612 ***
histol -502.20 8.1297 3 0.0434061 *
cholest -495.54 13.3284 1 0.0002614 ***
copper -488.37 14.3382 1 0.0001527 ***
alkphos -488.21 0.3247 1 0.5687736
sgot -485.80 4.8068 1 0.0283479 *
trigli -485.66 0.2929 1 0.5883700
platel -485.62 0.0755 1 0.7835382
age -479.36 12.5139 1 0.0004039 ***
logbili -464.97 28.7746 1 8.131e-08 ***
logalbu -462.58 4.7944 1 0.0285528 *
logprot -460.75 3.6545 1 0.0559169 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We used the Analysis of Deviance Table as a guide for variable selection in building our Cox proportional hazards model. By adding covariates sequentially and examining their contributions to model fit through changes in log-likelihood and Chi-squared statistics, we identified which variables significantly improved the model. Variables such as ascites, hepatomegaly, edema, log-bilirubin, age, and copper showed strong evidence of association with survival (p < 0.001), supporting their inclusion in the final model. Additional variables like sex, spiders, SGOT, cholesterol, and log-albumin were also retained based on their statistical significance (p < 0.05). While log-protein showed borderline significance (p = 0.056), it was considered for inclusion due to its potential clinical relevance. Variables such as alkaline phosphatase, triglycerides, and platelet count did not meaningfully contribute to model fit and were excluded. This approach helped ensure a parsimonious model by retaining only covariates with substantial explanatory power.
Constructing the final model that retains all variables that improve the model (Parsimonious model).
$edema_tve <- ifelse(bio_model$edema == "Yes", 1, 0) * log(pmax(bio_model$years, 0.01))
bio_model# Generate RCS terms for copper and sgot (3 knots each)
$rcs_copper <- rcspline.eval(bio_model$copper, nk = 3, inclx = TRUE)
bio_model$rcs_sgot <- rcspline.eval(bio_model$sgot, nk = 3, inclx = TRUE)
bio_model
<- coxph(
tbl1 Surv(years, event) ~
+ sex + asictes + hepatom + spiders + histol +
rx + rcs_sgot +
rcs_copper + logbili + logalbu +
age + tt(cholest),
edema_tve data = bio_model,
tt = function(x, t, ...) x * log(pmax(t, 0.01)), # Avoid log(0)
ties = "efron",
control = coxph.control(
iter.max = 5000,
eps = 1e-09,
toler.chol = 1e-10
)%>%
)tbl_regression(exponentiate = TRUE) %>%
bold_p() %>%
add_global_p()
<- coxph(
model_output Surv(years, event) ~
+ sex + asictes + hepatom + spiders + histol +
rx + rcs_sgot +
rcs_copper + logbili + logalbu +
age + tt(cholest),
edema_tve data = bio_model,
tt = function(x, t, ...) x * log(pmax(t, 0.01)), # Avoid log(0)
ties = "efron",
control = coxph.control(
iter.max = 5000,
eps = 1e-09,
toler.chol = 1e-10
)
)summary(model_output)
Call:
coxph(formula = Surv(years, event) ~ rx + sex + asictes + hepatom +
spiders + histol + rcs_copper + rcs_sgot + age + logbili +
logalbu + edema_tve + tt(cholest), data = bio_model, control = coxph.control(iter.max = 5000,
eps = 1e-09, toler.chol = 1e-10), ties = "efron", tt = function(x,
t, ...) x * log(pmax(t, 0.01)))
n= 282, number of events= 113
(30 observations deleted due to missingness)
coef exp(coef) se(coef) z Pr(>|z|)
rxDPCA -0.0125861 0.9874928 0.2077890 -0.061 0.95170
sexMale 0.0755612 1.0784893 0.3084180 0.245 0.80646
asictesYes 0.5907990 1.8054303 0.3727170 1.585 0.11294
hepatomYes 0.0130064 1.0130914 0.2528276 0.051 0.95897
spidersYes 0.2262123 1.2538418 0.2301370 0.983 0.32563
histol2 0.9467543 2.5773308 1.0519693 0.900 0.36813
histol3 1.4029502 4.0671812 1.0338774 1.357 0.17479
histol4 1.6680314 5.3017206 1.0595319 1.574 0.11542
rcs_copperx 0.0112154 1.0112785 0.0056931 1.970 0.04884 *
rcs_copper -0.0142585 0.9858427 0.0081700 -1.745 0.08095 .
rcs_sgotx 0.0056794 1.0056955 0.0063734 0.891 0.37287
rcs_sgot -0.0062334 0.9937860 0.0067233 -0.927 0.35386
age 0.0261604 1.0265056 0.0108025 2.422 0.01545 *
logbili 0.6962297 2.0061745 0.1452381 4.794 1.64e-06 ***
logalbu -2.4120892 0.0896278 0.8633178 -2.794 0.00521 **
edema_tve -0.8181390 0.4412520 0.2654395 -3.082 0.00205 **
tt(cholest) 0.0006124 1.0006126 0.0003131 1.956 0.05046 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
exp(coef) exp(-coef) lower .95 upper .95
rxDPCA 0.98749 1.0127 0.6571 1.4839
sexMale 1.07849 0.9272 0.5892 1.9740
asictesYes 1.80543 0.5539 0.8696 3.7483
hepatomYes 1.01309 0.9871 0.6172 1.6629
spidersYes 1.25384 0.7975 0.7986 1.9685
histol2 2.57733 0.3880 0.3279 20.2585
histol3 4.06718 0.2459 0.5361 30.8554
histol4 5.30172 0.1886 0.6646 42.2953
rcs_copperx 1.01128 0.9888 1.0001 1.0226
rcs_copper 0.98584 1.0144 0.9702 1.0018
rcs_sgotx 1.00570 0.9943 0.9932 1.0183
rcs_sgot 0.99379 1.0063 0.9808 1.0070
age 1.02651 0.9742 1.0050 1.0485
logbili 2.00617 0.4985 1.5092 2.6668
logalbu 0.08963 11.1572 0.0165 0.4867
edema_tve 0.44125 2.2663 0.2623 0.7424
tt(cholest) 1.00061 0.9994 1.0000 1.0012
Concordance= 0.854 (se = 0.018 )
Likelihood ratio test= 194.2 on 17 df, p=<2e-16
Wald test = 184.6 on 17 df, p=<2e-16
Score (logrank) test = 276.4 on 17 df, p=<2e-16
# List of all predictors
<- c("rx", "sex", "asictes", "hepatom", "spiders", "edema_tve", "histol","cholest", "rcs_copper", "rcs_sgot", "age", "logbili", "logalbu")
predictors
# Specify continuous predictors for spline transformation
<- c("cholest", "rcs_copper", "rcs_sgot")
continuous_vars
# Fit unadjusted Cox models
<- lapply(predictors, function(var) {
unadjusted_models <- if (var %in% continuous_vars) {
formula # For RCS variables, use all components of the spline
as.formula(paste0("Surv(years, event) ~ ", var))
else {
} # For non-RCS variables (including binary/categorical)
as.formula(paste0("Surv(years, event) ~ ", var))
}coxph(formula, data = bio_model)
})
# Name each model
names(unadjusted_models) <- gsub("^rcs_", "", predictors) # Clean names by removing 'rcs_' prefix
# Generate regression tables
<- lapply(unadjusted_models, function(model) {
tbl2 tbl_regression(model,
exponentiate = TRUE,
pvalue_fun = ~style_pvalue(.x, digits = 3)) %>%
bold_p() %>%
add_global_p()
})
<- tbl_stack(tbl2) %>%
tbl2_combined modify_header(label = "**Variable**") %>%
modify_caption("**Adjusted and Unadjusted Cox Regression Models**")
<- tbl_merge(
tbl_merge_1 tbls = list(tbl1, tbl2_combined), # use combined table here
tab_spanner = c("**Adjusted**", "**Unadjusted**")
%>%
) bold_labels() %>%
modify_spanning_header(all_stat_cols() ~ "**Treatment group**") %>%
as_gt() %>%
::tab_header(
gttitle = gt::md("**Table 2. Treatment Effects of DPCA on Primiry Biliary Cirrhosis (PBC)**"),
subtitle = gt::md("BioGee4 Analysis")
%>%
) ::tab_source_note(
gt::md(
gt"Concordance= 0.854 (se = 0.018 )
Likelihood ratio test= 194.2 on 17 df, p=<2e-16
Wald test = 184.6 on 17 df, p=<2e-16
Score (logrank) test = 276.4 on 17 df, p=<2e-16"
)) tbl_merge_1
Table 2. Treatment Effects of DPCA on Primiry Biliary Cirrhosis (PBC) | ||||||
---|---|---|---|---|---|---|
BioGee4 Analysis | ||||||
Characteristic |
Adjusted
|
Unadjusted
|
||||
HR | 95% CI | p-value | HR | 95% CI | p-value | |
treatment | >0.9 | 0.749 | ||||
Placebo | — | — | — | — | ||
DPCA | 0.99 | 0.66, 1.48 | 0.94 | 0.66, 1.34 | ||
sex | 0.8 | 0.052 | ||||
Female | — | — | — | — | ||
Male | 1.08 | 0.59, 1.97 | 0.62 | 0.39, 0.98 | ||
Presence of Asictes | 0.11 | <0.001 | ||||
No | — | — | — | — | ||
Yes | 1.81 | 0.87, 3.75 | 7.79 | 4.89, 12.4 | ||
Hepatomegaly | >0.9 | <0.001 | ||||
No | — | — | — | — | ||
Yes | 1.01 | 0.62, 1.66 | 3.27 | 2.22, 4.82 | ||
Presence of Spiders | 0.3 | <0.001 | ||||
No | — | — | — | — | ||
Yes | 1.25 | 0.80, 1.97 | 2.65 | 1.85, 3.80 | ||
histol | 0.2 | <0.001 | ||||
1 | — | — | — | — | ||
2 | 2.58 | 0.33, 20.3 | 4.99 | 0.66, 37.6 | ||
3 | 4.07 | 0.54, 30.9 | 8.58 | 1.18, 62.4 | ||
4 | 5.30 | 0.66, 42.3 | 21.4 | 2.96, 154 | ||
rcs_copper | 0.11 | <0.001 | ||||
rcs_copperx | 1.01 | 1.00, 1.02 | 1.02 | 1.02, 1.03 | ||
rcs_copper | 0.99 | 0.97, 1.00 | 0.97 | 0.96, 0.99 | ||
rcs_sgot | 0.7 | <0.001 | ||||
rcs_sgotx | 1.01 | 0.99, 1.02 | 1.02 | 1.01, 1.03 | ||
rcs_sgot | 0.99 | 0.98, 1.01 | 0.98 | 0.97, 0.99 | ||
Age (years) | 1.03 | 1.01, 1.05 | 0.015 | 1.04 | 1.02, 1.06 | <0.001 |
Bilirubin (log) | 2.01 | 1.51, 2.67 | <0.001 | 2.96 | 2.47, 3.55 | <0.001 |
Albumin (log) | 0.09 | 0.02, 0.49 | 0.005 | 0.00 | 0.00, 0.01 | <0.001 |
edema_tve | 0.44 | 0.26, 0.74 | 0.002 | 0.54 | 0.31, 0.95 | 0.010 |
Serum Cholesterol in mg/dl | 1.00 | 1.00, 1.00 | 0.050 | |||
Serum Cholesterol in mg/dl | 1.00 | 1.00, 1.00 | 0.001 | |||
Abbreviations: CI = Confidence Interval, HR = Hazard Ratio | ||||||
Concordance= 0.854 (se = 0.018 ) Likelihood ratio test= 194.2 on 17 df, p=<2e-16 Wald test = 184.6 on 17 df, p=<2e-16 Score (logrank) test = 276.4 on 17 df, p=<2e-16 |
Table 2 presents the effects of DPCA on PBC survival outcomes based on both unadjusted and adjusted Cox proportional hazards models. After adjusting for covariates, DPCA treatment was not significantly associated with overall survival (HR = 0.99, 95% CI: 0.66–1.48, p > 0.9), suggesting no treatment effect. Similarly, variables such as sex, hepatomegaly, presence of spiders, histological subtype, and the time-varying effect of serum cholesterol were not significant in the adjusted model despite being significant in the unadjusted analysis. In contrast, several predictors remained significantly associated with survival after adjustment. Age was positively associated with hazard (HR = 1.03, 95% CI: 1.01–1.05, p = 0.015), while elevated bilirubin (log-transformed) was strongly linked to increased mortality (HR = 2.01, 95% CI: 1.51–2.67, p < 0.001). Higher log-transformed albumin levels were significantly protective (HR = 0.09, 95% CI: 0.02–0.49, p = 0.005), and the presence of edema (time-varying) was associated with a lower hazard (HR = 0.44, 95% CI: 0.26–0.74, p = 0.002). The restricted cubic spline terms for copper approached significance, while those for SGOT were not significant in the adjusted model. Although the unadjusted results suggested strong associations for ascites, hepatomegaly, and histological subtype (all p < 0.001), these were attenuated after adjustment, indicating potential confounding. The final model demonstrated good predictive performance with a concordance index of 0.854 (SE = 0.018), and all global model tests (likelihood ratio, Wald, and score/logrank tests) were highly significant (p < 2e-16), supporting the model’s robustness.
Forestplot
library(survival)
library(ggplot2)
library(broom) # for tidy()
library(dplyr)
library(forcats) # for fct_reorder
# Fit the Cox model
<- coxph(
ggmodel Surv(years, event) ~
+ sex + asictes + hepatom + spiders + histol +
rx + rcs_sgot +
rcs_copper + logbili + logalbu +
age + tt(cholest),
edema_tve data = bio_model,
tt = function(x, t, ...) x * log(pmax(t, 0.01)), # Avoid log(0)
ties = "efron",
control = coxph.control(
iter.max = 5000,
eps = 1e-09,
toler.chol = 1e-10
)
)
# Tidy model output
<- tidy(ggmodel, exponentiate = TRUE, conf.int = TRUE)
cox_df
# Optional: make labels more readable
<- cox_df %>%
cox_df mutate(
term = gsub("`", "", term),
label = fct_reorder(term, estimate),
sig = ifelse(conf.low > 1 | conf.high < 1, "Yes", "No")
)
# Plot
ggplot(cox_df, aes(x = label, y = estimate)) +
geom_point(aes(color = sig), size = 3) +
geom_errorbar(aes(ymin = conf.low, ymax = conf.high, color = sig), width = 0.2) +
geom_hline(yintercept = 1, linetype = "dashed", color = "black") +
coord_flip() +
scale_y_log10() +
scale_color_manual(values = c("Yes" = "red", "No" = "turquoise")) +
labs(
title = "Treatment Effects of DPCA on Primiry Biliary Cirrhosis",
x = "Variables",
y = "Hazard Ratio (95% CI)",
color = "Statistically Significant"
+
) theme_minimal(base_size = 14) +
theme(plot.title = element_text(face = "bold", hjust = 0.5))
Kaplan Meier plots
library(survival)
library(survminer)
library(dplyr)
# Define your survival object
<- Surv(time = bio_model$years, event = bio_model$event)
surv_obj
# Variables to summarize
<- c("edema_tve", "age", "logbili", "logalbu", "copper")
km_vars
# Create list to store KM summaries
<- lapply(km_vars, function(var) {
km_summary_list cat("Processing KM summary for:", var, "\n")
<- bio_model %>%
data_km select(years, event, all_of(var)) %>%
filter(!is.na(.data[[var]]))
# Stratify variable
if (is.numeric(data_km[[var]])) {
<- median(data_km[[var]], na.rm = TRUE)
median_val $strata <- ifelse(data_km[[var]] > median_val,
data_kmpaste(var, "> median"),
paste(var, "<= median"))
else {
} $strata <- as.character(data_km[[var]])
data_km
}
# Check that we have at least two strata
if (length(unique(data_km$strata)) < 2) {
warning(paste("Skipped", var, ": only one group present"))
return(NULL)
}
# Fit Cox model with strata
<- coxph(Surv(years, event) ~ strata, data = data_km)
fit return(summary(fit))
})
Processing KM summary for: edema_tve
Processing KM summary for: age
Processing KM summary for: logbili
Processing KM summary for: logalbu
Processing KM summary for: copper
# Name the list by the variables
names(km_summary_list) <- km_vars
# Use the variable of interest
<- "rx"
var
# Prepare data
<- bio_model %>%
km_data ::select(years, event, all_of(var)) %>%
dplyrfilter(!is.na(.data[[var]]))
# Stratify by median if numeric, or use as is if binary/categorical
if (is.numeric(km_data[[var]])) {
<- km_data %>%
km_data mutate(group = ifelse(.data[[var]] > median(.data[[var]], na.rm = TRUE),
"Placebo", "DPCA"))
else {
} <- km_data %>%
km_data mutate(group = as.character(.data[[var]]))
}
# Fit KM model
<- survfit(Surv(years, event) ~ group, data = km_data)
fit
# Plot KM curve
<- ggsurvplot(
ggsurv
fit,data = km_data,
size = 1,
table.height = 0.2,
risk.table = T,
pval = T,
conf.int = T,
surv.median.line = "hv",
title = "Kaplan meier for Treatment",
legend.labs = sort(unique(km_data$group)),
xlab = "Time (years)",
ylab = "Survival Probability",
palette = c("#E69", "#56B4E9"),
ggtheme = theme_gray()
)
Warning in geom_segment(aes(x = 0, y = max(y2), xend = max(x1), yend = max(y2)), : All aesthetics have length 1, but the data has 2 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing
a single row.
ggsurv
Warning in geom_segment(aes(x = 0, y = max(y2), xend = max(x1), yend = max(y2)), : All aesthetics have length 1, but the data has 2 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing
a single row.
All aesthetics have length 1, but the data has 2 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing
a single row.
All aesthetics have length 1, but the data has 2 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing
a single row.
library(survival)
library(survminer)
library(dplyr)
# Use the variable of interest
<- "edema_tve"
var
# Prepare data
<- bio_model %>%
km_data ::select(years, event, all_of(var)) %>%
dplyrfilter(!is.na(.data[[var]]))
# Stratify by median if numeric, or use as is if binary/categorical
if (is.numeric(km_data[[var]])) {
<- km_data %>%
km_data mutate(group = ifelse(.data[[var]] > median(.data[[var]], na.rm = TRUE),
"Yes", "No"))
else {
} <- km_data %>%
km_data mutate(group = as.character(.data[[var]]))
}
# Fit KM model
<- survfit(Surv(years, event) ~ group, data = km_data)
fit1
# Plot KM curve
<- ggsurvplot(
ggsurv1
fit1,data = km_data,
size = 1,
table.height = 0.2,
risk.table = T,
pval = T,
conf.int = T,
surv.median.line = "hv",
title = "Kaplan meier for time varying Edema",
legend.labs = sort(unique(km_data$group)),
xlab = "Time (years)",
ylab = "Survival Probability",
palette = c("#E69", "#56B4E9"),
ggtheme = theme_gray()
)
Warning in geom_segment(aes(x = 0, y = max(y2), xend = max(x1), yend = max(y2)), : All aesthetics have length 1, but the data has 2 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing
a single row.
ggsurv1
Warning in geom_segment(aes(x = 0, y = max(y2), xend = max(x1), yend = max(y2)), : All aesthetics have length 1, but the data has 2 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing
a single row.
All aesthetics have length 1, but the data has 2 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing
a single row.
All aesthetics have length 1, but the data has 2 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing
a single row.
# Use the variable of interest
<- "age"
var
# Prepare data
<- bio_model %>%
km_data ::select(years, event, all_of(var)) %>%
dplyrfilter(!is.na(.data[[var]]))
# Stratify by median if numeric, or use as is if binary/categorical
if (is.numeric(km_data[[var]])) {
<- km_data %>%
km_data mutate(group = ifelse(.data[[var]] > median(.data[[var]], na.rm = TRUE),
"old", "young"))
else {
} <- km_data %>%
km_data mutate(group = as.character(.data[[var]]))
}
# Fit KM model
<- survfit(Surv(years, event) ~ group, data = km_data)
fit2
# Plot KM curve
<- ggsurvplot(
ggsurv2
fit2,data = km_data,
size = 1,
table.height = 0.2,
risk.table = T,
pval = T,
conf.int = T,
surv.median.line = "hv",
title = "Kaplan meier for Age",
legend.labs = sort(unique(km_data$group)),
xlab = "Time (years)",
ylab = "Survival Probability",
palette = c("#E69", "#56B4E9"),
ggtheme = theme_gray()
)
Warning in geom_segment(aes(x = 0, y = max(y2), xend = max(x1), yend = max(y2)), : All aesthetics have length 1, but the data has 2 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing
a single row.
ggsurv2
Warning in geom_segment(aes(x = 0, y = max(y2), xend = max(x1), yend = max(y2)), : All aesthetics have length 1, but the data has 2 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing
a single row.
All aesthetics have length 1, but the data has 2 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing
a single row.
All aesthetics have length 1, but the data has 2 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing
a single row.
# Use the variable of interest
<- "logbili"
var
# Prepare data
<- bio_model %>%
km_data ::select(years, event, all_of(var)) %>%
dplyrfilter(!is.na(.data[[var]]))
# Stratify by median if numeric, or use as is if binary/categorical
if (is.numeric(km_data[[var]])) {
<- km_data %>%
km_data mutate(group = ifelse(.data[[var]] > median(.data[[var]], na.rm = TRUE),
"High", "Low"))
else {
} <- km_data %>%
km_data mutate(group = as.character(.data[[var]]))
}
# Fit KM model
<- survfit(Surv(years, event) ~ group, data = km_data)
fit3
# Plot KM curve
<- ggsurvplot(
ggsurv3
fit3,data = km_data,
size = 1,
table.height = 0.2,
risk.table = T,
pval = T,
conf.int = T,
surv.median.line = "hv",
title = "Kaplan meier for Bilirubin(log)",
legend.labs = sort(unique(km_data$group)),
xlab = "Time (years)",
ylab = "Survival Probability",
palette = c("#E69", "#56B4E9"),
ggtheme = theme_gray()
)
ggsurv3
# Use the variable of interest
<- "logalbu"
var
# Prepare data
<- bio_model %>%
km_data ::select(years, event, all_of(var)) %>%
dplyrfilter(!is.na(.data[[var]]))
# Stratify by median if numeric, or use as is if binary/categorical
if (is.numeric(km_data[[var]])) {
<- km_data %>%
km_data mutate(group = ifelse(.data[[var]] > median(.data[[var]], na.rm = TRUE),
"High", "Low"))
else {
} <- km_data %>%
km_data mutate(group = as.character(.data[[var]]))
}
# Fit KM model
<- survfit(Surv(years, event) ~ group, data = km_data)
fit4
# Plot KM curve
<- ggsurvplot(
ggsurv4
fit4,data = km_data,
size = 1,
table.height = 0.2,
risk.table = T,
pval = T,
conf.int = T,
surv.median.line = "hv",
title = "Kaplan meier for Albumin(log)",
legend.labs = sort(unique(km_data$group)),
xlab = "Time (years)",
ylab = "Survival Probability",
palette = c("#E69", "#56B4E9"),
ggtheme = theme_gray()
)
Warning in geom_segment(aes(x = 0, y = max(y2), xend = max(x1), yend = max(y2)), : All aesthetics have length 1, but the data has 2 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing
a single row.
ggsurv4
Warning in geom_segment(aes(x = 0, y = max(y2), xend = max(x1), yend = max(y2)), : All aesthetics have length 1, but the data has 2 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing
a single row.
All aesthetics have length 1, but the data has 2 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing
a single row.
All aesthetics have length 1, but the data has 2 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing
a single row.
<- "copper"
var
# Prepare data
<- bio_model %>%
km_data ::select(years, event, all_of(var)) %>%
dplyrfilter(!is.na(.data[[var]]))
# Stratify by median if numeric, or use as is if binary/categorical
if (is.numeric(km_data[[var]])) {
<- km_data %>%
km_data mutate(group = ifelse(.data[[var]] > median(.data[[var]], na.rm = TRUE),
"High", "Low"))
else {
} <- km_data %>%
km_data mutate(group = as.character(.data[[var]]))
}
# Fit KM model
<- survfit(Surv(years, event) ~ group, data = km_data)
fit5
# Plot KM curve
<- ggsurvplot(
ggsurv5
fit5,data = km_data,
size = 1,
table.height = 0.2,
risk.table = T,
pval = T,
conf.int = T,
surv.median.line = "hv",
title = "Kaplan meier for Copper",
legend.labs = sort(unique(km_data$group)),
xlab = "Time (years)",
ylab = "Survival Probability",
palette = c("#E69", "#56B4E9"),
ggtheme = theme_gray()
)
ggsurv5
The six Kaplan-Meier plots reveal a consistent pattern regarding prognostic factors for survival. The comparison between DPCA treatment and placebo shows no significant survival difference (p = 0.75), indicating that the treatment itself does not influence survival outcomes. However, the remaining plots, covering age, bilirubin, albumin, edema, and copper levels, demonstrate strong and statistically significant associations with survival, with p-values mostly below 0.0001. Patients who are younger, have lower bilirubin and copper levels, higher albumin levels, and no edema tend to have significantly longer survival times and more favorable prognoses.
Different and available approaches
In survival analysis, a variety of modeling approaches are available, each with specific assumptions and limitations.
Parametric models
Parametric models include the Gompertz, exponential, and Weibull distributions among many others. The Gompertz model assumes that the hazard function changes exponentially over time, either increasing or decreasing depending on the sign of the shape parameter. This makes it particularly suitable for modeling adult mortality or aging-related processes. It assumes independent observations, non-informative censoring, and a log-linear relationship between covariates and the hazard function. However, its limitations include poor fit for non-monotonic hazard patterns and reduced interpretability compared to simpler models like the exponential or Weibull. Additionally, it is less commonly implemented, so diagnostic tools may be more limited.
The exponential model assumes a constant hazard rate over time and is most appropriate for data where the risk of the event remains unchanged throughout the follow-up. It also assumes independent observations, non-informative censoring, and a log-linear relationship between covariates and hazard. While mathematically simple, this model often poorly fits real-world data where hazard rates change over time, making its assumptions overly restrictive.
The Weibull model is more flexible and generalizes the exponential distribution. Its shape parameter determines whether the hazard increases \((p>1p > 1p>1)\) or decreases \((p<1p < 1p<1)\) over time, while \(p=1p = 1p=1\) reduces it to the exponential case. Like the other parametric models, it assumes non-informative censoring, independent observations, and a log-linear effect of covariates on hazard. However, it still cannot accommodate non-monotonic hazard functions and can lead to poor model fit if the true hazard pattern is not well captured.
Although not a survival model per se, Poisson regression is sometimes used in event count data where the outcome is the number of events over time. It assumes that counts follow a Poisson distribution with the mean equal to the variance (equidispersion), events occur randomly and independently, and covariates have a log-linear relationship with the expected count. However, it is not appropriate for binary time-to-event outcomes like death or survival without careful transformation. It also performs poorly in the presence of overdispersion (variance greater than the mean) or excessive zeros, in which case models like the negative binomial or zero-inflated Poisson are preferred.
Non-parametric models
Among non-parametric methods, the log-rank test is commonly used to compare survival curves between two or more groups. It assumes independent survival times, non-informative censoring, and proportional hazards across time. While it is a robust method, it does not adjust for covariates, cannot estimate effect sizes like hazard ratios, and has reduced sensitivity when survival curves cross.
Lastly, actuarial life table methods are used when data are grouped into fixed time intervals. These methods assume censoring occurs uniformly across intervals and that survival in one interval is conditional only on surviving the previous one. They are best suited for large samples and regularly spaced follow-up. However, they lack the precision of Kaplan–Meier methods, especially when exact event times are available, and may obscure hazard patterns due to interval grouping.
Unanswered question
Does the hazard ratio from a Cox model accurately reflect the overall treatment effect, or are there limitations to its interpretability—even when the model assumptions are met?
Even when the proportional hazards (PH) assumption is satisfied, the hazard ratio (HR) from a Cox model is not a fully reliable estimand for quantifying treatment effect. This is because the HR is a local, instantaneous measure that reflects the relative risk of the event at any given moment in time but does not capture the cumulative experience of subjects over the follow-up period. Specifically, the HR lacks a direct interpretation as a population-level summary of the treatment effect; it does not account for the absolute timing or probability of events, nor does it integrate over the full distribution of survival times. In contrast to estimands like the difference in survival probabilities or the restricted mean survival time (RMST), the HR does not summarize what happens to individuals after a particular timepoint, making it potentially misleading in settings where understanding long-term outcomes or average survival is important. Therefore, even under ideal model conditions, reliance on the hazard ratio alone can obscure the practical magnitude and relevance of the treatment effect.