Evaluating the Effects of D-Penicillamine on Survival of Patients with Primary Biliary Cirrhosis: A Survival Analysis

Biostatistics for Health Researchers III Course Code: COMH7066A

Authors

Affiliation

Vusumuzi Mabasa

Witwatersrand University

Bohlokwa Ntsoane

Witwatersrand University

Amanda Xaba

Witwatersrand University

Tshegofatso Kgobane

Witwatersrand University

Unathi Noraqa

Witwatersrand University

Moyahabo Rampya

Witwatersrand University

Mojaki Nyabela

Witwatersrand University

Aim of the study

To evaluate the Effect of DPCA on survival in Patients with Primary Biliary Cirrhosis.

Motivation for use of the technique

Since the aim of this study is to assess the effect of DPCA treatment on overall survival while adjusting for relevant covariates, the Cox proportional hazards (PH) model is a suitable choice. Unlike non-parametric methods that primarily compare survival between groups (e.g., log-rank tests), the Cox model allows us to quantify the effect of DPCA treatment on survival while controlling for potential confounding variables.

In our cohort of patients with primary biliary cirrhosis (PBC), approximately 60% were still alive at the end of the follow-up period. The Cox model accommodates this right-censored data naturally, ensuring that information from individuals who have not yet experienced the event (death) by the study’s end is still appropriately included in the analysis.

Furthermore, the Cox PH model provides insight beyond simple group comparisons. It allows us to examine how the hazard (risk of death) evolves over time for individuals, giving a more nuanced understanding of disease progression and how DPCA treatment may influence mortality risk at different follow-up intervals.

Finally, the Cox model is especially useful in real-world clinical studies like ours, where not all patients begin treatment or follow-up at the same time. It can include patients recruited at different points during the study period, allowing for a comprehensive and slightly limited (compared to parametric models) analysis of treatment effects despite staggered entry.

Theoretical Motivation for Using the Cox Proportional Hazards Model

The Cox proportional hazards (PH) model is a semi-parametric model used to investigate the association between the survival time of subjects and one or more predictor variables. It is especially useful in clinical studies because of its flexibility and ability to handle censored data.

Why the Cox regression?

Semi-parametric nature: The Cox model does not assume a specific distribution for the baseline hazard function \(h_0(t)\), which makes it less prone to model misspecification.
Model formulation: \(h(t \mid \mathbf{X}) = h_0(t) \exp(\beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_p X_p)\)where:

\(h(t|\mathbf{X})\) is the hazard at time \(t\) for an individual with covariate vector \(\mathbf{X}\),
\(h_0(t)\) is the unspecified baseline hazard,
\(\beta_1, \dots, \beta_p\) are coefficients estimating the log-relative hazard.

Hazard ratios: The model estimates the effect of covariates via hazard ratios, where \(\exp(\beta_j)\) represents the multiplicative effect on the hazard for a one-unit increase in \(X_j\), holding other variables constant.
Handles censoring: The Cox model accommodates right-censored observations using partial likelihood estimation, which leverages the relative order of events without needing exact survival times for all individuals.

Partial Likelihood

Let \(t_1 < t_2 < \dots < t_k\) be the ordered distinct event times, and let \(R(t_j)\) denote the risk set at time \(t_j\), i.e., all individuals still under observation just before time \(t_j\). For each event time \(t_j\) with corresponding covariate vector \(\mathbf{X}_j\), the partial likelihood is:

\((\boldsymbol{\beta}) = \prod_{j=1}^{k} \frac{\exp(\boldsymbol{\beta}^\top \mathbf{X}_j)}{\sum_{l \in R(t_j)} \exp(\boldsymbol{\beta}^\top \mathbf{X}_l)}\)

The log-partial likelihood is maximized to estimate \(\boldsymbol{\beta}\).

Application to Our Study

In our clinical study of primary biliary cirrhosis (PBC), the goal is to assess how treatment with DPCA affects survival, while adjusting for covariates such as sex, edema, age, and biomarkers like copper and bilirubin levels. The Cox PH model is particularly suitable to this dataset because:

It allows for the inclusion of time-varying covariates (e.g., edema and cholesterol),
It accommodates right-censored data, which is relevant since most patients were still alive at the end of follow-up,
It enables fair inclusion of patients recruited at different times.

LIST OF REFERENCES

1. Ahmed FE, Vos PW, Holbert D. Modeling survival in colon cancer: a methodological review. Mol Cancer. 2007;6(1):15.

2. Başar E. Aalen’s Additive, Cox Proportional Hazards and the Cox-Aalen Model: Application to Kidney Transplant Data. Sains Malays. 2017 Mar 31;46(3):469–76.

3. Mandal S, Wang S, Sinha S. Analysis of linear transformation models with covariate measurement error and interval censoring. Stat Med. 2019 Oct 15;38(23):4642–55.

4. Abd ElHafeez S, D’Arrigo G, Leonardis D, Fusaro M, Tripepi G, Roumeliotis S. Methods to Analyze Time‐to‐Event Data: The Cox Regression Analysis. Georgakilas A, editor. Oxid Med Cell Longev [Internet]. 2021 Jan [cited 2025 Jul 30];2021(1). Available from: https://onlinelibrary.wiley.com/doi/10.1155/2021/1302811

5. Dickson RE, Grambsch PM, Fleming TR, Fisher LD, Langworthy A. Prognosis in primary biliary cirrhosis: Model for decision making. Hepatology. 1989 Jul;10(1):1–7.

6. Shao Y, Ye Z, Zhang Z. Exact test and exact confidence interval for the Cox model. Stat Med. 2024 Oct 15;43(23):4499–518.

7. Ratnaningsih DJ, Saefuddin A, Kurnia A, Mangku IW. Stratified-extended cox model in survival modeling of non-proportional hazard. IOP Conf Ser Earth Environ Sci. 2019 Jul 1;299(1):012023.

8. Faruk A, Syahputra E, Cahyono ES, Aisyah SM. A non-proportional hazards model with time-dependent variables for under-five mortality in Indonesia. J Phys Conf Ser. 2019 Jul 1;1282(1):012011.

9. Zhang Z, Reinikainen J, Adeleke KA, Pieterse ME, Groothuis-Oudshoorn CGM. Time-varying covariates and coefficients in Cox regression models. Ann Transl Med. 2018 Apr;6(7):121–121.

Analyses of the study data applying the Semi-Parametric Cox regression

library(gtsummary)
library(datasets)
library(tidyverse)
library(survival)
library(survminer)
library(car) 
library(haven)
library(rms)
library(Hmisc)
library(splines)
library(purrr)
library(tibble)
library(plotly)

setwd("C:/Users/VUSI/Downloads")
(Vusi <- read_dta("pbc_stata_group_bio3.dta"))

bio3 <- Vusi %>%
  mutate(
    status = as_factor(status),
    rx = as_factor(rx),
    sex = as_factor(sex), # For 'sex' if 0/1 are categories but no labels
    asictes = as_factor(asictes),
    spiders = as_factor(spiders),
    hepatom = as_factor(hepatom),
    edema = as_factor(edema)
  )

bio3

bio3_1 <- bio3 %>%
  dplyr::mutate(
    sex = dplyr::case_when(
      sex == 1 ~ "Male",
      sex == 0 ~ "Female",
      TRUE ~ as.factor(sex)
    ),
    edema = dplyr::case_when(
      edema == 1 ~ "Yes",
      edema == 0 ~ "No",
      TRUE ~ as.factor(edema)))
bio3_1

# A tibble: 312 × 23
   number status   rx      sex   asictes hepatom spiders edema bilirubin cholest
    <dbl> <fct>    <fct>   <chr> <fct>   <fct>   <fct>   <chr>     <dbl>   <dbl>
 1      1 Dead     Placebo Male  Yes     Yes     Yes     Yes      14.5       261
 2      2 Censored Placebo Male  No      Yes     Yes     No        1.10      302
 3      3 Dead     Placebo Fema… No      No      No      Yes       1.40      176
 4      4 Dead     Placebo Male  No      Yes     Yes     Yes       1.80      244
 5      5 Censored DPCA    Male  No      Yes     Yes     No        3.40      279
 6      6 Dead     DPCA    Male  No      Yes     No      No        0.800     248
 7      7 Censored DPCA    Male  No      Yes     No      No        1         322
 8      8 Dead     DPCA    Male  No      No      No      No        0.300     280
 9      9 Dead     Placebo Male  No      No      Yes     No        3.20      562
10     10 Dead     DPCA    Male  Yes     No      Yes     Yes      12.6       200
# ℹ 302 more rows
# ℹ 13 more variables: albumin <dbl>, copper <dbl>, alkphos <dbl>, sgot <dbl>,
#   trigli <dbl>, platel <dbl>, prothrom <dbl>, histol <dbl>, age <dbl>,
#   years <dbl>, logbili <dbl>, logalbu <dbl>, logprot <dbl>

saveRDS(bio3_1, "bio3_2.rds")

bio3_1 %>%
  dplyr::select(status, rx, sex, asictes, hepatom, spiders, edema,bilirubin, cholest, albumin, copper, alkphos, sgot, trigli, platel, prothrom, age, years, logbili, logalbu, logprot) %>%
  tbl_summary(
    by = rx,
    missing = "always",
    missing_text = "Missing" ,
    statistic = list(
      all_of(c("bilirubin", "cholest", "albumin", "copper", "alkphos", "sgot", "trigli", "platel", "prothrom", "age", "years", "logbili", "logalbu", "logprot")) ~ "{mean} ({sd})",
      all_of(c("status", "sex", "asictes", "hepatom", "spiders", "edema")) ~ "{n} / {N} ({p}%)"
    )
  ) %>%
  add_stat_label() %>% add_overall() %>% 
  bold_labels() %>%
  modify_header(
    label ~ "**Characteristics**",
    all_stat_cols() ~ "**{level}**"
  ) %>%
  modify_spanning_header(
    all_stat_cols()~ "**Treatment group**"
  ) %>%
  as_gt() %>%
  gt::tab_header(
    title = gt::md("**Table 1. Demographic Characteristics**"),
    subtitle = gt::md("")
  ) %>%
  gt::tab_source_note(source_note = "Data updated August by Vusi, 2025")

Characteristics	Treatment group
Table 1. Demographic Characteristics

Characteristics	Overall¹	Placebo	DPCA
Alive/Dead, n / N (%)
Censored	187 / 312 (60%)	93 / 158 (59%)	94 / 154 (61%)
Dead	125 / 312 (40%)	65 / 158 (41%)	60 / 154 (39%)
Missing	0	0	0
sex, n / N (%)
Female	36 / 312 (12%)	21 / 158 (13%)	15 / 154 (9.7%)
Male	276 / 312 (88%)	137 / 158 (87%)	139 / 154 (90%)
Missing	0	0	0
Presence of Asictes, n / N (%)	24 / 312 (7.7%)	14 / 158 (8.9%)	10 / 154 (6.5%)
Missing	0	0	0
Hepatomegaly, n / N (%)	160 / 312 (51%)	73 / 158 (46%)	87 / 154 (56%)
Missing	0	0	0
Presence of Spiders, n / N (%)	90 / 312 (29%)	45 / 158 (28%)	45 / 154 (29%)
Missing	0	0	0
edema, n / N (%)	49 / 312 (16%)	26 / 158 (16%)	23 / 154 (15%)
Missing	0	0	0
Serum Bilirubin in mg/dl, Mean (SD)	3.3 (4.5)	2.9 (3.6)	3.6 (5.3)
Missing	0	0	0
Serum Cholesterol in mg/dl, Mean (SD)	370 (232)	365 (210)	374 (252)
Missing	28	18	10
Albumin in gm/dl, Mean (SD)	3.52 (0.42)	3.52 (0.44)	3.52 (0.40)
Missing	0	0	0
Urine Copper in ug/day, Mean (SD)	98 (86)	98 (91)	98 (80)
Missing	2	1	1
Alkaline Phosphatase in U/liter, Mean (SD)	1,983 (2,140)	2,021 (2,183)	1,943 (2,102)
Missing	0	0	0
SGOT in U/ml, Mean (SD)	123 (57)	120 (55)	125 (59)
Missing	0	0	0
Triglicerides in mg/dl, Mean (SD)	125 (65)	124 (72)	125 (59)
Missing	30	19	11
Platelets per cubic ml / 1000, Mean (SD)	262 (96)	259 (100)	265 (91)
Missing	4	2	2
Prothrombin time in seconds, Mean (SD)	10.73 (1.00)	10.65 (0.85)	10.80 (1.14)
Missing	0	0	0
Age (years), Mean (SD)	50 (11)	51 (11)	49 (10)
Missing	0	0	0
Time to Death (in Years), Mean (SD)	5.5 (3.1)	5.5 (3.0)	5.5 (3.2)
Missing	0	0	0
Bilirubin (log), Mean (SD)	0.58 (1.03)	0.54 (0.97)	0.61 (1.10)
Missing	0	0	0
Albumin (log), Mean (SD)	1.25 (0.13)	1.25 (0.13)	1.25 (0.12)
Missing	0	0	0
Prothom time (log), Mean (SD)	2.37 (0.09)	2.36 (0.08)	2.37 (0.10)
Missing	0	0	0
Data updated August by Vusi, 2025
¹ n / N (%); Mean (SD)

Full multivariable Cox regression model

colnames(bio3_1)

 [1] "number"    "status"    "rx"        "sex"       "asictes"   "hepatom"  
 [7] "spiders"   "edema"     "bilirubin" "cholest"   "albumin"   "copper"   
[13] "alkphos"   "sgot"      "trigli"    "platel"    "prothrom"  "histol"   
[19] "age"       "years"     "logbili"   "logalbu"   "logprot"

bio3_1$event <- ifelse(bio3_1$status == "Dead", 1, 0)
bio3_1$years <- as.numeric(bio3_1$years)

model_survival <- coxph(
  Surv(time = years, event = event) ~ . - number - status - event - years,
  data = bio3_1
)
summary(model_survival)

Call:
coxph(formula = Surv(time = years, event = event) ~ . - number - 
    status - event - years, data = bio3_1)

  n= 276, number of events= 111 
   (36 observations deleted due to missingness)

                 coef  exp(coef)   se(coef)      z Pr(>|z|)    
rxDPCA     -2.058e-02  9.796e-01  2.207e-01 -0.093 0.925725    
sexMale    -2.177e-01  8.043e-01  3.281e-01 -0.664 0.506971    
asictesYes  2.505e-01  1.285e+00  3.842e-01  0.652 0.514471    
hepatomYes -1.296e-01  8.784e-01  2.574e-01 -0.504 0.614551    
spidersYes  1.021e-01  1.107e+00  2.395e-01  0.426 0.669975    
edemaYes    6.272e-01  1.872e+00  2.856e-01  2.196 0.028110 *  
bilirubin  -1.252e-02  9.876e-01  3.999e-02 -0.313 0.754229    
cholest     1.611e-04  1.000e+00  4.522e-04  0.356 0.721680    
albumin     1.083e+00  2.955e+00  1.898e+00  0.571 0.568062    
copper      1.828e-03  1.002e+00  1.284e-03  1.424 0.154427    
alkphos     2.722e-06  1.000e+00  4.087e-05  0.067 0.946899    
sgot        2.597e-03  1.003e+00  2.062e-03  1.260 0.207767    
trigli     -2.096e-03  9.979e-01  1.379e-03 -1.520 0.128614    
platel      8.200e-04  1.001e+00  1.209e-03  0.678 0.497647    
prothrom    5.540e-01  1.740e+00  1.231e+00  0.450 0.652744    
histol      3.876e-01  1.473e+00  1.841e-01  2.105 0.035254 *  
age         2.449e-02  1.025e+00  1.129e-02  2.168 0.030148 *  
logbili     8.385e-01  2.313e+00  2.414e-01  3.474 0.000513 ***
logalbu    -5.663e+00  3.470e-03  5.962e+00 -0.950 0.342176    
logprot    -4.082e+00  1.687e-02  1.450e+01 -0.281 0.778343    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

           exp(coef) exp(-coef) lower .95 upper .95
rxDPCA       0.97963     1.0208 6.356e-01 1.510e+00
sexMale      0.80433     1.2433 4.228e-01 1.530e+00
asictesYes   1.28464     0.7784 6.049e-01 2.728e+00
hepatomYes   0.87843     1.1384 5.304e-01 1.455e+00
spidersYes   1.10745     0.9030 6.926e-01 1.771e+00
edemaYes     1.87239     0.5341 1.070e+00 3.277e+00
bilirubin    0.98756     1.0126 9.131e-01 1.068e+00
cholest      1.00016     0.9998 9.993e-01 1.001e+00
albumin      2.95454     0.3385 7.166e-02 1.218e+02
copper       1.00183     0.9982 9.993e-01 1.004e+00
alkphos      1.00000     1.0000 9.999e-01 1.000e+00
sgot         1.00260     0.9974 9.986e-01 1.007e+00
trigli       0.99791     1.0021 9.952e-01 1.001e+00
platel       1.00082     0.9992 9.985e-01 1.003e+00
prothrom     1.74025     0.5746 1.558e-01 1.944e+01
histol       1.47347     0.6787 1.027e+00 2.114e+00
age          1.02479     0.9758 1.002e+00 1.048e+00
logbili      2.31280     0.4324 1.441e+00 3.712e+00
logalbu      0.00347   288.1518 2.919e-08 4.126e+02
logprot      0.01687    59.2610 7.643e-15 3.726e+10

Concordance= 0.852  (se = 0.018 )
Likelihood ratio test= 178.6  on 20 df,   p=<2e-16
Wald test            = 169.5  on 20 df,   p=<2e-16
Score (logrank) test = 285.4  on 20 df,   p=<2e-16

names(bio3_1)

 [1] "number"    "status"    "rx"        "sex"       "asictes"   "hepatom"  
 [7] "spiders"   "edema"     "bilirubin" "cholest"   "albumin"   "copper"   
[13] "alkphos"   "sgot"      "trigli"    "platel"    "prothrom"  "histol"   
[19] "age"       "years"     "logbili"   "logalbu"   "logprot"   "event"

Multi-collinearity for the Cox model

vif(model_survival)

    rxDPCA    sexMale asictesYes hepatomYes spidersYes   edemaYes  bilirubin 
  1.325506   1.801613   1.947398   1.517800   1.401001   1.673546   4.877026 
   cholest    albumin     copper    alkphos       sgot     trigli     platel 
  1.822286  71.688146   1.946742   1.448359   1.782706   1.741475   1.684452 
  prothrom     histol        age    logbili    logalbu    logprot 
174.404706   1.955595   1.662151   5.099236  70.090287 179.384959

In the final multivariable Cox proportional hazards model, the variables bilirubin, prothrombin, and albumin were excluded due to their high collinearity with their respective log-transformed counterparts. Instead, the log-transformed versions, namely, logbili, logprot, and logalbu, were retained in the model to address skewness and better meet model assumptions.

(bio_model <- bio3_1 %>% dplyr::select(-prothrom, -albumin, -bilirubin ,-number))

# A tibble: 312 × 20
   status rx    sex   asictes hepatom spiders edema cholest copper alkphos  sgot
   <fct>  <fct> <chr> <fct>   <fct>   <fct>   <chr>   <dbl>  <dbl>   <dbl> <dbl>
 1 Dead   Plac… Male  Yes     Yes     Yes     Yes       261    156   1718  138. 
 2 Censo… Plac… Male  No      Yes     Yes     No        302     54   7395. 114. 
 3 Dead   Plac… Fema… No      No      No      Yes       176    210    516   96.1
 4 Dead   Plac… Male  No      Yes     Yes     Yes       244     64   6122.  60.6
 5 Censo… DPCA  Male  No      Yes     Yes     No        279    143    671  113. 
 6 Dead   DPCA  Male  No      Yes     No      No        248     50    944   93  
 7 Censo… DPCA  Male  No      Yes     No      No        322     52    824   60.5
 8 Dead   DPCA  Male  No      No      No      No        280     52   4651.  28.4
 9 Dead   Plac… Male  No      No      Yes     No        562     79   2276  144. 
10 Dead   DPCA  Male  Yes     No      Yes     Yes       200    140    918  147. 
# ℹ 302 more rows
# ℹ 9 more variables: trigli <dbl>, platel <dbl>, histol <dbl>, age <dbl>,
#   years <dbl>, logbili <dbl>, logalbu <dbl>, logprot <dbl>, event <dbl>

bio_model$event <- as.numeric(bio_model$event)
bio_model$histol <- as.factor(bio_model$histol)
bio_model$edema <- as.factor(bio_model$edema)

names(bio_model)

 [1] "status"  "rx"      "sex"     "asictes" "hepatom" "spiders" "edema"  
 [8] "cholest" "copper"  "alkphos" "sgot"    "trigli"  "platel"  "histol" 
[15] "age"     "years"   "logbili" "logalbu" "logprot" "event"

SECOND MODEL AFTER VIF

model_survival1 <- coxph(Surv(years, event) ~ rx + sex + asictes + hepatom + spiders + edema + histol+
      cholest + copper + alkphos + sgot + trigli + platel + age + logbili + logalbu + logprot, data = bio_model)


summary(model_survival1)

Call:
coxph(formula = Surv(years, event) ~ rx + sex + asictes + hepatom + 
    spiders + edema + histol + cholest + copper + alkphos + sgot + 
    trigli + platel + age + logbili + logalbu + logprot, data = bio_model)

  n= 276, number of events= 111 
   (36 observations deleted due to missingness)

                 coef  exp(coef)   se(coef)      z Pr(>|z|)    
rxDPCA     -2.749e-02  9.729e-01  2.127e-01 -0.129   0.8972    
sexMale    -2.588e-01  7.719e-01  3.230e-01 -0.801   0.4230    
asictesYes  2.199e-01  1.246e+00  3.884e-01  0.566   0.5713    
hepatomYes -1.392e-01  8.701e-01  2.586e-01 -0.538   0.5904    
spidersYes  1.007e-01  1.106e+00  2.385e-01  0.422   0.6727    
edemaYes    5.691e-01  1.767e+00  2.808e-01  2.027   0.0427 *  
histol2     1.262e+00  3.532e+00  1.058e+00  1.193   0.2328    
histol3     1.503e+00  4.497e+00  1.034e+00  1.454   0.1459    
histol4     1.852e+00  6.373e+00  1.057e+00  1.752   0.0797 .  
cholest     1.028e-04  1.000e+00  4.524e-04  0.227   0.8202    
copper      1.776e-03  1.002e+00  1.283e-03  1.384   0.1664    
alkphos    -4.378e-06  1.000e+00  4.165e-05 -0.105   0.9163    
sgot        2.621e-03  1.003e+00  2.057e-03  1.274   0.2026    
trigli     -2.012e-03  9.980e-01  1.340e-03 -1.501   0.1334    
platel      9.215e-04  1.001e+00  1.201e-03  0.767   0.4428    
age         2.402e-02  1.024e+00  1.122e-02  2.141   0.0323 *  
logbili     7.718e-01  2.164e+00  1.589e-01  4.857 1.19e-06 ***
logalbu    -2.372e+00  9.326e-02  9.586e-01 -2.475   0.0133 *  
logprot     2.658e+00  1.427e+01  1.374e+00  1.935   0.0530 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

           exp(coef) exp(-coef) lower .95 upper .95
rxDPCA       0.97288    1.02787   0.64118    1.4762
sexMale      0.77194    1.29544   0.40984    1.4540
asictesYes   1.24596    0.80260   0.58199    2.6674
hepatomYes   0.87008    1.14932   0.52414    1.4443
spidersYes   1.10599    0.90417   0.69305    1.7650
edemaYes     1.76673    0.56602   1.01889    3.0635
histol2      3.53227    0.28310   0.44441   28.0755
histol3      4.49720    0.22236   0.59272   34.1222
histol4      6.37340    0.15690   0.80295   50.5890
cholest      1.00010    0.99990   0.99922    1.0010
copper       1.00178    0.99823   0.99926    1.0043
alkphos      1.00000    1.00000   0.99991    1.0001
sgot         1.00262    0.99738   0.99859    1.0067
trigli       0.99799    1.00201   0.99537    1.0006
platel       1.00092    0.99908   0.99857    1.0033
age          1.02431    0.97627   1.00203    1.0471
logbili      2.16374    0.46216   1.58470    2.9544
logalbu      0.09326   10.72314   0.01425    0.6104
logprot     14.26768    0.07009   0.96596  210.7411

Concordance= 0.852  (se = 0.018 )
Likelihood ratio test= 178.9  on 19 df,   p=<2e-16
Wald test            = 169.8  on 19 df,   p=<2e-16
Score (logrank) test = 257.2  on 19 df,   p=<2e-16

Checking Assumptions

Testing the proportional hazard assumption

test_ph <- cox.zph(model_survival1) 

test_ph

          chisq df     p
rx       0.4480  1 0.503
sex      0.0898  1 0.764
asictes  3.1194  1 0.077
hepatom  0.1496  1 0.699
spiders  0.1618  1 0.687
edema    6.5612  1 0.010
histol   3.5058  3 0.320
cholest  8.8267  1 0.003
copper   0.2416  1 0.623
alkphos  0.4429  1 0.506
sgot     0.7737  1 0.379
trigli   2.1733  1 0.140
platel   1.4567  1 0.227
age      1.3418  1 0.247
logbili  1.1767  1 0.278
logalbu  0.3060  1 0.580
logprot  6.2798  1 0.012
GLOBAL  24.0981 19 0.192

plot(test_ph, 
     resid = TRUE,
     se = TRUE,
     col = 4:6,
     lwd = 2)

The Schoenfeld residual tests were used to evaluate the proportional hazards assumption for each covariate in the Cox model. Most covariates showed no evidence of violating the PH assumption, with non-significant p-values (e.g., rx, sex, hepatom, spiders, age, logbili). However, edema (p = 0.010), cholest (p = 0.003), and logprot (p = 0.012) showed significant departures, indicating potential violations of the PH assumption. The global test (χ² = 24.10, df = 19, p = 0.192) suggests that, overall, the PH assumption holds reasonably well for the model.

Linearity assumption testing using Restricted Cubic Splines (RCS)

# Define continuous variables
cont_vars <- c("cholest", "copper", "alkphos", "sgot", 
               "trigli", "platel", "age", 
               "logbili", "logalbu", "logprot")

# Create survival object
surv_obj <- Surv(time = bio_model$years, event = bio_model$event)

# Set up data distribution object (needed by rms)
dd <- datadist(bio_model)
options(datadist = 'dd')


# Loop through variables
for (var in cont_vars) {
  formula <- as.formula(paste0("surv_obj ~ rcs(", var, ", 3)"))
  
  cat("\nLinearity test for:", var, "\n")
  
  tryCatch({
    model <- cph(
      formula,
      data = bio_model,
      x = TRUE, y = TRUE,
      surv = TRUE,
      maxiter = 100000,          # Increase iterations
      eps = 1e-6             # Convergence tolerance
    )
    print(anova(model))      # Wald test for non-linearity
  }, error = function(e) {
    message("⚠️  Model for ", var, " did not converge: ", e$message)
  })
}


Linearity test for: cholest 
                Wald Statistics          Response: surv_obj 

 Factor     Chi-Square d.f. P     
 cholest    13.34      2    0.0013
  Nonlinear  0.02      1    0.8769
 TOTAL      13.34      2    0.0013

Linearity test for: copper 
                Wald Statistics          Response: surv_obj 

 Factor     Chi-Square d.f. P     
 copper     67.98      2    <.0001
  Nonlinear 15.01      1    1e-04 
 TOTAL      67.98      2    <.0001

Linearity test for: alkphos 
                Wald Statistics          Response: surv_obj 

 Factor     Chi-Square d.f. P     
 alkphos    12.65      2    0.0018
  Nonlinear 10.67      1    0.0011
 TOTAL      12.65      2    0.0018

Linearity test for: sgot 
                Wald Statistics          Response: surv_obj 

 Factor     Chi-Square d.f. P     
 sgot       28.89      2    <.0001
  Nonlinear  8.80      1    0.003 
 TOTAL      28.89      2    <.0001

Linearity test for: trigli 
                Wald Statistics          Response: surv_obj 

 Factor     Chi-Square d.f. P     
 trigli     14.13      2    0.0009
  Nonlinear  1.21      1    0.2713
 TOTAL      14.13      2    0.0009

Linearity test for: platel 
                Wald Statistics          Response: surv_obj 

 Factor     Chi-Square d.f. P     
 platel     20.81      2    <.0001
  Nonlinear  5.96      1    0.0147
 TOTAL      20.81      2    <.0001

Linearity test for: age 
                Wald Statistics          Response: surv_obj 

 Factor     Chi-Square d.f. P     
 age        20.43      2    <.0001
  Nonlinear  0.01      1    0.9352
 TOTAL      20.43      2    <.0001

Linearity test for: logbili 
                Wald Statistics          Response: surv_obj 

 Factor     Chi-Square d.f. P     
 logbili    127.72     2    <.0001
  Nonlinear   0.60     1    0.4367
 TOTAL      127.72     2    <.0001

Linearity test for: logalbu 
                Wald Statistics          Response: surv_obj 

 Factor     Chi-Square d.f. P     
 logalbu    86.24      2    <.0001
  Nonlinear  0.05      1    0.8232
 TOTAL      86.24      2    <.0001

Linearity test for: logprot 
                Wald Statistics          Response: surv_obj 

 Factor     Chi-Square d.f. P     
 logprot    45.86      2    <.0001
  Nonlinear  7.36      1    0.0067
 TOTAL      45.86      2    <.0001

Martingale residuals to support the Restricted Cubic Spline results

# Define your model formula
cox_formula <- Surv(years, event) ~ rx + sex + asictes + hepatom + spiders + edema + histol +
  cholest + copper + alkphos + sgot + trigli + platel + age + logbili + logalbu + logprot

# Extract complete cases used in the model
model_vars <- all.vars(cox_formula)
bio_model_cc <- bio_model[complete.cases(bio_model[, model_vars]), ]

# Fit model on complete cases
model_survival1 <- coxph(cox_formula, data = bio_model_cc)

# Get martingale residuals
martingale_resid <- residuals(model_survival1, type = "martingale")

# List of continuous variables
cont_vars <- c("cholest", "copper", "alkphos", "sgot", "trigli", "platel", "age", "logbili", "logalbu", "logprot")

# Plot
par(mfrow = c(3, 4))  # Adjust as needed

for (var in cont_vars) {
  x_vals <- bio_model_cc[[var]]
  plot(x_vals, martingale_resid,
       xlab = var,
       ylab = "Martingale Residuals",
       main = paste("Residuals vs", var),
       pch = 20)
  lines(lowess(x_vals, martingale_resid), col = "blue", lwd = 2)
}

To assess whether the continuous covariates in our Cox proportional hazards model satisfied the linearity assumption, we conducted tests using restricted cubic splines and Wald statistics for each variable. The null hypothesis in this context is that the relationship between the continuous predictor and the log hazard is linear. The test output reports both the overall association (factor p-value) and the nonlinearity component. A p-value < 0.05 for the nonlinearity test suggests evidence of departure from linearity.

Significant evidence of nonlinearity (p < 0.05) was found for copper, alkphos, sgot, platelets,, and logprot. We believe that these variables may benefit from transformation or modeling using spline terms to better capture their relationships with the hazard function. Variables such as cholest, trigli, age, and logbili did not show evidence of nonlinearity and were retained as linear terms.

In addition to statistical tests, we examined Martingale residual plots (see figure above) for each continuous covariate. These plots visually assess the functional form of the covariates in the model. Deviations from a horizontal band centered around zero further supported evidence of nonlinearity for variables such as copper, alkphos, sgot, and logprot. These visual assessments reinforced our decision to use spline functions or transformations for those variables.

Analysis of deviance to assess predictor variables that improve the Cox regression model - Variable selection

(anova(coxph(Surv(years, event) ~ rx + sex + asictes + hepatom + spiders + edema + histol+
      cholest + copper + alkphos + sgot + trigli + platel + age + logbili + logalbu + logprot, data = bio_model)))

Analysis of Deviance Table
 Cox model: response is Surv(years, event)
Terms added sequentially (first to last)

         loglik   Chisq Df Pr(>|Chi|)    
NULL    -550.19                          
rx      -549.99  0.4059  1  0.5240359    
sex     -548.04  3.8957  1  0.0484086 *  
asictes -525.46 45.1622  1  1.814e-11 ***
hepatom -515.32 20.2704  1  6.723e-06 ***
spiders -511.93  6.7793  1  0.0092219 ** 
edema   -506.27 11.3336  1  0.0007612 ***
histol  -502.20  8.1297  3  0.0434061 *  
cholest -495.54 13.3284  1  0.0002614 ***
copper  -488.37 14.3382  1  0.0001527 ***
alkphos -488.21  0.3247  1  0.5687736    
sgot    -485.80  4.8068  1  0.0283479 *  
trigli  -485.66  0.2929  1  0.5883700    
platel  -485.62  0.0755  1  0.7835382    
age     -479.36 12.5139  1  0.0004039 ***
logbili -464.97 28.7746  1  8.131e-08 ***
logalbu -462.58  4.7944  1  0.0285528 *  
logprot -460.75  3.6545  1  0.0559169 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We used the Analysis of Deviance Table as a guide for variable selection in building our Cox proportional hazards model. By adding covariates sequentially and examining their contributions to model fit through changes in log-likelihood and Chi-squared statistics, we identified which variables significantly improved the model. Variables such as ascites, hepatomegaly, edema, log-bilirubin, age, and copper showed strong evidence of association with survival (p < 0.001), supporting their inclusion in the final model. Additional variables like sex, spiders, SGOT, cholesterol, and log-albumin were also retained based on their statistical significance (p < 0.05). While log-protein showed borderline significance (p = 0.056), it was considered for inclusion due to its potential clinical relevance. Variables such as alkaline phosphatase, triglycerides, and platelet count did not meaningfully contribute to model fit and were excluded. This approach helped ensure a parsimonious model by retaining only covariates with substantial explanatory power.

Constructing the final model that retains all variables that improve the model (Parsimonious model).

bio_model$edema_tve <- ifelse(bio_model$edema == "Yes", 1, 0) * log(pmax(bio_model$years, 0.01))
# Generate RCS terms for copper and sgot (3 knots each)
bio_model$rcs_copper <- rcspline.eval(bio_model$copper, nk = 3, inclx = TRUE)
bio_model$rcs_sgot   <- rcspline.eval(bio_model$sgot, nk = 3, inclx = TRUE)


tbl1 <- coxph(
  Surv(years, event) ~ 
    rx + sex + asictes + hepatom + spiders + histol + 
    rcs_copper + rcs_sgot +
    age + logbili + logalbu +
    edema_tve + tt(cholest),
  data = bio_model,
  tt = function(x, t, ...) x * log(pmax(t, 0.01)),  # Avoid log(0)
  ties = "efron",
  control = coxph.control(
    iter.max = 5000,
    eps = 1e-09,
    toler.chol = 1e-10
  )
)%>%
  tbl_regression(exponentiate = TRUE) %>%
  bold_p() %>%
  add_global_p()

model_output <- coxph(
  Surv(years, event) ~ 
    rx + sex + asictes + hepatom + spiders + histol + 
    rcs_copper + rcs_sgot +
    age + logbili + logalbu +
    edema_tve + tt(cholest),
  data = bio_model,
  tt = function(x, t, ...) x * log(pmax(t, 0.01)),  # Avoid log(0)
  ties = "efron",
  control = coxph.control(
    iter.max = 5000,
    eps = 1e-09,
    toler.chol = 1e-10
  )
)
summary(model_output)

Call:
coxph(formula = Surv(years, event) ~ rx + sex + asictes + hepatom + 
    spiders + histol + rcs_copper + rcs_sgot + age + logbili + 
    logalbu + edema_tve + tt(cholest), data = bio_model, control = coxph.control(iter.max = 5000, 
    eps = 1e-09, toler.chol = 1e-10), ties = "efron", tt = function(x, 
    t, ...) x * log(pmax(t, 0.01)))

  n= 282, number of events= 113 
   (30 observations deleted due to missingness)

                  coef  exp(coef)   se(coef)      z Pr(>|z|)    
rxDPCA      -0.0125861  0.9874928  0.2077890 -0.061  0.95170    
sexMale      0.0755612  1.0784893  0.3084180  0.245  0.80646    
asictesYes   0.5907990  1.8054303  0.3727170  1.585  0.11294    
hepatomYes   0.0130064  1.0130914  0.2528276  0.051  0.95897    
spidersYes   0.2262123  1.2538418  0.2301370  0.983  0.32563    
histol2      0.9467543  2.5773308  1.0519693  0.900  0.36813    
histol3      1.4029502  4.0671812  1.0338774  1.357  0.17479    
histol4      1.6680314  5.3017206  1.0595319  1.574  0.11542    
rcs_copperx  0.0112154  1.0112785  0.0056931  1.970  0.04884 *  
rcs_copper  -0.0142585  0.9858427  0.0081700 -1.745  0.08095 .  
rcs_sgotx    0.0056794  1.0056955  0.0063734  0.891  0.37287    
rcs_sgot    -0.0062334  0.9937860  0.0067233 -0.927  0.35386    
age          0.0261604  1.0265056  0.0108025  2.422  0.01545 *  
logbili      0.6962297  2.0061745  0.1452381  4.794 1.64e-06 ***
logalbu     -2.4120892  0.0896278  0.8633178 -2.794  0.00521 ** 
edema_tve   -0.8181390  0.4412520  0.2654395 -3.082  0.00205 ** 
tt(cholest)  0.0006124  1.0006126  0.0003131  1.956  0.05046 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

            exp(coef) exp(-coef) lower .95 upper .95
rxDPCA        0.98749     1.0127    0.6571    1.4839
sexMale       1.07849     0.9272    0.5892    1.9740
asictesYes    1.80543     0.5539    0.8696    3.7483
hepatomYes    1.01309     0.9871    0.6172    1.6629
spidersYes    1.25384     0.7975    0.7986    1.9685
histol2       2.57733     0.3880    0.3279   20.2585
histol3       4.06718     0.2459    0.5361   30.8554
histol4       5.30172     0.1886    0.6646   42.2953
rcs_copperx   1.01128     0.9888    1.0001    1.0226
rcs_copper    0.98584     1.0144    0.9702    1.0018
rcs_sgotx     1.00570     0.9943    0.9932    1.0183
rcs_sgot      0.99379     1.0063    0.9808    1.0070
age           1.02651     0.9742    1.0050    1.0485
logbili       2.00617     0.4985    1.5092    2.6668
logalbu       0.08963    11.1572    0.0165    0.4867
edema_tve     0.44125     2.2663    0.2623    0.7424
tt(cholest)   1.00061     0.9994    1.0000    1.0012

Concordance= 0.854  (se = 0.018 )
Likelihood ratio test= 194.2  on 17 df,   p=<2e-16
Wald test            = 184.6  on 17 df,   p=<2e-16
Score (logrank) test = 276.4  on 17 df,   p=<2e-16

# List of all predictors
predictors <- c("rx", "sex", "asictes", "hepatom", "spiders", "edema_tve", "histol","cholest", "rcs_copper", "rcs_sgot", "age", "logbili", "logalbu")

# Specify continuous predictors for spline transformation
continuous_vars <- c("cholest", "rcs_copper", "rcs_sgot")

# Fit unadjusted Cox models
unadjusted_models <- lapply(predictors, function(var) {
  formula <- if (var %in% continuous_vars) {
    # For RCS variables, use all components of the spline
    as.formula(paste0("Surv(years, event) ~ ", var))
  } else {
    # For non-RCS variables (including binary/categorical)
    as.formula(paste0("Surv(years, event) ~ ", var))
  }
  coxph(formula, data = bio_model)
})

# Name each model
names(unadjusted_models) <- gsub("^rcs_", "", predictors)  # Clean names by removing 'rcs_' prefix

# Generate regression tables
tbl2 <- lapply(unadjusted_models, function(model) {
  tbl_regression(model,
                exponentiate = TRUE,
                pvalue_fun = ~style_pvalue(.x, digits = 3)) %>%
    bold_p() %>%
    add_global_p()
})

tbl2_combined <- tbl_stack(tbl2) %>%
  modify_header(label = "**Variable**") %>%
  modify_caption("**Adjusted and Unadjusted Cox Regression Models**")

tbl_merge_1 <- tbl_merge(
  tbls = list(tbl1, tbl2_combined),  # use combined table here
  tab_spanner = c("**Adjusted**", "**Unadjusted**")
) %>%
  bold_labels() %>%
  modify_spanning_header(all_stat_cols() ~ "**Treatment group**") %>%
  as_gt() %>%
  gt::tab_header(
    title = gt::md("**Table 2. Treatment Effects of DPCA on Primiry Biliary Cirrhosis (PBC)**"),
    subtitle = gt::md("BioGee4 Analysis")
  ) %>%
  gt::tab_source_note(
    gt::md(
      "Concordance= 0.854  (se = 0.018 )
Likelihood ratio test= 194.2  on 17 df,   p=<2e-16
Wald test            = 184.6  on 17 df,   p=<2e-16
Score (logrank) test = 276.4  on 17 df,   p=<2e-16"
))
tbl_merge_1

**Adjusted and Unadjusted Cox Regression Models**
Table 2. Treatment Effects of DPCA on Primiry Biliary Cirrhosis (PBC)
BioGee4 Analysis
Characteristic	Adjusted			Unadjusted
Characteristic	HR	95% CI	p-value	HR	95% CI	p-value
treatment			>0.9			0.749
Placebo	—	—		—	—
DPCA	0.99	0.66, 1.48		0.94	0.66, 1.34
sex			0.8			0.052
Female	—	—		—	—
Male	1.08	0.59, 1.97		0.62	0.39, 0.98
Presence of Asictes			0.11			<0.001
No	—	—		—	—
Yes	1.81	0.87, 3.75		7.79	4.89, 12.4
Hepatomegaly			>0.9			<0.001
No	—	—		—	—
Yes	1.01	0.62, 1.66		3.27	2.22, 4.82
Presence of Spiders			0.3			<0.001
No	—	—		—	—
Yes	1.25	0.80, 1.97		2.65	1.85, 3.80
histol			0.2			<0.001
1	—	—		—	—
2	2.58	0.33, 20.3		4.99	0.66, 37.6
3	4.07	0.54, 30.9		8.58	1.18, 62.4
4	5.30	0.66, 42.3		21.4	2.96, 154
rcs_copper			0.11			<0.001
rcs_copperx	1.01	1.00, 1.02		1.02	1.02, 1.03
rcs_copper	0.99	0.97, 1.00		0.97	0.96, 0.99
rcs_sgot			0.7			<0.001
rcs_sgotx	1.01	0.99, 1.02		1.02	1.01, 1.03
rcs_sgot	0.99	0.98, 1.01		0.98	0.97, 0.99
Age (years)	1.03	1.01, 1.05	0.015	1.04	1.02, 1.06	<0.001
Bilirubin (log)	2.01	1.51, 2.67	<0.001	2.96	2.47, 3.55	<0.001
Albumin (log)	0.09	0.02, 0.49	0.005	0.00	0.00, 0.01	<0.001
edema_tve	0.44	0.26, 0.74	0.002	0.54	0.31, 0.95	0.010
Serum Cholesterol in mg/dl	1.00	1.00, 1.00	0.050
Serum Cholesterol in mg/dl				1.00	1.00, 1.00	0.001
Abbreviations: CI = Confidence Interval, HR = Hazard Ratio
Concordance= 0.854 (se = 0.018 ) Likelihood ratio test= 194.2 on 17 df, p=<2e-16 Wald test = 184.6 on 17 df, p=<2e-16 Score (logrank) test = 276.4 on 17 df, p=<2e-16

Table 2 presents the effects of DPCA on PBC survival outcomes based on both unadjusted and adjusted Cox proportional hazards models. After adjusting for covariates, DPCA treatment was not significantly associated with overall survival (HR = 0.99, 95% CI: 0.66–1.48, p > 0.9), suggesting no treatment effect. Similarly, variables such as sex, hepatomegaly, presence of spiders, histological subtype, and the time-varying effect of serum cholesterol were not significant in the adjusted model despite being significant in the unadjusted analysis. In contrast, several predictors remained significantly associated with survival after adjustment. Age was positively associated with hazard (HR = 1.03, 95% CI: 1.01–1.05, p = 0.015), while elevated bilirubin (log-transformed) was strongly linked to increased mortality (HR = 2.01, 95% CI: 1.51–2.67, p < 0.001). Higher log-transformed albumin levels were significantly protective (HR = 0.09, 95% CI: 0.02–0.49, p = 0.005), and the presence of edema (time-varying) was associated with a lower hazard (HR = 0.44, 95% CI: 0.26–0.74, p = 0.002). The restricted cubic spline terms for copper approached significance, while those for SGOT were not significant in the adjusted model. Although the unadjusted results suggested strong associations for ascites, hepatomegaly, and histological subtype (all p < 0.001), these were attenuated after adjustment, indicating potential confounding. The final model demonstrated good predictive performance with a concordance index of 0.854 (SE = 0.018), and all global model tests (likelihood ratio, Wald, and score/logrank tests) were highly significant (p < 2e-16), supporting the model’s robustness.

Forestplot

library(survival)
library(ggplot2)
library(broom)     # for tidy()
library(dplyr)
library(forcats)   # for fct_reorder

# Fit the Cox model
ggmodel <- coxph(
  Surv(years, event) ~ 
    rx + sex + asictes + hepatom + spiders + histol + 
    rcs_copper + rcs_sgot +
    age + logbili + logalbu +
    edema_tve + tt(cholest),
  data = bio_model,
  tt = function(x, t, ...) x * log(pmax(t, 0.01)),  # Avoid log(0)
  ties = "efron",
  control = coxph.control(
    iter.max = 5000,
    eps = 1e-09,
    toler.chol = 1e-10
  )
)

# Tidy model output
cox_df <- tidy(ggmodel, exponentiate = TRUE, conf.int = TRUE)

# Optional: make labels more readable
cox_df <- cox_df %>%
  mutate(
    term = gsub("`", "", term),
    label = fct_reorder(term, estimate),
    sig = ifelse(conf.low > 1 | conf.high < 1, "Yes", "No")
  )

# Plot
ggplot(cox_df, aes(x = label, y = estimate)) +
  geom_point(aes(color = sig), size = 3) +
  geom_errorbar(aes(ymin = conf.low, ymax = conf.high, color = sig), width = 0.2) +
  geom_hline(yintercept = 1, linetype = "dashed", color = "black") +
  coord_flip() +
  scale_y_log10() +
  scale_color_manual(values = c("Yes" = "red", "No" = "turquoise")) +
  labs(
    title = "Treatment Effects of DPCA on Primiry Biliary Cirrhosis",
    x = "Variables",
    y = "Hazard Ratio (95% CI)",
    color = "Statistically Significant"
  ) +
  theme_minimal(base_size = 14) +
  theme(plot.title = element_text(face = "bold", hjust = 0.5))

Kaplan Meier plots

library(survival)
library(survminer)
library(dplyr)

# Define your survival object
surv_obj <- Surv(time = bio_model$years, event = bio_model$event)

# Variables to summarize
km_vars <- c("edema_tve", "age", "logbili", "logalbu", "copper")

# Create list to store KM summaries
km_summary_list <- lapply(km_vars, function(var) {
  cat("Processing KM summary for:", var, "\n")
  
  data_km <- bio_model %>%
    select(years, event, all_of(var)) %>%
    filter(!is.na(.data[[var]]))
  
  # Stratify variable
  if (is.numeric(data_km[[var]])) {
    median_val <- median(data_km[[var]], na.rm = TRUE)
    data_km$strata <- ifelse(data_km[[var]] > median_val,
                             paste(var, "> median"),
                             paste(var, "<= median"))
  } else {
    data_km$strata <- as.character(data_km[[var]])
  }
  
  # Check that we have at least two strata
  if (length(unique(data_km$strata)) < 2) {
    warning(paste("Skipped", var, ": only one group present"))
    return(NULL)
  }
  
  # Fit Cox model with strata
  fit <- coxph(Surv(years, event) ~ strata, data = data_km)
  return(summary(fit))
})

Processing KM summary for: edema_tve 
Processing KM summary for: age 
Processing KM summary for: logbili 
Processing KM summary for: logalbu 
Processing KM summary for: copper

# Name the list by the variables
names(km_summary_list) <- km_vars

# Use the variable of interest
var <- "rx"

# Prepare data
km_data <- bio_model %>%
  dplyr::select(years, event, all_of(var)) %>%
  filter(!is.na(.data[[var]]))

# Stratify by median if numeric, or use as is if binary/categorical
if (is.numeric(km_data[[var]])) {
  km_data <- km_data %>%
    mutate(group = ifelse(.data[[var]] > median(.data[[var]], na.rm = TRUE),
                          "Placebo", "DPCA"))
} else {
  km_data <- km_data %>%
    mutate(group = as.character(.data[[var]]))
}

# Fit KM model
fit <- survfit(Surv(years, event) ~ group, data = km_data)

# Plot KM curve
ggsurv <- ggsurvplot(
  fit,
  data = km_data,
  size = 1,
  table.height = 0.2,
  risk.table = T,
  pval = T,
  conf.int = T,
  surv.median.line = "hv",
  title = "Kaplan meier for Treatment",
  legend.labs = sort(unique(km_data$group)),
  xlab = "Time (years)",
  ylab = "Survival Probability",
  palette = c("#E69", "#56B4E9"),
  ggtheme = theme_gray()
)

Warning in geom_segment(aes(x = 0, y = max(y2), xend = max(x1), yend = max(y2)), : All aesthetics have length 1, but the data has 2 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing
  a single row.

ggsurv

Warning in geom_segment(aes(x = 0, y = max(y2), xend = max(x1), yend = max(y2)), : All aesthetics have length 1, but the data has 2 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing
  a single row.
All aesthetics have length 1, but the data has 2 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing
  a single row.
All aesthetics have length 1, but the data has 2 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing
  a single row.

library(survival)
library(survminer)
library(dplyr)

# Use the variable of interest
var <- "edema_tve"

# Prepare data
km_data <- bio_model %>%
  dplyr::select(years, event, all_of(var)) %>%
  filter(!is.na(.data[[var]]))

# Stratify by median if numeric, or use as is if binary/categorical
if (is.numeric(km_data[[var]])) {
  km_data <- km_data %>%
    mutate(group = ifelse(.data[[var]] > median(.data[[var]], na.rm = TRUE),
                          "Yes", "No"))
} else {
  km_data <- km_data %>%
    mutate(group = as.character(.data[[var]]))
}

# Fit KM model
fit1 <- survfit(Surv(years, event) ~ group, data = km_data)

# Plot KM curve
ggsurv1 <- ggsurvplot(
  fit1,
  data = km_data,
  size = 1,
  table.height = 0.2,
  risk.table = T,
  pval = T,
  conf.int = T,
  surv.median.line = "hv",
  title = "Kaplan meier for time varying Edema",
  legend.labs = sort(unique(km_data$group)),
  xlab = "Time (years)",
  ylab = "Survival Probability",
  palette = c("#E69", "#56B4E9"),
  ggtheme = theme_gray()
)

Warning in geom_segment(aes(x = 0, y = max(y2), xend = max(x1), yend = max(y2)), : All aesthetics have length 1, but the data has 2 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing
  a single row.

ggsurv1

Warning in geom_segment(aes(x = 0, y = max(y2), xend = max(x1), yend = max(y2)), : All aesthetics have length 1, but the data has 2 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing
  a single row.
All aesthetics have length 1, but the data has 2 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing
  a single row.
All aesthetics have length 1, but the data has 2 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing
  a single row.

# Use the variable of interest
var <- "age"

# Prepare data
km_data <- bio_model %>%
  dplyr::select(years, event, all_of(var)) %>%
  filter(!is.na(.data[[var]]))

# Stratify by median if numeric, or use as is if binary/categorical
if (is.numeric(km_data[[var]])) {
  km_data <- km_data %>%
    mutate(group = ifelse(.data[[var]] > median(.data[[var]], na.rm = TRUE),
                          "old", "young"))
} else {
  km_data <- km_data %>%
    mutate(group = as.character(.data[[var]]))
}

# Fit KM model
fit2 <- survfit(Surv(years, event) ~ group, data = km_data)

# Plot KM curve
ggsurv2 <- ggsurvplot(
  fit2,
  data = km_data,
  size = 1,
  table.height = 0.2,
  risk.table = T,
  pval = T,
  conf.int = T,
  surv.median.line = "hv",
  title = "Kaplan meier for Age",
  legend.labs = sort(unique(km_data$group)),
  xlab = "Time (years)",
  ylab = "Survival Probability",
  palette = c("#E69", "#56B4E9"),
  ggtheme = theme_gray()
)

Warning in geom_segment(aes(x = 0, y = max(y2), xend = max(x1), yend = max(y2)), : All aesthetics have length 1, but the data has 2 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing
  a single row.

ggsurv2

Warning in geom_segment(aes(x = 0, y = max(y2), xend = max(x1), yend = max(y2)), : All aesthetics have length 1, but the data has 2 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing
  a single row.
All aesthetics have length 1, but the data has 2 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing
  a single row.
All aesthetics have length 1, but the data has 2 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing
  a single row.

# Use the variable of interest
var <- "logbili"

# Prepare data
km_data <- bio_model %>%
  dplyr::select(years, event, all_of(var)) %>%
  filter(!is.na(.data[[var]]))

# Stratify by median if numeric, or use as is if binary/categorical
if (is.numeric(km_data[[var]])) {
  km_data <- km_data %>%
    mutate(group = ifelse(.data[[var]] > median(.data[[var]], na.rm = TRUE),
                          "High", "Low"))
} else {
  km_data <- km_data %>%
    mutate(group = as.character(.data[[var]]))
}

# Fit KM model
fit3 <- survfit(Surv(years, event) ~ group, data = km_data)

# Plot KM curve
ggsurv3 <- ggsurvplot(
  fit3,
  data = km_data,
  size = 1,
  table.height = 0.2,
  risk.table = T,
  pval = T,
  conf.int = T,
  surv.median.line = "hv",
  title = "Kaplan meier for Bilirubin(log)",
  legend.labs = sort(unique(km_data$group)),
  xlab = "Time (years)",
  ylab = "Survival Probability",
  palette = c("#E69", "#56B4E9"),
  ggtheme = theme_gray()
)


ggsurv3

# Use the variable of interest
var <- "logalbu"

# Prepare data
km_data <- bio_model %>%
  dplyr::select(years, event, all_of(var)) %>%
  filter(!is.na(.data[[var]]))

# Stratify by median if numeric, or use as is if binary/categorical
if (is.numeric(km_data[[var]])) {
  km_data <- km_data %>%
    mutate(group = ifelse(.data[[var]] > median(.data[[var]], na.rm = TRUE),
                          "High", "Low"))
} else {
  km_data <- km_data %>%
    mutate(group = as.character(.data[[var]]))
}

# Fit KM model
fit4 <- survfit(Surv(years, event) ~ group, data = km_data)

# Plot KM curve
ggsurv4 <- ggsurvplot(
  fit4,
  data = km_data,
  size = 1,
  table.height = 0.2,
  risk.table = T,
  pval = T,
  conf.int = T,
  surv.median.line = "hv",
  title = "Kaplan meier for Albumin(log)",
  legend.labs = sort(unique(km_data$group)),
  xlab = "Time (years)",
  ylab = "Survival Probability",
  palette = c("#E69", "#56B4E9"),
  ggtheme = theme_gray()
)

Warning in geom_segment(aes(x = 0, y = max(y2), xend = max(x1), yend = max(y2)), : All aesthetics have length 1, but the data has 2 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing
  a single row.

ggsurv4

Warning in geom_segment(aes(x = 0, y = max(y2), xend = max(x1), yend = max(y2)), : All aesthetics have length 1, but the data has 2 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing
  a single row.
All aesthetics have length 1, but the data has 2 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing
  a single row.
All aesthetics have length 1, but the data has 2 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing
  a single row.

var <- "copper"

# Prepare data
km_data <- bio_model %>%
  dplyr::select(years, event, all_of(var)) %>%
  filter(!is.na(.data[[var]]))

# Stratify by median if numeric, or use as is if binary/categorical
if (is.numeric(km_data[[var]])) {
  km_data <- km_data %>%
    mutate(group = ifelse(.data[[var]] > median(.data[[var]], na.rm = TRUE),
                          "High", "Low"))
} else {
  km_data <- km_data %>%
    mutate(group = as.character(.data[[var]]))
}

# Fit KM model
fit5 <- survfit(Surv(years, event) ~ group, data = km_data)

# Plot KM curve
ggsurv5 <- ggsurvplot(
  fit5,
  data = km_data,
  size = 1,
  table.height = 0.2,
  risk.table = T,
  pval = T,
  conf.int = T,
  surv.median.line = "hv",
  title = "Kaplan meier for Copper",
  legend.labs = sort(unique(km_data$group)),
  xlab = "Time (years)",
  ylab = "Survival Probability",
  palette = c("#E69", "#56B4E9"),
  ggtheme = theme_gray()
)


ggsurv5

The six Kaplan-Meier plots reveal a consistent pattern regarding prognostic factors for survival. The comparison between DPCA treatment and placebo shows no significant survival difference (p = 0.75), indicating that the treatment itself does not influence survival outcomes. However, the remaining plots, covering age, bilirubin, albumin, edema, and copper levels, demonstrate strong and statistically significant associations with survival, with p-values mostly below 0.0001. Patients who are younger, have lower bilirubin and copper levels, higher albumin levels, and no edema tend to have significantly longer survival times and more favorable prognoses.

Different and available approaches

In survival analysis, a variety of modeling approaches are available, each with specific assumptions and limitations.

Parametric models

Parametric models include the Gompertz, exponential, and Weibull distributions among many others. The Gompertz model assumes that the hazard function changes exponentially over time, either increasing or decreasing depending on the sign of the shape parameter. This makes it particularly suitable for modeling adult mortality or aging-related processes. It assumes independent observations, non-informative censoring, and a log-linear relationship between covariates and the hazard function. However, its limitations include poor fit for non-monotonic hazard patterns and reduced interpretability compared to simpler models like the exponential or Weibull. Additionally, it is less commonly implemented, so diagnostic tools may be more limited.

The exponential model assumes a constant hazard rate over time and is most appropriate for data where the risk of the event remains unchanged throughout the follow-up. It also assumes independent observations, non-informative censoring, and a log-linear relationship between covariates and hazard. While mathematically simple, this model often poorly fits real-world data where hazard rates change over time, making its assumptions overly restrictive.

The Weibull model is more flexible and generalizes the exponential distribution. Its shape parameter determines whether the hazard increases \((p>1p > 1p>1)\) or decreases \((p<1p < 1p<1)\) over time, while \(p=1p = 1p=1\) reduces it to the exponential case. Like the other parametric models, it assumes non-informative censoring, independent observations, and a log-linear effect of covariates on hazard. However, it still cannot accommodate non-monotonic hazard functions and can lead to poor model fit if the true hazard pattern is not well captured.

Although not a survival model per se, Poisson regression is sometimes used in event count data where the outcome is the number of events over time. It assumes that counts follow a Poisson distribution with the mean equal to the variance (equidispersion), events occur randomly and independently, and covariates have a log-linear relationship with the expected count. However, it is not appropriate for binary time-to-event outcomes like death or survival without careful transformation. It also performs poorly in the presence of overdispersion (variance greater than the mean) or excessive zeros, in which case models like the negative binomial or zero-inflated Poisson are preferred.

Non-parametric models

Among non-parametric methods, the log-rank test is commonly used to compare survival curves between two or more groups. It assumes independent survival times, non-informative censoring, and proportional hazards across time. While it is a robust method, it does not adjust for covariates, cannot estimate effect sizes like hazard ratios, and has reduced sensitivity when survival curves cross.

Lastly, actuarial life table methods are used when data are grouped into fixed time intervals. These methods assume censoring occurs uniformly across intervals and that survival in one interval is conditional only on surviving the previous one. They are best suited for large samples and regularly spaced follow-up. However, they lack the precision of Kaplan–Meier methods, especially when exact event times are available, and may obscure hazard patterns due to interval grouping.

Unanswered question

Does the hazard ratio from a Cox model accurately reflect the overall treatment effect, or are there limitations to its interpretability—even when the model assumptions are met?

Even when the proportional hazards (PH) assumption is satisfied, the hazard ratio (HR) from a Cox model is not a fully reliable estimand for quantifying treatment effect. This is because the HR is a local, instantaneous measure that reflects the relative risk of the event at any given moment in time but does not capture the cumulative experience of subjects over the follow-up period. Specifically, the HR lacks a direct interpretation as a population-level summary of the treatment effect; it does not account for the absolute timing or probability of events, nor does it integrate over the full distribution of survival times. In contrast to estimands like the difference in survival probabilities or the restricted mean survival time (RMST), the HR does not summarize what happens to individuals after a particular timepoint, making it potentially misleading in settings where understanding long-term outcomes or average survival is important. Therefore, even under ideal model conditions, reliance on the hazard ratio alone can obscure the practical magnitude and relevance of the treatment effect.