Loading Library

#load survival package
library(survival)
library(ggsurvfit)
## Loading required package: ggplot2
#load nafld dataset
data(nafld, package = 'survival')
head(nafld1)
##       id age male weight height      bmi case.id futime status
## 3631   1  57    0   60.0    163 22.69094   10630   6261      0
## 8458   2  67    0   70.4    168 24.88403   14817    624      0
## 6298   3  53    1  105.8    186 30.45354       3   1783      0
## 15398  4  56    1  109.3    170 37.83010    6628   3143      0
## 13261  5  68    1     NA     NA       NA    1871   1836      1
## 14423  6  39    0   63.9    155 26.61559    7144   1581      0
#perform analysis with nafld1
#for example, survival object would be
surv_obj <- Surv(nafld1$futime, nafld1$status)

Plot the overall survival curve

survfit(Surv(futime, status) ~ 1, data = nafld1) |>
  ggsurvfit() +
  labs(
    x = 'Days', 
    y = 'Overall survival probability'
  ) + 
  add_confidence_interval() + 
  add_risktable()

Estimate median survival time and at 1.5 years

summary(surv_obj)
##       time          status       
##  Min.   :   7   Min.   :0.00000  
##  1st Qu.:1132   1st Qu.:0.00000  
##  Median :2148   Median :0.00000  
##  Mean   :2411   Mean   :0.07773  
##  3rd Qu.:3353   3rd Qu.:0.00000  
##  Max.   :7268   Max.   :1.00000

Median survival time = 2148 days.

#1.5 years = 365 * 1.5
summary(survfit(Surv(futime, status) ~ 1, data = nafld1), times = 547.5)
## Call: survfit(formula = Surv(futime, status) ~ 1, data = nafld1)
## 
##  time n.risk n.event survival  std.err lower 95% CI upper 95% CI
##   548  16305     246    0.986 0.000909        0.984        0.987

Survival probability at 1.5 years = 98.6%

Plot the survival curves of males and females

survfit(Surv(futime, status) ~ as.factor(male), data = nafld1) |>
  ggsurvfit() +
  labs(
    x = 'Days', 
    y = 'Overall survival probability'
  ) + 
  add_confidence_interval() + 
  add_risktable()

Seems like up until about 6500 days, males had a higher survival probability, but after that they end up with a lower survival probability.

Test the difference between the curves

null hypothesis: The survival curve for males and females are the same

survdiff(Surv(futime, status) ~ as.factor(male), data = nafld1)
## Call:
## survdiff(formula = Surv(futime, status) ~ as.factor(male), data = nafld1)
## 
##                      N Observed Expected (O-E)^2/E (O-E)^2/V
## as.factor(male)=0 9348      674      736      5.20      11.3
## as.factor(male)=1 8201      690      628      6.09      11.3
## 
##  Chisq= 11.3  on 1 degrees of freedom, p= 8e-04

p-value is lower than 0.05. We reject null hypothesis. There is a significant difference between males and females survival curve.

Examine each of other covariates (age, weight, height, bmi) and check if any covariate significantly affect survival

library(gtsummary)
coxph(Surv(futime, status) ~ as.factor(male), data = nafld1) |>
  tbl_regression(exp = TRUE)
Characteristic HR 95% CI p-value
as.factor(male)


    0
    1 1.20 1.08, 1.33 <0.001
Abbreviations: CI = Confidence Interval, HR = Hazard Ratio
coxph(Surv(futime, status) ~ age, data = nafld1) |>
  tbl_regression(exp = TRUE)
Characteristic HR 95% CI p-value
age 1.10 1.10, 1.11 <0.001
Abbreviations: CI = Confidence Interval, HR = Hazard Ratio
coxph(Surv(futime, status) ~ weight, data = nafld1) |>
  tbl_regression(exp = TRUE)
Characteristic HR 95% CI p-value
weight 1.0 0.99, 1.00 <0.001
Abbreviations: CI = Confidence Interval, HR = Hazard Ratio
coxph(Surv(futime, status) ~ height, data = nafld1) |>
  tbl_regression(exp = TRUE)
Characteristic HR 95% CI p-value
height 0.98 0.97, 0.98 <0.001
Abbreviations: CI = Confidence Interval, HR = Hazard Ratio
coxph(Surv(futime, status) ~ bmi, data = nafld1) |>
  tbl_regression(exp = TRUE)
Characteristic HR 95% CI p-value
bmi 1.00 0.99, 1.01 >0.9
Abbreviations: CI = Confidence Interval, HR = Hazard Ratio

Based on the univariate Cox Proportional Hazard models, male, age, weight and height all do significantly affect survival on their own. The only one that doesn’t is bmi.

Now we will include all variables into one so we can see their multivariate effects. This would resemble a real-world scenario where variables are often correlated.

coxph(Surv(futime, status) ~ as.factor(male) + age + weight + height + bmi, data = nafld1) |>
  tbl_regression(exp = TRUE)
Characteristic HR 95% CI p-value
as.factor(male)


    0
    1 1.73 1.45, 2.07 <0.001
age 1.10 1.10, 1.11 <0.001
weight 1.01 0.99, 1.03 0.4
height 0.98 0.95, 1.00 0.10
bmi 0.99 0.93, 1.06 0.7
Abbreviations: CI = Confidence Interval, HR = Hazard Ratio

When accounting for all the variables, only male and age have a significant impact in survival. Being a male or having a higher age show a higher probability of experiencing the event.

Report and provide rationale for the best multivariate Cox model

Since we are about to run step wise selection for our variables, we will ensure that missing values are omitted.

full_cox_model <- coxph(Surv(futime, status) ~ as.factor(male) + age + weight + 
                          height + bmi, 
                        data = nafld1)
summary(full_cox_model)
## Call:
## coxph(formula = Surv(futime, status) ~ as.factor(male) + age + 
##     weight + height + bmi, data = nafld1)
## 
##   n= 12588, number of events= 1018 
##    (4961 observations deleted due to missingness)
## 
##                       coef exp(coef)  se(coef)      z Pr(>|z|)    
## as.factor(male)1  0.550169  1.733547  0.091513  6.012 1.83e-09 ***
## age               0.098330  1.103327  0.002771 35.483  < 2e-16 ***
## weight            0.009734  1.009781  0.012206  0.797    0.425    
## height           -0.022267  0.977979  0.013629 -1.634    0.102    
## bmi              -0.011134  0.988927  0.033234 -0.335    0.738    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##                  exp(coef) exp(-coef) lower .95 upper .95
## as.factor(male)1    1.7335     0.5769    1.4489     2.074
## age                 1.1033     0.9063    1.0974     1.109
## weight              1.0098     0.9903    0.9859     1.034
## height              0.9780     1.0225    0.9522     1.004
## bmi                 0.9889     1.0112    0.9266     1.055
## 
## Concordance= 0.828  (se = 0.007 )
## Likelihood ratio test= 1758  on 5 df,   p=<2e-16
## Wald test            = 1489  on 5 df,   p=<2e-16
## Score (logrank) test = 1729  on 5 df,   p=<2e-16

As shown earlier in the univariate and multivariate Cox proportional hazards models, only sex and age had a statistically significant effect on survival.

One modeling approach is to retain only those significant variables from the outset. Alternatively, we can iteratively remove non-significant variables, starting with the one with the highest p-value. In our case, that was bmi.

cox_model_1 <- coxph(Surv(futime, status) ~ as.factor(male) + age + weight + 
                          height, 
                        data = nafld1)
summary(cox_model_1)
## Call:
## coxph(formula = Surv(futime, status) ~ as.factor(male) + age + 
##     weight + height, data = nafld1)
## 
##   n= 12588, number of events= 1018 
##    (4961 observations deleted due to missingness)
## 
##                       coef exp(coef)  se(coef)      z Pr(>|z|)    
## as.factor(male)1  0.545900  1.726162  0.090392  6.039 1.55e-09 ***
## age               0.098361  1.103361  0.002769 35.521  < 2e-16 ***
## weight            0.005686  1.005702  0.001832  3.104  0.00191 ** 
## height           -0.017951  0.982209  0.004597 -3.905 9.43e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##                  exp(coef) exp(-coef) lower .95 upper .95
## as.factor(male)1    1.7262     0.5793    1.4459    2.0607
## age                 1.1034     0.9063    1.0974    1.1094
## weight              1.0057     0.9943    1.0021    1.0093
## height              0.9822     1.0181    0.9734    0.9911
## 
## Concordance= 0.828  (se = 0.007 )
## Likelihood ratio test= 1758  on 4 df,   p=<2e-16
## Wald test            = 1490  on 4 df,   p=<2e-16
## Score (logrank) test = 1729  on 4 df,   p=<2e-16

After removing bmi from the model, the remaining covariates are all statistically significant. This result is also conceptually consistent, as bmi is a derived measure based on height and weight, so its predictive value may be diminished when those components are already included in the model.

With this rationale, we can report the above model as our best multivariate Cox model.