Loading Library
#load survival package
library(survival)
library(ggsurvfit)
## Loading required package: ggplot2
#load nafld dataset
data(nafld, package = 'survival')
head(nafld1)
## id age male weight height bmi case.id futime status
## 3631 1 57 0 60.0 163 22.69094 10630 6261 0
## 8458 2 67 0 70.4 168 24.88403 14817 624 0
## 6298 3 53 1 105.8 186 30.45354 3 1783 0
## 15398 4 56 1 109.3 170 37.83010 6628 3143 0
## 13261 5 68 1 NA NA NA 1871 1836 1
## 14423 6 39 0 63.9 155 26.61559 7144 1581 0
#perform analysis with nafld1
#for example, survival object would be
surv_obj <- Surv(nafld1$futime, nafld1$status)
survfit(Surv(futime, status) ~ 1, data = nafld1) |>
ggsurvfit() +
labs(
x = 'Days',
y = 'Overall survival probability'
) +
add_confidence_interval() +
add_risktable()
summary(surv_obj)
## time status
## Min. : 7 Min. :0.00000
## 1st Qu.:1132 1st Qu.:0.00000
## Median :2148 Median :0.00000
## Mean :2411 Mean :0.07773
## 3rd Qu.:3353 3rd Qu.:0.00000
## Max. :7268 Max. :1.00000
Median survival time = 2148 days.
#1.5 years = 365 * 1.5
summary(survfit(Surv(futime, status) ~ 1, data = nafld1), times = 547.5)
## Call: survfit(formula = Surv(futime, status) ~ 1, data = nafld1)
##
## time n.risk n.event survival std.err lower 95% CI upper 95% CI
## 548 16305 246 0.986 0.000909 0.984 0.987
Survival probability at 1.5 years = 98.6%
survfit(Surv(futime, status) ~ as.factor(male), data = nafld1) |>
ggsurvfit() +
labs(
x = 'Days',
y = 'Overall survival probability'
) +
add_confidence_interval() +
add_risktable()
Seems like up until about 6500 days, males had a higher survival probability, but after that they end up with a lower survival probability.
null hypothesis: The survival curve for males and females are the same
survdiff(Surv(futime, status) ~ as.factor(male), data = nafld1)
## Call:
## survdiff(formula = Surv(futime, status) ~ as.factor(male), data = nafld1)
##
## N Observed Expected (O-E)^2/E (O-E)^2/V
## as.factor(male)=0 9348 674 736 5.20 11.3
## as.factor(male)=1 8201 690 628 6.09 11.3
##
## Chisq= 11.3 on 1 degrees of freedom, p= 8e-04
p-value is lower than 0.05. We reject null hypothesis. There is a significant difference between males and females survival curve.
library(gtsummary)
coxph(Surv(futime, status) ~ as.factor(male), data = nafld1) |>
tbl_regression(exp = TRUE)
| Characteristic | HR | 95% CI | p-value |
|---|---|---|---|
| as.factor(male) | |||
| 0 | — | — | |
| 1 | 1.20 | 1.08, 1.33 | <0.001 |
| Abbreviations: CI = Confidence Interval, HR = Hazard Ratio | |||
coxph(Surv(futime, status) ~ age, data = nafld1) |>
tbl_regression(exp = TRUE)
| Characteristic | HR | 95% CI | p-value |
|---|---|---|---|
| age | 1.10 | 1.10, 1.11 | <0.001 |
| Abbreviations: CI = Confidence Interval, HR = Hazard Ratio | |||
coxph(Surv(futime, status) ~ weight, data = nafld1) |>
tbl_regression(exp = TRUE)
| Characteristic | HR | 95% CI | p-value |
|---|---|---|---|
| weight | 1.0 | 0.99, 1.00 | <0.001 |
| Abbreviations: CI = Confidence Interval, HR = Hazard Ratio | |||
coxph(Surv(futime, status) ~ height, data = nafld1) |>
tbl_regression(exp = TRUE)
| Characteristic | HR | 95% CI | p-value |
|---|---|---|---|
| height | 0.98 | 0.97, 0.98 | <0.001 |
| Abbreviations: CI = Confidence Interval, HR = Hazard Ratio | |||
coxph(Surv(futime, status) ~ bmi, data = nafld1) |>
tbl_regression(exp = TRUE)
| Characteristic | HR | 95% CI | p-value |
|---|---|---|---|
| bmi | 1.00 | 0.99, 1.01 | >0.9 |
| Abbreviations: CI = Confidence Interval, HR = Hazard Ratio | |||
Based on the univariate Cox Proportional Hazard models,
male, age, weight and
height all do significantly affect survival on their own.
The only one that doesn’t is bmi.
Now we will include all variables into one so we can see their multivariate effects. This would resemble a real-world scenario where variables are often correlated.
coxph(Surv(futime, status) ~ as.factor(male) + age + weight + height + bmi, data = nafld1) |>
tbl_regression(exp = TRUE)
| Characteristic | HR | 95% CI | p-value |
|---|---|---|---|
| as.factor(male) | |||
| 0 | — | — | |
| 1 | 1.73 | 1.45, 2.07 | <0.001 |
| age | 1.10 | 1.10, 1.11 | <0.001 |
| weight | 1.01 | 0.99, 1.03 | 0.4 |
| height | 0.98 | 0.95, 1.00 | 0.10 |
| bmi | 0.99 | 0.93, 1.06 | 0.7 |
| Abbreviations: CI = Confidence Interval, HR = Hazard Ratio | |||
When accounting for all the variables, only male and
age have a significant impact in survival. Being a male or
having a higher age show a higher probability of experiencing the
event.
Since we are about to run step wise selection for our variables, we will ensure that missing values are omitted.
full_cox_model <- coxph(Surv(futime, status) ~ as.factor(male) + age + weight +
height + bmi,
data = nafld1)
summary(full_cox_model)
## Call:
## coxph(formula = Surv(futime, status) ~ as.factor(male) + age +
## weight + height + bmi, data = nafld1)
##
## n= 12588, number of events= 1018
## (4961 observations deleted due to missingness)
##
## coef exp(coef) se(coef) z Pr(>|z|)
## as.factor(male)1 0.550169 1.733547 0.091513 6.012 1.83e-09 ***
## age 0.098330 1.103327 0.002771 35.483 < 2e-16 ***
## weight 0.009734 1.009781 0.012206 0.797 0.425
## height -0.022267 0.977979 0.013629 -1.634 0.102
## bmi -0.011134 0.988927 0.033234 -0.335 0.738
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## as.factor(male)1 1.7335 0.5769 1.4489 2.074
## age 1.1033 0.9063 1.0974 1.109
## weight 1.0098 0.9903 0.9859 1.034
## height 0.9780 1.0225 0.9522 1.004
## bmi 0.9889 1.0112 0.9266 1.055
##
## Concordance= 0.828 (se = 0.007 )
## Likelihood ratio test= 1758 on 5 df, p=<2e-16
## Wald test = 1489 on 5 df, p=<2e-16
## Score (logrank) test = 1729 on 5 df, p=<2e-16
As shown earlier in the univariate and multivariate Cox proportional
hazards models, only sex and
age had a statistically significant effect
on survival.
One modeling approach is to retain only those significant variables
from the outset. Alternatively, we can iteratively remove
non-significant variables, starting with the one with the highest
p-value. In our case, that was bmi.
cox_model_1 <- coxph(Surv(futime, status) ~ as.factor(male) + age + weight +
height,
data = nafld1)
summary(cox_model_1)
## Call:
## coxph(formula = Surv(futime, status) ~ as.factor(male) + age +
## weight + height, data = nafld1)
##
## n= 12588, number of events= 1018
## (4961 observations deleted due to missingness)
##
## coef exp(coef) se(coef) z Pr(>|z|)
## as.factor(male)1 0.545900 1.726162 0.090392 6.039 1.55e-09 ***
## age 0.098361 1.103361 0.002769 35.521 < 2e-16 ***
## weight 0.005686 1.005702 0.001832 3.104 0.00191 **
## height -0.017951 0.982209 0.004597 -3.905 9.43e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## as.factor(male)1 1.7262 0.5793 1.4459 2.0607
## age 1.1034 0.9063 1.0974 1.1094
## weight 1.0057 0.9943 1.0021 1.0093
## height 0.9822 1.0181 0.9734 0.9911
##
## Concordance= 0.828 (se = 0.007 )
## Likelihood ratio test= 1758 on 4 df, p=<2e-16
## Wald test = 1490 on 4 df, p=<2e-16
## Score (logrank) test = 1729 on 4 df, p=<2e-16
After removing bmi from the model, the remaining
covariates are all statistically significant. This result is also
conceptually consistent, as bmi is a derived measure based
on height and weight, so its predictive value may be diminished when
those components are already included in the model.
With this rationale, we can report the above model as our best multivariate Cox model.