Using the survival package PBC data from the Mayo Clinic’s Primary Biliary Cirrhosis data to demonstrate Cox multivariate regression; r-bloggers original example using the lung dataset
The ‘survival’ R package has multiple sample health data sets. I’ll use the pbc set, which is from the Mayo Clinic Primary Biliary Cirrhosis data. The study was conducted from 1974 to 1984 on 418 patients as a randomized placebo controlled trial of the drug D-penicillamine.
# A tibble: 6 x 20
id time status trt age sex ascites hepato spiders edema
<dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 1 400 2 1 58.8 f 1 1 1 1
2 2 4500 0 1 56.4 f 0 1 1 0
3 3 1012 2 1 70.1 m 0 0 0 0.5
4 4 1925 2 1 54.7 f 0 1 1 0.5
5 5 1504 1 2 38.1 f 0 1 1 0
6 6 2503 2 2 66.3 f 0 1 0 0
# ... with 10 more variables: bili <dbl>, chol <dbl>, albumin <dbl>,
# copper <dbl>, alk.phos <dbl>, ast <dbl>, trig <dbl>,
# platelet <dbl>, protime <dbl>, stage <dbl>
# i Use `colnames()` to see all variable names
I’ll perform a Cox regression of time to death on the covariates age, sex, and treatment.
First, I need to recode some variables and remove the non-randomized participants.
# remove NA from trt variable (non-randomized participants)
pbc_data <- pbc_data %>% na.omit(trt)
# sex 1 = female, 2 = male
pbc_data$sex[pbc_data$sex == "f"] <- "1"
pbc_data$sex[pbc_data$sex == "m"] <- "2"
# remove transplant (1) status
pbc_data <- pbc_data %>% filter(!str_detect(status, "1"))
# recode censor (0) status to 1
pbc_data$status[pbc_data$status == "0"] <- "1"
pbc_data$status <- as.numeric(pbc_data$status)
head(pbc_data, 10)
# A tibble: 10 x 20
id time status trt age sex ascites hepato spiders edema
<dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 1 400 2 1 58.8 1 1 1 1 1
2 2 4500 1 1 56.4 1 0 1 1 0
3 3 1012 2 1 70.1 2 0 0 0 0.5
4 4 1925 2 1 54.7 1 0 1 1 0.5
5 7 1832 1 2 55.5 1 0 1 0 0
6 8 2466 2 2 53.1 1 0 0 0 0
7 9 2400 2 1 42.5 1 0 0 1 0
8 10 51 2 2 70.6 1 1 0 1 1
9 11 3762 2 2 53.7 1 0 1 1 0
10 12 304 2 2 59.1 1 0 0 1 0
# ... with 10 more variables: bili <dbl>, chol <dbl>, albumin <dbl>,
# copper <dbl>, alk.phos <dbl>, ast <dbl>, trig <dbl>,
# platelet <dbl>, protime <dbl>, stage <dbl>
# i Use `colnames()` to see all variable names
sex:
1 = female
2 = male
trt:
1 = D-penicillmain
2 = placebo
status:
1 = censored
2 = deceased
Call:
coxph(formula = Surv(time, status) ~ age + sex + trt, data = pbc_data)
n= 258, number of events= 111
coef exp(coef) se(coef) z Pr(>|z|)
age 0.038871 1.039637 0.009892 3.930 8.51e-05 ***
sex2 0.330437 1.391576 0.248463 1.330 0.184
trt -0.014984 0.985128 0.192715 -0.078 0.938
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
exp(coef) exp(-coef) lower .95 upper .95
age 1.0396 0.9619 1.0197 1.060
sex2 1.3916 0.7186 0.8551 2.265
trt 0.9851 1.0151 0.6752 1.437
Concordance= 0.628 (se = 0.029 )
Likelihood ratio test= 19.88 on 3 df, p=2e-04
Wald test = 20.58 on 3 df, p=1e-04
Score (logrank) test = 20.71 on 3 df, p=1e-04
The p-value for all 3 overall tests (likelihood, Wald, and score) are significant, indicating that the model is significant. The omnibus null hypothesis (that all betas (β) are 0) is rejected.
In the multivariate Cox analysis, the covariate age is significant (p < 0.01). However, the covariates sex and trt fail to be significant (p = 0.0.18 and p = 0.94, respectively).
The hazard ratio for age is HR = exp(coef) = 1.039, with a significant p-value, which indicates a strong relationship between age and increased risk of death. Holding the other covariates constant, a higher value for age is associated with poorer prognosis for survival.
# Plot the baseline survival function
ggsurvplot(survfit(res_cox, data = pbc_data), color = "#2E9FDF",
ggtheme = theme_minimal())
Although sex was not determined to have a significant effect on survival when controlling for age and treatment, I’ll demostrate plotting impact of sex on the estimated survival probability.
# Create the new data
sex_df <- with(pbc_data,
data.frame(sex = c(1, 2),
age = rep(mean(age, na.rm = TRUE), 2),
trt = c(1, 1)
))
# Survival curves
fit2 <- survfit(res_cox, newdata = sex_df)
ggsurvplot(fit2, conf_int = TRUE, legend.labs=c("Sex=1", "Sex=2"),
ggtheme = theme_minimal(), data = sex_df)
The above was my adaptation of the following example with the ‘lung’ dataset.
We’ll include the 3 factors (sex, age and ph.ecog) into the multivariate model.
This is direct coding and explanation from r-bloggers; not my own work
A Cox regression of time to death on the time-constant covariates is specified as follow:
head(lung, 10)
inst time status age sex ph.ecog ph.karno pat.karno meal.cal
1 3 306 2 74 1 1 90 100 1175
2 3 455 2 68 1 0 90 90 1225
3 3 1010 1 56 1 0 90 90 NA
4 5 210 2 57 1 1 90 60 1150
5 1 883 2 60 1 0 100 90 NA
6 12 1022 1 74 1 1 50 80 513
7 7 310 2 68 2 2 70 60 384
8 11 361 2 71 2 2 60 80 538
9 1 218 2 53 1 1 70 80 825
10 7 166 2 61 1 2 70 70 271
wt.loss
1 NA
2 15
3 15
4 11
5 0
6 0
7 10
8 1
9 16
10 34
Call:
coxph(formula = Surv(time, status) ~ age + sex + ph.ecog, data = lung)
n= 227, number of events= 164
(1 observation deleted due to missingness)
coef exp(coef) se(coef) z Pr(>|z|)
age 0.011067 1.011128 0.009267 1.194 0.232416
sex -0.552612 0.575445 0.167739 -3.294 0.000986 ***
ph.ecog 0.463728 1.589991 0.113577 4.083 4.45e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
exp(coef) exp(-coef) lower .95 upper .95
age 1.0111 0.9890 0.9929 1.0297
sex 0.5754 1.7378 0.4142 0.7994
ph.ecog 1.5900 0.6289 1.2727 1.9864
Concordance= 0.637 (se = 0.025 )
Likelihood ratio test= 30.5 on 3 df, p=1e-06
Wald test = 29.93 on 3 df, p=1e-06
Score (logrank) test = 30.5 on 3 df, p=1e-06
“The p-value for all 3 overall tests (likelihood, Wald, and score) are significant, indicating that the model is significant. These tests evaluate the omnibus null hypothesis that all of the betas (β) are 0. In the above example, the test statistics are in close agreement, and the omnibus null hypothesis is soundly rejected.”
“In the multivariate Cox analysis, the covariates sex and ph.ecog are significant (p < 0.05). However, the covariate age fails to be significant (p = 0.23, which is grater than 0.05).”
“The p-value for sex is 0.000986, with a hazard ratio HR = exp(coef) = 0.58, indicating a strong relationship between the patients’ sex and decreased risk of death. The hazard ratios of covariates are interpretable as multiplicative effects on the hazard. For example, holding the other covariates constant, being female (sex=2) reduces the hazard by a factor of 0.58, or 42%. We conclude that, being female is associated with good prognostic.”
“Similarly, the p-value for ph.ecog is 4.45e-05, with a hazard ratio HR = 1.59, indicating a strong relationship between the ph.ecog value and increased risk of death. Holding the other covariates constant, a higher value of ph.ecog is associated with a poor survival.”
“By contrast, the p-value for age is now p=0.23. The hazard ratio HR = exp(coef) = 1.01, with a 95% confidence interval of 0.99 to 1.03. Because the confidence interval for HR includes 1, these results indicate that age makes a smaller contribution to the difference in the HR after adjusting for the ph.ecog values and patient’s sex, and only trend toward significance. For example, holding the other covariates constant, an additional year of age induce daily hazard of death by a factor of exp(beta) = 1.01, or 1%, which is not a significant contribution.”
# Plot the baseline survival function
ggsurvplot(survfit(res.cox, data = lung), color = "#2E9FDF",
ggtheme = theme_minimal())
To assess the impact of the sex on the estimated survival probability:
# Create new dataset with 2 rows (each sex)
# other covariates are fixed to their average values
sex_df <- with(lung,
data.frame(sex = c(1, 2),
age = rep(mean(age, na.rm = TRUE), 2),
ph.ecog = c(1, 1)
))
sex_df
sex age ph.ecog
1 1 62.44737 1
2 2 62.44737 1
# Survival curves
fit2 <- survfit(res.cox, newdata = sex_df)
ggsurvplot(fit2, conf.int = TRUE, legend.labs=c("Sex=1", "Sex=2"),
ggtheme = theme_minimal(), data = sex_df)
SOURCE: r-bloggers