Sample: Cox Proportional Hazard Model

Using the survival package PBC data from the Mayo Clinic’s Primary Biliary Cirrhosis data to demonstrate Cox multivariate regression; r-bloggers original example using the lung dataset

Megan Georges
2022-09-09

Load Sample Data

The ‘survival’ R package has multiple sample health data sets. I’ll use the pbc set, which is from the Mayo Clinic Primary Biliary Cirrhosis data. The study was conducted from 1974 to 1984 on 418 patients as a randomized placebo controlled trial of the drug D-penicillamine.

pbc_data <- read_csv("../Samples/pbc_data.csv")
head(pbc_data)
# A tibble: 6 x 20
     id  time status   trt   age sex   ascites hepato spiders edema
  <dbl> <dbl>  <dbl> <dbl> <dbl> <chr>   <dbl>  <dbl>   <dbl> <dbl>
1     1   400      2     1  58.8 f           1      1       1   1  
2     2  4500      0     1  56.4 f           0      1       1   0  
3     3  1012      2     1  70.1 m           0      0       0   0.5
4     4  1925      2     1  54.7 f           0      1       1   0.5
5     5  1504      1     2  38.1 f           0      1       1   0  
6     6  2503      2     2  66.3 f           0      1       0   0  
# ... with 10 more variables: bili <dbl>, chol <dbl>, albumin <dbl>,
#   copper <dbl>, alk.phos <dbl>, ast <dbl>, trig <dbl>,
#   platelet <dbl>, protime <dbl>, stage <dbl>
# i Use `colnames()` to see all variable names

Multivariate Cox Regression Analysis

I’ll perform a Cox regression of time to death on the covariates age, sex, and treatment.

First, I need to recode some variables and remove the non-randomized participants.

# remove NA from trt variable (non-randomized participants)
pbc_data <- pbc_data %>% na.omit(trt)

# sex 1 = female, 2 = male
pbc_data$sex[pbc_data$sex == "f"] <- "1"
pbc_data$sex[pbc_data$sex == "m"] <- "2"

# remove transplant (1) status
pbc_data <- pbc_data %>% filter(!str_detect(status, "1"))

# recode censor (0) status to 1
pbc_data$status[pbc_data$status == "0"] <- "1"
pbc_data$status <- as.numeric(pbc_data$status)
head(pbc_data, 10)
# A tibble: 10 x 20
      id  time status   trt   age sex   ascites hepato spiders edema
   <dbl> <dbl>  <dbl> <dbl> <dbl> <chr>   <dbl>  <dbl>   <dbl> <dbl>
 1     1   400      2     1  58.8 1           1      1       1   1  
 2     2  4500      1     1  56.4 1           0      1       1   0  
 3     3  1012      2     1  70.1 2           0      0       0   0.5
 4     4  1925      2     1  54.7 1           0      1       1   0.5
 5     7  1832      1     2  55.5 1           0      1       0   0  
 6     8  2466      2     2  53.1 1           0      0       0   0  
 7     9  2400      2     1  42.5 1           0      0       1   0  
 8    10    51      2     2  70.6 1           1      0       1   1  
 9    11  3762      2     2  53.7 1           0      1       1   0  
10    12   304      2     2  59.1 1           0      0       1   0  
# ... with 10 more variables: bili <dbl>, chol <dbl>, albumin <dbl>,
#   copper <dbl>, alk.phos <dbl>, ast <dbl>, trig <dbl>,
#   platelet <dbl>, protime <dbl>, stage <dbl>
# i Use `colnames()` to see all variable names

sex:

trt:

status:

res_cox <- coxph(Surv(time, status) ~ age + sex + trt, data =  pbc_data)
summary(res_cox)
Call:
coxph(formula = Surv(time, status) ~ age + sex + trt, data = pbc_data)

  n= 258, number of events= 111 

          coef exp(coef)  se(coef)      z Pr(>|z|)    
age   0.038871  1.039637  0.009892  3.930 8.51e-05 ***
sex2  0.330437  1.391576  0.248463  1.330    0.184    
trt  -0.014984  0.985128  0.192715 -0.078    0.938    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

     exp(coef) exp(-coef) lower .95 upper .95
age     1.0396     0.9619    1.0197     1.060
sex2    1.3916     0.7186    0.8551     2.265
trt     0.9851     1.0151    0.6752     1.437

Concordance= 0.628  (se = 0.029 )
Likelihood ratio test= 19.88  on 3 df,   p=2e-04
Wald test            = 20.58  on 3 df,   p=1e-04
Score (logrank) test = 20.71  on 3 df,   p=1e-04

The p-value for all 3 overall tests (likelihood, Wald, and score) are significant, indicating that the model is significant. The omnibus null hypothesis (that all betas (β) are 0) is rejected.

In the multivariate Cox analysis, the covariate age is significant (p < 0.01). However, the covariates sex and trt fail to be significant (p = 0.0.18 and p = 0.94, respectively).

The hazard ratio for age is HR = exp(coef) = 1.039, with a significant p-value, which indicates a strong relationship between age and increased risk of death. Holding the other covariates constant, a higher value for age is associated with poorer prognosis for survival.

Visualizing the Estimated Distribution of Survival Times

# Plot the baseline survival function
ggsurvplot(survfit(res_cox, data = pbc_data), color = "#2E9FDF",
           ggtheme = theme_minimal())

Although sex was not determined to have a significant effect on survival when controlling for age and treatment, I’ll demostrate plotting impact of sex on the estimated survival probability.

# Create the new data  
sex_df <- with(pbc_data,
            data.frame(sex = c(1, 2), 
            age = rep(mean(age, na.rm = TRUE), 2),
            trt = c(1, 1)
))

# Survival curves
fit2 <- survfit(res_cox, newdata = sex_df)
ggsurvplot(fit2, conf_int = TRUE, legend.labs=c("Sex=1", "Sex=2"),
           ggtheme = theme_minimal(), data = sex_df)

Original Example with Multivariate Cox Regression Analysis

The above was my adaptation of the following example with the ‘lung’ dataset.

We’ll include the 3 factors (sex, age and ph.ecog) into the multivariate model.

This is direct coding and explanation from r-bloggers; not my own work

A Cox regression of time to death on the time-constant covariates is specified as follow:

head(lung, 10)
   inst time status age sex ph.ecog ph.karno pat.karno meal.cal
1     3  306      2  74   1       1       90       100     1175
2     3  455      2  68   1       0       90        90     1225
3     3 1010      1  56   1       0       90        90       NA
4     5  210      2  57   1       1       90        60     1150
5     1  883      2  60   1       0      100        90       NA
6    12 1022      1  74   1       1       50        80      513
7     7  310      2  68   2       2       70        60      384
8    11  361      2  71   2       2       60        80      538
9     1  218      2  53   1       1       70        80      825
10    7  166      2  61   1       2       70        70      271
   wt.loss
1       NA
2       15
3       15
4       11
5        0
6        0
7       10
8        1
9       16
10      34
res.cox <- coxph(Surv(time, status) ~ age + sex + ph.ecog, data =  lung)
summary(res.cox)
Call:
coxph(formula = Surv(time, status) ~ age + sex + ph.ecog, data = lung)

  n= 227, number of events= 164 
   (1 observation deleted due to missingness)

             coef exp(coef)  se(coef)      z Pr(>|z|)    
age      0.011067  1.011128  0.009267  1.194 0.232416    
sex     -0.552612  0.575445  0.167739 -3.294 0.000986 ***
ph.ecog  0.463728  1.589991  0.113577  4.083 4.45e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

        exp(coef) exp(-coef) lower .95 upper .95
age        1.0111     0.9890    0.9929    1.0297
sex        0.5754     1.7378    0.4142    0.7994
ph.ecog    1.5900     0.6289    1.2727    1.9864

Concordance= 0.637  (se = 0.025 )
Likelihood ratio test= 30.5  on 3 df,   p=1e-06
Wald test            = 29.93  on 3 df,   p=1e-06
Score (logrank) test = 30.5  on 3 df,   p=1e-06

Visualizing the Estimated Distribution of Survival Times

# Plot the baseline survival function
ggsurvplot(survfit(res.cox, data = lung), color = "#2E9FDF",
           ggtheme = theme_minimal())

To assess the impact of the sex on the estimated survival probability:

# Create new dataset with 2 rows (each sex)
# other covariates are fixed to their average values
sex_df <- with(lung,
            data.frame(sex = c(1, 2), 
            age = rep(mean(age, na.rm = TRUE), 2),
            ph.ecog = c(1, 1)
))
sex_df
  sex      age ph.ecog
1   1 62.44737       1
2   2 62.44737       1
# Survival curves
fit2 <- survfit(res.cox, newdata = sex_df)
ggsurvplot(fit2, conf.int = TRUE, legend.labs=c("Sex=1", "Sex=2"),
           ggtheme = theme_minimal(), data = sex_df)

SOURCE: r-bloggers