Introduction

In my role my as a research data analyst, I manage healthcare data and create insights to help physicians predict patient outcomes. Survival analysis is one of the tools I use to quantify oncology data. Survival analysis estimates the probability of survival over time. This is important in healthcare for understanding disease progression and treatment effectiveness. Survival analysis plays a crucial role in oncology where time-to-time outcomes, determining sample size, defining endpoints, and treatment efficacy influences care initiatives. The purpose of this project is to demonstrate important features of survival analysis, understand data output, and communicate insights from the mgus2 data set.

Packages

knitr::opts_chunk$set(echo = TRUE)
require(formattable)
## Loading required package: formattable
## Warning in library(package, lib.loc = lib.loc, character.only = TRUE,
## logical.return = TRUE, : there is no package called 'formattable'
require(RPostgres)
## Loading required package: RPostgres
## Warning in library(package, lib.loc = lib.loc, character.only = TRUE,
## logical.return = TRUE, : there is no package called 'RPostgres'
require(DT)
## Loading required package: DT
## Warning in library(package, lib.loc = lib.loc, character.only = TRUE,
## logical.return = TRUE, : there is no package called 'DT'
require(lubridate)
## Loading required package: lubridate
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
require(knitr)
## Loading required package: knitr
require(ggplot2)
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.3.2
require(ggVennDiagram)
## Loading required package: ggVennDiagram
## Warning in library(package, lib.loc = lib.loc, character.only = TRUE,
## logical.return = TRUE, : there is no package called 'ggVennDiagram'
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.3.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr   1.1.2     ✔ stringr 1.5.0
## ✔ forcats 1.0.0     ✔ tibble  3.2.1
## ✔ purrr   1.0.1     ✔ tidyr   1.3.0
## ✔ readr   2.1.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(survival)
library(survminer)
## Loading required package: ggpubr
## 
## Attaching package: 'survminer'
## 
## The following object is masked from 'package:survival':
## 
##     myeloma
library(janitor)
## Warning: package 'janitor' was built under R version 4.3.2
## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
library(ggsurvfit)
## Warning: package 'ggsurvfit' was built under R version 4.3.3
library(gtsummary)
## Warning: package 'gtsummary' was built under R version 4.3.3
library(tidycmprsk)
## Warning: package 'tidycmprsk' was built under R version 4.3.3
## 
## Attaching package: 'tidycmprsk'
## 
## The following object is masked from 'package:gtsummary':
## 
##     trial

Data description

The data() function will pull a list of built in R data sets. The data used in this project is from the survival package. The (Monoclonal gammopathy) data set depicts the natural history of 1341 sequential patients with monoclonal gammopathy of undetermined significance (MGUS).

data(cancer, package="survival")

view(mgus2)

The mgus2 data set contains the following variables:

id: subject identifier

age: age at diagnosis, in years

sex: a factor with levels F M

dxyr: year of diagnosis

hgb: hemoglobin

creat: creatinine

mspike: size of the monoclonal serum splike

ptime: time until progression to a plasma cell malignancy (PCM) or last contact, in months

pstat: occurrence of PCM: 0=no, 1=yes

futime: time until death or last contact, in months

death: occurrence of death: 0=no, 1=yes

Objectives

Data Overview

head(mgus2)
##   id age sex dxyr  hgb creat mspike ptime pstat futime death
## 1  1  88   F 1981 13.1   1.3    0.5    30     0     30     1
## 2  2  78   F 1968 11.5   1.2    2.0    25     0     25     1
## 3  3  94   M 1980 10.5   1.5    2.6    46     0     46     1
## 4  4  68   M 1977 15.2   1.2    1.2    92     0     92     1
## 5  5  90   F 1973 10.7   0.8    1.0     8     0      8     1
## 6  6  90   M 1990 12.9   1.0    0.5     4     0      4     1
summary(mgus2)
##        id              age        sex          dxyr           hgb      
##  Min.   :   1.0   Min.   :24.00   F:631   Min.   :1960   Min.   : 5.7  
##  1st Qu.: 346.8   1st Qu.:63.00   M:753   1st Qu.:1980   1st Qu.:12.2  
##  Median : 692.5   Median :72.00           Median :1984   Median :13.5  
##  Mean   : 692.5   Mean   :70.42           Mean   :1983   Mean   :13.3  
##  3rd Qu.:1038.2   3rd Qu.:79.00           3rd Qu.:1988   3rd Qu.:14.7  
##  Max.   :1384.0   Max.   :96.00           Max.   :1994   Max.   :18.9  
##                                                          NA's   :13    
##      creat            mspike          ptime            pstat        
##  Min.   : 0.400   Min.   :0.000   Min.   :  1.00   Min.   :0.00000  
##  1st Qu.: 0.900   1st Qu.:0.600   1st Qu.: 37.00   1st Qu.:0.00000  
##  Median : 1.100   Median :1.200   Median : 81.00   Median :0.00000  
##  Mean   : 1.292   Mean   :1.164   Mean   : 93.54   Mean   :0.08309  
##  3rd Qu.: 1.300   3rd Qu.:1.500   3rd Qu.:136.25   3rd Qu.:0.00000  
##  Max.   :22.000   Max.   :3.000   Max.   :424.00   Max.   :1.00000  
##  NA's   :30       NA's   :11                                        
##      futime          death       
##  Min.   :  1.0   Min.   :0.0000  
##  1st Qu.: 40.0   1st Qu.:0.0000  
##  Median : 84.0   Median :1.0000  
##  Mean   : 95.8   Mean   :0.6958  
##  3rd Qu.:139.0   3rd Qu.:1.0000  
##  Max.   :424.0   Max.   :1.0000  
## 
str(mgus2)
## 'data.frame':    1384 obs. of  11 variables:
##  $ id    : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ age   : num  88 78 94 68 90 90 89 87 86 79 ...
##   ..- attr(*, "label")= chr "AGE AT mgus_sp"
##  $ sex   : Factor w/ 2 levels "F","M": 1 1 2 2 1 2 1 1 1 1 ...
##   ..- attr(*, "label")= chr "Sex"
##   ..- attr(*, "format")= chr "$desex"
##  $ dxyr  : num  1981 1968 1980 1977 1973 ...
##  $ hgb   : num  13.1 11.5 10.5 15.2 10.7 12.9 10.5 12.3 14.5 9.4 ...
##  $ creat : num  1.3 1.2 1.5 1.2 0.8 1 0.9 1.2 0.9 1.1 ...
##  $ mspike: num  0.5 2 2.6 1.2 1 0.5 1.3 1.6 2.4 2.3 ...
##   ..- attr(*, "label")= chr "SEP SPIKE AT mgus_sp"
##  $ ptime : num  30 25 46 92 8 4 151 2 57 136 ...
##  $ pstat : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ futime: num  30 25 46 92 8 4 151 2 57 136 ...
##  $ death : num  1 1 1 1 1 1 1 1 0 1 ...

VIMpackage: missing values and imputation tools.

library("VIM")
## Warning: package 'VIM' was built under R version 4.3.3
## Loading required package: colorspace
## Loading required package: grid
## VIM is ready to use.
## Suggestions and bug-reports can be submitted at: https://github.com/statistikat/VIM/issues
## 
## Attaching package: 'VIM'
## The following object is masked from 'package:datasets':
## 
##     sleep
#show where there are missing values
df <- mgus2
sapply(df, function(x) sum(is.na(x)))
##     id    age    sex   dxyr    hgb  creat mspike  ptime  pstat futime  death 
##      0      0      0      0     13     30     11      0      0      0      0
#plot the missing values
df_miss = aggr(df, numbers=TRUE, sortVars=TRUE, labels=names(df), cex.axis=.7, gap=3, ylab=c("Proportion of missingness","Missingness Pattern"))

## 
##  Variables sorted by number of missings: 
##  Variable       Count
##     creat 0.021676301
##       hgb 0.009393064
##    mspike 0.007947977
##        id 0.000000000
##       age 0.000000000
##       sex 0.000000000
##      dxyr 0.000000000
##     ptime 0.000000000
##     pstat 0.000000000
##    futime 0.000000000
##     death 0.000000000

This project will not be addressing the missing values of this data set beyond the visualizations above.

Survival Curve

Overall Survival

To begin, we need to turn the dxyr (diagnosis year) into a full ymd date format, and convert the ptime and futime from months to years.

#format dates
df <- mgus2 |>
  mutate(dxyr = make_date(year = dxyr, month = 1, day = 1))

head(df)
##   id age sex       dxyr  hgb creat mspike ptime pstat futime death
## 1  1  88   F 1981-01-01 13.1   1.3    0.5    30     0     30     1
## 2  2  78   F 1968-01-01 11.5   1.2    2.0    25     0     25     1
## 3  3  94   M 1980-01-01 10.5   1.5    2.6    46     0     46     1
## 4  4  68   M 1977-01-01 15.2   1.2    1.2    92     0     92     1
## 5  5  90   F 1973-01-01 10.7   0.8    1.0     8     0      8     1
## 6  6  90   M 1990-01-01 12.9   1.0    0.5     4     0      4     1
# change the df layout: as_tibble(df)

df <- as_tibble(df)

df
## # A tibble: 1,384 × 11
##       id   age sex   dxyr         hgb creat mspike ptime pstat futime death
##    <dbl> <dbl> <fct> <date>     <dbl> <dbl>  <dbl> <dbl> <dbl>  <dbl> <dbl>
##  1     1    88 F     1981-01-01  13.1   1.3    0.5    30     0     30     1
##  2     2    78 F     1968-01-01  11.5   1.2    2      25     0     25     1
##  3     3    94 M     1980-01-01  10.5   1.5    2.6    46     0     46     1
##  4     4    68 M     1977-01-01  15.2   1.2    1.2    92     0     92     1
##  5     5    90 F     1973-01-01  10.7   0.8    1       8     0      8     1
##  6     6    90 M     1990-01-01  12.9   1      0.5     4     0      4     1
##  7     7    89 F     1974-01-01  10.5   0.9    1.3   151     0    151     1
##  8     8    87 F     1974-01-01  12.3   1.2    1.6     2     0      2     1
##  9     9    86 F     1994-01-01  14.5   0.9    2.4    57     0     57     0
## 10    10    79 F     1981-01-01   9.4   1.1    2.3   136     0    136     1
## # ℹ 1,374 more rows
#convert `ptime` and `futime` into years
df <- df |>
  mutate(ptime = ptime/12) |>
  mutate(futime = futime/12)

#find overall survival data
df <-
  df %>% 
  mutate(
    os_yrs = as.duration(ptime %--% futime) / dyears(1)
  )
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `os_yrs = as.duration(ptime %--% futime)/dyears(1)`.
## Caused by warning:
## ! tz(): Don't know how to compute timezone for object of class numeric; returning "UTC".
df
## # A tibble: 1,384 × 12
##       id   age sex   dxyr         hgb creat mspike  ptime pstat futime death
##    <dbl> <dbl> <fct> <date>     <dbl> <dbl>  <dbl>  <dbl> <dbl>  <dbl> <dbl>
##  1     1    88 F     1981-01-01  13.1   1.3    0.5  2.5       0  2.5       1
##  2     2    78 F     1968-01-01  11.5   1.2    2    2.08      0  2.08      1
##  3     3    94 M     1980-01-01  10.5   1.5    2.6  3.83      0  3.83      1
##  4     4    68 M     1977-01-01  15.2   1.2    1.2  7.67      0  7.67      1
##  5     5    90 F     1973-01-01  10.7   0.8    1    0.667     0  0.667     1
##  6     6    90 M     1990-01-01  12.9   1      0.5  0.333     0  0.333     1
##  7     7    89 F     1974-01-01  10.5   0.9    1.3 12.6       0 12.6       1
##  8     8    87 F     1974-01-01  12.3   1.2    1.6  0.167     0  0.167     1
##  9     9    86 F     1994-01-01  14.5   0.9    2.4  4.75      0  4.75      0
## 10    10    79 F     1981-01-01   9.4   1.1    2.3 11.3       0 11.3       1
## # ℹ 1,374 more rows
## # ℹ 1 more variable: os_yrs <dbl>
#create the survival object (fist 10 observations)

Surv(df$ptime, df$pstat)[1:10]
##  [1]  2.5000000+  2.0833333+  3.8333333+  7.6666667+  0.6666667+  0.3333333+
##  [7] 12.5833333+  0.1666667+  4.7500000+ 11.3333333+

Understanding our output: These are observations of time until progression to a plasma cell malignancy (PCM) or last contact, in years. We see that subject 1 had an event at time 2.50 years, subject 2 had an event at time 2.08 years, subject 3 was censored at time 3.83 years, etc.

#calculate overall survival

s1 <- survfit(Surv(futime, death) ~ 1, data = df)

str(s1)
## List of 16
##  $ n        : int 1384
##  $ time     : num [1:272] 0.0833 0.1667 0.25 0.3333 0.4167 ...
##  $ n.risk   : num [1:272] 1384 1341 1313 1298 1282 ...
##  $ n.event  : num [1:272] 42 28 15 16 10 9 6 13 9 10 ...
##  $ n.censor : num [1:272] 1 0 0 0 0 0 0 0 0 1 ...
##  $ surv     : num [1:272] 0.97 0.949 0.939 0.927 0.92 ...
##  $ std.err  : num [1:272] 0.00476 0.00621 0.00688 0.00755 0.00794 ...
##  $ cumhaz   : num [1:272] 0.0303 0.0512 0.0627 0.075 0.0828 ...
##  $ std.chaz : num [1:272] 0.00468 0.00612 0.0068 0.00746 0.00786 ...
##  $ type     : chr "right"
##  $ logse    : logi TRUE
##  $ conf.int : num 0.95
##  $ conf.type: chr "log"
##  $ lower    : num [1:272] 0.961 0.938 0.926 0.913 0.906 ...
##  $ upper    : num [1:272] 0.979 0.961 0.951 0.941 0.934 ...
##  $ call     : language survfit(formula = Surv(futime, death) ~ 1, data = df)
##  - attr(*, "class")= chr "survfit"

Understanding our output: When reading the output it is important to understand that time: the time-points at which the curve has a step, i.e. at least one event occurred surv: the estimate of survival at the corresponding time.

Kaplan-Meier plots

survfit2(Surv(futime, death) ~ 1, data = df) %>% 
  ggsurvfit() +
  labs(
    x = "Years",
    y = "Overall survival probability",
    title = "Overall Survival: mgus2",
           ) + 
  add_confidence_interval() +
  add_risktable()

Estimating x-year survival

Estimating the probability of survival in a specific amount of time is important for clinical trials to asses major events.

#estimate survival in 1 year

summary(survfit(Surv(futime, death) ~ 1, data = df), times = 1)
## Call: survfit(formula = Surv(futime, death) ~ 1, data = df)
## 
##  time n.risk n.event survival std.err lower 95% CI upper 95% CI
##     1   1215     173    0.875  0.0089        0.858        0.893

1-year probability of survival in this study is 87.5%. The associated lower and upper and lower bounds of the 95% confidence interval are also displayed. The 1-year survival probability is the point on the y-axis that corresponds to 1 year on the x-axis for the survival curve.

#produce a table

survfit(Surv(futime, death) ~ 1, data = df) %>% 
  tbl_survfit(
    times = 1,
    label_header = "**1-year survival (95% CI)**"
  )
Characteristic 1-year survival (95% CI)
Overall 87% (86%, 89%)

Estimating median survival time

survfit(Surv(futime, death) ~ 1, data = df)
## Call: survfit(formula = Surv(futime, death) ~ 1, data = df)
## 
##         n events median 0.95LCL 0.95UCL
## [1,] 1384    963   8.17    7.67    8.58
#produce table

survfit(Surv(futime, death) ~ 1, data = df) %>% 
  tbl_survfit(
    probs = 0.5,
    label_header = "**Median survival (95% CI)**"
  )
Characteristic Median survival (95% CI)
Overall 8.2 (7.7, 8.6)

Understanding our output: We see the median survival time is 8.2 years.

Comparing survival times between groups

survdiff tests if there is a difference between two or more survival curves using the G-rho family of tests, or for a single curve against a known alternative.

Sex

survdiff(Surv(futime, death) ~ sex, data = df)
## Call:
## survdiff(formula = Surv(futime, death) ~ sex, data = df)
## 
##         N Observed Expected (O-E)^2/E (O-E)^2/V
## sex=F 631      423      471      4.88      9.67
## sex=M 753      540      492      4.67      9.67
## 
##  Chisq= 9.7  on 1 degrees of freedom, p= 0.002
#plot the comparison 

survfit2(Surv(futime, death) ~ sex, data = df) %>% 
  ggsurvfit() +
  labs(
    x = "Years",
    y = "Overall survival probability",
    title = "Male vs. Female: mgus2",
           ) + 
  add_confidence_interval() +
  add_risktable()

Female patients do slightly better than male patients.

Age

#Show age distribution

summary(df$age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   24.00   63.00   72.00   70.42   79.00   96.00
#Create age bins

df <- df %>%
  mutate(
    age_group = dplyr::case_when(
    age <= 63            ~ "24-63",
    age > 63 & age <= 72 ~ "63-72",
    age > 72 & age <= 79 ~ "72-79",
    age > 79             ~ "79-96"
  ),
    # Convert to factor
    age_group = factor(
      age_group,
      level = c("24-63", "63-72", "72-79", "79-96" )
    )
  )

df
## # A tibble: 1,384 × 13
##       id   age sex   dxyr         hgb creat mspike  ptime pstat futime death
##    <dbl> <dbl> <fct> <date>     <dbl> <dbl>  <dbl>  <dbl> <dbl>  <dbl> <dbl>
##  1     1    88 F     1981-01-01  13.1   1.3    0.5  2.5       0  2.5       1
##  2     2    78 F     1968-01-01  11.5   1.2    2    2.08      0  2.08      1
##  3     3    94 M     1980-01-01  10.5   1.5    2.6  3.83      0  3.83      1
##  4     4    68 M     1977-01-01  15.2   1.2    1.2  7.67      0  7.67      1
##  5     5    90 F     1973-01-01  10.7   0.8    1    0.667     0  0.667     1
##  6     6    90 M     1990-01-01  12.9   1      0.5  0.333     0  0.333     1
##  7     7    89 F     1974-01-01  10.5   0.9    1.3 12.6       0 12.6       1
##  8     8    87 F     1974-01-01  12.3   1.2    1.6  0.167     0  0.167     1
##  9     9    86 F     1994-01-01  14.5   0.9    2.4  4.75      0  4.75      0
## 10    10    79 F     1981-01-01   9.4   1.1    2.3 11.3       0 11.3       1
## # ℹ 1,374 more rows
## # ℹ 2 more variables: os_yrs <dbl>, age_group <fct>
survdiff(Surv(futime, death) ~ age_group, data = df)
## Call:
## survdiff(formula = Surv(futime, death) ~ age_group, data = df)
## 
##                   N Observed Expected (O-E)^2/E (O-E)^2/V
## age_group=24-63 357      163      345     95.69     159.3
## age_group=63-72 354      227      276      8.65      12.3
## age_group=72-79 342      277      201     28.87      37.6
## age_group=79-96 331      296      142    168.05     206.8
## 
##  Chisq= 330  on 3 degrees of freedom, p= <2e-16
#plot the comparison 

survfit2(Surv(futime, death) ~ age_group, data = df) %>% 
  ggsurvfit() +
  labs(
    x = "Years",
    y = "Overall survival probability",
    title = "Age Group: mgus2",
           ) + 
  add_confidence_interval() +
  add_risktable()

This visualization shows that as age increases, survival decreases.

The Cox regression model

The Cox regression model quantifies the survival between patient groups by modeling the effect of multiple variables at once. This model can adjust for confounding effects of other variables using the coxph() function which uses the same syntax as lm(), glm(), etc. The response variable you create with Surv() goes on the left hand side of the formula, specified with a ~.

cox <- coxph(Surv(futime, death) ~ age_group + sex + mspike, data =  df)

summary(cox)
## Call:
## coxph(formula = Surv(futime, death) ~ age_group + sex + mspike, 
##     data = df)
## 
##   n= 1373, number of events= 957 
##    (11 observations deleted due to missingness)
## 
##                   coef exp(coef) se(coef)      z Pr(>|z|)    
## age_group63-72 0.65987   1.93454  0.10485  6.293 3.10e-10 ***
## age_group72-79 1.28162   3.60247  0.10435 12.282  < 2e-16 ***
## age_group79-96 1.75523   5.78478  0.10587 16.579  < 2e-16 ***
## sexM           0.33454   1.39729  0.06573  5.089 3.59e-07 ***
## mspike         0.04446   1.04547  0.06000  0.741    0.459    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##                exp(coef) exp(-coef) lower .95 upper .95
## age_group63-72     1.935     0.5169    1.5752     2.376
## age_group72-79     3.602     0.2776    2.9362     4.420
## age_group79-96     5.785     0.1729    4.7008     7.119
## sexM               1.397     0.7157    1.2284     1.589
## mspike             1.045     0.9565    0.9295     1.176
## 
## Concordance= 0.656  (se = 0.01 )
## Likelihood ratio test= 346  on 5 df,   p=<2e-16
## Wald test            = 322.3  on 5 df,   p=<2e-16
## Score (logrank) test = 359.2  on 5 df,   p=<2e-16

Statistical significance: The column marked z gives the Wald statistic value. It corresponds to the ratio of each regression coefficient to its standard error (z = coef/se(coef)). The Wald statistic evaluates, whether the beta (β) coefficient of a given variable is statistically significantly different from 0. We can conclude that sex and age_group are highly statistically significant. However, the size of the monoclonal serum splike is not significant.

The regression coefficients: The second feature to note in the Cox model results is the the sign of the regression coefficients (coef). A positive sign means that the hazard (risk of death) is higher, and thus the prognosis worse, for subjects with higher values of that variable. The R summary for the Cox model gives the hazard ratio (HR) for the second group relative to the first group, that is, female versus male. This indicates that females have lower risk of death (higher survival rates) than males.

Hazard ratios: The exponentiated coefficients of sexM (exp(coef) = exp(0.33) = 1.40), also known as hazard ratios, give the effect size of covariates.

Confidence intervals of the hazard ratios: For example, (mspike) also gives upper and lower 95% confidence intervals for the hazard ratio (exp(coef)) = 1.05, lower 95% bound = 0.93, upper 95% bound = 1.18.

Global statistical significance of the model: The output gives p-values for three alternative tests for overall significance of the model: The likelihood-ratio test, Wald test, and score log rank statistics. These three methods are asymptotically equivalent. For large enough N, they will give similar results. For small N, they may differ somewhat. The Likelihood ratio test has better behavior for small sample sizes, so it is generally preferred.

Likelihood-ratio test: A likelihood ratio test compares the goodness of fit of two nested regression models based on the predictor variables in the overall regression model. Thelikelihood-ratio test = 346, p = <2e-16.

coxph(Surv(futime, death) ~ age_group + sex + mspike, data =  df) |>
  tbl_regression(exp = TRUE) 
Characteristic HR1 95% CI1 p-value
age_group


    24-63
    63-72 1.93 1.58, 2.38 <0.001
    72-79 3.60 2.94, 4.42 <0.001
    79-96 5.78 4.70, 7.12 <0.001
Sex


    F
    M 1.40 1.23, 1.59 <0.001
SEP SPIKE AT mgus_sp 1.05 0.93, 1.18 0.5
1 HR = Hazard Ratio, CI = Confidence Interval

The quantity of interest from a Cox regression model is a hazard ratio (HR). The HR represents the ratio of hazards between two groups at any particular point in time.

sexM: HR = 1.42 implies that males have a 1.42 times higher risk of death in relation to females.

age_group63-72: 1.93 times higher risk of death in relation to age_group24-63.

age_group72-79: 3.60 times higher risk of death in relation to age_group24-63.

age_group79-96: 5.78 times higher risk of death in relation to age_group24-63.

Assessing proportional hazards

cox.zph(): Tests the proportional hazards assumption for the Cox regression model. This is done by testing for an interaction effect between the covariate and log(time).

Interpretation: A significant p-value indicates that the proportional hazards assumption is violated.

Plots of the Schoenfeld residuals: Deviation from a zero-slope line is evidence that the proportional hazards assumption is violated.

mv_fit <- coxph(Surv(futime, death) ~ sex + age_group + mspike, data = df)
cz <- cox.zph(mv_fit)
print(cz)
##           chisq df       p
## sex        1.01  1    0.32
## age_group 30.86  3 9.1e-07
## mspike     1.13  1    0.29
## GLOBAL    33.53  5 3.0e-06
plot(cz)

Covariates with p-values >0.05, do not reject the null hypothesis. Therefore, the proportional hazards assumption is satisfied for each individual covariate, and also for the model overall.

Smooth survival plot

In statistics, a continuous variable is a quantitative variable that may be continuous or discrete if they are typically obtained by measuring or counting, respectively (ex. age). To visualize a survival estimate according to a continuous variable. The sm.survival function from the sm package allows you to do this for a quantile of the distribution of survival data. The default quantile is p = 0.5 for median survival.

#install.packages("sm")
library(sm)
## Warning: package 'sm' was built under R version 4.3.3
## Package 'sm', version 2.2-6.0: type help(sm) for summary information
sm.options(
  list(
    xlab = "Age (years)",
    ylab = "Median time to death (years)")
  )

sm.survival(
  x = df$age,
  y = df$futime,
  status = df$death,
  h = (1/6) * sd(df$age) / nrow(df)^(-1/4)
)

#play with the `h` ratio to get a median line that is not too smooth (start with 1/2)

Progression Analysis

Survival analysis tools can also be used when understanding disease progression. This is especially important for tumor treatment assesment.

Overall progression

survfit2(Surv(ptime, pstat) ~ 1, data = df) %>% 
  ggsurvfit() +
  labs(
    x = "Years",
    y = "Overall progression probability",
    title = "Overall Progession: mgus2",
           ) + 
  add_confidence_interval() +
  add_risktable()

#estimate median time to progression in years
survfit(Surv(ptime, pstat) ~ 1, data = df) %>% 
  tbl_survfit(
    probs = 0.5,
    label_header = "**Median progression (95% CI)**"
  )
Characteristic Median progression (95% CI)
Overall 31 (28, —)

Quantifying variables

hgb: hemoglobin

creat: creatinine

mspike: size of the monoclonal serum splike

cox_prog <- coxph(Surv(ptime, pstat) ~  hgb + creat + mspike, data =  df)

summary(cox_prog)
## Call:
## coxph(formula = Surv(ptime, pstat) ~ hgb + creat + mspike, data = df)
## 
##   n= 1338, number of events= 112 
##    (46 observations deleted due to missingness)
## 
##            coef exp(coef) se(coef)      z Pr(>|z|)    
## hgb    -0.14090   0.86857  0.05109 -2.758  0.00582 ** 
## creat  -0.11778   0.88889  0.15144 -0.778  0.43670    
## mspike  0.90718   2.47733  0.16557  5.479 4.28e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##        exp(coef) exp(-coef) lower .95 upper .95
## hgb       0.8686     1.1513    0.7858     0.960
## creat     0.8889     1.1250    0.6606     1.196
## mspike    2.4773     0.4037    1.7908     3.427
## 
## Concordance= 0.665  (se = 0.031 )
## Likelihood ratio test= 37.01  on 3 df,   p=5e-08
## Wald test            = 38.78  on 3 df,   p=2e-08
## Score (logrank) test = 39.43  on 3 df,   p=1e-08

We will explore hgb and mspike because creatine is not a significant factor when determining progression.

#check distribution
summary(df$hgb)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     5.7    12.2    13.5    13.3    14.7    18.9      13
summary(df$mspike)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.600   1.200   1.164   1.500   3.000      11
#Create `hgb` bins

df_prog <- df %>%
  mutate(
    hgb_group = dplyr::case_when(
    hgb <= 12.2            ~ "5.7-12.2",
    hgb > 12.2 & hgb <= 13.5 ~ "12.2-13.5",
    hgb > 13.5 & hgb <= 14.7 ~ "13.5-14.7",
    hgb > 14.7             ~ "14.7-18.9"
  ),
    # Convert to factor
    hgb_group = factor(
      hgb_group,
      level = c("5.7-12.2", "12.2-13.5", "13.5-14.7", "14.7-18.9" )
    )
  )

df_prog
## # A tibble: 1,384 × 14
##       id   age sex   dxyr         hgb creat mspike  ptime pstat futime death
##    <dbl> <dbl> <fct> <date>     <dbl> <dbl>  <dbl>  <dbl> <dbl>  <dbl> <dbl>
##  1     1    88 F     1981-01-01  13.1   1.3    0.5  2.5       0  2.5       1
##  2     2    78 F     1968-01-01  11.5   1.2    2    2.08      0  2.08      1
##  3     3    94 M     1980-01-01  10.5   1.5    2.6  3.83      0  3.83      1
##  4     4    68 M     1977-01-01  15.2   1.2    1.2  7.67      0  7.67      1
##  5     5    90 F     1973-01-01  10.7   0.8    1    0.667     0  0.667     1
##  6     6    90 M     1990-01-01  12.9   1      0.5  0.333     0  0.333     1
##  7     7    89 F     1974-01-01  10.5   0.9    1.3 12.6       0 12.6       1
##  8     8    87 F     1974-01-01  12.3   1.2    1.6  0.167     0  0.167     1
##  9     9    86 F     1994-01-01  14.5   0.9    2.4  4.75      0  4.75      0
## 10    10    79 F     1981-01-01   9.4   1.1    2.3 11.3       0 11.3       1
## # ℹ 1,374 more rows
## # ℹ 3 more variables: os_yrs <dbl>, age_group <fct>, hgb_group <fct>
#Create `mspike` bins

df_prog <- df_prog %>%
  mutate(
    mspike_group = dplyr::case_when(
    mspike <= 0.6            ~ "0-0.6",
    mspike > 0.6 & mspike <= 1.2 ~ "0.6-1.2",
    mspike > 1.2 & mspike <= 1.5 ~ "1.2-1.5",
    mspike > 1.5             ~ "1.5-3.0"
  ),
    # Convert to factor
    mspike_group = factor(
      mspike_group,
      level = c("0-0.6", "0.6-1.2", "1.2-1.5", "1.5-3.0" )
    )
  )

df_prog
## # A tibble: 1,384 × 15
##       id   age sex   dxyr         hgb creat mspike  ptime pstat futime death
##    <dbl> <dbl> <fct> <date>     <dbl> <dbl>  <dbl>  <dbl> <dbl>  <dbl> <dbl>
##  1     1    88 F     1981-01-01  13.1   1.3    0.5  2.5       0  2.5       1
##  2     2    78 F     1968-01-01  11.5   1.2    2    2.08      0  2.08      1
##  3     3    94 M     1980-01-01  10.5   1.5    2.6  3.83      0  3.83      1
##  4     4    68 M     1977-01-01  15.2   1.2    1.2  7.67      0  7.67      1
##  5     5    90 F     1973-01-01  10.7   0.8    1    0.667     0  0.667     1
##  6     6    90 M     1990-01-01  12.9   1      0.5  0.333     0  0.333     1
##  7     7    89 F     1974-01-01  10.5   0.9    1.3 12.6       0 12.6       1
##  8     8    87 F     1974-01-01  12.3   1.2    1.6  0.167     0  0.167     1
##  9     9    86 F     1994-01-01  14.5   0.9    2.4  4.75      0  4.75      0
## 10    10    79 F     1981-01-01   9.4   1.1    2.3 11.3       0 11.3       1
## # ℹ 1,374 more rows
## # ℹ 4 more variables: os_yrs <dbl>, age_group <fct>, hgb_group <fct>,
## #   mspike_group <fct>
#plot the comparison 

survfit2(Surv(ptime, pstat) ~ hgb_group, data = df_prog) %>% 
  ggsurvfit() +
  labs(
    x = "Years",
    y = "Overall progression probability",
    title = "Hemoglobin: mgus2",
           ) + 
  add_confidence_interval() +
  add_risktable()

survfit2(Surv(ptime, pstat) ~ mspike_group, data = df_prog) %>% 
  ggsurvfit() +
  labs(
    x = "Years",
    y = "Overall progression probability",
    title = "size of the monoclonal serum splike: mgus2",
           ) + 
  add_confidence_interval() +
  add_risktable()

coxph(Surv(ptime, pstat) ~ hgb_group + mspike_group, data =  df_prog) |>
  tbl_regression(exp = TRUE) 
Characteristic HR1 95% CI1 p-value
hgb_group


    5.7-12.2
    12.2-13.5 0.79 0.48, 1.29 0.3
    13.5-14.7 0.45 0.25, 0.78 0.005
    14.7-18.9 0.51 0.30, 0.86 0.011
mspike_group


    0-0.6
    0.6-1.2 1.24 0.63, 2.46 0.5
    1.2-1.5 1.90 0.97, 3.71 0.061
    1.5-3.0 3.47 1.88, 6.40 <0.001
1 HR = Hazard Ratio, CI = Confidence Interval
#evaluate the cox model
mv_fit <- coxph(Surv(ptime, pstat) ~ hgb_group + mspike_group, data = df_prog)
cz_prog <- cox.zph(mv_fit)
print(cz_prog)
##              chisq df    p
## hgb_group     4.39  3 0.22
## mspike_group  1.19  3 0.75
## GLOBAL        5.63  6 0.47
plot(cz_prog)

There is a significant relationship between the size of the monoclonal serum splike and probability of progression.

Conclusion

According to the insights produced by the mgus2 data set, the 1-year probability of MGUS survival was 87.5% and the median survival time was 8.2 years. There was evidence that older patients and male patients had a significantly lower chance of survival. Lastly, there was a significant relationship between the size of the monoclonal serum splike and probability of progression

Resources

Survival Analysis in R - Emily C. Zabor

A Solution to Missing Data: Imputation Using R

Survival analysis in clinical trials: Basics and must know areas

Censoring in Clinical Trials: Review of Survival Analysis Techniques.

Survival Analysis II: Cox Regression

Survival analysis in R

Cox Proportional-Hazards Model