In my role my as a research data analyst, I manage healthcare data
and create insights to help physicians predict patient outcomes.
Survival analysis is one of the tools I use to quantify oncology data.
Survival analysis estimates the probability of survival over time. This
is important in healthcare for understanding disease progression and
treatment effectiveness. Survival analysis plays a crucial role in
oncology where time-to-time outcomes, determining sample size, defining
endpoints, and treatment efficacy influences care initiatives. The
purpose of this project is to demonstrate important features of survival
analysis, understand data output, and communicate insights from the
mgus2 data set.
knitr::opts_chunk$set(echo = TRUE)
require(formattable)
## Loading required package: formattable
## Warning in library(package, lib.loc = lib.loc, character.only = TRUE,
## logical.return = TRUE, : there is no package called 'formattable'
require(RPostgres)
## Loading required package: RPostgres
## Warning in library(package, lib.loc = lib.loc, character.only = TRUE,
## logical.return = TRUE, : there is no package called 'RPostgres'
require(DT)
## Loading required package: DT
## Warning in library(package, lib.loc = lib.loc, character.only = TRUE,
## logical.return = TRUE, : there is no package called 'DT'
require(lubridate)
## Loading required package: lubridate
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
require(knitr)
## Loading required package: knitr
require(ggplot2)
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.3.2
require(ggVennDiagram)
## Loading required package: ggVennDiagram
## Warning in library(package, lib.loc = lib.loc, character.only = TRUE,
## logical.return = TRUE, : there is no package called 'ggVennDiagram'
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.3.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ stringr 1.5.0
## ✔ forcats 1.0.0 ✔ tibble 3.2.1
## ✔ purrr 1.0.1 ✔ tidyr 1.3.0
## ✔ readr 2.1.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(survival)
library(survminer)
## Loading required package: ggpubr
##
## Attaching package: 'survminer'
##
## The following object is masked from 'package:survival':
##
## myeloma
library(janitor)
## Warning: package 'janitor' was built under R version 4.3.2
##
## Attaching package: 'janitor'
##
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
library(ggsurvfit)
## Warning: package 'ggsurvfit' was built under R version 4.3.3
library(gtsummary)
## Warning: package 'gtsummary' was built under R version 4.3.3
library(tidycmprsk)
## Warning: package 'tidycmprsk' was built under R version 4.3.3
##
## Attaching package: 'tidycmprsk'
##
## The following object is masked from 'package:gtsummary':
##
## trial
The data() function will pull a list of built in R data
sets. The data used in this project is from the survival
package. The (Monoclonal
gammopathy) data set depicts the natural history of 1341 sequential
patients with monoclonal gammopathy of undetermined significance
(MGUS).
data(cancer, package="survival")
view(mgus2)
The mgus2 data set contains the following variables:
id: subject identifier
age: age at diagnosis, in years
sex: a factor with levels F M
dxyr: year of diagnosis
hgb: hemoglobin
creat: creatinine
mspike: size of the monoclonal serum splike
ptime: time until progression to a plasma cell
malignancy (PCM) or last contact, in months
pstat: occurrence of PCM: 0=no, 1=yes
futime: time until death or last contact, in months
death: occurrence of death: 0=no, 1=yes
Understand the tools that produced the results of the (A long-term study of prognosis in monoclonal gammopathy of undetermined significance) study which were a cumulative probability of progression was 12 percent at 10 years, 25 percent at 20 years, and 30 percent at 25 years, and the initial concentration of serum monoclonal protein was a significant predictor of progression at 20 years.
Visualize survival trends and compare patient demographics with Kaplan-Meier plots and smooth survival plots.
Identify significant covariets with the Cox regression model.
Test the proportional hazards assumption for the Cox regression model.
head(mgus2)
## id age sex dxyr hgb creat mspike ptime pstat futime death
## 1 1 88 F 1981 13.1 1.3 0.5 30 0 30 1
## 2 2 78 F 1968 11.5 1.2 2.0 25 0 25 1
## 3 3 94 M 1980 10.5 1.5 2.6 46 0 46 1
## 4 4 68 M 1977 15.2 1.2 1.2 92 0 92 1
## 5 5 90 F 1973 10.7 0.8 1.0 8 0 8 1
## 6 6 90 M 1990 12.9 1.0 0.5 4 0 4 1
summary(mgus2)
## id age sex dxyr hgb
## Min. : 1.0 Min. :24.00 F:631 Min. :1960 Min. : 5.7
## 1st Qu.: 346.8 1st Qu.:63.00 M:753 1st Qu.:1980 1st Qu.:12.2
## Median : 692.5 Median :72.00 Median :1984 Median :13.5
## Mean : 692.5 Mean :70.42 Mean :1983 Mean :13.3
## 3rd Qu.:1038.2 3rd Qu.:79.00 3rd Qu.:1988 3rd Qu.:14.7
## Max. :1384.0 Max. :96.00 Max. :1994 Max. :18.9
## NA's :13
## creat mspike ptime pstat
## Min. : 0.400 Min. :0.000 Min. : 1.00 Min. :0.00000
## 1st Qu.: 0.900 1st Qu.:0.600 1st Qu.: 37.00 1st Qu.:0.00000
## Median : 1.100 Median :1.200 Median : 81.00 Median :0.00000
## Mean : 1.292 Mean :1.164 Mean : 93.54 Mean :0.08309
## 3rd Qu.: 1.300 3rd Qu.:1.500 3rd Qu.:136.25 3rd Qu.:0.00000
## Max. :22.000 Max. :3.000 Max. :424.00 Max. :1.00000
## NA's :30 NA's :11
## futime death
## Min. : 1.0 Min. :0.0000
## 1st Qu.: 40.0 1st Qu.:0.0000
## Median : 84.0 Median :1.0000
## Mean : 95.8 Mean :0.6958
## 3rd Qu.:139.0 3rd Qu.:1.0000
## Max. :424.0 Max. :1.0000
##
str(mgus2)
## 'data.frame': 1384 obs. of 11 variables:
## $ id : num 1 2 3 4 5 6 7 8 9 10 ...
## $ age : num 88 78 94 68 90 90 89 87 86 79 ...
## ..- attr(*, "label")= chr "AGE AT mgus_sp"
## $ sex : Factor w/ 2 levels "F","M": 1 1 2 2 1 2 1 1 1 1 ...
## ..- attr(*, "label")= chr "Sex"
## ..- attr(*, "format")= chr "$desex"
## $ dxyr : num 1981 1968 1980 1977 1973 ...
## $ hgb : num 13.1 11.5 10.5 15.2 10.7 12.9 10.5 12.3 14.5 9.4 ...
## $ creat : num 1.3 1.2 1.5 1.2 0.8 1 0.9 1.2 0.9 1.1 ...
## $ mspike: num 0.5 2 2.6 1.2 1 0.5 1.3 1.6 2.4 2.3 ...
## ..- attr(*, "label")= chr "SEP SPIKE AT mgus_sp"
## $ ptime : num 30 25 46 92 8 4 151 2 57 136 ...
## $ pstat : num 0 0 0 0 0 0 0 0 0 0 ...
## $ futime: num 30 25 46 92 8 4 151 2 57 136 ...
## $ death : num 1 1 1 1 1 1 1 1 0 1 ...
VIMpackage: missing values and imputation tools.
library("VIM")
## Warning: package 'VIM' was built under R version 4.3.3
## Loading required package: colorspace
## Loading required package: grid
## VIM is ready to use.
## Suggestions and bug-reports can be submitted at: https://github.com/statistikat/VIM/issues
##
## Attaching package: 'VIM'
## The following object is masked from 'package:datasets':
##
## sleep
#show where there are missing values
df <- mgus2
sapply(df, function(x) sum(is.na(x)))
## id age sex dxyr hgb creat mspike ptime pstat futime death
## 0 0 0 0 13 30 11 0 0 0 0
#plot the missing values
df_miss = aggr(df, numbers=TRUE, sortVars=TRUE, labels=names(df), cex.axis=.7, gap=3, ylab=c("Proportion of missingness","Missingness Pattern"))
##
## Variables sorted by number of missings:
## Variable Count
## creat 0.021676301
## hgb 0.009393064
## mspike 0.007947977
## id 0.000000000
## age 0.000000000
## sex 0.000000000
## dxyr 0.000000000
## ptime 0.000000000
## pstat 0.000000000
## futime 0.000000000
## death 0.000000000
This project will not be addressing the missing values of this data set beyond the visualizations above.
To begin, we need to turn the dxyr (diagnosis year) into
a full ymd date format, and convert the ptime and
futime from months to years.
#format dates
df <- mgus2 |>
mutate(dxyr = make_date(year = dxyr, month = 1, day = 1))
head(df)
## id age sex dxyr hgb creat mspike ptime pstat futime death
## 1 1 88 F 1981-01-01 13.1 1.3 0.5 30 0 30 1
## 2 2 78 F 1968-01-01 11.5 1.2 2.0 25 0 25 1
## 3 3 94 M 1980-01-01 10.5 1.5 2.6 46 0 46 1
## 4 4 68 M 1977-01-01 15.2 1.2 1.2 92 0 92 1
## 5 5 90 F 1973-01-01 10.7 0.8 1.0 8 0 8 1
## 6 6 90 M 1990-01-01 12.9 1.0 0.5 4 0 4 1
# change the df layout: as_tibble(df)
df <- as_tibble(df)
df
## # A tibble: 1,384 × 11
## id age sex dxyr hgb creat mspike ptime pstat futime death
## <dbl> <dbl> <fct> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 88 F 1981-01-01 13.1 1.3 0.5 30 0 30 1
## 2 2 78 F 1968-01-01 11.5 1.2 2 25 0 25 1
## 3 3 94 M 1980-01-01 10.5 1.5 2.6 46 0 46 1
## 4 4 68 M 1977-01-01 15.2 1.2 1.2 92 0 92 1
## 5 5 90 F 1973-01-01 10.7 0.8 1 8 0 8 1
## 6 6 90 M 1990-01-01 12.9 1 0.5 4 0 4 1
## 7 7 89 F 1974-01-01 10.5 0.9 1.3 151 0 151 1
## 8 8 87 F 1974-01-01 12.3 1.2 1.6 2 0 2 1
## 9 9 86 F 1994-01-01 14.5 0.9 2.4 57 0 57 0
## 10 10 79 F 1981-01-01 9.4 1.1 2.3 136 0 136 1
## # ℹ 1,374 more rows
#convert `ptime` and `futime` into years
df <- df |>
mutate(ptime = ptime/12) |>
mutate(futime = futime/12)
#find overall survival data
df <-
df %>%
mutate(
os_yrs = as.duration(ptime %--% futime) / dyears(1)
)
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `os_yrs = as.duration(ptime %--% futime)/dyears(1)`.
## Caused by warning:
## ! tz(): Don't know how to compute timezone for object of class numeric; returning "UTC".
df
## # A tibble: 1,384 × 12
## id age sex dxyr hgb creat mspike ptime pstat futime death
## <dbl> <dbl> <fct> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 88 F 1981-01-01 13.1 1.3 0.5 2.5 0 2.5 1
## 2 2 78 F 1968-01-01 11.5 1.2 2 2.08 0 2.08 1
## 3 3 94 M 1980-01-01 10.5 1.5 2.6 3.83 0 3.83 1
## 4 4 68 M 1977-01-01 15.2 1.2 1.2 7.67 0 7.67 1
## 5 5 90 F 1973-01-01 10.7 0.8 1 0.667 0 0.667 1
## 6 6 90 M 1990-01-01 12.9 1 0.5 0.333 0 0.333 1
## 7 7 89 F 1974-01-01 10.5 0.9 1.3 12.6 0 12.6 1
## 8 8 87 F 1974-01-01 12.3 1.2 1.6 0.167 0 0.167 1
## 9 9 86 F 1994-01-01 14.5 0.9 2.4 4.75 0 4.75 0
## 10 10 79 F 1981-01-01 9.4 1.1 2.3 11.3 0 11.3 1
## # ℹ 1,374 more rows
## # ℹ 1 more variable: os_yrs <dbl>
#create the survival object (fist 10 observations)
Surv(df$ptime, df$pstat)[1:10]
## [1] 2.5000000+ 2.0833333+ 3.8333333+ 7.6666667+ 0.6666667+ 0.3333333+
## [7] 12.5833333+ 0.1666667+ 4.7500000+ 11.3333333+
Understanding our output: These are observations of time until progression to a plasma cell malignancy (PCM) or last contact, in years. We see that subject 1 had an event at time 2.50 years, subject 2 had an event at time 2.08 years, subject 3 was censored at time 3.83 years, etc.
#calculate overall survival
s1 <- survfit(Surv(futime, death) ~ 1, data = df)
str(s1)
## List of 16
## $ n : int 1384
## $ time : num [1:272] 0.0833 0.1667 0.25 0.3333 0.4167 ...
## $ n.risk : num [1:272] 1384 1341 1313 1298 1282 ...
## $ n.event : num [1:272] 42 28 15 16 10 9 6 13 9 10 ...
## $ n.censor : num [1:272] 1 0 0 0 0 0 0 0 0 1 ...
## $ surv : num [1:272] 0.97 0.949 0.939 0.927 0.92 ...
## $ std.err : num [1:272] 0.00476 0.00621 0.00688 0.00755 0.00794 ...
## $ cumhaz : num [1:272] 0.0303 0.0512 0.0627 0.075 0.0828 ...
## $ std.chaz : num [1:272] 0.00468 0.00612 0.0068 0.00746 0.00786 ...
## $ type : chr "right"
## $ logse : logi TRUE
## $ conf.int : num 0.95
## $ conf.type: chr "log"
## $ lower : num [1:272] 0.961 0.938 0.926 0.913 0.906 ...
## $ upper : num [1:272] 0.979 0.961 0.951 0.941 0.934 ...
## $ call : language survfit(formula = Surv(futime, death) ~ 1, data = df)
## - attr(*, "class")= chr "survfit"
Understanding our output: When reading the output it
is important to understand that time: the time-points at
which the curve has a step, i.e. at least one event occurred
surv: the estimate of survival at the corresponding
time.
survfit2(Surv(futime, death) ~ 1, data = df) %>%
ggsurvfit() +
labs(
x = "Years",
y = "Overall survival probability",
title = "Overall Survival: mgus2",
) +
add_confidence_interval() +
add_risktable()
Estimating the probability of survival in a specific amount of time is important for clinical trials to asses major events.
#estimate survival in 1 year
summary(survfit(Surv(futime, death) ~ 1, data = df), times = 1)
## Call: survfit(formula = Surv(futime, death) ~ 1, data = df)
##
## time n.risk n.event survival std.err lower 95% CI upper 95% CI
## 1 1215 173 0.875 0.0089 0.858 0.893
1-year probability of survival in this study is 87.5%. The associated lower and upper and lower bounds of the 95% confidence interval are also displayed. The 1-year survival probability is the point on the y-axis that corresponds to 1 year on the x-axis for the survival curve.
#produce a table
survfit(Surv(futime, death) ~ 1, data = df) %>%
tbl_survfit(
times = 1,
label_header = "**1-year survival (95% CI)**"
)
| Characteristic | 1-year survival (95% CI) |
|---|---|
| Overall | 87% (86%, 89%) |
survfit(Surv(futime, death) ~ 1, data = df)
## Call: survfit(formula = Surv(futime, death) ~ 1, data = df)
##
## n events median 0.95LCL 0.95UCL
## [1,] 1384 963 8.17 7.67 8.58
#produce table
survfit(Surv(futime, death) ~ 1, data = df) %>%
tbl_survfit(
probs = 0.5,
label_header = "**Median survival (95% CI)**"
)
| Characteristic | Median survival (95% CI) |
|---|---|
| Overall | 8.2 (7.7, 8.6) |
Understanding our output: We see the median survival time is 8.2 years.
survdiff tests if there is a difference between two or
more survival curves using the G-rho family of tests, or for a single
curve against a known alternative.
survdiff(Surv(futime, death) ~ sex, data = df)
## Call:
## survdiff(formula = Surv(futime, death) ~ sex, data = df)
##
## N Observed Expected (O-E)^2/E (O-E)^2/V
## sex=F 631 423 471 4.88 9.67
## sex=M 753 540 492 4.67 9.67
##
## Chisq= 9.7 on 1 degrees of freedom, p= 0.002
#plot the comparison
survfit2(Surv(futime, death) ~ sex, data = df) %>%
ggsurvfit() +
labs(
x = "Years",
y = "Overall survival probability",
title = "Male vs. Female: mgus2",
) +
add_confidence_interval() +
add_risktable()
Female patients do slightly better than male patients.
#Show age distribution
summary(df$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 24.00 63.00 72.00 70.42 79.00 96.00
#Create age bins
df <- df %>%
mutate(
age_group = dplyr::case_when(
age <= 63 ~ "24-63",
age > 63 & age <= 72 ~ "63-72",
age > 72 & age <= 79 ~ "72-79",
age > 79 ~ "79-96"
),
# Convert to factor
age_group = factor(
age_group,
level = c("24-63", "63-72", "72-79", "79-96" )
)
)
df
## # A tibble: 1,384 × 13
## id age sex dxyr hgb creat mspike ptime pstat futime death
## <dbl> <dbl> <fct> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 88 F 1981-01-01 13.1 1.3 0.5 2.5 0 2.5 1
## 2 2 78 F 1968-01-01 11.5 1.2 2 2.08 0 2.08 1
## 3 3 94 M 1980-01-01 10.5 1.5 2.6 3.83 0 3.83 1
## 4 4 68 M 1977-01-01 15.2 1.2 1.2 7.67 0 7.67 1
## 5 5 90 F 1973-01-01 10.7 0.8 1 0.667 0 0.667 1
## 6 6 90 M 1990-01-01 12.9 1 0.5 0.333 0 0.333 1
## 7 7 89 F 1974-01-01 10.5 0.9 1.3 12.6 0 12.6 1
## 8 8 87 F 1974-01-01 12.3 1.2 1.6 0.167 0 0.167 1
## 9 9 86 F 1994-01-01 14.5 0.9 2.4 4.75 0 4.75 0
## 10 10 79 F 1981-01-01 9.4 1.1 2.3 11.3 0 11.3 1
## # ℹ 1,374 more rows
## # ℹ 2 more variables: os_yrs <dbl>, age_group <fct>
survdiff(Surv(futime, death) ~ age_group, data = df)
## Call:
## survdiff(formula = Surv(futime, death) ~ age_group, data = df)
##
## N Observed Expected (O-E)^2/E (O-E)^2/V
## age_group=24-63 357 163 345 95.69 159.3
## age_group=63-72 354 227 276 8.65 12.3
## age_group=72-79 342 277 201 28.87 37.6
## age_group=79-96 331 296 142 168.05 206.8
##
## Chisq= 330 on 3 degrees of freedom, p= <2e-16
#plot the comparison
survfit2(Surv(futime, death) ~ age_group, data = df) %>%
ggsurvfit() +
labs(
x = "Years",
y = "Overall survival probability",
title = "Age Group: mgus2",
) +
add_confidence_interval() +
add_risktable()
This visualization shows that as age increases, survival decreases.
The Cox regression model quantifies the survival between patient
groups by modeling the effect of multiple variables at once. This model
can adjust for confounding effects of other variables using the
coxph() function which uses the same syntax as
lm(), glm(), etc. The response variable you
create with Surv() goes on the left hand side of the
formula, specified with a ~.
cox <- coxph(Surv(futime, death) ~ age_group + sex + mspike, data = df)
summary(cox)
## Call:
## coxph(formula = Surv(futime, death) ~ age_group + sex + mspike,
## data = df)
##
## n= 1373, number of events= 957
## (11 observations deleted due to missingness)
##
## coef exp(coef) se(coef) z Pr(>|z|)
## age_group63-72 0.65987 1.93454 0.10485 6.293 3.10e-10 ***
## age_group72-79 1.28162 3.60247 0.10435 12.282 < 2e-16 ***
## age_group79-96 1.75523 5.78478 0.10587 16.579 < 2e-16 ***
## sexM 0.33454 1.39729 0.06573 5.089 3.59e-07 ***
## mspike 0.04446 1.04547 0.06000 0.741 0.459
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## age_group63-72 1.935 0.5169 1.5752 2.376
## age_group72-79 3.602 0.2776 2.9362 4.420
## age_group79-96 5.785 0.1729 4.7008 7.119
## sexM 1.397 0.7157 1.2284 1.589
## mspike 1.045 0.9565 0.9295 1.176
##
## Concordance= 0.656 (se = 0.01 )
## Likelihood ratio test= 346 on 5 df, p=<2e-16
## Wald test = 322.3 on 5 df, p=<2e-16
## Score (logrank) test = 359.2 on 5 df, p=<2e-16
Statistical significance: The column marked
z gives the Wald statistic value. It corresponds to the
ratio of each regression coefficient to its standard error
(z = coef/se(coef)). The Wald statistic evaluates, whether
the beta (β) coefficient of a given variable is statistically
significantly different from 0. We can conclude that sex
and age_group are highly statistically significant.
However, the size of the monoclonal serum splike is not significant.
The regression coefficients: The second feature to note in the Cox model results is the the sign of the regression coefficients (coef). A positive sign means that the hazard (risk of death) is higher, and thus the prognosis worse, for subjects with higher values of that variable. The R summary for the Cox model gives the hazard ratio (HR) for the second group relative to the first group, that is, female versus male. This indicates that females have lower risk of death (higher survival rates) than males.
Hazard ratios: The exponentiated coefficients of
sexM (exp(coef) = exp(0.33) = 1.40), also known as hazard
ratios, give the effect size of covariates.
Confidence intervals of the hazard ratios: For
example, (mspike) also gives upper and lower 95% confidence
intervals for the hazard ratio (exp(coef)) = 1.05, lower 95% bound =
0.93, upper 95% bound = 1.18.
Global statistical significance of the model: The output gives p-values for three alternative tests for overall significance of the model: The likelihood-ratio test, Wald test, and score log rank statistics. These three methods are asymptotically equivalent. For large enough N, they will give similar results. For small N, they may differ somewhat. The Likelihood ratio test has better behavior for small sample sizes, so it is generally preferred.
Likelihood-ratio test: A likelihood ratio test
compares the goodness of fit of two nested regression models based on
the predictor variables in the overall regression model.
Thelikelihood-ratio test = 346, p =
<2e-16.
coxph(Surv(futime, death) ~ age_group + sex + mspike, data = df) |>
tbl_regression(exp = TRUE)
| Characteristic | HR1 | 95% CI1 | p-value |
|---|---|---|---|
| age_group | |||
| 24-63 | — | — | |
| 63-72 | 1.93 | 1.58, 2.38 | <0.001 |
| 72-79 | 3.60 | 2.94, 4.42 | <0.001 |
| 79-96 | 5.78 | 4.70, 7.12 | <0.001 |
| Sex | |||
| F | — | — | |
| M | 1.40 | 1.23, 1.59 | <0.001 |
| SEP SPIKE AT mgus_sp | 1.05 | 0.93, 1.18 | 0.5 |
| 1 HR = Hazard Ratio, CI = Confidence Interval | |||
The quantity of interest from a Cox regression model is a hazard ratio (HR). The HR represents the ratio of hazards between two groups at any particular point in time.
HR = 1: No effect
HR < 1: Reduction in the hazard
HR > 1: Increase in Hazard
sexM: HR = 1.42 implies that males have a 1.42 times
higher risk of death in relation to females.
age_group63-72: 1.93 times higher risk of death in
relation to age_group24-63.
age_group72-79: 3.60 times higher risk of death in
relation to age_group24-63.
age_group79-96: 5.78 times higher risk of death in
relation to age_group24-63.
cox.zph(): Tests the proportional hazards assumption for
the Cox regression model. This is done by testing for an interaction
effect between the covariate and log(time).
Interpretation: A significant p-value indicates that the proportional hazards assumption is violated.
Plots of the Schoenfeld residuals: Deviation from a zero-slope line is evidence that the proportional hazards assumption is violated.
mv_fit <- coxph(Surv(futime, death) ~ sex + age_group + mspike, data = df)
cz <- cox.zph(mv_fit)
print(cz)
## chisq df p
## sex 1.01 1 0.32
## age_group 30.86 3 9.1e-07
## mspike 1.13 1 0.29
## GLOBAL 33.53 5 3.0e-06
plot(cz)
Covariates with p-values >0.05, do not reject the null hypothesis. Therefore, the proportional hazards assumption is satisfied for each individual covariate, and also for the model overall.
In statistics, a continuous variable is a
quantitative variable that may be continuous or discrete if they are
typically obtained by measuring or counting, respectively (ex.
age). To visualize a survival estimate according to a
continuous variable. The sm.survival function from the
sm package allows you to do this for a quantile of the
distribution of survival data. The default quantile is p = 0.5 for
median survival.
#install.packages("sm")
library(sm)
## Warning: package 'sm' was built under R version 4.3.3
## Package 'sm', version 2.2-6.0: type help(sm) for summary information
sm.options(
list(
xlab = "Age (years)",
ylab = "Median time to death (years)")
)
sm.survival(
x = df$age,
y = df$futime,
status = df$death,
h = (1/6) * sd(df$age) / nrow(df)^(-1/4)
)
#play with the `h` ratio to get a median line that is not too smooth (start with 1/2)
x = events
o = censoring
The line is a smoothed estimate of median survival according to
age.
As age increases, median time to death
decreases.
Survival analysis tools can also be used when understanding disease progression. This is especially important for tumor treatment assesment.
survfit2(Surv(ptime, pstat) ~ 1, data = df) %>%
ggsurvfit() +
labs(
x = "Years",
y = "Overall progression probability",
title = "Overall Progession: mgus2",
) +
add_confidence_interval() +
add_risktable()
#estimate median time to progression in years
survfit(Surv(ptime, pstat) ~ 1, data = df) %>%
tbl_survfit(
probs = 0.5,
label_header = "**Median progression (95% CI)**"
)
| Characteristic | Median progression (95% CI) |
|---|---|
| Overall | 31 (28, —) |
hgb: hemoglobin
creat: creatinine
mspike: size of the monoclonal serum splike
cox_prog <- coxph(Surv(ptime, pstat) ~ hgb + creat + mspike, data = df)
summary(cox_prog)
## Call:
## coxph(formula = Surv(ptime, pstat) ~ hgb + creat + mspike, data = df)
##
## n= 1338, number of events= 112
## (46 observations deleted due to missingness)
##
## coef exp(coef) se(coef) z Pr(>|z|)
## hgb -0.14090 0.86857 0.05109 -2.758 0.00582 **
## creat -0.11778 0.88889 0.15144 -0.778 0.43670
## mspike 0.90718 2.47733 0.16557 5.479 4.28e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## hgb 0.8686 1.1513 0.7858 0.960
## creat 0.8889 1.1250 0.6606 1.196
## mspike 2.4773 0.4037 1.7908 3.427
##
## Concordance= 0.665 (se = 0.031 )
## Likelihood ratio test= 37.01 on 3 df, p=5e-08
## Wald test = 38.78 on 3 df, p=2e-08
## Score (logrank) test = 39.43 on 3 df, p=1e-08
We will explore hgb and mspike because
creatine is not a significant factor when determining
progression.
#check distribution
summary(df$hgb)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 5.7 12.2 13.5 13.3 14.7 18.9 13
summary(df$mspike)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.600 1.200 1.164 1.500 3.000 11
#Create `hgb` bins
df_prog <- df %>%
mutate(
hgb_group = dplyr::case_when(
hgb <= 12.2 ~ "5.7-12.2",
hgb > 12.2 & hgb <= 13.5 ~ "12.2-13.5",
hgb > 13.5 & hgb <= 14.7 ~ "13.5-14.7",
hgb > 14.7 ~ "14.7-18.9"
),
# Convert to factor
hgb_group = factor(
hgb_group,
level = c("5.7-12.2", "12.2-13.5", "13.5-14.7", "14.7-18.9" )
)
)
df_prog
## # A tibble: 1,384 × 14
## id age sex dxyr hgb creat mspike ptime pstat futime death
## <dbl> <dbl> <fct> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 88 F 1981-01-01 13.1 1.3 0.5 2.5 0 2.5 1
## 2 2 78 F 1968-01-01 11.5 1.2 2 2.08 0 2.08 1
## 3 3 94 M 1980-01-01 10.5 1.5 2.6 3.83 0 3.83 1
## 4 4 68 M 1977-01-01 15.2 1.2 1.2 7.67 0 7.67 1
## 5 5 90 F 1973-01-01 10.7 0.8 1 0.667 0 0.667 1
## 6 6 90 M 1990-01-01 12.9 1 0.5 0.333 0 0.333 1
## 7 7 89 F 1974-01-01 10.5 0.9 1.3 12.6 0 12.6 1
## 8 8 87 F 1974-01-01 12.3 1.2 1.6 0.167 0 0.167 1
## 9 9 86 F 1994-01-01 14.5 0.9 2.4 4.75 0 4.75 0
## 10 10 79 F 1981-01-01 9.4 1.1 2.3 11.3 0 11.3 1
## # ℹ 1,374 more rows
## # ℹ 3 more variables: os_yrs <dbl>, age_group <fct>, hgb_group <fct>
#Create `mspike` bins
df_prog <- df_prog %>%
mutate(
mspike_group = dplyr::case_when(
mspike <= 0.6 ~ "0-0.6",
mspike > 0.6 & mspike <= 1.2 ~ "0.6-1.2",
mspike > 1.2 & mspike <= 1.5 ~ "1.2-1.5",
mspike > 1.5 ~ "1.5-3.0"
),
# Convert to factor
mspike_group = factor(
mspike_group,
level = c("0-0.6", "0.6-1.2", "1.2-1.5", "1.5-3.0" )
)
)
df_prog
## # A tibble: 1,384 × 15
## id age sex dxyr hgb creat mspike ptime pstat futime death
## <dbl> <dbl> <fct> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 88 F 1981-01-01 13.1 1.3 0.5 2.5 0 2.5 1
## 2 2 78 F 1968-01-01 11.5 1.2 2 2.08 0 2.08 1
## 3 3 94 M 1980-01-01 10.5 1.5 2.6 3.83 0 3.83 1
## 4 4 68 M 1977-01-01 15.2 1.2 1.2 7.67 0 7.67 1
## 5 5 90 F 1973-01-01 10.7 0.8 1 0.667 0 0.667 1
## 6 6 90 M 1990-01-01 12.9 1 0.5 0.333 0 0.333 1
## 7 7 89 F 1974-01-01 10.5 0.9 1.3 12.6 0 12.6 1
## 8 8 87 F 1974-01-01 12.3 1.2 1.6 0.167 0 0.167 1
## 9 9 86 F 1994-01-01 14.5 0.9 2.4 4.75 0 4.75 0
## 10 10 79 F 1981-01-01 9.4 1.1 2.3 11.3 0 11.3 1
## # ℹ 1,374 more rows
## # ℹ 4 more variables: os_yrs <dbl>, age_group <fct>, hgb_group <fct>,
## # mspike_group <fct>
#plot the comparison
survfit2(Surv(ptime, pstat) ~ hgb_group, data = df_prog) %>%
ggsurvfit() +
labs(
x = "Years",
y = "Overall progression probability",
title = "Hemoglobin: mgus2",
) +
add_confidence_interval() +
add_risktable()
survfit2(Surv(ptime, pstat) ~ mspike_group, data = df_prog) %>%
ggsurvfit() +
labs(
x = "Years",
y = "Overall progression probability",
title = "size of the monoclonal serum splike: mgus2",
) +
add_confidence_interval() +
add_risktable()
coxph(Surv(ptime, pstat) ~ hgb_group + mspike_group, data = df_prog) |>
tbl_regression(exp = TRUE)
| Characteristic | HR1 | 95% CI1 | p-value |
|---|---|---|---|
| hgb_group | |||
| 5.7-12.2 | — | — | |
| 12.2-13.5 | 0.79 | 0.48, 1.29 | 0.3 |
| 13.5-14.7 | 0.45 | 0.25, 0.78 | 0.005 |
| 14.7-18.9 | 0.51 | 0.30, 0.86 | 0.011 |
| mspike_group | |||
| 0-0.6 | — | — | |
| 0.6-1.2 | 1.24 | 0.63, 2.46 | 0.5 |
| 1.2-1.5 | 1.90 | 0.97, 3.71 | 0.061 |
| 1.5-3.0 | 3.47 | 1.88, 6.40 | <0.001 |
| 1 HR = Hazard Ratio, CI = Confidence Interval | |||
#evaluate the cox model
mv_fit <- coxph(Surv(ptime, pstat) ~ hgb_group + mspike_group, data = df_prog)
cz_prog <- cox.zph(mv_fit)
print(cz_prog)
## chisq df p
## hgb_group 4.39 3 0.22
## mspike_group 1.19 3 0.75
## GLOBAL 5.63 6 0.47
plot(cz_prog)
There is a significant relationship between the size of the monoclonal serum splike and probability of progression.
According to the insights produced by the mgus2 data
set, the 1-year probability of MGUS survival was 87.5% and the median
survival time was 8.2 years. There was evidence that older patients and
male patients had a significantly lower chance of survival. Lastly,
there was a significant relationship between the size of the monoclonal
serum splike and probability of progression