This “Lung” Dataset Contains data on patients with advanced lung cancer from the North Central Cancer Treatment Group.
Key Variables:
time: Survival time in days.
status: Censoring status (1 = censored, 2 = dead).
age: Age in years.
sex: Male (1) or Female (2).
ph.ecog: ECOG performance score (0 = good to 5 = dead).
Other variables include Karnofsky performance scores and calorie intake.
This dataset provides rich information for exploring survival analysis techniques and understanding factors influencing survival among lung cancer patients. This dataset is typically used for survival analysis, including Kaplan-Meier survival curves and Cox proportional hazards models.
The primary outcome variable is time, representing survival duration, and status, indicating whether the event of interest (death) occurred or if the observation was censored. Performance scores (ph.ecog, ph.karno, and pat.karno) are important predictors of survival, reflecting the functional status of patients.
Key Notes:
Censoring (status): Observations with a status of “1” are censored, meaning that the patient was still alive at their last follow-up.
ECOG and Karnofsky Scores: These scores assess a patient’s ability to perform daily activities and are commonly used as prognostic indicators in cancer studies.
Dietary Intake (meal.cal) and Weight Loss (wt.loss): These variables may reflect nutritional status, which can influence survival outcomes.
#install.packages("survival")
library(survival)
library(eha)
library(tidyverse) # For data manipulation
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(summarytools) # Optional: For detailed summary tables
## Warning in fun(libname, pkgname): couldn't connect to display ":0"
## system might not have X11 capabilities; in case of errors when using dfSummary(), set st_options(use.x11 = FALSE)
##
## Attaching package: 'summarytools'
##
## The following object is masked from 'package:tibble':
##
## view
library(tinytex)
library(flexsurv)
##
## Attaching package: 'flexsurv'
##
## The following objects are masked from 'package:eha':
##
## dgompertz, dllogis, hgompertz, Hgompertz, hllogis, Hllogis, hlnorm,
## Hlnorm, hweibull, Hweibull, pgompertz, pllogis, qgompertz, qllogis,
## rgompertz, rllogis
lung <- lung
descr(lung) # summarytools
## Descriptive Statistics
## lung
## N: 228
##
## age inst meal.cal pat.karno ph.ecog ph.karno sex status
## ----------------- -------- -------- ---------- ----------- --------- ---------- -------- --------
## Mean 62.45 11.09 928.78 79.96 0.95 81.94 1.39 1.72
## Std.Dev 9.07 8.30 402.17 14.62 0.72 12.33 0.49 0.45
## Min 39.00 1.00 96.00 30.00 0.00 50.00 1.00 1.00
## Q1 56.00 3.00 635.00 70.00 0.00 70.00 1.00 1.00
## Median 63.00 11.00 975.00 80.00 1.00 80.00 1.00 2.00
## Q3 69.00 16.00 1150.00 90.00 1.00 90.00 2.00 2.00
## Max 82.00 33.00 2600.00 100.00 3.00 100.00 2.00 2.00
## MAD 9.64 8.90 296.52 14.83 1.48 14.83 0.00 0.00
## IQR 13.00 13.00 515.00 20.00 1.00 15.00 1.00 1.00
## CV 0.15 0.75 0.43 0.18 0.75 0.15 0.35 0.26
## Skewness -0.37 0.66 1.00 -0.60 0.14 -0.57 0.43 -0.99
## SE.Skewness 0.16 0.16 0.18 0.16 0.16 0.16 0.16 0.16
## Kurtosis -0.40 -0.22 3.35 0.13 -0.85 -0.20 -1.82 -1.02
## N.Valid 228.00 227.00 181.00 225.00 227.00 227.00 228.00 228.00
## N 228.00 228.00 228.00 228.00 228.00 228.00 228.00 228.00
## Pct.Valid 100.00 99.56 79.39 98.68 99.56 99.56 100.00 100.00
##
## Table: Table continues below
##
##
##
## time wt.loss
## ----------------- --------- ---------
## Mean 305.23 9.83
## Std.Dev 210.65 13.14
## Min 5.00 -24.00
## Q1 166.50 0.00
## Median 255.50 7.00
## Q3 399.00 16.00
## Max 1022.00 68.00
## MAD 160.86 10.38
## IQR 229.75 15.75
## CV 0.69 1.34
## Skewness 1.08 1.17
## SE.Skewness 0.16 0.17
## Kurtosis 0.86 2.33
## N.Valid 228.00 214.00
## N 228.00 228.00
## Pct.Valid 100.00 93.86
# Perform OLS regression with survival time as the dependent variable
ols_model <- lm(time ~ age , data = lung)
# Summary of the OLS model
summary(ols_model)
##
## Call:
## lm(formula = time ~ age, data = lung)
##
## Residuals:
## Min 1Q Median 3Q Max
## -295.61 -143.37 -57.01 107.12 737.70
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 418.375 97.147 4.307 2.47e-05 ***
## age -1.812 1.540 -1.177 0.241
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 210.5 on 226 degrees of freedom
## Multiple R-squared: 0.006091, Adjusted R-squared: 0.001693
## F-statistic: 1.385 on 1 and 226 DF, p-value: 0.2405
# Recode the 'status' variable: 1 = censored (0), 2 = dead (1)
lung$death <- ifelse(lung$status == 2, 1, 0)
# Perform logistic regression with survival status as the dependent variable
logistic_model <- glm(death ~ age + sex , data = lung, family = binomial)
# Summary of the logistic regression model
summary(logistic_model)
##
## Call:
## glm(formula = death ~ age + sex, family = binomial, data = lung)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.51507 1.17422 0.439 0.660916
## age 0.03189 0.01701 1.875 0.060854 .
## sex -1.04839 0.30844 -3.399 0.000676 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 268.78 on 227 degrees of freedom
## Residual deviance: 251.93 on 225 degrees of freedom
## AIC: 257.93
##
## Number of Fisher Scoring iterations: 4
# Optional: Calculate odds ratios and confidence intervals
exp(cbind(Odds_Ratio = coef(logistic_model), confint(logistic_model)))
## Waiting for profiling to be done...
## Odds_Ratio 2.5 % 97.5 %
## (Intercept) 1.6737524 0.1701245 17.3552243
## age 1.0324022 0.9986430 1.0678140
## sex 0.3505024 0.1899317 0.6385775
# Load the required package
library(survival)
# Load the lung dataset
data(lung)
## Warning in data(lung): data set 'lung' not found
# Convert the status variable to a factor (since logistic regression requires a binary outcome)
lung$status <- as.factor(lung$status)
# Fit the logistic regression model
logistic_model <- glm(status ~ age + time, data = lung, family = binomial)
# Display the summary of the model
summary(logistic_model)
##
## Call:
## glm(formula = status ~ age + time, family = binomial, data = lung)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.6772741 1.0629153 -0.637 0.5240
## age 0.0352921 0.0167466 2.107 0.0351 *
## time -0.0016807 0.0006951 -2.418 0.0156 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 268.78 on 227 degrees of freedom
## Residual deviance: 257.87 on 225 degrees of freedom
## AIC: 263.87
##
## Number of Fisher Scoring iterations: 4
# Extract the coefficients
coefficients <- coef(logistic_model)
# Convert the coefficients to odds ratios
odds_ratios <- exp(coefficients)
# Display the odds ratios
odds_ratios
## (Intercept) age time
## 0.5079999 1.0359223 0.9983207
table(lung$death)
##
## 0 1
## 63 165
# Load the required package
library(survival)
# Load the lung dataset
data(lung)
## Warning in data(lung): data set 'lung' not found
# Create a dummy variable for status (1 = dead, 0 = censored)
lung$status_dummy <- ifelse(lung$status == 2, 1, 0)
# Fit the Cox proportional hazards model
cox_model <- coxph(Surv(time, status_dummy) ~ age + sex, data = lung)
# Display the summary of the model
summary(cox_model)
## Call:
## coxph(formula = Surv(time, status_dummy) ~ age + sex, data = lung)
##
## n= 228, number of events= 165
##
## coef exp(coef) se(coef) z Pr(>|z|)
## age 0.017045 1.017191 0.009223 1.848 0.06459 .
## sex -0.513219 0.598566 0.167458 -3.065 0.00218 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## age 1.0172 0.9831 0.9990 1.0357
## sex 0.5986 1.6707 0.4311 0.8311
##
## Concordance= 0.603 (se = 0.025 )
## Likelihood ratio test= 14.12 on 2 df, p=9e-04
## Wald test = 13.47 on 2 df, p=0.001
## Score (logrank) test = 13.72 on 2 df, p=0.001
library(survival)
tt <- c(7,6,6,5,2,4)
cens <- c(0,1,0,0,1,1)
Surv(tt,cens)
## [1] 7+ 6 6+ 5+ 2 4
result.km <- survfit(Surv(tt,cens)~1, conf.type="log-log")
result.km
## Call: survfit(formula = Surv(tt, cens) ~ 1, conf.type = "log-log")
##
## n events median 0.95LCL 0.95UCL
## [1,] 6 3 6 2 NA
summary(result.km)
## Call: survfit(formula = Surv(tt, cens) ~ 1, conf.type = "log-log")
##
## time n.risk n.event survival std.err lower 95% CI upper 95% CI
## 2 6 1 0.833 0.152 0.2731 0.975
## 4 5 1 0.667 0.192 0.1946 0.904
## 6 3 1 0.444 0.222 0.0662 0.785
plot(result.km,
ylab = "Survival probability",
xlab = "Time",
mark.time = T,
main="KM survival curve")
abline(h = 0.5, col = "sienna", lty = 3)
plot(result.km,
ylab = "Cumulative hazard",
xlab = "Time",
mark.time = T,
fun="cumhaz",
main="KM cumulative hazard curve")
abline(h = 0.5, col = "sienna", lty = 3)
result.fh <- survfit(Surv(tt,cens)~1, conf.type="log-log", type="fh")
result.fh
## Call: survfit(formula = Surv(tt, cens) ~ 1, conf.type = "log-log",
## type = "fh")
##
## n events median 0.95LCL 0.95UCL
## [1,] 6 3 6 2 NA
summary(result.fh)
## Call: survfit(formula = Surv(tt, cens) ~ 1, conf.type = "log-log",
## type = "fh")
##
## time n.risk n.event survival std.err lower 95% CI upper 95% CI
## 2 6 1 0.846 0.141 0.306 0.977
## 4 5 1 0.693 0.180 0.229 0.913
## 6 3 1 0.497 0.210 0.101 0.807
# NA survival curve
plot(result.fh,
ylab = "Survival probability",
xlab = "Time",
mark.time = T,
main="NA survival curve")
abline(h = 0.5, col = "sienna", lty = 3)
# NA cumulative hazard curve
plot(result.fh,
ylab = "Cumulative hazard",
xlab = "Time",
mark.time = T,
fun="cumhaz",
main="NA cumulative hazard curve")
abline(h = 0.5, col = "sienna", lty = 3)
The dataset originates from an Australian study by Caplehorn and Bell (1991) and is featured in by Kleinbaum and Klein. The primary goal is to analyze factors influencing retention time in methadone clinics, such as clinic type, methadone dose, and prison history. This dataset is widely used for illustrating Kaplan-Meier survival curves, Cox proportional hazards models, and other survival analysis techniques.
We are interested in either 1) the hazard of dropping out of the clinic or was censored or 2) the time (in days) until the person dropped out of the clinic or was censored. The predictor of interest is CLINIC (coded 1 or 2) for two methadone clinics for heroin addicts. Covariates include DOSE (continuous) for methadone dose (mg/day), and PRISON (coded 1 if patient has a prison record and 0 if not).
ID (id): Each patient has a unique numeric ID to distinguish them in the dataset.
Clinic (clinic): Indicates which methadone clinic the patient attended. Values are coded as: 1: Clinic 1; 2: Clinic 2
Status (status): Represents whether the patient experienced the event of interest (dropout) or was censored. Values are coded as:
Survival Time (survt): The time (in days) from admission to either dropout or censoring. This is the primary time-to-event variable used for survival analysis.
Prison Record (prison): Indicates whether the patient had a history of incarceration. Values are coded as:
Methadone Dose (dose): The maximum daily dose of methadone prescribed to the patient during treatment, measured in milligrams per day.
# Install and load necessary packages
# install.packages("haven")
library(haven)
library(survival)
library(eha)
library(tidyverse)
# Load the addicts dataset
addicts <- read_dta("http://web1.sph.emory.edu/dkleinb/allDatasets/surv2datasets/addicts.dta")
# Preview the dataset
head(addicts)
## # A tibble: 6 × 6
## id clinic status survt prison dose
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 1 1 428 0 50
## 2 2 1 1 275 1 55
## 3 3 1 1 262 0 55
## 4 4 1 1 183 0 30
## 5 5 1 1 259 1 65
## 6 6 1 1 714 0 55
summary(addicts)
## id clinic status survt
## Min. : 1.00 Min. :1.000 Min. :0.0000 Min. : 2.0
## 1st Qu.: 65.25 1st Qu.:1.000 1st Qu.:0.0000 1st Qu.: 171.2
## Median :131.50 Median :1.000 Median :1.0000 Median : 367.5
## Mean :134.13 Mean :1.315 Mean :0.6303 Mean : 402.6
## 3rd Qu.:205.75 3rd Qu.:2.000 3rd Qu.:1.0000 3rd Qu.: 585.5
## Max. :266.00 Max. :2.000 Max. :1.0000 Max. :1076.0
## prison dose
## Min. :0.0000 Min. : 20.0
## 1st Qu.:0.0000 1st Qu.: 50.0
## Median :0.0000 Median : 60.0
## Mean :0.4664 Mean : 60.4
## 3rd Qu.:1.0000 3rd Qu.: 70.0
## Max. :1.0000 Max. :110.0
#install.packages("summarytools")
# Load required libraries
library(tidyverse) # For data manipulation
library(summarytools) # Optional: For detailed summary tables
# View the structure of the dataset
str(addicts)
## tibble [238 × 6] (S3: tbl_df/tbl/data.frame)
## $ id : num [1:238] 1 2 3 4 5 6 7 8 9 10 ...
## ..- attr(*, "label")= chr "Subject ID"
## ..- attr(*, "format.stata")= chr "%10.0g"
## $ clinic: num [1:238] 1 1 1 1 1 1 1 1 1 1 ...
## ..- attr(*, "label")= chr "Coded 1 or 2"
## ..- attr(*, "format.stata")= chr "%10.0g"
## $ status: num [1:238] 1 1 1 1 1 1 1 0 1 1 ...
## ..- attr(*, "label")= chr "status (0=censored, 1=endpoint)"
## ..- attr(*, "format.stata")= chr "%10.0g"
## $ survt : num [1:238] 428 275 262 183 259 714 438 796 892 393 ...
## ..- attr(*, "label")= chr "survival time in days"
## ..- attr(*, "format.stata")= chr "%10.0g"
## $ prison: num [1:238] 0 1 0 0 1 0 1 1 0 1 ...
## ..- attr(*, "label")= chr "0=none, 1=prison record"
## ..- attr(*, "format.stata")= chr "%10.0g"
## $ dose : num [1:238] 50 55 55 30 65 55 65 60 50 65 ...
## ..- attr(*, "label")= chr "methadone dose (mg/day)"
## ..- attr(*, "format.stata")= chr "%10.0g"
# Basic summary statistics for all variables
summary(addicts)
## id clinic status survt
## Min. : 1.00 Min. :1.000 Min. :0.0000 Min. : 2.0
## 1st Qu.: 65.25 1st Qu.:1.000 1st Qu.:0.0000 1st Qu.: 171.2
## Median :131.50 Median :1.000 Median :1.0000 Median : 367.5
## Mean :134.13 Mean :1.315 Mean :0.6303 Mean : 402.6
## 3rd Qu.:205.75 3rd Qu.:2.000 3rd Qu.:1.0000 3rd Qu.: 585.5
## Max. :266.00 Max. :2.000 Max. :1.0000 Max. :1076.0
## prison dose
## Min. :0.0000 Min. : 20.0
## 1st Qu.:0.0000 1st Qu.: 50.0
## Median :0.0000 Median : 60.0
## Mean :0.4664 Mean : 60.4
## 3rd Qu.:1.0000 3rd Qu.: 70.0
## Max. :1.0000 Max. :110.0
# Descriptive statistics for numeric variables
numeric_summary <- addicts %>%
select_if(is.numeric) %>%
summarise_all(list(
Mean = mean,
Median = median,
SD = sd,
Min = min,
Max = max,
NAs = ~sum(is.na(.))
))
print(numeric_summary)
## # A tibble: 1 × 36
## id_Mean clinic_Mean status_Mean survt_Mean prison_Mean dose_Mean id_Median
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 134. 1.32 0.630 403. 0.466 60.4 132.
## # ℹ 29 more variables: clinic_Median <dbl>, status_Median <dbl>,
## # survt_Median <dbl>, prison_Median <dbl>, dose_Median <dbl>, id_SD <dbl>,
## # clinic_SD <dbl>, status_SD <dbl>, survt_SD <dbl>, prison_SD <dbl>,
## # dose_SD <dbl>, id_Min <dbl>, clinic_Min <dbl>, status_Min <dbl>,
## # survt_Min <dbl>, prison_Min <dbl>, dose_Min <dbl>, id_Max <dbl>,
## # clinic_Max <dbl>, status_Max <dbl>, survt_Max <dbl>, prison_Max <dbl>,
## # dose_Max <dbl>, id_NAs <int>, clinic_NAs <int>, status_NAs <int>, …
# Frequency table for categorical variables (e.g., clinic and prison)
categorical_summary <- addicts %>%
reframe(clinic, prison) %>%
summarise_all(~table(.))
## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
## always returns an ungrouped data frame and adjust accordingly.
## ℹ The deprecated feature was likely used in the dplyr package.
## Please report the issue at <https://github.com/tidyverse/dplyr/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
print(categorical_summary)
## # A tibble: 2 × 2
## clinic prison
## <table[1d]> <table[1d]>
## 1 163 127
## 2 75 111
# Optional: Use summarytools for a detailed report
library(summarytools)
# Generate a detailed descriptive statistics table
descr(addicts)
## Descriptive Statistics
## addicts
## N: 238
##
## clinic dose id prison status survt
## ----------------- -------- -------- -------- -------- -------- ---------
## Mean 1.32 60.40 134.13 0.47 0.63 402.57
## Std.Dev 0.47 14.45 79.29 0.50 0.48 267.85
## Min 1.00 20.00 1.00 0.00 0.00 2.00
## Q1 1.00 50.00 65.00 0.00 0.00 170.00
## Median 1.00 60.00 131.50 0.00 1.00 367.50
## Q3 2.00 70.00 206.00 1.00 1.00 587.00
## Max 2.00 110.00 266.00 1.00 1.00 1076.00
## MAD 0.00 14.83 104.52 0.00 0.00 306.16
## IQR 1.00 20.00 140.50 1.00 1.00 414.25
## CV 0.35 0.24 0.59 1.07 0.77 0.67
## Skewness 0.79 0.26 -0.01 0.13 -0.54 0.37
## SE.Skewness 0.16 0.16 0.16 0.16 0.16 0.16
## Kurtosis -1.38 0.08 -1.32 -1.99 -1.72 -0.87
## N.Valid 238.00 238.00 238.00 238.00 238.00 238.00
## N 238.00 238.00 238.00 238.00 238.00 238.00
## Pct.Valid 100.00 100.00 100.00 100.00 100.00 100.00
lm <- glm(survt ~ clinic + dose + prison, data=addicts, family = "gaussian")
b_lm = coef(lm)
summary(lm)
##
## Call:
## glm(formula = survt ~ clinic + dose + prison, family = "gaussian",
## data = addicts)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -109.405 76.216 -1.435 0.15249
## clinic 86.527 33.725 2.566 0.01092 *
## dose 7.244 1.087 6.666 1.87e-10 ***
## prison -84.357 31.072 -2.715 0.00712 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 57139.85)
##
## Null deviance: 17003606 on 237 degrees of freedom
## Residual deviance: 13370725 on 234 degrees of freedom
## AIC: 3288.3
##
## Number of Fisher Scoring iterations: 2
\[ y=b_0 +b_1 x_1 + b_2 x_2 + b_3 x_3 + \epsilon, \;\; \epsilon \sim N(0,1) \]
time to dropout (\(t\)) =
(-109.4045583) + (86.5268733)\(\times\)clinic + (7.2438949)\(\times\)dose + (-84.3568952)\(\times\)prison
Interpretations
The following model includes an interaction term between clinic and prison. In this model specification, prison is a (mediator, moderator) on the association between clinic and time to dropout.
lm_int <- glm(survt~ clinic + dose + prison + clinic*prison,
data=addicts,
family = "gaussian")
b_lm_int = coef(lm_int)
summary(lm_int)
##
## Call:
## glm(formula = survt ~ clinic + dose + prison + clinic * prison,
## family = "gaussian", data = addicts)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -60.688 84.631 -0.717 0.4740
## clinic 44.359 46.492 0.954 0.3410
## dose 7.350 1.088 6.756 1.12e-10 ***
## prison -200.229 93.392 -2.144 0.0331 *
## clinic:prison 87.989 66.891 1.315 0.1897
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 56962.08)
##
## Null deviance: 17003606 on 237 degrees of freedom
## Residual deviance: 13272164 on 233 degrees of freedom
## AIC: 3288.5
##
## Number of Fisher Scoring iterations: 2
logi <- glm(status~ clinic + dose + prison,
data=addicts,
family = "binomial")
b_logi = coef(logi)
summary(logi)
##
## Call:
## glm(formula = status ~ clinic + dose + prison, family = "binomial",
## data = addicts)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.22797 0.78182 5.408 6.38e-08 ***
## clinic -1.54175 0.30493 -5.056 4.28e-07 ***
## dose -0.02630 0.01048 -2.509 0.0121 *
## prison -0.04155 0.29257 -0.142 0.8871
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 313.60 on 237 degrees of freedom
## Residual deviance: 276.33 on 234 degrees of freedom
## AIC: 284.33
##
## Number of Fisher Scoring iterations: 4
\[\ln(\frac{p(y)}{1-p(y)})=b_0 + b_1x_1+b_2x_2+b_3x_3+\epsilon, \;\;\;\; \epsilon \sim Bernoulli(p)\]
\[\ln(\frac{p(DropOut)}{1-p(DropOut)}) = (4.2279749) + (-1.5417485)\times clinic + (-0.026296) \times dose + (-0.0415542) \times prison\]
What is the metric of \(y, b_0, b_1,\) and \(b_2\), respectively?
What is the difference between coefficients and \(\exp\)(coefficients)? Specify the metric.
What is the interpretation when 1) \(b_i=0\), 2) \(b_i<0\), or 3) \(b_i>0\)?
What is the interpretation when 1) \(\exp(b_i)=1\), 2) \(\exp(b_i)<1\) or 3) \(\exp(b_i)>1\)?
How would you compare the odds of dropout between two groups of people below? Is it additive or multiplicative?
Compare the following two groups:
Interpret \(b_0, b_1,\) and \(b_2\), respectively.
The following model includes an interaction term between clinic and prison. In this model specification, prison is a (mediator, moderator) on the association between clinic and time to dropout.
logi_int <- glm(status~ clinic + dose + prison + clinic*prison,
data=addicts,
family = "binomial")
b_logi_int = coef(logi_int)
summary(logi_int)
##
## Call:
## glm(formula = status ~ clinic + dose + prison + clinic * prison,
## family = "binomial", data = addicts)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.11777 0.85788 4.800 1.59e-06 ***
## clinic -1.45279 0.42049 -3.455 0.00055 ***
## dose -0.02646 0.01048 -2.524 0.01159 *
## prison 0.21200 0.88104 0.241 0.80985
## clinic:prison -0.18671 0.61194 -0.305 0.76028
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 313.60 on 237 degrees of freedom
## Residual deviance: 276.24 on 233 degrees of freedom
## AIC: 286.24
##
## Number of Fisher Scoring iterations: 4
# Create a survival object
km_surv_obj <- Surv(time = addicts$survt, event = addicts$status == 1)
km_surv_obj[1:20]
## [1] 428 275 262 183 259 714 438 796+ 892 393 161+ 836 523 612 212
## [16] 399 771 514 512 624
# Fit Kaplan-Meier survival curves stratified by clinic
km_fit <- survfit(km_surv_obj ~ clinic, data = addicts, conf.type = "log-log")
# Plot Kaplan-Meier curves using base R
plot(km_fit, col = c("blue", "red"), lty = 1:2, xlab = "Time (days)", ylab = "Survival Probability",
main = "Kaplan-Meier Survival Curves by Clinic")
legend("bottomleft", legend = c("Clinic 1", "Clinic 2"), col = c("blue", "red"), lty = 1:2)
\[ S(t) = \prod_{T \le t} (1 - \frac{d_i}{n_i}) \], where \(d_i\) is the number of events at time \(t_i\) and \(n_i\) is the number of individuals at risk just before \(t_i\).
km_fit
## Call: survfit(formula = km_surv_obj ~ clinic, data = addicts, conf.type = "log-log")
##
## n events median 0.95LCL 0.95UCL
## clinic=1 163 122 428 341 512
## clinic=2 75 28 NA 661 NA
summary(km_fit)
## Call: survfit(formula = km_surv_obj ~ clinic, data = addicts, conf.type = "log-log")
##
## clinic=1
## time n.risk n.event survival std.err lower 95% CI upper 95% CI
## 7 162 1 0.9938 0.00615 0.95699 0.9991
## 17 161 1 0.9877 0.00868 0.95154 0.9969
## 19 160 1 0.9815 0.01059 0.94369 0.9940
## 29 157 1 0.9752 0.01223 0.93535 0.9906
## 30 156 1 0.9690 0.01366 0.92708 0.9870
## 33 155 1 0.9627 0.01493 0.91892 0.9831
## 35 154 1 0.9565 0.01609 0.91087 0.9790
## 37 153 1 0.9502 0.01716 0.90293 0.9748
## 41 152 1 0.9440 0.01815 0.89509 0.9704
## 47 151 1 0.9377 0.01907 0.88734 0.9660
## 49 150 1 0.9315 0.01994 0.87967 0.9615
## 50 149 1 0.9252 0.02077 0.87207 0.9568
## 59 147 1 0.9189 0.02156 0.86446 0.9521
## 62 146 1 0.9126 0.02231 0.85692 0.9473
## 67 144 1 0.9063 0.02304 0.84937 0.9424
## 75 143 1 0.9000 0.02373 0.84188 0.9375
## 84 142 1 0.8936 0.02440 0.83444 0.9325
## 90 141 1 0.8873 0.02503 0.82704 0.9274
## 95 140 1 0.8809 0.02564 0.81969 0.9224
## 96 139 1 0.8746 0.02623 0.81239 0.9172
## 117 135 1 0.8681 0.02683 0.80491 0.9120
## 126 134 1 0.8616 0.02740 0.79748 0.9067
## 127 133 1 0.8552 0.02795 0.79009 0.9013
## 129 132 1 0.8487 0.02848 0.78274 0.8959
## 136 130 1 0.8422 0.02899 0.77535 0.8905
## 145 129 1 0.8356 0.02950 0.76800 0.8850
## 147 128 1 0.8291 0.02998 0.76068 0.8795
## 150 126 1 0.8225 0.03045 0.75332 0.8739
## 157 124 1 0.8159 0.03092 0.74592 0.8683
## 160 123 1 0.8093 0.03138 0.73856 0.8626
## 167 121 1 0.8026 0.03182 0.73114 0.8569
## 168 120 1 0.7959 0.03225 0.72376 0.8511
## 175 119 1 0.7892 0.03267 0.71641 0.8453
## 176 117 1 0.7824 0.03308 0.70901 0.8394
## 180 116 2 0.7690 0.03385 0.69429 0.8276
## 181 114 1 0.7622 0.03422 0.68697 0.8217
## 183 113 1 0.7555 0.03458 0.67968 0.8158
## 192 112 1 0.7487 0.03492 0.67241 0.8098
## 193 111 1 0.7420 0.03525 0.66517 0.8038
## 204 110 1 0.7352 0.03557 0.65795 0.7977
## 205 108 1 0.7284 0.03589 0.65067 0.7916
## 207 107 1 0.7216 0.03619 0.64341 0.7855
## 209 106 1 0.7148 0.03648 0.63618 0.7794
## 212 104 2 0.7011 0.03705 0.62161 0.7670
## 216 102 1 0.6942 0.03732 0.61436 0.7607
## 223 101 1 0.6873 0.03758 0.60713 0.7545
## 237 100 1 0.6804 0.03783 0.59992 0.7482
## 244 99 1 0.6736 0.03807 0.59273 0.7419
## 247 98 1 0.6667 0.03829 0.58557 0.7356
## 257 97 1 0.6598 0.03851 0.57842 0.7292
## 258 96 1 0.6530 0.03872 0.57129 0.7229
## 259 95 1 0.6461 0.03892 0.56418 0.7165
## 262 94 2 0.6323 0.03928 0.55002 0.7037
## 275 92 1 0.6255 0.03945 0.54296 0.6973
## 293 90 1 0.6185 0.03962 0.53583 0.6908
## 294 89 1 0.6116 0.03978 0.52872 0.6842
## 299 88 1 0.6046 0.03993 0.52163 0.6777
## 302 87 1 0.5977 0.04007 0.51456 0.6712
## 314 86 1 0.5907 0.04020 0.50750 0.6646
## 337 83 1 0.5836 0.04035 0.50026 0.6579
## 341 81 1 0.5764 0.04049 0.49294 0.6511
## 348 78 1 0.5690 0.04063 0.48541 0.6441
## 350 77 1 0.5616 0.04077 0.47791 0.6371
## 358 76 1 0.5542 0.04090 0.47043 0.6301
## 367 75 1 0.5468 0.04102 0.46297 0.6230
## 368 74 1 0.5394 0.04112 0.45554 0.6160
## 376 73 1 0.5321 0.04122 0.44812 0.6089
## 386 72 1 0.5247 0.04130 0.44073 0.6018
## 393 71 1 0.5173 0.04138 0.43336 0.5947
## 394 70 1 0.5099 0.04144 0.42601 0.5876
## 399 69 1 0.5025 0.04149 0.41868 0.5805
## 428 66 1 0.4949 0.04156 0.41111 0.5731
## 434 65 1 0.4873 0.04161 0.40357 0.5657
## 438 64 1 0.4797 0.04165 0.39605 0.5583
## 452 62 1 0.4719 0.04169 0.38841 0.5508
## 457 61 1 0.4642 0.04172 0.38079 0.5433
## 465 59 1 0.4563 0.04175 0.37305 0.5356
## 482 56 1 0.4482 0.04179 0.36501 0.5277
## 489 55 1 0.4400 0.04182 0.35699 0.5198
## 496 54 1 0.4319 0.04183 0.34901 0.5118
## 504 53 1 0.4237 0.04183 0.34106 0.5039
## 512 52 1 0.4156 0.04181 0.33315 0.4958
## 514 51 1 0.4074 0.04177 0.32526 0.4878
## 517 50 1 0.3993 0.04173 0.31740 0.4797
## 518 48 1 0.3910 0.04168 0.30938 0.4715
## 522 47 1 0.3826 0.04161 0.30140 0.4632
## 523 46 2 0.3660 0.04143 0.28553 0.4466
## 532 44 1 0.3577 0.04132 0.27765 0.4383
## 533 43 1 0.3494 0.04119 0.26981 0.4299
## 546 40 1 0.3406 0.04107 0.26153 0.4211
## 550 39 1 0.3319 0.04094 0.25329 0.4124
## 560 38 1 0.3232 0.04078 0.24510 0.4035
## 563 37 1 0.3144 0.04060 0.23695 0.3947
## 581 33 1 0.3049 0.04048 0.22794 0.3852
## 591 31 1 0.2951 0.04035 0.21865 0.3753
## 612 29 2 0.2747 0.04005 0.19953 0.3550
## 624 26 1 0.2641 0.03988 0.18965 0.3444
## 646 25 1 0.2536 0.03966 0.17987 0.3337
## 652 24 1 0.2430 0.03939 0.17020 0.3230
## 667 23 1 0.2325 0.03907 0.16064 0.3122
## 679 22 1 0.2219 0.03869 0.15118 0.3012
## 683 21 1 0.2113 0.03827 0.14184 0.2902
## 714 20 1 0.2008 0.03778 0.13261 0.2791
## 739 19 1 0.1902 0.03724 0.12350 0.2679
## 749 18 1 0.1796 0.03664 0.11451 0.2566
## 755 17 1 0.1691 0.03598 0.10565 0.2452
## 760 16 1 0.1585 0.03525 0.09692 0.2337
## 771 15 1 0.1479 0.03444 0.08834 0.2220
## 774 14 1 0.1374 0.03357 0.07991 0.2102
## 785 13 1 0.1268 0.03260 0.07165 0.1983
## 821 10 2 0.1014 0.03062 0.05164 0.1708
## 836 7 1 0.0869 0.02948 0.04051 0.1556
## 837 6 1 0.0725 0.02790 0.03022 0.1396
## 857 4 1 0.0543 0.02615 0.01784 0.1216
## 892 3 1 0.0362 0.02286 0.00809 0.1017
## 899 2 1 0.0181 0.01717 0.00171 0.0801
##
## clinic=2
## time n.risk n.event survival std.err lower 95% CI upper 95% CI
## 13 74 1 0.986 0.0134 0.908 0.998
## 26 73 1 0.973 0.0189 0.896 0.993
## 35 72 1 0.959 0.0229 0.880 0.987
## 41 71 1 0.946 0.0263 0.862 0.979
## 79 68 1 0.932 0.0294 0.844 0.971
## 109 66 1 0.918 0.0321 0.826 0.962
## 122 65 1 0.904 0.0346 0.809 0.953
## 143 64 1 0.890 0.0368 0.791 0.943
## 149 62 1 0.875 0.0389 0.774 0.933
## 161 61 1 0.861 0.0408 0.757 0.923
## 170 60 1 0.847 0.0426 0.740 0.912
## 190 59 1 0.832 0.0442 0.723 0.901
## 216 58 1 0.818 0.0457 0.707 0.890
## 231 56 1 0.803 0.0472 0.690 0.879
## 232 55 1 0.789 0.0486 0.674 0.867
## 268 54 2 0.759 0.0510 0.642 0.843
## 280 52 1 0.745 0.0520 0.626 0.831
## 286 51 1 0.730 0.0530 0.610 0.819
## 322 50 1 0.716 0.0539 0.594 0.806
## 366 47 1 0.700 0.0549 0.578 0.794
## 389 45 1 0.685 0.0558 0.561 0.780
## 450 43 1 0.669 0.0568 0.544 0.767
## 460 41 1 0.653 0.0577 0.527 0.753
## 540 35 1 0.634 0.0590 0.507 0.737
## 661 23 1 0.606 0.0625 0.473 0.716
## 708 19 1 0.575 0.0669 0.433 0.693
## 878 10 1 0.517 0.0812 0.349 0.661
summary(km_fit, times=c(0,100,200,300,400,500,600,700,800,900,1000))
## Call: survfit(formula = km_surv_obj ~ clinic, data = addicts, conf.type = "log-log")
##
## clinic=1
## time n.risk n.event survival std.err lower 95% CI upper 95% CI
## 0 163 0 1.0000 0.0000 1.00000 1.0000
## 100 137 20 0.8746 0.0262 0.81239 0.9172
## 200 110 20 0.7420 0.0353 0.66517 0.8038
## 300 87 20 0.6046 0.0399 0.52163 0.6777
## 400 68 14 0.5025 0.0415 0.41868 0.5805
## 500 53 9 0.4319 0.0418 0.34901 0.5118
## 600 30 16 0.2951 0.0403 0.21865 0.3753
## 700 20 8 0.2113 0.0383 0.14184 0.2902
## 800 10 8 0.1268 0.0326 0.07165 0.1983
## 900 1 7 0.0181 0.0172 0.00171 0.0801
##
## clinic=2
## time n.risk n.event survival std.err lower 95% CI upper 95% CI
## 0 75 0 1.000 0.0000 1.000 1.000
## 100 66 5 0.932 0.0294 0.844 0.971
## 200 58 7 0.832 0.0442 0.723 0.901
## 300 50 7 0.730 0.0530 0.610 0.819
## 400 43 3 0.685 0.0558 0.561 0.780
## 500 39 2 0.653 0.0577 0.527 0.753
## 600 27 1 0.634 0.0590 0.507 0.737
## 700 19 1 0.606 0.0625 0.473 0.716
## 800 11 1 0.575 0.0669 0.433 0.693
## 900 7 1 0.517 0.0812 0.349 0.661
## 1000 3 0 0.517 0.0812 0.349 0.661
#install.packages("survminer")
library(survminer)
## Loading required package: ggpubr
##
## Attaching package: 'survminer'
## The following object is masked from 'package:survival':
##
## myeloma
# Optional: Enhanced KM plot using survminer
ggsurvplot(km_fit, data = addicts,
conf.int = TRUE, # Add confidence intervals
pval = TRUE, # Add log-rank test p-value
risk.table = TRUE, # Add risk table below the plot
xlab = "Time (days)",
ylab = "Survival Probability",
title = "Kaplan-Meier Survival Curves by Clinic",
legend.labs = c("Clinic 1", "Clinic 2"),
palette = c("blue", "red"))
#log rank test
survdiff(Surv(survt,status)~clinic, data=addicts, rho=0) #log-rank
## Call:
## survdiff(formula = Surv(survt, status) ~ clinic, data = addicts,
## rho = 0)
##
## N Observed Expected (O-E)^2/E (O-E)^2/V
## clinic=1 163 122 90.9 10.6 27.9
## clinic=2 75 28 59.1 16.4 27.9
##
## Chisq= 27.9 on 1 degrees of freedom, p= 1e-07
#log rank test
survdiff(Surv(survt,status)~clinic + strata(prison), data=addicts, rho=0) #log-rank
## Call:
## survdiff(formula = Surv(survt, status) ~ clinic + strata(prison),
## data = addicts, rho = 0)
##
## N Observed Expected (O-E)^2/E (O-E)^2/V
## clinic=1 163 122 91.7 10.0 26.9
## clinic=2 75 28 58.3 15.8 26.9
##
## Chisq= 26.9 on 1 degrees of freedom, p= 2e-07
\[h(t, X) = h_0(t) \exp\left( \sum_{i=1}^{p} \beta_i X_i \right), \;\; where\; X_i = (X_1, X_2, \ldots, X_p) \]
The Cox PH model is a robust model, so that the results from using the Cox model will closely approximate the results for the correct parametric model.
Along with ”robustness”, the model specification of the Cox PH model has several good properties.
# Load the required package
library(survival)
# Load the addicts dataset
addicts <- read_dta("http://web1.sph.emory.edu/dkleinb/allDatasets/surv2datasets/addicts.dta")
# Fit the Cox proportional hazards model
Y=Surv(addicts$survt,addicts$status==1)
crude_cox <- coxph(Y~ prison , data=addicts)
adj_cox <- coxph(Y~ prison + dose , data=addicts)
# Display the summary of the model
#summary(baseline_cox)
summary(crude_cox)
## Call:
## coxph(formula = Y ~ prison, data = addicts)
##
## n= 238, number of events= 150
##
## coef exp(coef) se(coef) z Pr(>|z|)
## prison 0.1838 1.2018 0.1642 1.119 0.263
##
## exp(coef) exp(-coef) lower .95 upper .95
## prison 1.202 0.8321 0.8711 1.658
##
## Concordance= 0.536 (se = 0.023 )
## Likelihood ratio test= 1.25 on 1 df, p=0.3
## Wald test = 1.25 on 1 df, p=0.3
## Score (logrank) test = 1.26 on 1 df, p=0.3
summary(adj_cox)
## Call:
## coxph(formula = Y ~ prison + dose, data = addicts)
##
## n= 238, number of events= 150
##
## coef exp(coef) se(coef) z Pr(>|z|)
## prison 0.18965 1.20883 0.16427 1.155 0.248
## dose -0.03608 0.96457 0.00600 -6.013 1.83e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## prison 1.2088 0.8272 0.8761 1.668
## dose 0.9646 1.0367 0.9533 0.976
##
## Concordance= 0.663 (se = 0.025 )
## Likelihood ratio test= 38.21 on 2 df, p=5e-09
## Wald test = 37.15 on 2 df, p=9e-09
## Score (logrank) test = 37.48 on 2 df, p=7e-09
Similar to other analytic models, we fit both crude and adjusted models.
First, let’s examine the model fit statistics.
Now, let’s examine coefficients.
\[ HR = \frac{\hat{h}(t, X^*)}{\hat{h}(t, X)} = \frac{h_0(t) \exp(\sum_{i=1}^{p} \beta_i X_i^*)}{h_0(t) \exp(\sum_{i=1}^{p} \beta_i X_i)} = \frac{\exp(\sum_{i=1}^{p} \beta_i X_i^*)}{\exp(\sum_{i=1}^{p} \beta_i X_i)} = \exp\left( {\sum_{i=1}^{p} \beta_i (X_i^* - X_i )}\right) \]
\[\hat{HR} = \exp \left( \sum_{i=1}^{p} \hat{\beta}_i (X_i^* - X_i) \right) = \exp [\hat{\beta}_1 (1 - 0)] = \exp (\hat{\beta}_1)\]
As with an odds ratio, it is easier to interpret an HR that exceeds the null value of 1 than an HR that is less than 1. Thus, the \(X’s\) are typically coded so that group with the larger hazard corresponds to \(X^∗\), and the group with the smaller hazard corresponds to \(X\).
library(eha)
Cox.PH <- coxreg(Y ~ clinic + dose + prison,
data=addicts)
b_coxph = coef(Cox.PH)
Cox.PH
## Call:
## coxreg(formula = Y ~ clinic + dose + prison, data = addicts)
##
## Covariate Mean Coef Rel.Risk S.E. Wald p
## clinic 1.378 -1.010 0.364 0.215 0.000
## dose 64.317 -0.035 0.965 0.006 0.000
## prison 0.418 0.327 1.386 0.167 0.051
##
## Events 150
## Total time at risk 95812
## Max. log. likelihood -673.26
## LR test statistic 64.56
## Degrees of freedom 3
## Overall p-value 6.22835e-14
plot(Cox.PH, main = "Cumulative hazard function")
\[\ln h(t)= b_1x_1+b_2x_2+b_3x_3+\epsilon\]
or \[h(t)=h_0 (t)\exp(b_1x_1+b_2x_2+b_3x_3)\]
\[\ln(h(t)) = (-1.0098959)\times clinic + (-0.0353692) \times dose + (0.326555) \times prison\] or \[h(t)=h_0 (t)\exp((-1.0098959)\times clinic + (-0.0353692) \times dose + (0.326555)\times prison)\]
Cox.PH_int <- coxreg(Y ~ clinic + dose + prison + clinic*prison,
data=addicts)
b_coxph_int = coef(Cox.PH_int)
Cox.PH_int
## Call:
## coxreg(formula = Y ~ clinic + dose + prison + clinic * prison,
## data = addicts)
##
## Covariate Mean Coef Rel.Risk S.E. Wald p
## clinic 1.378 -0.655 0.519 0.289 0.023
## dose 64.317 -0.037 0.964 0.007 0.000
## prison 0.418 1.164 3.203 0.540 0.031
## clinic:prison
## : -0.699 0.497 0.429 0.103
##
## Events 150
## Total time at risk 95812
## Max. log. likelihood -671.93
## LR test statistic 67.22
## Degrees of freedom 4
## Overall p-value 8.78186e-14
Recall that
\[ HR = \frac{\hat{h}(t, X^*)}{\hat{h}(t, X)} = \frac{h_0(t) \exp(\sum_{i=1}^{p} \beta_i X_i^*)}{h_0(t) \exp(\sum_{i=1}^{p} \beta_i X_i)} = \frac{\exp(\sum_{i=1}^{p} \beta_i X_i^*)}{\exp(\sum_{i=1}^{p} \beta_i X_i)} = \exp\left( {\sum_{i=1}^{p} \beta_i (X_i^* - X_i )}\right) \] - Notice that the baseline hazard function \(h_0(t)\) appears in both the numerator and denominator of the hazard ratio and cancels out of the formula.
The final expression for the hazard ratio therefore involves the estimated coefficients \(\hat{β_i}\) and the values of \(X^∗\) and \(X\) for each variable. However, because the baseline hazard has canceled out, the final expression does not involve time \(t\).
Thus, once the model is fitted and the values for \(X^∗\) and \(X\) are specified, the value of the exponential expression for the estimated hazard ratio is a constant, which does not depend on time \(t\):
\[ HR = \frac{\hat{h}(t, X^*)}{\hat{h}(t, X)} = \exp \left( \sum_{i=1}^{p} \beta_i (X_i^* - X_i) \right) = \theta \quad \text{therefore,} \] \[ \hat{h}(t, X^*) = \hat{\theta} h(t, X) \]
The Cox PH model assumes that the hazard ratio comparing any two specifications of predictors is constant over time. Equivalently, this means that the hazard for one individual is proportional to the hazard for any other individual, where the proportionality constant is independent of time.
The PH assumption is not met if the graph of the hazards cross for two or more categories of a predictor of interest. However, even if the hazard functions do not cross, it is possible that the PH assumption is not met. Thus, rather than checking for crossing hazards, we must use other approaches to evaluate the reasonableness of the PH assumption.
plot(survfit(Y ~ clinic, data=addicts), col=c("black", "red"), fun="cloglog")
## Fit a model assuming PH for all variables
adj_cox <- coxph(Surv(time, status_dummy) ~ sex + age, data = lung)
## Use cox.zph. The survival times are transformed to ranks.
res.zph <- cox.zph(adj_cox, transform = c("km","rank","idenityt")[2])
## Print test results
res.zph
## chisq df p
## sex 2.378 1 0.12
## age 0.137 1 0.71
## GLOBAL 2.475 2 0.29
## scaled Schoenfeld residuals vs
## Plotting can be useful. A non-horizontal trend means changes in HR over time
plot(res.zph)
If the proportional hazards assumption is violated for the variable CLINIC but met for PRISON and DOSE, a stratified Cox model can be performed with CLINIC the stratified variable. The coxph function includes a strata() option in the model formula. First we define the response variable Y with the Surv function and then the coxph function is used to run a stratified Cox model (code and output shown below):
# Load the addicts dataset
addicts <- read_dta("http://web1.sph.emory.edu/dkleinb/allDatasets/surv2datasets/addicts.dta")
Y=Surv(addicts$survt,addicts$status==1)
coxph(Y~ prison + dose + strata(clinic),data=addicts)
## Call:
## coxph(formula = Y ~ prison + dose + strata(clinic), data = addicts)
##
## coef exp(coef) se(coef) z p
## prison 0.389605 1.476397 0.168930 2.306 0.0211
## dose -0.035115 0.965495 0.006465 -5.432 5.59e-08
##
## Likelihood ratio test=33.91 on 2 df, p=4.322e-08
## n= 238, number of events= 150
Interaction terms for CLINIC can be included directly in the model formula by including product terms using the : operator (clinic:prison and clinic:dose) (code and output follow)
coxph(Y~ prison + dose + clinic:prison + clinic:dose +strata(clinic),
data=addicts)
## Call:
## coxph(formula = Y ~ prison + dose + clinic:prison + clinic:dose +
## strata(clinic), data = addicts)
##
## coef exp(coef) se(coef) z p
## prison 1.085836 2.961914 0.538636 2.016 0.0438
## dose -0.034635 0.965958 0.019797 -1.750 0.0802
## prison:clinic -0.582989 0.558227 0.428135 -1.362 0.1733
## dose:clinic -0.001164 0.998837 0.014570 -0.080 0.9363
##
## Likelihood ratio test=35.77 on 4 df, p=3.222e-07
## n= 238, number of events= 150
Suppose we wish to estimate the hazard ratio for PRISON=1 vs. PRISON=0 for CLINIC=2. This hazard ratio can be estimated by exponentiating the coefficient for prison plus 2 times the coefficient for the CLINIC* PRISON interaction term. This expression is obtained by substituting the appropriate values into the hazard in both the numerator (for PRISON=1) and denominator (for PRISON=0):
\[ HR = \frac{h_0(t) \exp[\beta_4 + \beta_2 DOSE + (2)(t)\beta_3 + \beta_{clinic} \times DOSE]}{h_0(t) \exp[(0)\beta_4 + \beta_2 DOSE + (2)(0)\beta_3 + \beta_{clinic} \times DOSE]} = \exp(\beta_t + 2\beta_3) \]
The resulting hazard ratio, \(exp(β_1 + 2β_2)\), is an exponentiated linear combination of parameters. Unfortunately, R does not have a lincom command that Stata provides or an estimate statement that SAS provides in order to calculate a linear combination of parameter estimates. However an approach that can be used in any statistical software package for such a situation is to recode the variable(s) of interest such that the desired estimate is no longer a linear combination of parameter estimates.
In this example, we are interested in a hazard ratio PRISON=1 versus PRISON=0 for CLINIC=2. We can define a new variable CLINIC × 2 so when CLINIC=2, CLINIC × 2=0.
addicts$clinic2=addicts$clinic-2
summary(coxph(Y~ prison + dose + clinic2:prison + clinic2:dose
+ strata(clinic2), data=addicts))
## Call:
## coxph(formula = Y ~ prison + dose + clinic2:prison + clinic2:dose +
## strata(clinic2), data = addicts)
##
## n= 238, number of events= 150
##
## coef exp(coef) se(coef) z Pr(>|z|)
## prison -0.080143 0.922985 0.384305 -0.209 0.83481
## dose -0.036964 0.963711 0.012346 -2.994 0.00275 **
## prison:clinic2 -0.582989 0.558227 0.428135 -1.362 0.17329
## dose:clinic2 -0.001164 0.998837 0.014570 -0.080 0.93632
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## prison 0.9230 1.083 0.4346 1.9603
## dose 0.9637 1.038 0.9407 0.9873
## prison:clinic2 0.5582 1.791 0.2412 1.2919
## dose:clinic2 0.9988 1.001 0.9707 1.0278
##
## Concordance= 0.649 (se = 0.026 )
## Likelihood ratio test= 35.77 on 4 df, p=3e-07
## Wald test = 34.09 on 4 df, p=7e-07
## Score (logrank) test = 34.97 on 4 df, p=5e-07
The first line of code defines a new variable CLINIC2. CLINIC2 is used in the stratified Cox model rather than CLINIC. We are interested in the hazard ratio for PRISON=1 vs PRISON=0 for CLINIC2=0. When CLINIC2=0, the product terms cancel and the hazard ratio reduces to \(exp(β_1)\).
The second line of code applies the summary function to the coxph function. The summary function applied in this way produces additional output including 95% confidence intervals for the hazard ratios.
The estimate for \(exp(β_1)\) can be found in the second table, \(exp(coef)\) for prison = 0.9203. The lower and upper confidence limits are 0.4346 and 1.9603, respectively. If we did not recode the variable CLINIC the problem would have been more complicated in that we would have had to use variance–covariance matrix (which can be obtained with the vcov function) to calculate a 95% confidence interval for this hazard ratio.
The Cox model is extended to contain product (i.e., interaction) terms involving the time-independent variable being assessed and some function of time. If the coefficient of the product term turns out to be significant, we can conclude that the PH assumption is violated.
Using the above one-at-a-time model, we assess the PH assumption by testing for the significance of the product term. The null hypothesis is therefore “\(t\) equal to zero.” Note that if the null hypothesis is true, the model reduces to a Cox PH model containing the single variable \(X\). The test can be carried out using either a Wald statistic or a likelihood ratio statistic.
To assess the PH assumption for several predictors simultaneously, the form of the extended model is \[ h(t, X) = h_0(t) \exp \left( \sum_{i=1}^{p} (\beta_i X_i + \delta_i (X_i \times g_i(t))) \right), \] where \(g_i(t)\) is a function of time for $ i^{th} $ predictor.
This model contains the predictors being assessed as main effect terms and also as product terms with some function of time. Note that different predictors may require different functions of time; hence, the notation gi(t) is used to define the time function for the ith predictor ∗ With the above model, we test for the PH assumption simultaneously by assessing the null hypothesis that all the \(δi\) coefficients are equal to zero. This requires a likelihood ratio chi-square statistic with p degrees of freedom, where \(p\) denotes the number of predictors being assessed. The LR statistic computes the difference between the log likelihood statistic (i.e., \(−2 ln L\)) for the PH model and the log likelihood statistic for the extended Cox model. Note that under the null hypothesis, the model reduces to the Cox PH model.
If the above test is found to be significant, then we can conclude that the PH assumption is not satisfied for at least one of the predictors in the model. To determine which predictor(s) do not satisfy the PH assumption, we could proceed by backward elimination of nonsignificant product terms until a final model is attained.
The primary drawback of the use of an extended Cox model for assessing the PH assumption concerns the choice of the functions \(g_i(t)\) for the timedependent product terms in the model. This choice is typically not clear-cut, and it is possible that different choices, such as \(g(t)\) equal to \(t\) versus \(log t\) versus a heaviside function, may result in different conclusions about whether the PH assumption is satisfied.
\[ h(t|x) = h_0(t) \cdot \exp(\mathbf{x}^\top \boldsymbol{\beta}) \]
\[ S(t|x) = S_0(t \cdot \exp(\mathbf{x}^\top \boldsymbol{\beta})) \]
Key Equations:
PH coefficients: \(\beta_{PH} = -\beta_{AFT}/\sigma\)
Shape parameter: \(\gamma = 1/\sigma\)
Hazard Ratio: \(HR = \exp(\beta_{PH})\)
Event Time Ratio: \(ETR = \exp(-\beta_{PH}/\gamma)\)
| Feature | AFT | PH |
|---|---|---|
| Parameterization | Log-time | Hazard ratio |
| Time Acceleration | Direct interpretation | Requires conversion |
| Diagnostic Tools | Residual plots | Stratified analysis |
| Covariate Effects | Event Time Ratios | Hazard Ratios |
Examples
# Load required library and data
library(survival)
data(lung)
## Warning in data(lung): data set 'lung' not found
# If your event variable is a factor (e.g., "Alive"/"Dead"):
lung$status <- as.numeric(lung$status == 2) # Convert to 0/1 (censored/event)
# Create a survival object
surv_obj <- with(lung, Surv(time, status))
# Fit a Cox proportional hazards model
cox_model <- coxph(surv_obj ~ age + sex + ph.ecog, data = lung)
# Summarize the model
summary(cox_model)
## Call:
## coxph(formula = surv_obj ~ age + sex + ph.ecog, data = lung)
##
## n= 227, number of events= 164
## (1 observation deleted due to missingness)
##
## coef exp(coef) se(coef) z Pr(>|z|)
## age 0.011067 1.011128 0.009267 1.194 0.232416
## sex -0.552612 0.575445 0.167739 -3.294 0.000986 ***
## ph.ecog 0.463728 1.589991 0.113577 4.083 4.45e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## age 1.0111 0.9890 0.9929 1.0297
## sex 0.5754 1.7378 0.4142 0.7994
## ph.ecog 1.5900 0.6289 1.2727 1.9864
##
## Concordance= 0.637 (se = 0.025 )
## Likelihood ratio test= 30.5 on 3 df, p=1e-06
## Wald test = 29.93 on 3 df, p=1e-06
## Score (logrank) test = 30.5 on 3 df, p=1e-06
# Plot the survival curves for different groups
plot(survfit(cox_model), xlab = "Time", ylab = "Survival Probability",
main = "Cox Proportional Hazards Model")
# Fit an AFT model with Weibull distribution
aft_model <- survreg(Surv(time, status) ~ age + sex + ph.ecog, data = lung, dist = "weibull")
# Summarize the model
summary(aft_model)
##
## Call:
## survreg(formula = Surv(time, status) ~ age + sex + ph.ecog, data = lung,
## dist = "weibull")
## Value Std. Error z p
## (Intercept) 6.27344 0.45358 13.83 < 2e-16
## age -0.00748 0.00676 -1.11 0.2690
## sex 0.40109 0.12373 3.24 0.0012
## ph.ecog -0.33964 0.08348 -4.07 4.7e-05
## Log(scale) -0.31319 0.06135 -5.11 3.3e-07
##
## Scale= 0.731
##
## Weibull distribution
## Loglik(model)= -1132.4 Loglik(intercept only)= -1147.4
## Chisq= 29.98 on 3 degrees of freedom, p= 1.4e-06
## Number of Newton-Raphson Iterations: 5
## n=227 (1 observation deleted due to missingness)
# Extract scale parameter and coefficients
scale_param <- aft_model$scale
coefficients <- aft_model$coefficients
# Plot Kaplan-Meier curve and overlay Weibull fit
km_fit <- survfit(Surv(time, status) ~ 1, data = lung)
plot(km_fit, xlab = "Time", ylab = "Survival Probability", conf.int = TRUE,
main = "Kaplan-Meier Curve with Weibull Fit")
# Overlay Weibull survival curve
time_seq <- seq(0, max(lung$time), length.out = 100)
weibull_surv <- exp(-((time_seq / exp(coefficients[1]))^scale_param))
lines(time_seq, weibull_surv, col = "red", lwd = 2)
legend("topright", legend = c("Kaplan-Meier", "Weibull Fit"), col = c("black", "red"), lty = 1)
In parametric survival models, the shape and scale parameters play key roles in defining the distribution of survival times and the behavior of the hazard function. Below is an explanation of these parameters, particularly in the context of commonly used distributions like the Weibull model.
The shape parameter (\(\beta\) or \(k\), depending on notation) controls the form of the hazard function over time. It determines whether the hazard rate (the risk of an event occurring at a given time) is constant, increasing, or decreasing.
For the Weibull distribution, the hazard function is:
\[ h(t) = \frac{\beta}{\lambda} \left(\frac{t}{\lambda}\right)^{\beta - 1} \]
where \(\beta\) is the shape parameter and \(\lambda\) is the scale parameter.
The shape parameter reflects how risk evolves: in medical studies, a higher shape parameter may indicate that risks increase with age or disease progression.
The scale parameter (\(\lambda\)) stretches or compresses the time axis. It determines how quickly events occur on average.
For the Weibull distribution, the survival function is:
\[ S(t) = e^{-(t / \lambda)^\beta} \]
where \(\lambda > 0\) is the scale parameter.
The scale parameter reflects the “average” time to event: in medical studies, a larger scale parameter might indicate longer expected survival.
The interaction between shape and scale parameters determines the overall behavior of the hazard function:
When \(\beta = 1\), the Weibull model reduces to the exponential model:
\[ h(t) = \frac{1}{\lambda} \]
This implies a constant hazard rate, independent of time.
For \(t =[^1][^2][^3]\), \(S(t) = e^{-(t/2)^2}\):
| Parameter | Role | Interpretation |
|---|---|---|
| Shape (\(\beta\)) | Determines hazard trend | Constant (\(=1\)), increasing (\(>1\)), or decreasing (\(<1\)) |
| Scale (\(\lambda\)) | Stretches/compresses time axis | Larger values: longer survival times; smaller values: shorter survival times |
These parameters provide flexibility for modeling various real-world scenarios in survival analysis.
Exp.PH <- phreg(Y ~ clinic + dose + prison,
data=addicts, shape=1, dist="weibull", param="survreg")
b_Exp.PH = coef(Exp.PH)
Exp.PH
## Call:
## phreg(formula = Y ~ clinic + dose + prison, data = addicts, dist = "weibull",
## shape = 1, param = "survreg")
##
## Covariate W.mean Coef Exp(Coef) se(Coef) Wald p
## clinic 1.378 -0.881 0.415 0.211 0.000
## dose 64.317 -0.029 0.971 0.006 0.000
## prison 0.418 0.253 1.287 0.165 0.125
##
## log(scale) 3.684 0.431 0.000
##
## Shape is fixed at 1
##
## Events 150
## Total time at risk 95812
## Max. log. likelihood -1094
## LR test statistic 49.91
## Degrees of freedom 3
## Overall p-value 8.34931e-11
plot(Exp.PH)
\[\ln h(t)= b_0 + b_1x_1+b_2x_2+b_3x_3+\epsilon\]
or \[h(t)=h_0 (t)\exp(b_0 + b_1x_1+b_2x_2+b_3x_3)\] Note: we will get back to \(h_0(t)\) later.
\[\ln(h(t)) = (3.6843409) + (-0.8805819)\times clinic + (-0.0289167) \times dose + (0.2526491) \times prison\] or \[h(t)=\exp((3.6843409) + -0.8805819)\times clinic + (-0.0289167) \times dose + (0.2526491)\]
Exp.PH_int <- phreg(Y ~ clinic + dose + prison + clinic*prison,
data=addicts, shape=1, dist="weibull", param="survreg")
b_Exp.PH_int = coef(Exp.PH_int)
Exp.PH_int
## Call:
## phreg(formula = Y ~ clinic + dose + prison + clinic * prison,
## data = addicts, dist = "weibull", shape = 1, param = "survreg")
##
## Covariate W.mean Coef Exp(Coef) se(Coef) Wald p
## clinic 1.378 -0.670 0.512 0.288 0.020
## dose 64.317 -0.030 0.971 0.006 0.000
## prison 0.418 0.754 2.127 0.529 0.154
## clinic:prison
## : -0.421 0.656 0.423 0.319
##
## log(scale) 3.893 0.475 0.000
##
## Shape is fixed at 1
##
## Events 150
## Total time at risk 95812
## Max. log. likelihood -1093.5
## LR test statistic 50.91
## Degrees of freedom 4
## Overall p-value 2.33502e-10
Exp.AFT <- aftreg(Y ~ clinic + dose + prison,
data=addicts, shape=1, dist="weibull", param="survreg")
## Warning in aftreg(Y ~ clinic + dose + prison, data = addicts, shape = 1, :
## 'survreg' is a deprecated argument value
a_Exp.AFT = coef(Exp.AFT)
Exp.AFT
## Call:
## aftreg(formula = Y ~ clinic + dose + prison, data = addicts,
## dist = "weibull", shape = 1, param = "survreg")
##
## Covariate W.mean Coef Life-Expn se(Coef) Wald p
## clinic 1.378 0.881 2.412 0.211 0.000
## dose 64.317 0.029 1.029 0.006 0.000
## prison 0.418 -0.253 0.777 0.165 0.125
##
## Baseline parameters:
## log(scale) 3.684 0.431 0.000
## Baseline life expectancy: NA
##
## Shape is fixed at 1
##
## Events 150
## Total time at risk 95812
## Max. log. likelihood -1094
## LR test statistic 49.9
## Degrees of freedom 3
## Overall p-value 8.34931e-11
#plot(Exp.AFT)
#check.dist(Cox.PH, Exp.AFT)
\[\ln (t)= a_0 + a_1x_1 + a_2x_2 + a_3x_3 + \epsilon\] Note: we will get back to \(h_0(t)\) later.
\[\ln(t) = (3.6843122) + (0.880538)\times clinic + (0.0289172) \times dose + (-0.2526278) \times prison\]
Exp.AFT_int <- aftreg(Y ~ clinic + dose + prison + clinic*prison,
data=addicts, shape=1, dist="weibull", param="survreg")
## Warning in aftreg(Y ~ clinic + dose + prison + clinic * prison, data = addicts,
## : 'survreg' is a deprecated argument value
a_Exp.AFT_int = coef(Exp.AFT_int)
Exp.AFT_int
## Call:
## aftreg(formula = Y ~ clinic + dose + prison + clinic * prison,
## data = addicts, dist = "weibull", shape = 1, param = "survreg")
##
## Covariate W.mean Coef Life-Expn se(Coef) Wald p
## clinic 1.378 0.670 1.955 0.288 0.020
## dose 64.317 0.030 1.030 0.006 0.000
## prison 0.418 -0.754 0.470 0.529 0.154
## clinic:prison
## : 0.421 1.524 0.423 0.319
##
## Baseline parameters:
## log(scale) 3.893 0.475 0.000
## Baseline life expectancy: NA
##
## Shape is fixed at 1
##
## Events 150
## Total time at risk 95812
## Max. log. likelihood -1093.5
## LR test statistic 50.9
## Degrees of freedom 4
## Overall p-value 2.33502e-10
Weib.PH <- phreg(Y ~ clinic + dose + prison,
data=addicts, dist="weibull", param="survreg")
b_Weib.PH = coef(Weib.PH)
Weib.PH
## Call:
## phreg(formula = Y ~ clinic + dose + prison, data = addicts, dist = "weibull",
## param = "survreg")
##
## Covariate W.mean Coef Exp(Coef) se(Coef) Wald p
## clinic 1.378 -0.972 0.379 0.212 0.000
## dose 64.317 -0.033 0.967 0.006 0.000
## prison 0.418 0.314 1.369 0.166 0.058
##
## log(scale) 4.105 0.328 0.000
## log(shape) 0.315 0.068 0.000
##
## Events 150
## Total time at risk 95812
## Max. log. likelihood -1084.5
## LR test statistic 60.89
## Degrees of freedom 3
## Overall p-value 3.8014e-13
plot(Weib.PH)
\[\ln h(t)= b_0 + b_1x_1+b_2x_2+b_3x_3+\epsilon\]
or \[h(t)=h_0 (t)\exp(b_0 + b_1x_1+b_2x_2+b_3x_3)\] Note: we will get back to \(h_0(t)\) later.
\[\ln(h(t)) = (4.1048451) + (-0.9715245)\times clinic + (-0.0334675) \times dose + (0.3144143) \times prison\] or \[h(t)=\exp((4.1048451) + (-0.9715245)\times clinic + (-0.0334675) \times dose + (0.3144143)\times prison\]
Weib.AFT <- aftreg(Y ~ clinic + dose + prison,
data=addicts, dist="weibull", param="survreg")
## Warning in aftreg(Y ~ clinic + dose + prison, data = addicts, dist = "weibull",
## : 'survreg' is a deprecated argument value
a_Weib.AFT = coef(Weib.AFT)
Weib.AFT
## Call:
## aftreg(formula = Y ~ clinic + dose + prison, data = addicts,
## dist = "weibull", param = "survreg")
##
## Covariate W.mean Coef Life-Expn se(Coef) Wald p
## clinic 1.378 0.709 2.032 0.157 0.000
## dose 64.317 0.024 1.025 0.005 0.000
## prison 0.418 -0.230 0.795 0.121 0.057
##
## Baseline parameters:
## log(scale) 4.105 0.328 0.000
## log(shape) 0.315 0.068 0.000
## Baseline life expectancy: 55.5
##
## Events 150
## Total time at risk 95812
## Max. log. likelihood -1084.5
## LR test statistic 60.9
## Degrees of freedom 3
## Overall p-value 3.8014e-13
plot(Weib.AFT)
#check.dist(Cox.PH, Weib.AFT)
\[\ln (t)= a_0 + a_1x_1 + a_2x_2 + a_3x_3 + \epsilon\]
\[\ln(t) = (4.1051824) + (0.7089269)\times clinic + (0.0244211) \times dose + (-0.2296602) \times prison\]
Llogis.PO <- phreg(Y ~ clinic + dose + prison,
data=addicts, dist="loglogistic", param="survreg")
b_Llogis.PO = coef(Llogis.PO)
Llogis.PO
## Call:
## phreg(formula = Y ~ clinic + dose + prison, data = addicts, dist = "loglogistic",
## param = "survreg")
##
## Covariate W.mean Coef Exp(Coef) se(Coef) Wald p
## (Intercept) 18.490 781.991 0.981
## clinic 1.378 -0.972 0.378 0.212 0.000
## dose 64.317 -0.033 0.967 0.006 0.000
## prison 0.418 0.314 1.369 0.166 0.058
##
## log(scale) 17.598 570.780 0.975
## log(shape) 0.315 0.068 0.000
##
## Events 150
## Total time at risk 95812
## Max. log. likelihood -1084.5
## LR test statistic 60.89
## Degrees of freedom 3
## Overall p-value 3.78697e-13
\[\ln h(t)= b_0 + b_1x_1+b_2x_2+b_3x_3+\epsilon\]
or \[h(t)=h_0 (t)\exp(b_0 + b_1x_1+b_2x_2+b_3x_3)\]
\[\ln(h(t)) = (0.3144273) + (18.4898785)\times clinic + (-0.9715453) \times dose + (-0.0334685) \times prison\] or \[h(t)=\exp((0.3144273) + 18.4898785)\times clinic + (-0.9715453) \times dose + (-0.0334685)\times prison\]
Please note that the log-logistic model is not proportional to the hazard, but proportional to the odds.
\[S(x)=\frac{1}{1+\lambda t^p} \] \[1-S(x) = 1- \frac{1}{1+\lambda t^p}= \frac{1+\lambda t^p - 1}{1+\lambda t^p} = \frac{\lambda t^p}{1+\lambda t^p} \] \[ Survival\;\;odds\; (SO) = \frac{S(x)}{1-S(x)}= \frac{\frac{1}{1+\lambda t^p}}{\frac{\lambda t^p}{1+\lambda t^p}}= \frac{1}{\lambda t^p}\] \[ Failure\;\; odds\; (FO) = \frac{1-S(x)}{S(x)} = \lambda t^p \] \[ \ln (FO) = \ln(\lambda t^p)=\ln(\lambda)+p \times \ln(t) \]
Llogis.AFT <- aftreg(Y ~ clinic + dose + prison,
data=addicts, dist="loglogistic", param="survreg")
## Warning in aftreg(Y ~ clinic + dose + prison, data = addicts, dist =
## "loglogistic", : 'survreg' is a deprecated argument value
a_Llogis.AFT = coef(Llogis.AFT)
Llogis.AFT
## Call:
## aftreg(formula = Y ~ clinic + dose + prison, data = addicts,
## dist = "loglogistic", param = "survreg")
##
## Covariate W.mean Coef Life-Expn se(Coef) Wald p
## clinic 1.378 0.581 1.787 0.172 0.001
## dose 64.317 0.032 1.032 0.006 0.000
## prison 0.418 -0.291 0.747 0.144 0.043
##
## Baseline parameters:
## log(scale) 3.563 0.389 0.000
## log(shape) 0.533 0.069 0.000
## Baseline life expectancy: NA
##
## Events 150
## Total time at risk 95812
## Max. log. likelihood -1093.9
## LR test statistic 52.2
## Degrees of freedom 3
## Overall p-value 2.73866e-11
plot(Llogis.AFT)
#check.dist(Cox.PH, Llogis.AFT)
\[\ln (t)= a_0 + a_1x_1 + a_2x_2 + a_3x_3 + \epsilon\]
\[\ln(t) = (3.5634386) + (0.5805364)\times clinic + (0.0316117) \times dose + (-0.2912455) \times prison\]
G.PH <- phreg(Y ~ clinic + dose + prison,
data=addicts, dist="gompertz")
G.PH
## Call:
## phreg(formula = Y ~ clinic + dose + prison, data = addicts, dist = "gompertz")
##
## Covariate W.mean Coef Exp(Coef) se(Coef) Wald p
## clinic 1.378 -1.030 0.357 0.214 0.000
## dose 64.317 -0.035 0.965 0.006 0.000
## prison 0.418 0.327 1.386 0.166 0.050
##
## log(scale) 6.262 0.195 0.000
## log(shape) 2.508 0.664 0.000
##
## Events 150
## Total time at risk 95812
## Max. log. likelihood -1081.5
## LR test statistic 65.61
## Degrees of freedom 3
## Overall p-value 3.70814e-14
plot(G.PH)
#check.dist(Cox.PH, G.PH)
G.AFT <- aftreg(Y ~ clinic + dose + prison,
data=addicts, dist="gompertz", param="survreg")
## Warning in aftreg(Y ~ clinic + dose + prison, data = addicts, dist =
## "gompertz", : 'survreg' is a deprecated argument value
G.AFT
## Call:
## aftreg(formula = Y ~ clinic + dose + prison, data = addicts,
## dist = "gompertz", param = "survreg")
##
## Covariate W.mean Coef Life-Expn se(Coef) Wald p
## clinic 1.378 0.726 2.067 0.153 0.000
## dose 64.317 0.019 1.019 0.004 0.000
## prison 0.418 -0.199 0.819 0.098 0.043
##
## Baseline parameters:
## log(scale) 4.437 0.271 0.000
## log(shape) -0.573 0.336 0.088
## Baseline life expectancy: 72.6
##
## Events 150
## Total time at risk 95812
## Max. log. likelihood -1081.6
## LR test statistic 65.5
## Degrees of freedom 3
## Overall p-value 4.00791e-14
plot(G.AFT)
#check.dist(Cox.PH, G.AFT)
Lognormal.AFT <- aftreg(Y ~ clinic + dose + prison,
data=addicts, dist="lognormal", param="survreg")
## Warning in aftreg(Y ~ clinic + dose + prison, data = addicts, dist =
## "lognormal", : 'survreg' is a deprecated argument value
Lognormal.AFT
## Call:
## aftreg(formula = Y ~ clinic + dose + prison, data = addicts,
## dist = "lognormal", param = "survreg")
##
## Covariate W.mean Coef Life-Expn se(Coef) Wald p
## clinic 1.378 0.576 1.780 0.176 0.001
## dose 64.317 0.034 1.034 0.006 0.000
## prison 0.418 -0.309 0.734 0.154 0.045
##
## Baseline parameters:
## log(scale) 3.407 0.398 0.000
## log(shape) -0.075 0.059 0.207
## Baseline life expectancy: NA
##
## Events 150
## Total time at risk 95812
## Max. log. likelihood -1097.8
## LR test statistic 51.9
## Degrees of freedom 3
## Overall p-value 3.22167e-11
plot(Lognormal.AFT)
#check.dist(Cox.PH, Lognormal.AFT)
Exp.AFT <- flexsurvreg(Y ~ clinic + dose + prison,
data=addicts, dist="exponential")
Exp.AFT
## Call:
## flexsurvreg(formula = Y ~ clinic + dose + prison, data = addicts,
## dist = "exponential")
##
## Estimates:
## data mean est L95% U95% se exp(est) L95%
## rate NA 0.02511 0.01080 0.05842 0.01082 NA NA
## clinic 1.31513 -0.88058 -1.29340 -0.46776 0.21063 0.41454 0.27434
## dose 60.39916 -0.02892 -0.04096 -0.01687 0.00614 0.97150 0.95987
## prison 0.46639 0.25265 -0.07052 0.57582 0.16489 1.28743 0.93191
## U95%
## rate NA
## clinic 0.62640
## dose 0.98327
## prison 1.77859
##
## N = 238, Events: 150, Censored: 88
## Total time at risk: 95812
## Log-likelihood = -1093.971, df = 4
## AIC = 2195.942
Weib.AFT <- flexsurvreg(Y ~ clinic + dose + prison,
data=addicts, dist="weibull")
Weib.AFT
## Call:
## flexsurvreg(formula = Y ~ clinic + dose + prison, data = addicts,
## dist = "weibull")
##
## Estimates:
## data mean est L95% U95% se exp(est)
## shape NA 1.37019 1.20026 1.56418 0.09257 NA
## scale NA 60.63335 31.87629 115.33346 19.89127 NA
## clinic 1.31513 0.70904 0.40089 1.01720 0.15722 2.03204
## dose 60.39916 0.02443 0.01543 0.03342 0.00459 1.02473
## prison 0.46639 -0.22947 -0.46621 0.00727 0.12079 0.79496
## L95% U95%
## shape NA NA
## scale NA NA
## clinic 1.49315 2.76543
## dose 1.01555 1.03399
## prison 0.62738 1.00730
##
## N = 238, Events: 150, Censored: 88
## Total time at risk: 95812
## Log-likelihood = -1084.477, df = 5
## AIC = 2178.953
Llogis.AFT <- flexsurvreg(Y ~ clinic + dose + prison,
data=addicts, dist="llogis")
Llogis.AFT
## Call:
## flexsurvreg(formula = Y ~ clinic + dose + prison, data = addicts,
## dist = "llogis")
##
## Estimates:
## data mean est L95% U95% se exp(est) L95%
## shape NA 1.70428 1.48980 1.94964 0.11696 NA NA
## scale NA 35.27831 16.44372 75.68597 13.73944 NA NA
## clinic 1.31513 0.58060 0.24432 0.91687 0.17157 1.78711 1.27675
## dose 60.39916 0.03161 0.02079 0.04244 0.00552 1.03212 1.02101
## prison 0.46639 -0.29127 -0.57344 -0.00910 0.14397 0.74731 0.56358
## U95%
## shape NA
## scale NA
## clinic 2.50146
## dose 1.04335
## prison 0.99094
##
## N = 238, Events: 150, Censored: 88
## Total time at risk: 95812
## Log-likelihood = -1093.915, df = 5
## AIC = 2197.83
plot(Llogis.AFT, ci=FALSE, conf.int=FALSE, ylab="Survival", xlab="Time")
lines(Weib.AFT, col="blue", ci=FALSE)
lines(Exp.AFT, col="green", ci=FALSE)
legend("topright", lty=c(1,1,1), lwd=c(2,2,2), col=c("green", "blue", "red"),
c("Exp", "Weibul", "Loglogistic"))
The \(\exp(\beta_1)\) indicates the ratio, thus - when \(\exp(\beta_1)\)=1: no difference of hazards between two groups; - when \(\exp(\beta_1)\)<1: the hazard of dropout in the comparison group (i.e., numerator) is lower than the hazard of dropout in the reference group (i.e., denominator) by “(1-\(\exp(\beta_1)\)%”; - when \(\exp(\beta_1)\)>1: the hazard of dropout in the comparison group (i.e., numerator) is higher than the hazard of dropout in the reference group (i.e., denominator) by “\(\exp(\beta_1)\)” times
The same rule applies to the logistic regression or other multiplicative models; simply change the “hazard” with “odds” (or the right metric in the mode). How about the additive model (e.g., linear regression)?
Akaike’s information criterion (AIC) provides an approach for comparing the fit of models with different underlying distributions, making use of the -2 log likelihood statistic
Figure: “AIC Example”
The exponential distribution is characterized by the fact that it lacks memory. In other words, items whose life lengths follow an exponential distribution do not age; no matter how old they are, if they are alive they are as good as new. This concept is not useful when it comes to human lives, but the life lengths of electronic components are often modeled by the exponential distribution in reliability theory.
If the exponential distribution is not useful in describing human lives, it may be so for short segments of life. At least it will be a good approximation if the segment is short enough. This is the idea behind the piecewise constant hazards distribution. Its definition involves a partition of time (age) axis, and one positive constant (the hazard level) corresponding to each interval. Note that the last interval will be open, with infinite length; only a finite number of cut points are allowed. The definition of the hazard function \(h(x)\) becomes, with the cuts denoted \(t=(t_1 < \cdots < t_n)\) and the levels denoted \(h=(h_1, \dots, h_{n+1})\): \[h(t;t,h)= h_1 (t \ge t_1);\] \[h_i (t_{i_1} < t \ge t_i, i=2,\dots,n,);\] \[h_{n+1} (t_n<t)\]
Piecewise constant hazards function, AFT
PCH.AFT1 <- pchreg(Y ~ clinic + dose + prison, data=addicts, cuts=1:1000)
PCH.AFT1
## Call:
## pchreg(formula = Y ~ clinic + dose + prison, data = addicts,
## cuts = 1:1000)
##
## Covariate W.mean Coef Exp(Coef) se(Coef) Wald p
## clinic 1.378 -1.009 0.365 0.215 0.000
## dose 64.317 -0.035 0.965 0.006 0.000
## prison 0.418 0.327 1.386 0.167 0.051
##
##
## Events 150
## Total time at risk 95812
## Max. log. likelihood -809.54
## LR test statistic 64.52
## Degrees of freedom 3
## Overall p-value 6.36158e-14
plot(PCH.AFT1)
check.dist(Cox.PH, PCH.AFT1)
Suppose that we collect data measuring time (variable \(time\)) from the onset of risk at time zero until occurrence of an event of interest (variable \(fail\)) on patients from different hospitals (variable \(hospital\)). We want to study patients’ survival as a function of some risk factors, say age and gender (variable \(age\) and \(gender\)).
There are various ways of adjusting for group effects (i.e., subjects are correlated we mean that subjects’ failure times are correlated or they are heterogenous). Each depends on the nature of the grouping of subjects and on the assumptions we are willing to make about the effect of grouping on subjects’ survival.
In sum, there is no definitive recommendation on how to account for the group effect and on which model is the most appropriate when analyzing data.
A widely used technique for adjusting for the correlation among outcomes on the same subject is called robust estimation (also referred to as empirical estimation). This technique essentially involves adjusting the estimated variances of regression coefficients obtained for a fitted model to account for misspecification of the correlation structure assumed
library(survival)
fram <- read.csv("frmgham2.csv", header = TRUE)
attach(fram)
#head(fram)
Y=Surv(TIMECVD, CVD==1)
mod1=coxph(Y ~ BMI + factor(SEX), data=fram)
summary(mod1)
## Call:
## coxph(formula = Y ~ BMI + factor(SEX), data = fram)
##
## n= 11575, number of events= 2879
## (52 observations deleted due to missingness)
##
## coef exp(coef) se(coef) z Pr(>|z|)
## BMI 0.047292 1.048428 0.004396 10.76 <2e-16 ***
## factor(SEX)2 -0.704816 0.494200 0.037834 -18.63 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## BMI 1.0484 0.9538 1.0394 1.0575
## factor(SEX)2 0.4942 2.0235 0.4589 0.5322
##
## Concordance= 0.618 (se = 0.005 )
## Likelihood ratio test= 487.5 on 2 df, p=<2e-16
## Wald test = 476.4 on 2 df, p=<2e-16
## Score (logrank) test = 495.1 on 2 df, p=<2e-16
mod1_id=coxph(Y ~ BMI + factor(SEX), id=RANDID, data=fram)
summary(mod1_id)
## Call:
## coxph(formula = Y ~ BMI + factor(SEX), data = fram, id = RANDID)
##
## n= 11575, number of events= 2879
## (52 observations deleted due to missingness)
##
## coef exp(coef) se(coef) robust se z Pr(>|z|)
## BMI 0.047292 1.048428 0.004396 0.007177 6.59 4.41e-11 ***
## factor(SEX)2 -0.704816 0.494200 0.037834 0.062677 -11.24 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## BMI 1.0484 0.9538 1.0338 1.0633
## factor(SEX)2 0.4942 2.0235 0.4371 0.5588
##
## Concordance= 0.618 (se = 0.008 )
## Likelihood ratio test= 487.5 on 2 df, p=<2e-16
## Wald test = 187.1 on 2 df, p=<2e-16
## Score (logrank) test = 495.1 on 2 df, p=<2e-16, Robust = 174.5 p=<2e-16
##
## (Note: the likelihood ratio and score tests assume independence of
## observations within a cluster, the Wald and robust score tests do not).
There are occasions we want to estimate the baseline hazard:
The full hazard function for the Weibull PH model is \[h(t)=\exp(b_0 + b_1 x_1 + b_2 x_2 + \cdots + b_n x_n)pt^{p-1}\] Therefore, in terms of \(S(t)\), \[ S(t)=\exp(-(b_0 + b_1 x_1 + b_2 x_2 + \cdots + b_n x_n)t^p) \]
With the Weibull PH model estimates from the “Addicts” dataset, this model is specified as \[\hat{h(t)}=\exp((4.1048451) + (-0.9715245)\times clinic + (-0.0334675) \times dose + (0.3144143) \times prison) \exp(0.3149526)t^{\exp(0.3149526)-1}\]
\[ \hat{S(t)}=\exp(-((4.1048451) + (-0.9715245)\times clinic + (-0.0334675) \times dose + (0.3144143) \times prison))t^{\exp(0.3149526)} \]
In the EHA package in R, the baseline parameters are presented as
log(scale)(i.e., \(b_0 =
(4.1048451)\)) and log(shape)(i.e. shape parameter,
\(p=\exp(0.3149526)\)) in the output.
Please note that the parameterization and naming are different across
statistical packages/R-packages.
For the exponential model (\(p\)=1 in the Weibull model),
The full hazard function is \[h(t)=\exp(b_0 + b_1 x_1 + b_2 x_2 + \cdots + b_n x_n)pt^{p-1}=\exp(b_0 + b_1 x_1 + b_2 x_2 + \cdots + b_n x_n)\] Therefore, in terms of \(S(t)\), \[ S(t)=\exp(-(b_0 + b_1 x_1 + b_2 x_2 + \cdots + b_n x_n)t^p)=\exp(-(b_0 + b_1 x_1 + b_2 x_2 + \cdots + b_n x_n)t) \]
For the log-logistic model, The full hazard function is \[h(t)=\frac{\exp(b_0 + b_1 x_1 + b_2 x_2 + \cdots + b_n x_n)pt^{p-1}}{1+\exp(b_0 + b_1 x_1 + b_2 x_2 + \cdots + b_n x_n)t^{p}}\] Therefore, in terms of \(S(t)\), \[ S(t)=\frac{1}{1+\exp(b_0 + b_1 x_1 + b_2 x_2 + \cdots + b_n x_n)t^{p}} \]
Let us assume that in a follow-up study, the cohort is not homogeneous but instead consists of two equally sized groups with differing hazard rates. Assume further that we have no indication of which group an individual belongs to, and that members of both groups follow an exponential life length distribution: \[ h_1(t)=\lambda_1, \;\; t>0 \] \[ h_2(t)=\lambda_2, \;\; t>0 \] This implies that the corresponding survival functions \(S_1\) and \(S_2\) are \[ S_1(t)=e^{-\lambda_1 t}, \;\; t>0 \] \[ S_2(t)=e^{-\lambda_2 t}, \;\; t>0 \] and a randomly chosen individual will follow the “population mortality” \(S\), which is a mixture of the two distributions: \[ S(t)=\frac{1}{2} S_1(t) + \frac{1}{2} S_2(t), \;\; t>0.\]
Let us calculate the hazard function for this mixture. We start by finding the density function \(f(t)\): \[ f(t)=-\frac{dS(x)}{dx} =\frac{1}{2} (\lambda_1 e^{-\lambda_1 t} + \lambda_2 e^{-\lambda_2 t}), ;\;\ t>0\] Then, by the definition of \(h(t)\), we get \[ h(t) = \frac{f(t)}{S(t)} = \omega (t) \lambda_1 + (1- \omega (t)) \lambda_2, \;\; t>0\] with \[\omega(t)=\frac{e^{-\lambda_1 t}}{e^{-\lambda_1 t} + e^{-\lambda_2 t}} \] It is easy to see that as \(t \rightarrow \infty\)
implying that
\[h(t) \rightarrow min(\lambda_1, \lambda_2), \;\; t \rightarrow \infty\] The following Figure indicates a population hazard function(solid line). The dashed lines are the hazard functions of each group, \(\lambda_1=1\), \(\lambda_2=2\).
“Population hazard function”
Frailty models in survival analysis correspond to hierarchical models in linear or generalized linear models. They are also called mixed effects models. They contain an extra random component designed to account for individual(or subgroup)-level differences in the hazard otherwise unaccounted for by the model. The frailty, \(\alpha\), is a multiplicative effect on the hazard assumed to follow some distribution. The hazard function conditional on the frailty can be expressed as \(h(t|\alpha)=\alpha [h(t)]\).
Vaupel et al. (1979) described an individual frailty model, \[h(t;x,Z)=h_0(t)Z e^{\beta x}, \;\; t>0,\] where \(Z\) is assumed to be drawn independently for each individual. Hazard rates for “random survivors” are not proportional, but converging (to each other) if the frailty distribution has finite variance. Thus, the problem may be less pronounced in AFT than in PH regression.
R offers three choices for the distribution of the frailty: the gamma, Gaussian, and \(t\) distributions. The variance (theta) of the frailty component is a parameter typically estimated by the model. If theta = 0, then there is no frailty.
First, we rerun a stratified Cox model without frailty. The stratified variable is CLINIC while PRISON and DOSE are predictor variables. A stratified Cox model is appropriate if the PH assumption is violated for CLINIC and met for PRISON and DOSE and our interest is in estimating a hazard ratio for PRISON or DOSE.
Y=Surv(addicts$survt,addicts$status==1)
coxph(Y~ prison + dose + strata(clinic),
data=addicts)
## Call:
## coxph(formula = Y ~ prison + dose + strata(clinic), data = addicts)
##
## coef exp(coef) se(coef) z p
## prison 0.389605 1.476397 0.168930 2.306 0.0211
## dose -0.035115 0.965495 0.006465 -5.432 5.59e-08
##
## Likelihood ratio test=33.91 on 2 df, p=4.322e-08
## n= 238, number of events= 150
The estimated hazard ratio for PRISON=1 versus PRISON=0 is \(\exp(0.3896) = 1.476\). Next we illustrate how to include a frailty component in this model.
coxph(Y~ prison + dose + strata(clinic)
+ frailty(id, distribution="gamma"),
data=addicts)
## Warning in coxpenal.fit(X, Y, istrat, offset, init = init, control, weights =
## weights, : Inner loop failed to coverge for iterations 2
## Call:
## coxph(formula = Y ~ prison + dose + strata(clinic) + frailty(id,
## distribution = "gamma"), data = addicts)
##
## coef se(coef) se2 Chisq DF p
## prison 0.39003 0.16916 0.16893 5.31590 1.00 0.021
## dose -0.03517 0.00647 0.00647 29.50946 1.00 5.6e-08
## frailty(id, distribution 0.34134 0.32 0.314
##
## Iterations: 5 outer, 41 Newton-Raphson
## Variance of random effect= 0.00227 I-likelihood = -597.5
## Degrees of freedom for terms= 1.0 1.0 0.3
## Likelihood ratio test=34.6 on 2.32 df, p=5e-08
## n= 238, number of events= 150
The term “+ frailty(id, distribution=“gamma”)” is included in the model formula. The first argument of the frailty function is the variable id and indicates that the unmeasured heterogeneity (the frailty) is at the individual level. The second argument indicates that the distribution of the random component is the gamma distribution.
Under the table of parameter estimates the output indicates that the variance of random effect = 0.00227. The p-value for the frailty component of 3.1e-01= 0.31 is provided in the third row and right column of the table and indicates that the frailty component is not significant. We conclude that the variance of the random component is zero for this model (i.e., there is no frailty). The parameter estimates for PRISON and DOSE changed minimally in this model compared to the model previously run without the frailty.
Now, suppose the variable CLINIC was unmeasured. Next we consider a Cox model (without frailty) that does not contain CLINIC.
coxph(Y~ prison + dose, data=addicts)
## Call:
## coxph(formula = Y ~ prison + dose, data = addicts)
##
## coef exp(coef) se(coef) z p
## prison 0.18965 1.20883 0.16427 1.155 0.248
## dose -0.03608 0.96457 0.00600 -6.013 1.83e-09
##
## Likelihood ratio test=38.21 on 2 df, p=5.045e-09
## n= 238, number of events= 150
The estimated hazard ratio for PRISON=1 versus PRISON=0 is \(\exp(0.1897) = 1.209\) as compared to \(\exp(0.3896) = 1.476\) that was observed in the model that contained CLINIC as a stratified variable. In previous sections CLINIC was shown to be an important predictor that violates the proportional hazards assumption. If CLINIC was unaccounted for (as in the model above), there may be a source of unobserved heterogeneity that a frailty component might address.
The next model omits CLINIC but includes a frailty component and the predictors PRISON and DOSE. We also use SUMMARY function to get exponentiated estimates.
summary(coxph(Y~ prison + dose
+ frailty(id, distribution="gamma"),
data=addicts))
## Call:
## coxph(formula = Y ~ prison + dose + frailty(id, distribution = "gamma"),
## data = addicts)
##
## n= 238, number of events= 150
##
## coef se(coef) se2 Chisq DF p
## prison 0.41441 0.221604 0.17590 3.5 1.00 6.1e-02
## dose -0.05166 0.008448 0.00699 37.4 1.00 9.6e-10
## frailty(id, distribution 100.5 69.34 8.6e-03
##
## exp(coef) exp(-coef) lower .95 upper .95
## prison 1.5135 0.6607 0.9803 2.3367
## dose 0.9496 1.0530 0.9341 0.9655
##
## Iterations: 6 outer, 44 Newton-Raphson
## Variance of random effect= 0.6495364 I-likelihood = -685.4
## Degrees of freedom for terms= 0.6 0.7 69.3
## Concordance= 0.854 (se = 0.015 )
## Likelihood ratio test= 190.4 on 70.65 df, p=6e-13
The variance of the frailty component is estimated at 0.65 compared to 0.00227 for the model that we showed previously that contained CLINIC as the stratified variable. The p-value for the frailty is highly significant at 8.6e–3 = 0.0086. The hazard ratio for the effect of PRISON is \(\exp(0.4144) = 1.51\). The summary function can be applied to the coxph function to get R to exponentiate the parameter estimates (with 95% CI) when a frailty component is included in a Cox model.
It is interesting that the estimated hazard ratio for PRISON (1.51) obtained in this model (without CLINIC but with the frailty component) is closer to the corresponding hazard ratio obtained from the model that included CLINIC (1.476) compared to the one that did not include CLINIC (1.209). In this example, the frailty component might be accounting to some extent for the fact that CLINIC was omitted from the model.
A simple way to eliminate the effect of clustering is to stratify on the clusters.
Generalized stratified models
\[h_g(t,X)=h_0g (t) \exp[\beta_1X_1+\beta_2X_2+ \cdots+\beta_p x_p]\]
If the proportional hazards assumption is violated for the variable CLINIC but met for PRISON and DOSE, a stratified Cox model can be performed with CLINIC the stratified variable. The coxph function includes a strata() option in the model formula. First we define the response variable \(Y\) with the Surv function and then the coxph function is used to run a stratified Cox model (code and output shown below):
Y=Surv(addicts$survt,addicts$status==1)
coxph(Y~ prison + dose + strata(clinic),data=addicts)
## Call:
## coxph(formula = Y ~ prison + dose + strata(clinic), data = addicts)
##
## coef exp(coef) se(coef) z p
## prison 0.389605 1.476397 0.168930 2.306 0.0211
## dose -0.035115 0.965495 0.006465 -5.432 5.59e-08
##
## Likelihood ratio test=33.91 on 2 df, p=4.322e-08
## n= 238, number of events= 150
Interaction terms for CLINIC can be included directly in the model formula by including product terms using the : operator (clinic:prison and clinic:dose) (code and output follow):
coxph(Y~ prison + dose + clinic:prison + clinic:dose +strata(clinic),
data=addicts)
## Call:
## coxph(formula = Y ~ prison + dose + clinic:prison + clinic:dose +
## strata(clinic), data = addicts)
##
## coef exp(coef) se(coef) z p
## prison 1.085836 2.961914 0.538636 2.016 0.0438
## dose -0.034635 0.965958 0.019797 -1.750 0.0802
## prison:clinic -0.582989 0.558227 0.428135 -1.362 0.1733
## dose:clinic -0.001164 0.998837 0.014570 -0.080 0.9363
##
## Likelihood ratio test=35.77 on 4 df, p=3.222e-07
## n= 238, number of events= 150
Suppose we wish to estimate the hazard ratio for PRISON=1 vs. PRISON=0 for CLINIC=2. This hazard ratio can be estimated by exponentiating the coefficient for prison plus 2 times the coefficient for the CLINIC* PRISON interaction term. This expression is obtained by substituting the appropriate values into the hazard in both the numerator (for PRISON=1) and denominator (for PRISON=0) (see below):
\[HR=\frac{h_0(t) \exp[1\beta_1+\beta_2 DOSE + (2)(1)\beta_3 + \beta_4 clinic \times DOSE]}{h_0(t) \exp[(0)\beta_1+\beta_2 DOSE + (2)(0)\beta_3 + \beta_4 clinic \times DOSE]}=\exp(\beta_1 + 2 \beta_3)\]
The resulting hazard ratio, \(\exp(\beta_1
+ 2 \beta_2)\), is an exponentiated linear combination of
parameters. Unfortunately, R does not have a lincom command that Stata
provides or an estimate statement that SAS provides in order to
calculate a linear combination of parameter estimates. However an
approach that can be used in any statistical software package for such a
situation is to recode the variable(s) of interest such that the desired
estimate is no longer a linear combination of parameter estimates.
In this example, we are interested in a hazard ratio PRISON=1 versus PRISON=0 for CLINIC=2. We can define a new variable CLINIC \(\times\) 2 so when CLINIC=2, CLINIC \(\times\) 2=0.
addicts$clinic2=addicts$clinic-2
summary(coxph(Y~ prison + dose + clinic2:prison + clinic2:dose
+ strata(clinic2), data=addicts))
## Call:
## coxph(formula = Y ~ prison + dose + clinic2:prison + clinic2:dose +
## strata(clinic2), data = addicts)
##
## n= 238, number of events= 150
##
## coef exp(coef) se(coef) z Pr(>|z|)
## prison -0.080143 0.922985 0.384305 -0.209 0.83481
## dose -0.036964 0.963711 0.012346 -2.994 0.00275 **
## prison:clinic2 -0.582989 0.558227 0.428135 -1.362 0.17329
## dose:clinic2 -0.001164 0.998837 0.014570 -0.080 0.93632
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## prison 0.9230 1.083 0.4346 1.9603
## dose 0.9637 1.038 0.9407 0.9873
## prison:clinic2 0.5582 1.791 0.2412 1.2919
## dose:clinic2 0.9988 1.001 0.9707 1.0278
##
## Concordance= 0.649 (se = 0.026 )
## Likelihood ratio test= 35.77 on 4 df, p=3e-07
## Wald test = 34.09 on 4 df, p=7e-07
## Score (logrank) test = 34.97 on 4 df, p=5e-07
The first line of code defines a new variable CLINIC2. CLINIC2 is used in the stratified Cox model rather than CLINIC. We are interested in the hazard ratio for PRISON=1 vs PRISON=0 for CLINIC2=0. When CLINIC2=0, the product terms cancel and the hazard ratio reduces to \(\exp(\beta_1)\).
The second line of code applies the summary function to the coxph function. The summary function applied in this way produces additional output including 95% confidence intervals for the hazard ratios.
The estimate for \(\exp(\beta_1)\) can be found in the second table, \(\exp(coef)\) for prison = 0.9203. The lower and upper confidence limits are 0.4346 and 1.9603, respectively. If we did not recode the variable CLINIC the problem would have been more complicated in that we would have had to use variance–covariance matrix (which can be obtained with the vcov function) to calculate a 95% confidence interval for this hazard ratio.
Weib.PH_st <- phreg(Y ~ clinic + dose,
data=addicts, dist="weibull", param="survreg")
b_Weib.PH_st = coef(Weib.PH_st)
Weib.PH_st
## Call:
## phreg(formula = Y ~ clinic + dose, data = addicts, dist = "weibull",
## param = "survreg")
##
## Covariate W.mean Coef Exp(Coef) se(Coef) Wald p
## clinic 1.378 -0.925 0.397 0.211 0.000
## dose 64.317 -0.033 0.968 0.006 0.000
##
## log(scale) 4.066 0.326 0.000
## log(shape) 0.304 0.068 0.000
##
## Events 150
## Total time at risk 95812
## Max. log. likelihood -1086.3
## LR test statistic 57.34
## Degrees of freedom 2
## Overall p-value 3.54494e-13
Weib.PH_st <- phreg(Y ~ clinic + dose + strata(prison),
data=addicts, dist="weibull", param="survreg")
b_Weib.PH_st = coef(Weib.PH_st)
Weib.PH_st
## Call:
## phreg(formula = Y ~ clinic + dose + strata(prison), data = addicts,
## dist = "weibull", param = "survreg")
##
## Covariate W.mean Coef Exp(Coef) se(Coef) Wald p
## clinic 1.378 -0.953 0.385 0.213 0.000
## dose 64.317 -0.034 0.967 0.006 0.000
##
## log(scale):1 4.256 0.332 0.000
## log(shape):1 0.388 0.093 0.000
## log(scale):2 3.700 0.408 0.000
## log(shape):2 0.241 0.098 0.013
##
## Events 150
## Total time at risk 95812
## Max. log. likelihood -1083.9
## LR test statistic 59.36
## Degrees of freedom 2
## Overall p-value 1.28675e-13
Here are some differences between two models:
plot(Weib.PH_st)
library(survival)
ht00 <- coxph(Surv(futime,fustat)~ transplant + age + surgery,
data=jasa)
summary(ht00)
## Call:
## coxph(formula = Surv(futime, fustat) ~ transplant + age + surgery,
## data = jasa)
##
## n= 103, number of events= 75
##
## coef exp(coef) se(coef) z Pr(>|z|)
## transplant -1.71711 0.17958 0.27853 -6.165 7.05e-10 ***
## age 0.05889 1.06065 0.01505 3.913 9.12e-05 ***
## surgery -0.41902 0.65769 0.37118 -1.129 0.259
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## transplant 0.1796 5.5684 0.1040 0.310
## age 1.0607 0.9428 1.0298 1.092
## surgery 0.6577 1.5205 0.3177 1.361
##
## Concordance= 0.732 (se = 0.031 )
## Likelihood ratio test= 45.85 on 3 df, p=6e-10
## Wald test = 47.15 on 3 df, p=3e-10
## Score (logrank) test = 52.63 on 3 df, p=2e-11
coef_ht00 <- coef(summary(ht00))
exp.ci_ht00 <- summary(ht00)$conf.int
ind30<-jasa$futime >= 30
transplant30 <- {{jasa$transplant==1}&{jasa$wait.time <30}}
ht01 <- coxph(Surv(futime,fustat)~ transplant30 + age + surgery,
data=jasa, subset=ind30)
summary(ht01)
## Call:
## coxph(formula = Surv(futime, fustat) ~ transplant30 + age + surgery,
## data = jasa, subset = ind30)
##
## n= 79, number of events= 52
##
## coef exp(coef) se(coef) z Pr(>|z|)
## transplant30TRUE -0.04214 0.95874 0.28377 -0.148 0.8820
## age 0.03720 1.03790 0.01714 2.170 0.0300 *
## surgery -0.81966 0.44058 0.41297 -1.985 0.0472 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## transplant30TRUE 0.9587 1.0430 0.5497 1.6720
## age 1.0379 0.9635 1.0036 1.0734
## surgery 0.4406 2.2697 0.1961 0.9898
##
## Concordance= 0.618 (se = 0.044 )
## Likelihood ratio test= 9.5 on 3 df, p=0.02
## Wald test = 8.61 on 3 df, p=0.03
## Score (logrank) test = 8.94 on 3 df, p=0.03
coef_ht01 <- coef(summary(ht01))
exp.ci_ht01 <- summary(ht01)$conf.int
id <- 1:nrow(jasa)
jasaT <- data.frame(id, jasa)
id.simple <- c(2,5,10,12,28,95)
heart.simple <- jasaT[id.simple, c(1,10,9,6,11)]
heart.simple
## id wait.time futime fustat transplant
## 2 2 NA 5 1 0
## 5 5 NA 17 1 0
## 10 10 11 57 1 1
## 12 12 NA 7 1 0
## 28 28 70 71 1 1
## 95 95 1 15 1 1
# Ignore that transplant is time dependent
ht02 <- coxph(Surv(futime,fustat)~ transplant, data=heart.simple)
summary(ht02)
## Call:
## coxph(formula = Surv(futime, fustat) ~ transplant, data = heart.simple)
##
## n= 6, number of events= 6
##
## coef exp(coef) se(coef) z Pr(>|z|)
## transplant -1.6878 0.1849 1.1718 -1.44 0.15
##
## exp(coef) exp(-coef) lower .95 upper .95
## transplant 0.1849 5.408 0.0186 1.838
##
## Concordance= 0.733 (se = 0.077 )
## Likelihood ratio test= 2.47 on 1 df, p=0.1
## Wald test = 2.07 on 1 df, p=0.1
## Score (logrank) test = 2.56 on 1 df, p=0.1
coef_ht02 <- coef(summary(ht02))
exp.ci_ht02 <- summary(ht02)$conf.int
Sample of six patients from the Stanford heart transplant dataset
# Accounting for time-dependent covariates
sdata <- tmerge(heart.simple, heart.simple, id=id, death=event(futime,fustat),
transpl=tdc(wait.time))
sdata
## id wait.time futime fustat transplant tstart tstop death transpl
## 1 2 NA 5 1 0 0 5 1 0
## 2 5 NA 17 1 0 0 17 1 0
## 3 10 11 57 1 1 0 11 0 0
## 4 10 11 57 1 1 11 57 1 1
## 5 12 NA 7 1 0 0 7 1 0
## 6 28 70 71 1 1 0 70 0 0
## 7 28 70 71 1 1 70 71 1 1
## 8 95 1 15 1 1 0 1 0 0
## 9 95 1 15 1 1 1 15 1 1
heart.simple.counting <- sdata[,-(2:5)] # drop columns 2 through 5
heart.simple.counting
## id tstart tstop death transpl
## 1 2 0 5 1 0
## 2 5 0 17 1 0
## 3 10 0 11 0 0
## 4 10 11 57 1 1
## 5 12 0 7 1 0
## 6 28 0 70 0 0
## 7 28 70 71 1 1
## 8 95 0 1 0 0
## 9 95 1 15 1 1
Start-stop counting process
Using this example, we fit Cox model.
ht03 <- coxph(Surv(tstart,tstop, death)~ transpl,
data=heart.simple.counting)
summary(ht03)
## Call:
## coxph(formula = Surv(tstart, tstop, death) ~ transpl, data = heart.simple.counting)
##
## n= 9, number of events= 6
##
## coef exp(coef) se(coef) z Pr(>|z|)
## transpl 0.2846 1.3292 0.9609 0.296 0.767
##
## exp(coef) exp(-coef) lower .95 upper .95
## transpl 1.329 0.7523 0.2021 8.74
##
## Concordance= 0.5 (se = 0.082 )
## Likelihood ratio test= 0.09 on 1 df, p=0.8
## Wald test = 0.09 on 1 df, p=0.8
## Score (logrank) test = 0.09 on 1 df, p=0.8
coef_ht03 <- coef(summary(ht03))
exp.ci_ht03 <- summary(ht03)$conf.int
Now, we apply this approach accounting for time dependent covariates to the full dataset.10
jasa$subject <- 1:nrow(jasa) #we need an identifier variable
tdata <- with(jasa, data.frame(subject = subject,
futime= pmax(.5, fu.date - accept.dt),
txtime= ifelse(tx.date== fu.date,
(tx.date -accept.dt) -.5,
(tx.date - accept.dt)),
fustat = fustat
))
xdata <- tmerge(jasa, tdata, id=subject,
death = event(futime, fustat),
transplant = tdc(txtime),
options= list(idname="subject"))
## Warning in tmerge(jasa, tdata, id = subject, death = event(futime, fustat), :
## replacement of variable 'transplant'
ht04 <- coxph(Surv(tstart, tstop, death) ~ transplant + age + surgery,
data= xdata, ties="breslow")
summary(ht04)
## Call:
## coxph(formula = Surv(tstart, tstop, death) ~ transplant + age +
## surgery, data = xdata, ties = "breslow")
##
## n= 170, number of events= 75
##
## coef exp(coef) se(coef) z Pr(>|z|)
## transplant 0.01238 1.01246 0.30815 0.040 0.9680
## age 0.03055 1.03102 0.01390 2.198 0.0279 *
## surgery -0.77155 0.46230 0.35967 -2.145 0.0319 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## transplant 1.0125 0.9877 0.5534 1.8522
## age 1.0310 0.9699 1.0033 1.0595
## surgery 0.4623 2.1631 0.2284 0.9356
##
## Concordance= 0.6 (se = 0.036 )
## Likelihood ratio test= 10.68 on 3 df, p=0.01
## Wald test = 9.65 on 3 df, p=0.02
## Score (logrank) test = 9.97 on 3 df, p=0.02
coef_ht04 <- coef(summary(ht04))
exp.ci_ht04 <- summary(ht04)$conf.int
SHS <- cbind(coef_ht00, coef_ht04)
SHS
## coef exp(coef) se(coef) z Pr(>|z|) coef
## transplant -1.7171129 0.1795839 0.27852978 -6.164917 7.052018e-10 0.01238011
## age 0.0588863 1.0606546 0.01504927 3.912902 9.119349e-05 0.03054984
## surgery -0.4190159 0.6576938 0.37117505 -1.128890 2.589442e-01 -0.77154747
## exp(coef) se(coef) z Pr(>|z|)
## transplant 1.0124571 0.30815300 0.0401752 0.96795345
## age 1.0310213 0.01389822 2.1981113 0.02794117
## surgery 0.4622971 0.35967149 -2.1451449 0.03194126
In contrast to Stata, SAS, and SPSS, in order to run an extended Cox model in R, the analytic dataset must be in the counting process (start, stop) format. Unfortunately, the addicts dataset is not in that format, so it needs to be altered in order to include a time-varying covariate. This can be accomplished with the survSplit function. The survSplit function can create a dataset that provides multiple observations for the same subject allowing a subject’s covariate to change values from observation to observation. The user supplies the time cutpoint(s).
The most general choice for time cutpoints that can accommodate the modeling of any time-varying covariate is a vector of time cutpoints that includes all event times in the data. The variable SURVT in the addicts dataset contains each individual’s time-to-event or time-to-censorship. The following code creates a new analytic dataset (called addicts.cp) which puts the addicts data in the counting process format using the survSplit function:
#addicts <- read.csv(file="addicts.csv", header = TRUE)
addicts.cp=survSplit(addicts,cut=addicts$survt[addicts$status==1],
end="survt", event="status",start="start",id="iid")
The first argument of the survSplit function specifies the dataframe (addicts) to be manipulated into the counting process format. The “survt[status==1]” option specified that the time cutpoints are indicated by the SURVT variable subsetted where the STATUS variable equals 1 (i.e., keeping the event times but omitting censorship times). The event=”status” option specifies STATUS as the variable indicating whether the individual had an event or was censored. The start=”start” option creates a new variable called START. This newly defined variable for the starting times for each observation is necessary for the data to be in counting process (start, stop) format. The end=”survt” option defines SURVT as the stop variable (i.e., the time-to-event variable). The option id=”id” indicates that ID is the variable that identifies each individual. The survSplit function creates multiple observations for individuals at risk at multiple time points. The dataset addicts.cp created above contains 18,708 observations from the 238 observations in the addicts dataset (use the nrow function and the code nrow(addicts.cp)) to return the number of observations.
Suppose the PH assumption was violated for the variable DOSE and we were interested in defining a time-varying covariate as the product of DOSE and the natural log of time (SURVT). This variable can easily be defined if the dataset is in counting process form with time cutpoints at each event time as shown below:
addicts.cp$logtdose=addicts.cp$dose*log(addicts.cp$survt)
We now have a new variable in the dataset (called LOGTDOSE=ln(DOSE)*T) that varies over time. We print the dataset for one individual (id=106) who had an event at time=35 days. Rather than print all the variables, we request a subset of them with the c function:
addicts.cp[addicts.cp$iid==106,
c("iid","start","survt","status","dose","logtdose")]
## iid start survt status dose logtdose
## 9802 106 0 7 0 40 77.83641
## 9803 106 7 13 0 40 102.59797
## 9804 106 13 17 0 40 113.32853
## 9805 106 17 19 0 40 117.77756
## 9806 106 19 26 0 40 130.32386
## 9807 106 26 29 0 40 134.69183
## 9808 106 29 30 0 40 136.04790
## 9809 106 30 33 0 40 139.86030
## 9810 106 33 35 1 40 142.21392
The variable LOGTDOSE is time dependent as its values increase with time as expected. The variable SURVT lists all the event times in the addicts dataset up to day 35 when this individual had an event. Notice STATUS=1 when the event occurred and STATUS=0 prior to the event. Next we run an extended Cox model including the predictors PRISON, DOSE, and CLINIC and the time-dependent variable LOGTDOSE:
coxph(Surv(addicts.cp$start,addicts.cp$survt,addicts.cp$status) ~
prison + dose + clinic + logtdose + cluster(iid), data=addicts.cp)
## Call:
## coxph(formula = Surv(addicts.cp$start, addicts.cp$survt, addicts.cp$status) ~
## prison + dose + clinic + logtdose, data = addicts.cp, cluster = iid)
##
## coef exp(coef) se(coef) robust se z p
## prison 0.340633 1.405837 0.167474 0.159717 2.133 0.03295
## dose -0.082625 0.920696 0.035984 0.029601 -2.791 0.00525
## clinic -1.019875 0.360640 0.215416 0.236365 -4.315 1.6e-05
## logtdose 0.008615 1.008652 0.006455 0.005248 1.642 0.10068
##
## Likelihood ratio test=66.35 on 4 df, p=1.338e-13
## n= 18708, number of events= 150
The Surv function now takes three arguments: the start variable (called START), the stop variable (called SURVT), and the status variable (called STATUS). The term cluster (ID) in the model formula indicates that there are multiple observations (clusters) from the same subject and requests that robust standard errors be produced for the coefficient estimates. These robust standard errors are designed to account for the non-independence of observations from the same subject. The Wald test z statistic of 1.64 (p = 1.0e-01 or p=0.10) is not significant for LOGTDOSE, providing no evidence that the proportional hazards assumption is violated for DOSE.
Next we run an extended Cox model with heaviside functions for CLINIC defined about the time cutpoint of 365 days. We could use the dataset that we just created, addicts.cp, but since there is now only one cutpoint, we illustrate how to create a dataset in counting process format with only one cutpoint. The new dataset (called addicts.cp365) will have 360 observations compared to 18,708 in the dataset we previously had created called addicts.cp.
addicts.cp365=survSplit(addicts,cut=365,end="survt",
event="status",start="start",id="iid")
The cut=365 option in the survSplit function requests that day 365 be the only cutpoint. Next we create the two time dependent variables (HV1 and HV2). HV1 is defined to equal the value of CLINIC if survival time is less than 365 days and 0 otherwise. HV2 is defined to equal 0 if survival time is less than 365 days and equal the value of CLINIC otherwise:
addicts.cp365$hv1=addicts.cp365$clinic*(addicts.cp365$start<365)
addicts.cp365$hv2=addicts.cp365$clinic*(addicts.cp365$start>=365)
The conditional statements \(addicts.cp365 start<365)+\) and \(addicts.cp365 start>=365)+\), take the values of 1 if true and 0 if false and are then multiplied by the variable CLINIC to define HV1 and HV2. Next we’ll sort the dataset by the variables ID and START. This is not a necessary step but it is easier to view and understand the data when multiple observations from the same subject are consecutive. The order function sorts the dataset:
addicts.cp365=addicts.cp365[order(addicts.cp365$iid,addicts.cp365$start), ]
Next we print the first 10 observations for selected variables:
addicts.cp365[1:10,c('iid','start','survt','status','clinic','hv1','hv2')]
## iid start survt status clinic hv1 hv2
## 1 1 0 365 0 1 1 0
## 2 1 365 428 1 1 0 1
## 3 2 0 275 1 1 1 0
## 4 3 0 262 1 1 1 0
## 5 4 0 183 1 1 1 0
## 6 5 0 259 1 1 1 0
## 7 6 0 365 0 1 1 0
## 8 6 365 714 1 1 0 1
## 9 7 0 365 0 1 1 0
## 10 7 365 438 1 1 0 1
Notice the sorted order of the ID variable is 1, 10, and 100 rather than 1, 2, and 3. The ID variable is a character rather than numeric variable and is sorted in “alphabetical” rather than numerical order. The first subject (ID=1) had an event at 428 days, so was censored (STATUS=0) during the first time interval (0, 365) but had an event (STATUS=1) during the second interval (365, 428). This subject has the value CLINIC=1, thus has the time-dependent values HV1=1 and HV2=0 over the first interval and HV1=0 and HV2=1 over the second interval.
Before running an extended Cox model with these heaviside functions we define an object (called Y365) for the response variable using the Surv function. This object is then used in the coxph model formula. It is not necessary to explicitly define this object and we did not do so for the previous extended Cox model that we ran containing LOGTDOSE, but the code is more readable with the notation for the response variable simplified.
Y365=Surv(addicts.cp365$start,addicts.cp365$survt,addicts.cp365$status)
Next we run the model with two heaviside functions:
coxph(Y365 ~ prison + dose + hv1 + hv2 + cluster(iid), data=addicts.cp365)
## Call:
## coxph(formula = Y365 ~ prison + dose + hv1 + hv2, data = addicts.cp365,
## cluster = iid)
##
## coef exp(coef) se(coef) robust se z p
## prison 0.377951 1.459291 0.168415 0.167650 2.254 0.0242
## dose -0.035480 0.965142 0.006435 0.006520 -5.442 5.27e-08
## hv1 -0.459373 0.631679 0.255290 0.259983 -1.767 0.0772
## hv2 -1.830517 0.160331 0.385954 0.398376 -4.595 4.33e-06
##
## Likelihood ratio test=74.25 on 4 df, p=2.868e-15
## n= 360, number of events= 150
The estimated hazard ratio (CLINIC=2 vs. CLINIC=1) is 0.632 for days \(< 365\) and 0.160 for days \(\ge 365\) (found in the second numeric column under exp(coef)). If we wish to match the SAS, Stata, and SPSS output, we could run the model without robust standard errors and use the method=”breslow” to handle simultaneous events (ties) in the Cox likelihood.
coxph(Y365 ~ prison + dose + hv1 + hv2, data=addicts.cp365, method="breslow")
## Call:
## coxph(formula = Y365 ~ prison + dose + hv1 + hv2, data = addicts.cp365,
## method = "breslow")
##
## coef exp(coef) se(coef) z p
## prison 0.377704 1.458931 0.168402 2.243 0.0249
## dose -0.035512 0.965112 0.006435 -5.518 3.43e-08
## hv1 -0.459563 0.631560 0.255291 -1.800 0.0718
## hv2 -1.828228 0.160698 0.385946 -4.737 2.17e-06
##
## Likelihood ratio test=74.17 on 4 df, p=2.978e-15
## n= 360, number of events= 150
To run an equivalent model with one heaviside function, we need to include the CLINIC variable in the model:
coxph(Y365 ~ prison + dose + hv1 + hv2 + cluster(iid), data=addicts.cp365)
## Call:
## coxph(formula = Y365 ~ prison + dose + hv1 + hv2, data = addicts.cp365,
## cluster = iid)
##
## coef exp(coef) se(coef) robust se z p
## prison 0.377951 1.459291 0.168415 0.167650 2.254 0.0242
## dose -0.035480 0.965142 0.006435 0.006520 -5.442 5.27e-08
## hv1 -0.459373 0.631679 0.255290 0.259983 -1.767 0.0772
## hv2 -1.830517 0.160331 0.385954 0.398376 -4.595 4.33e-06
##
## Likelihood ratio test=74.25 on 4 df, p=2.868e-15
## n= 360, number of events= 150
The coefficient estimates are different with this model compared to the model with two heaviside functions but the estimated hazard ratios are the same. The estimated hazard ratio (CLINIC=2 vs. CLINIC=1) is 0.632 for days \(< 365\) (exponentiate the coefficient for CLINIC). In order to estimate the hazard ratio for days \(\ge 365\), we need to sum the coefficient estimates for CLINIC and HV2 and then exponentiate (exp(-0.4594)=-1.3711)) = 0.160). The significant p-value for the estimated coefficient for HV2 of (p = 3.6e-10 or p = 0.0036) suggests that the hazard ratios for CLINIC for the two different time periods are not equal. In other words, the significant p-value provides evidence that the proportional hazard assumption is violated for CLINIC.
#coxph(Surv(time, status==2)~age, data=lung)
#coxph(Surv(time, status==2)~ tt(age), data=lung,
# tt=function(x,t, ...)
# {age<-x + t/365.25})
#install.packages("asaur")
library(asaur)
head(pancreatic)
## stage onstudy progression death
## 1 M 12/16/2005 2/2/2006 10/19/2006
## 2 M 1/6/2006 2/26/2006 4/19/2006
## 3 LA 2/3/2006 8/2/2006 1/19/2007
## 4 M 3/30/2006 . 5/11/2006
## 5 LA 4/27/2006 3/11/2007 5/29/2007
## 6 M 5/7/2006 6/25/2006 10/11/2006
attach(pancreatic)
# convert the text dates into R dates
pdd <- as.Date(as.character(progression), format="%m/%d/%y")
odd <- as.Date(as.character(onstudy), format="%m/%d/%y")
ddd <- as.Date(as.character(death), format="%m/%d/%y")
pd <- (pdd - odd)
od <- (ddd - odd)
pfs <- pd
pfs[is.na(pfs)] <- od[is.na(pfs)]
pfs
## Time differences in days
## [1] -318 51 181 42 -47 49 -209 58 244 49 61 244 -228 -181 -208
## [16] 43 -229 -241 -226 213 51 54 82 -113 63 -142 42 -144 -246 37
## [31] 162 64 174 56 50 57 61 120 69 105 63
plot(survfit(Surv(pfs)~stage), xlab="Time in days", col=c("blue", "red"), lwd=2)
legend("topright", legend=c("Locally advanced PC","Metastatic PC"), col=c("blue","red"), lwd=2)
survdiff(Surv(pfs)~stage, rho=0)
## Call:
## survdiff(formula = Surv(pfs) ~ stage, rho = 0)
##
## N Observed Expected (O-E)^2/E (O-E)^2/V
## stage=LA 8 8 8.78 0.0685 0.102
## stage=M 33 33 32.22 0.0186 0.102
##
## Chisq= 0.1 on 1 degrees of freedom, p= 0.7
# pancreatic data2
head(pancreatic2)
## pfs os status stage
## 1 48 307 1 M
## 2 51 103 1 M
## 3 180 350 1 LA
## 4 42 42 1 M
## 5 318 397 1 LA
## 6 49 157 1 M
stage.n <- rep(0, nrow(pancreatic2))
stage.n[pancreatic2$stage=="M"]<-1
result.panc <- coxph(Surv(pfs)~stage.n, data=pancreatic2)
result.panc
## Call:
## coxph(formula = Surv(pfs) ~ stage.n, data = pancreatic2)
##
## coef exp(coef) se(coef) z p
## stage.n 0.5931 1.8095 0.4007 1.48 0.139
##
## Likelihood ratio test=2.43 on 1 df, p=0.1188
## n= 41, number of events= 41
result.sch.resid <- cox.zph(result.panc)
plot(result.sch.resid)
result.panc2.tt <- coxph(Surv(pfs)~stage.n + tt(stage.n), data=pancreatic2,
tt=function(x,t, ...) x*log(t))
result.panc2.tt
## Call:
## coxph(formula = Surv(pfs) ~ stage.n + tt(stage.n), data = pancreatic2,
## tt = function(x, t, ...) x * log(t))
##
## coef exp(coef) se(coef) z p
## stage.n 6.0096 407.3394 3.0598 1.964 0.0495
## tt(stage.n) -1.0858 0.3376 0.5889 -1.844 0.0652
##
## Likelihood ratio test=6.33 on 2 df, p=0.04229
## n= 41, number of events= 41
survdiff(Surv(pfs)~stage, rho=0, data=pancreatic2)
## Call:
## survdiff(formula = Surv(pfs) ~ stage, data = pancreatic2, rho = 0)
##
## N Observed Expected (O-E)^2/E (O-E)^2/V
## stage=LA 8 8 12.3 1.49 2.25
## stage=M 33 33 28.7 0.64 2.25
##
## Chisq= 2.2 on 1 degrees of freedom, p= 0.1
survdiff(Surv(pfs)~stage, rho=1, data=pancreatic2)
## Call:
## survdiff(formula = Surv(pfs) ~ stage, data = pancreatic2, rho = 1)
##
## N Observed Expected (O-E)^2/E (O-E)^2/V
## stage=LA 8 2.34 5.88 2.128 4.71
## stage=M 33 18.76 15.22 0.822 4.71
##
## Chisq= 4.7 on 1 degrees of freedom, p= 0.03
result.sch.resid <- cox.zph(result.panc, transform = function(pfs) log(pfs))
plot(result.sch.resid)
abline(coef(result.panc2.tt), col="red")
result.panc3.tt <- coxph(Surv(pfs)~stage.n + tt(stage.n), data=pancreatic2,
tt=function(x,t, ...) x*t)
result.panc3.tt
## Call:
## coxph(formula = Surv(pfs) ~ stage.n + tt(stage.n), data = pancreatic2,
## tt = function(x, t, ...) x * t)
##
## coef exp(coef) se(coef) z p
## stage.n 1.278099 3.589808 0.661027 1.934 0.0532
## tt(stage.n) -0.003656 0.996350 0.002532 -1.444 0.1487
##
## Likelihood ratio test=4.56 on 2 df, p=0.1025
## n= 41, number of events= 41
Survival times and backwards recurrence times for data from a comparative clinical trial. Patients marked “T” received the experimental treatment, and those marked “C” reveived the standard therapy
Re-aligned data with left truncation
Using time-on-study as the time scale can give a different view of the survival data (i.e., a closed cohort) than found when using age as the time scale (i.e., an open cohort). So which time scale should be used and how do we make such a decision in general?
One way to account for the age difference at entry would simply to control for age at entry (i.e., a0) as a covariate in one’s survival analysis by adding the variable a0 to a Cox PH model. This approach is reasonable provided the model is specified correctly (e.g., proportional hazards assumption is met for age). Alternatively, considering subjects I and J, who have entered at the same time but are 9 years different in age, we might consider using age as the time scale to represent a subject’s potential for failure, which we now describe.
At this point, we might again ask, when, if at all, would using a model based on h(a, X) be preferable to simply using a model of the form h(t, X, a0) where t denotes time-on follow-up, and \(a_0\) denotes age at entry?
We now specify several alternative forms that a Cox PH model might take to account for risk-truncated survival data.
Time of study: these models are time-on-study models that control for age at entry (\(a_0\)), but do so differently.
Age as time scale: these models use age-at-event or censorship rather than time-on-study as the outcome variable. These models differ in the way the baseline hazard function is specified.
In summary, of the seven models we have presented, Models 0 and 4 are inappropriate because Model 0 does not account for age at all and Model 4 ignores age truncation by incorrectly assuming that all study subjects were observed for the outcome from birth. The other five models (i.e., 1–3, 5, 6) all adjust for age at study entry in some way. A logical question at this point is whether in practice, it makes a difference which model is used to analyze age-truncated survival data?
The above question was actually addressed by Pencina et al. (Statist. Med., 2007) by comparing Models 1–6 above in terms of the estimated regression coefficients they produce. These authors consider Model 5, the age-truncated age scale model, to be “possibly the most appropriate refinement” to account for age-truncation. They also view time-on-study Models 1 and 2, which use linear and/or quadratic terms to adjust for entry age as a covariate as “attempts to approximate” Model 5.
Nevertheless, by considering numerical simulations as well as four practical examples from the Framingham Heart Study, Pencina et al. conclude that correct adjustment for the age at entry is crucial in reducing bias ofthe estimated coefficients. The unadjusted age-scale model (Model 1) is inferior to any of the five other models considered, regardless of their choice of time scale. Moreover, if correct adjustment for age at entry is made when considering Models 2–6, their analyses suggest that there exists little if any practical or meaningful difference in the estimated regression coefficients depending on the choice of time scale.
To illustrate, we show on the left results from Pencina et al. corresponding to Models 1–6 applied to 12-year follow-up Framingham Heart Study data16. The outcome considered here is coronary heart disease (CHD) in men. These results focus on two risk factors measured at baseline: diabetes mellitus status and education status, the latter categorized into two groups defined by post-high-school education (yes/no). The estimated regression coefficients (separately) relating these two risk factors to CHD outcome are presented in the table.
As expected, the table shows a substantial difference in the coefficient of the risk group variable estimated by the unadjusted age-scale model (Model 4) and the five other models. Moreover, the results for Models 1–3, 5, and 6 are all quite similar.
Pencina also point out that the directions of coefficients for these five models are in the directions anticipated conceptually, e.g., diabetes coefficients are positive, whereas education coefficients are negative. The quadratic baseline age term (Model 2) was significant for both CHD risk factors. This suggests potential misspecification in the modeling of the relationship between CHD and age introduced by Model 1, which treats entry age as linear. However, its inclusion in the time-on study model did not materially influence the magnitude or significance of the estimated exposure variable (diabetes or education status) coefficient.
When using age as the time scale and accounting for age truncation (i.e., using Model 5 above), the data layout requires the counting process (CP) in start–stop format with \(a_0\) as the start variable and a as the stop variable. However, since we are not considering recurrent events data here, the CP format for age truncated survival data has a simpler form, involving only one line of data for each study subject, as shown on the left. The computer code needed to program the analysis is described in the Computer Appendix for STATA, SAS, SPSS, or R packages.
Note that the CP format corresponding to Model 4, which assumes the starting time is birth, would modify the Model 4 layout by letting \(a_0\) = 0 in the \(a_0\) column for all subjects. Nevertheless, this layout would be equivalent to the “standard” layout that omits \(a_0\) column and simply treats the a column data as time on-study information. Again, since Model 4 appears to be inferior to the other models, we caution the reader not to use this format unless the risk period was observed since birth.
tt <- c(6,7,10,15,19,25)
status <- c(1,0,1,1,0,1)
grp <- c(0,0,1,0,1,1)
backtime <- c(-3,-11,-3,-7,-10,-5)
Survival times and backwards recurrence times for data from a comparative clinical trial. Patients marked “T” received the experimental treatment, and those marked “C” reveived the standard therapy
coxph(Surv(tt,status)~grp)
## Call:
## coxph(formula = Surv(tt, status) ~ grp)
##
## coef exp(coef) se(coef) z p
## grp -1.3261 0.2655 1.2509 -1.06 0.289
##
## Likelihood ratio test=1.21 on 1 df, p=0.2715
## n= 6, number of events= 4
tm.enter <- -backtime
tm.exit <- tt - backtime
Re-aligned data with left truncation
coxph(Surv(tm.enter, tm.exit, status)~grp)
## Call:
## coxph(formula = Surv(tm.enter, tm.exit, status) ~ grp)
##
## coef exp(coef) se(coef) z p
## grp -1.073 0.342 1.235 -0.869 0.385
##
## Likelihood ratio test=0.81 on 1 df, p=0.3677
## n= 6, number of events= 4
head(ChanningHouse)
## sex entry exit time cens
## 1 Male 782 909 127 1
## 2 Male 1020 1128 108 1
## 3 Male 856 969 113 1
## 4 Male 915 957 42 1
## 5 Male 863 983 120 1
## 6 Male 906 1012 106 1
ChanningHouse <- within(ChanningHouse,
{
entryYears <- entry/12
exitYears <- exit/12
})
head(ChanningHouse)
## sex entry exit time cens exitYears entryYears
## 1 Male 782 909 127 1 75.75000 65.16667
## 2 Male 1020 1128 108 1 94.00000 85.00000
## 3 Male 856 969 113 1 80.75000 71.33333
## 4 Male 915 957 42 1 79.75000 76.25000
## 5 Male 863 983 120 1 81.91667 71.91667
## 6 Male 906 1012 106 1 84.33333 75.50000
ChanningMales <- ChanningHouse[ChanningHouse$sex == "Male",]
head(ChanningMales)
## sex entry exit time cens exitYears entryYears
## 1 Male 782 909 127 1 75.75000 65.16667
## 2 Male 1020 1128 108 1 94.00000 85.00000
## 3 Male 856 969 113 1 80.75000 71.33333
## 4 Male 915 957 42 1 79.75000 76.25000
## 5 Male 863 983 120 1 81.91667 71.91667
## 6 Male 906 1012 106 1 84.33333 75.50000
result.km <- survfit(Surv(entryYears, exitYears, cens)~1, data=ChanningMales)
result.km
## Call: survfit(formula = Surv(entryYears, exitYears, cens) ~ 1, data = ChanningMales)
##
## records n.max n.start events median 0.95LCL 0.95UCL
## [1,] 96 39 2 46 64.9 64.8 NA
plot(result.km, xlim=c(64,101), xlab="Age",
ylim=c(0.0, 1.0), ylab="Survival probability", conf.int = F)
result.naa <- survfit(Surv(entryYears, exitYears, cens)~1,
data=ChanningMales, type="fleming-harrington")
result.naa
## Call: survfit(formula = Surv(entryYears, exitYears, cens) ~ 1, data = ChanningMales,
## type = "fleming-harrington")
##
## records n.max n.start events median 0.95LCL 0.95UCL
## [1,] 96 39 2 46 65.1 64.8 90
lines(result.naa, col="blue", conf.int=F)
result.km.68 <- survfit(Surv(entryYears, exitYears, cens)~1,
start.time=68, data=ChanningMales)
result.km.68
## Call: survfit(formula = Surv(entryYears, exitYears, cens) ~ 1, data = ChanningMales,
## start.time = 68)
##
## records n.max n.start events median 0.95LCL 0.95UCL
## [1,] 94 39 12 44 84.1 80.5 86.9
lines(result.km.68, col="green", conf.int=F)
legend("topright", legend=c("KM", "NAA", "KM 68 and older"),
lty=1, col=c("black","blue", "green"))
The black curve is the KM estimate, it plunges to zero at age 65 because, at this early age, the size of the risk set is small, and in fact reduces to 0. This forces the survival curve to zero. And, since the KM curve is a cumulative product, once it reaches zero, it can never vary from that.
The NAA estimate, shown in blue, is based on exponentiating a cumulative sum, so it doesn’t share this problem of going to zero early on. Still, it does take an early plunge, also due to the small size of the risk set at the younger ages. The problem here is that there is too little data to accurately estimate the overall survival distribution of men.
Instead, we can condition on men reaching the age of 68, using the “start.time” option, and estimate the survival among that cohort (a green line). The survival curve is much better behaved. So the only solution to the problem of a small risk set with left-truncated data is to select a realistic target(here, survival of men conditional on living to age 68) for which there is sufficient data to obtain a valid estimate.
Cox model without adjusting for left truncation
coxph(Surv(entryYears, exitYears, cens)~ sex,
data=ChanningHouse)
## Call:
## coxph(formula = Surv(entryYears, exitYears, cens) ~ sex, data = ChanningHouse)
##
## coef exp(coef) se(coef) z p
## sexMale 0.3219 1.3798 0.1733 1.857 0.0633
##
## Likelihood ratio test=3.28 on 1 df, p=0.07021
## n= 457, number of events= 175
channing68 <- ChanningHouse[ChanningHouse$exitYears >= 68,]
coxph(Surv(entryYears, exitYears, cens)~ sex,
data=channing68)
## Call:
## coxph(formula = Surv(entryYears, exitYears, cens) ~ sex, data = channing68)
##
## coef exp(coef) se(coef) z p
## sexMale 0.2733 1.3143 0.1762 1.552 0.121
##
## Likelihood ratio test=2.3 on 1 df, p=0.1292
## n= 451, number of events= 172
library(eha)
phreg(Surv(entryYears, exitYears, cens) ~ sex,
data=ChanningHouse, dist="weibull", param="survreg")
## Call:
## phreg(formula = Surv(entryYears, exitYears, cens) ~ sex, data = ChanningHouse,
## dist = "weibull", param = "survreg")
##
## Covariate W.mean Coef Exp(Coef) se(Coef) Wald p
## sex
## Female 0.807 0 1 (reference)
## Male 0.193 0.355 1.427 0.172 0.039
##
## log(scale) 4.476 0.011 0.000
## log(shape) 2.185 0.111 0.000
##
## Events 175
## Total time at risk 3088.3
## Max. log. likelihood -642.63
## LR test statistic 4.04
## Degrees of freedom 1
## Overall p-value 0.0445388
phreg(Surv(entryYears, exitYears, cens) ~ sex,
data=channing68, dist="weibull", param="survreg")
## Call:
## phreg(formula = Surv(entryYears, exitYears, cens) ~ sex, data = channing68,
## dist = "weibull", param = "survreg")
##
## Covariate W.mean Coef Exp(Coef) se(Coef) Wald p
## sex
## Female 0.808 0 1 (reference)
## Male 0.192 0.318 1.375 0.175 0.068
##
## log(scale) 4.480 0.010 0.000
## log(shape) 2.256 0.104 0.000
##
## Events 172
## Total time at risk 3074.3
## Max. log. likelihood -629.16
## LR test statistic 3.15
## Degrees of freedom 1
## Overall p-value 0.0761057
Assume that we have \(n\) independent individuals \((j=1,2,\dots,n)\).
For each individual \(i\), the data consist of three parts: \(t_i\), \(\delta_i\), and \(x_i\), where
Suppose that all the observations are uncensored. Because we are assuming independence, it follows that the probability of the entire data is found by taking the product of the probabilities of the data for every individual. Because \(t_i\) is assumed to be measured on a continuum, the probability that it will take on any specific value is 0.
Instead, we represent the probability of each observation by the probability density function (p.d.f.), \(f(t_i)\). Thus, the probability (or likelihood) of the data is given by the following expression, where \(\prod\) indicates repeated multiplication: \[L=\prod_{i=1}^{n} f_i(t_i)\]
To proceed further, we need to substitute an expression for \(f_i(t_i)\) that involves covariates and the unknown parameters.
Once we choose a particular model, we can substitute appropriate expressions for the p.d.f. and the survivor function.
Although this expression can be maximized directly, it is generally easier to work with the natural logarithm of the likelihood function because products get converted into sums and exponents become coefficients. Because the logarithm is an increasing function, whatever maximizes the logarithm also maximizes the original function.
Taking the logarithm of the likelihood, we get \[\log L= \sum_{i=1}^{n} \delta_i \log \lambda_i - \sum_{i=1}^{n} \lambda_i t_i = -\beta \sum_{i=1}^{n} \delta_i x_i - \sum_{i=1}^{n} t_i e^{-\beta x_i}\]
Now we are ready for step 2, finding values of \(\beta\) that make this expression as large as possible. There are many different methods for maximizing functions like this. One well-known approach is to find the derivative of the function with respect to \(\beta\), set the derivative equal to 0, and then solve for \(\beta\).
\[h_i(t)=h_0(t)\exp(\beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_n x_n )\]
The likelihood function for the proportional hazards model of this equation can be factored into two parts:
What partial likelihood does, in effect, is discard the first part and treat the second part - the partial likelihood function - as though it were an ordinary likelihood function. You get estimates by finding values of \(\beta\) that maximize the partial likelihood.
Now let’s take a closer look at how the partial likelihood works.
Using the same notation as MLE,
we can write the partial likelihoods as a product of the likelihoods for all the events that are observed. Thus if \(J\) is the number of events, \[PL=\prod_{j=1}^{J}L_j\]
Next we need to see how the individual \(L_J\)s are constructed. This is best explained by way of an example.
Until now, we made no assumptions about the form of the hazard function. Now, we invoke the proportional hazards model and substitute the expression for the hazard into the expression for \(L_1\), \[L_1=\frac{h_0(5)\exp[\beta x_1]}{h_0(5)\exp[\beta x_1]+h_0(5)\exp[\beta x_2]+\dots +h_0(5)\exp[\beta x_{45}]}\] where \(x_i\) is the value of \(x\) for the \(i^{th}\) patient.
We can also test that the partial likelihood depends only on the order of the event times, not on their exact values.
Therefore, a general expression for the partial likelihood for data with time-invariant covariates from a proportional hazards model is \[PL=\prod_{i=1}^{n}[\frac{\exp(\beta x_i)}{\sum_{j=1}^{n} Y_{ij}\exp{\beta x_j}}]^{\delta_i}\]
Once the partial likelihood is constructed, we can maximize it with respect to \(\beta\) just like an ordinary likelihood, which is \[\log PL=\sum_{i=1}^{n} \delta_i [\beta x_i - \log \sum_{j=1}^{n} Y_{ij}\exp{\beta x_j}]\]
Most partial likelihood programs use some version of the Newton-Raphson alhorithm to maximize this function with respect to \(\beta\).
To account for the time-varying covariates17, we need to modify the partial likelihood function to accommodate these types of variables. Essentially, at each failure time, there are a certain number of patients at risk, and one fails. However, the contributions of each subject can change from one failure time to the next. The hazard function if given by \(h(t)=h_0 (t)e^{z_k(t_i)\beta}\), where the covariate \(z_k(t_i)\) is the value of the time-varying covariate for the \(k^{th}\) subject at time \(t_i\).
Sample of six patients from the Stanford heart transplant dataset
Start-stop counting process
The dataset considered here is analyzed in Wooldridge (2002) and credited to Chung, Schmidt and Witte (1991). The data pertain to a random sample of convicts released from prison between July 1, 1977 and June 30, 1978. Of interest is the time until they return to prison. The information was collected retrospectively by looking at records in April 1984, so the maximum possible length of observation is 81 months. The data are available in binary format from the Stata website and consists of 1445 observations on 18 variables.
library(survival)
library(dplyr)
library(foreign)
recid <- read.dta("https://www.stata.com/data/jwooldridge/eacsap/recid.dta")
## Warning in read.dta("https://www.stata.com/data/jwooldridge/eacsap/recid.dta"):
## cannot read factor labels from Stata 5 files
head(recid)
## black alcohol drugs super married felon workprg property person priors educ
## 1 0 1 0 1 1 0 1 0 0 0 7
## 2 1 0 0 1 0 1 1 1 0 0 12
## 3 0 0 0 0 0 0 1 1 0 0 9
## 4 0 0 1 1 0 1 1 1 0 2 9
## 5 0 0 1 1 0 0 0 0 0 0 9
## 6 1 0 0 1 0 0 1 0 0 1 12
## rules age tserved follow durat cens ldurat
## 1 2 441 30 72 72 1 4.276666
## 2 0 307 19 75 75 1 4.317488
## 3 5 262 27 81 9 0 2.197225
## 4 3 253 38 76 25 0 3.218876
## 5 0 244 4 81 81 1 4.394449
## 6 0 277 13 79 79 1 4.369448
recid$fail <- 1 - recid$cens
recidx <- survSplit(recid, cut = seq(12, 60, 12),
start = "t0", end = "durat",
event = "fail",
episode = "interval")
labels <- paste("(",seq(0,60,12),",",c(seq(12,60,12),81), "]",sep="")
recidx <- mutate(recidx, exposure = durat - t0,
interval = factor(interval + 1, labels = labels))
mf <- Surv(durat, fail) ~ workprg + priors + tserved + felon +
alcohol + drugs + black + married + educ + age
Let’s begin with the exact method bacause its underlying model is probably more plausible for most application. Since events can occur at any point in time, it’s reasonable to suppose that ties are merely the result of imprecise measurement of time and that there is a true but unknown time ordering for the tied events.
If we knew that ordering, we could construct the partial likelihood in the usual way. In the absence of any knowledge of that ordering, however, we have to consider all the possibilities.
For example, with five tied events, there are \(5!=120\) different possible ordering.
cox_efron <- coxph(mf, data = recidx, ties="efron")
cox_beslow <- coxph(mf, data = recidx, ties="breslow")
cox_exact <- coxph(mf, data = recidx, ties="exact")
data.frame(exactp = coef(cox_exact),
efron = coef(cox_efron),
breslow = coef(cox_beslow))
## exactp efron breslow
## workprg 0.111590748 0.111560134 0.111337070
## priors 0.096271298 0.095985297 0.095859808
## tserved 0.015595528 0.015558389 0.015519980
## felon -0.334451818 -0.333671514 -0.333261160
## alcohol 0.478596659 0.477865298 0.477164506
## drugs 0.327465984 0.327094665 0.326467040
## black 0.504462343 0.503957605 0.503016140
## married -0.153975705 -0.153542571 -0.153523788
## educ -0.024847512 -0.024770080 -0.024746660
## age -0.004199204 -0.004195258 -0.004187421
The discrete method is also an exact method but one based on a fundamentally different model.
In fact, this is NOT a proportional hazard model at all. The model does fall within the framework of Cox regression, however, because it was proposed by Cox in his original 1972 paper and because the estimation method is a form of partial likelihood.
Unlike the exact model, which assumes that ties are merely the result of imprecise measurement of time, the discrete model assumes that time is really discrete.
When two or more events appear to happen at the same time, there is no underlying ordering - they really happened at the same time.
Cox’s model for discrete-time data can be described as follows. The time variable \(t\) can only take on integer values. Let \(P_{it}\) be the conditional probability that individual \(i\) has an event at time \(t\), given that an event has not already occurred to that individual.
This probability is sometimes called the discrete-time hazard. The model says that \(P_{it}\) is related to the covariates by a logistic regression equation: \[\log[\frac{P_{it}}{1-P_{it}}]=\beta_0 + \beta_1 x_1 + \cdots + \beta_i x_i\]
The expression on the left side of the equation is the logit or log-odds of \(P_{it}\). On the right side, we have a linear function of the covariates, plus a term \(\beta_0\) that is a set of constants that can vary arbitrarily from one time point to another.
This model can be described as proportional odds model. The odds that individual \(i\) has an event at time \(t\) (given that \(i\) did not already have an event) is \(O_{it}=\frac{P_{it}}{1-P_{it}}\).
The model implies that the ratio of the odds for any two individuals \(\frac{O_{it}}{O_{jt}}\) does not depend on time (although it may vary with covariates)
Estimation with partial likelihood: \(PL=\sum_{j=1}^{J} L_i\), where \(L_j\) is the partial likelihood of the \(j^{th}\) event.
Estimation with maximum likelihood method
Let’s have another look at the recidivism data. We will split duration into single years with an open-ended category at 5+ and fit a piecewise exponential model with the same covariates as Wooldridge.
We will then treat the data as discrete, assuming that all we know is that recidivism occurred somewhere in the year. We will fit a binary data model with a logit link, which corresponds to the discrete time model, and using a complementary-log-log link, which corresponds to a grouped continuous time model.
\[\log \mu = \beta_0 + \beta_1 x_1\]
The mean satisfies the exponential relationship \[\mu = \exp(\beta_0 + \beta_1 x_1) = \exp(\beta_0)\exp(\beta_1 x_1)\]
A one-unit increase in \(x\) has a multiplicative impact of \(\exp(\beta_1)\) on \(\mu\): the mean of \(Y\) at \(x+1\) equals the mean of \(Y\) at \(x\) multiplied by \(\exp(\beta_1)\). If \(\beta_1=0\), then \(\exp(\beta_1)=\exp(0)=1\) and the multiplicative factor is 1. Then, then mean of \(Y\) does not change at \(x\) changes. If \(\beta_1 >0\), then \(\exp(\beta_1)>1\), and the mean of \(Y\) increases as \(x\) increases. If \(\beta_1 <0\), then \(\exp(\beta_1)<1\), and the mean of \(Y\) decreases as \(x\) increases.
Overdispersion
Negative binomial regression
mmf <- fail ~ interval + workprg + priors + tserved +
felon + alcohol + drugs + black + married + educ + age
pwe <- glm(mmf, offset = log(exposure), data = recidx, family = poisson)
coef(summary(pwe))
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.830127469 0.280267334 -13.6659789 1.621090e-42
## interval(12,24] 0.036531989 0.109361775 0.3340471 7.383440e-01
## interval(24,36] -0.373815644 0.129611909 -2.8841150 3.925154e-03
## interval(36,48] -0.811543632 0.156401452 -5.1888497 2.115971e-07
## interval(48,60] -0.938231113 0.168321156 -5.5740534 2.488794e-08
## interval(60,81] -1.547177936 0.203348918 -7.6084886 2.773196e-14
## workprg 0.083829106 0.090794162 0.9232874 3.558575e-01
## priors 0.087245826 0.013473463 6.4753825 9.457203e-11
## tserved 0.013008862 0.001685901 7.7162667 1.197865e-14
## felon -0.283925203 0.106148770 -2.6747856 7.477705e-03
## alcohol 0.432442493 0.105721133 4.0904073 4.306163e-05
## drugs 0.274714115 0.097863462 2.8071162 4.998720e-03
## black 0.433555955 0.088362277 4.9065729 9.268154e-07
## married -0.154047742 0.109211869 -1.4105403 1.583802e-01
## educ -0.021416177 0.019444026 -1.1014271 2.707108e-01
## age -0.003580003 0.000522249 -6.8549738 7.132557e-12
recidx <- filter(recidx, interval != "(60,81]")
logit <- glm(mmf, data = recidx, family = binomial) # no offset
coef(summary(logit))
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.140802599 0.3084159337 -3.6989094 2.165279e-04
## interval(12,24] 0.030528163 0.1193582701 0.2557692 7.981291e-01
## interval(24,36] -0.413140262 0.1384064532 -2.9849783 2.835984e-03
## interval(36,48] -0.864148699 0.1639957690 -5.2693353 1.369186e-07
## interval(48,60] -0.993662524 0.1756321916 -5.6576332 1.534747e-08
## workprg 0.110988653 0.1003087410 1.1064704 2.685230e-01
## priors 0.099292063 0.0164653717 6.0303566 1.635983e-09
## tserved 0.014922136 0.0021429307 6.9634244 3.320994e-12
## felon -0.319662098 0.1178116529 -2.7133318 6.661038e-03
## alcohol 0.472499810 0.1184176515 3.9901130 6.604183e-05
## drugs 0.316729032 0.1086092071 2.9162264 3.542934e-03
## black 0.458027506 0.0973977193 4.7026512 2.568049e-06
## married -0.204807338 0.1204592720 -1.7002206 8.908944e-02
## educ -0.026725931 0.0215052145 -1.2427651 2.139544e-01
## age -0.004023087 0.0005840427 -6.8883431 5.644594e-12
Finally we use a complementary log-log link
cloglog <- glm(mmf, data = recidx, family = binomial(link = cloglog))
coef(summary(cloglog))
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.238795113 0.2893607427 -4.2811444 1.859347e-05
## interval(12,24] 0.021613951 0.1095604758 0.1972787 8.436094e-01
## interval(24,36] -0.392613793 0.1297681301 -3.0255024 2.482204e-03
## interval(36,48] -0.824996440 0.1566132100 -5.2677321 1.381194e-07
## interval(48,60] -0.948338328 0.1684247392 -5.6306356 1.795467e-08
## workprg 0.104466422 0.0934228228 1.1182109 2.634769e-01
## priors 0.088706984 0.0145113085 6.1129556 9.780261e-10
## tserved 0.013266906 0.0018142064 7.3127875 2.616567e-13
## felon -0.288542238 0.1096491770 -2.6315039 8.500789e-03
## alcohol 0.439780479 0.1090998881 4.0309893 5.554258e-05
## drugs 0.299102966 0.1003895869 2.9794222 2.887925e-03
## black 0.427210098 0.0910947168 4.6897352 2.735589e-06
## married -0.183040394 0.1136073451 -1.6111669 1.071433e-01
## educ -0.023334468 0.0202349457 -1.1531767 2.488379e-01
## age -0.003851008 0.0005486362 -7.0192372 2.230827e-12
cbind(coef(pwe)[-6], coef(cloglog), coef(logit))
## [,1] [,2] [,3]
## (Intercept) -3.830127469 -1.238795113 -1.140802599
## interval(12,24] 0.036531989 0.021613951 0.030528163
## interval(24,36] -0.373815644 -0.392613793 -0.413140262
## interval(36,48] -0.811543632 -0.824996440 -0.864148699
## interval(48,60] -0.938231113 -0.948338328 -0.993662524
## workprg 0.083829106 0.104466422 0.110988653
## priors 0.087245826 0.088706984 0.099292063
## tserved 0.013008862 0.013266906 0.014922136
## felon -0.283925203 -0.288542238 -0.319662098
## alcohol 0.432442493 0.439780479 0.472499810
## drugs 0.274714115 0.299102966 0.316729032
## black 0.433555955 0.427210098 0.458027506
## married -0.154047742 -0.183040394 -0.204807338
## educ -0.021416177 -0.023334468 -0.026725931
## age -0.003580003 -0.003851008 -0.004023087
All three approaches, however, lead to similar predicted survival probabilities.
Discrete data are often the result of interval-censoring. Events might happen in a continuous range of time, but they can only be observed at discrete moments (e.g., longitudinal data by waves).
Under a particular data structure, the loglikelihood for a conditional logistic regression model is the same with loglikelihood from a Cox model. A stratified Cox model with each case/control group assigned to its own stratum, time set to a constant, status of 1=case 0=control, and using the exact partial likelihood has the same likelihood formula as a conditional logistic regression. The clogit routine creates the necessary dummy variable of times (all 1) and the strata, then calls coxph.
cox1 <- coxph(Surv(durat, fail) ~ workprg + married + educ + age + strata(black),
data = recidx, ties="efron")
coef(summary(cox1))
## coef exp(coef) se(coef) z Pr(>|z|)
## workprg 0.156200755 1.1690609 0.0897549296 1.740303 8.180586e-02
## married -0.323341332 0.7237268 0.1133119425 -2.853550 4.323368e-03
## educ -0.049860264 0.9513624 0.0190833630 -2.612761 8.981412e-03
## age -0.001898137 0.9981037 0.0004532662 -4.187687 2.818123e-05
clogit <- clogit(fail ~ workprg + married + educ + age + strata(black),
data = recidx, method=c("efron"))
coef(summary(clogit))
## coef exp(coef) se(coef) z Pr(>|z|)
## workprg 0.151890146 1.1640324 0.0897545659 1.692283 9.059199e-02
## married -0.310120691 0.7333584 0.1132727550 -2.737822 6.184746e-03
## educ -0.047916781 0.9532131 0.0191298600 -2.504816 1.225151e-02
## age -0.001812212 0.9981894 0.0004512763 -4.015749 5.925726e-05
Up to this point, we have assumed that the event of interest can occur only once for a given subject. However, in many research scenarios in which the event of interest is not death, a subject may experience an event several times over follow-up. Examples of recurrent event data include:
An objective for such data is to assess the relationship of relevant predictors to the rate in which events are occurring, allowing for multiple events per subject.
\[h(t,x)=h_0(t)\exp[\sum \beta_i x_i]\]
The Cox PH model requires
In nonrecurrent event data,
Wehreas, in recurrent event data,
The “strata” variable for each approach treats the time interval number as a categorical variable. For example, if the maximum number of failures that occur on any given subject in the dataset is, say, 4, then time interval #1 is assigned to stratum 1, time interval #2 to stratum 2, and so on.
Both Stratified CP and Gap Time approaches focus on survival time between two events. Stratified CP uses the actual times of the two events from study entry, whereas Gap Time starts survival time at 0 for the earlier event and stops at the later event.
The Marginal approach, in contrast to each conditional approach, focuses on total survival time from study entry until the occurrence of a specific (e.g., kth) event; this approach is suggested when recurrent events are viewed to be of different types.
The modeling of recurrent events is illustrated with the bladder cancer dataset (bladder.rda). Recurrent events are represented in the data with multiple observations for subjects having multiple events. The data layout for the bladder cancer dataset is in the counting process (start, stop) format with time intervals defined for each observation. The load function is used to access an R dataframe that has been saved as a file.
##### data format
bladder <- read.csv(file="bladder.csv", header = TRUE)
bladder[12:20,]
## ID EVENT INTERVAL INTTIME START STOP TX NUM SIZE
## 12 10 1 1 12 0 12 0 1 1
## 13 10 1 2 4 12 16 0 1 1
## 14 10 0 3 2 16 18 0 1 1
## 15 11 0 1 23 0 23 0 3 3
## 16 12 1 1 10 0 10 0 1 3
## 17 12 1 2 5 10 15 0 1 3
## 18 12 0 3 8 15 23 0 1 3
## 19 13 1 1 3 0 3 0 1 1
## 20 13 1 2 13 3 16 0 1 1
There are three observations for ID=10, one observation for ID=11, three observations for ID=12, and two observations for ID=13. The variables START and STOP represent the time interval for the risk period specific to that observation. The variable EVENT indicates whether an event (coded 1) occurred. The first three observations indicate that the subject with ID=10 had an event at 12 months, another event at 16 months, and was censored at 18 months.
Recall we analyzed data in the counting process format when we ran extended Cox models. We saw how a subject’s covariate can change values from time-interval to time-interval. With the bladder dataset, the (start,stop) data format provides a way to indicate that a subject experienced multiple events.
The coxph function can be used to run Cox models with recurrent events. First, we’ll define a response variable using the Surv function (called \(Y\)):
library(survival)
Y=Surv(bladder$START,bladder$STOP,bladder$EVENT==1)
## Warning in Surv(bladder$START, bladder$STOP, bladder$EVENT == 1): Stop time
## must be > start time, NA created
The Surv function requires three arguments with data in the counting process format: the start variable (called START), the stop variable (called STOP), and the status variable (called EVENT). The code bladder$event==1 indicates that an event is coded 1. R recognizes the value 1 as the default coding of an event, so it was not necessary to state this explicitly in the Surv function as we did. Next, a recurrent-events Cox model is run with the predictors: treatment status (TX), initial number of tumors (NUM), and the initial size of tumors (SIZE):
coxph(Y ~ TX + NUM + SIZE + cluster(ID), data=bladder)
## Call:
## coxph(formula = Y ~ TX + NUM + SIZE, data = bladder, cluster = ID)
##
## coef exp(coef) se(coef) robust se z p
## TX -0.41164 0.66256 0.19989 0.24876 -1.655 0.09798
## NUM 0.16367 1.17782 0.04777 0.05842 2.801 0.00509
## SIZE -0.04108 0.95975 0.07029 0.07421 -0.554 0.57991
##
## Likelihood ratio test=14.66 on 3 df, p=0.002127
## n= 190, number of events= 112
## (1 observation deleted due to missingness)
The term + cluster(id) in the model formula requests robust standard errors for the parameter estimates.
The treatment variable (TX) is coded 1 for treatment with thiotepa and 0 for the placebo. The estimated hazard ratio (TX=1 vs. TX=0) is 0.663 (with a p-value of 0.0980). There are two sets of standard errors presented in the table under the columns labeled: se(coef) and robust se. The p-values and z-test statistics in this table are calculated using the robust standard errors. We could obtain additional model output (including 95% CIs) by applying the summary function to the coxph function.
A stratified Cox model can also be run using the data in this format with the variable INTERVAL as the stratified variable. The stratified variable indicates whether the subject was at risk for their first, second, third, or fourth event. This approach is called a Stratified CP recurrent event model and is used if the investigator wants to distinguish the order in which recurrent events occur. The bladder data is in the proper format to run this model.
coxph(Y ~ TX + NUM + SIZE + strata(INTERVAL) + cluster(ID), data=bladder)
## Call:
## coxph(formula = Y ~ TX + NUM + SIZE + strata(INTERVAL), data = bladder,
## cluster = ID)
##
## coef exp(coef) se(coef) robust se z p
## TX -0.333489 0.716420 0.216168 0.204787 -1.628 0.1034
## NUM 0.119617 1.127065 0.053338 0.051387 2.328 0.0199
## SIZE -0.008495 0.991541 0.072762 0.061635 -0.138 0.8904
##
## Likelihood ratio test=6.51 on 3 df, p=0.08928
## n= 190, number of events= 112
## (1 observation deleted due to missingness)
The only additional code from the previous model is the term + strata(interval) in the model formula which indicates that INTERVAL is the stratified variable. Interaction terms between the treatment variable (TX) and the stratified variable could be created to examine whether the effect of treatment differed for the 1st, 2nd, 3rd, or 4th event.
Another stratified approach (called Gap Time) is a slight variation of the Stratified CP approach. The difference is in the way the time intervals for the recurrent events are defined. There is no difference in the time intervals when subjects are at risk for their first event. However, with the Gap Time approach, the starting time at risk gets reset to zero for each subsequent event. To run a Gap Time model, we need to create two new (start, stop) variables in the bladder dataset, which we’ll call START2 and STOP2:
bladder$START2=0
bladder$STOP2=bladder$STOP - bladder$START
The first of the two newly defined variables (START2) is always zero. The second (STOP2) is defined as the time between each event (STOP–START). To print a subset of these variables, we can use the data.frame function. The attach function allows variables in the bladder dataset to be listed without the bladder$ prefix (code and output for printing the 12th–20th observation below).
attach(bladder)
data.frame(ID,EVENT,START,STOP,START2,STOP2)[12:20, ]
## ID EVENT START STOP START2 STOP2
## 12 10 1 0 12 0 12
## 13 10 1 12 16 0 4
## 14 10 0 16 18 0 2
## 15 11 0 0 23 0 23
## 16 12 1 0 10 0 10
## 17 12 1 10 15 0 5
## 18 12 0 15 23 0 8
## 19 13 1 0 3 0 3
## 20 13 1 3 16 0 13
Next we need to reset our response variable using the Surv function by changing our time intervals from (START, STOP) to (START2, STOP2):
Y2=Surv(bladder$START2,bladder$STOP2,bladder$EVENT)
## Warning in Surv(bladder$START2, bladder$STOP2, bladder$EVENT): Stop time must
## be > start time, NA created
Next we run a Gap Time model with the bladder data using similar code that was used for the Stratified CP model except we use Y2 rather than Y as our response variable.
coxph(Y2 ~ TX + NUM + SIZE + strata(INTERVAL) + cluster(ID),data=bladder)
## Call:
## coxph(formula = Y2 ~ TX + NUM + SIZE + strata(INTERVAL), data = bladder,
## cluster = ID)
##
## coef exp(coef) se(coef) robust se z p
## TX -0.279005 0.756536 0.207348 0.215624 -1.294 0.19569
## NUM 0.158046 1.171220 0.051942 0.050940 3.103 0.00192
## SIZE 0.007415 1.007443 0.070023 0.064333 0.115 0.90824
##
## Likelihood ratio test=9.33 on 3 df, p=0.02517
## n= 190, number of events= 112
## (1 observation deleted due to missingness)
The results using the Gap Time approach varies slightly from that obtained using the Stratified CP approach.
Until now we have considered survival times with a single, well-defined outcome, such as death or some other event. In some applications, however, a patient may potentially experience multiple events, only the first-occurring of which can be observed. For example, we may be interested in time from diagnosis with prostate cancer until death from that disease (Cause 1) or death from some other cause (Cause 2), but for a particular patient we can only observe the time to the first event. Of course, a patient may also be censored if he is still alive at the last follow-up time. If interest centers on a particular outcome, time to prostate cancer death, for example, a simplistic analysis method would be to treat death from other causes as a type of censoring. This approach has the advantage that implementing it is straightforward using the survival analysis methods we have discussed. However, a key assumption about censoring is that it is independent of the event in question. In most competing risk applications, this assumption may be questionable, and in some cases may be quite unrealistic. Furthermore, it is not possible to test the independence assumption using only the competing risks data. The only hope of evaluating the accuracy of the assumption would be to examine other data or appeal to theories concerning the etiology of the various death causes. Consequently, interpretation of survival analyses in the presence of competing risks will always be subject to at least some ambiguity due to uncertainty about the degree of dependence among the competing outcomes.
We begin with estimating a survival curve in a single sample in the presence of competing events. The simplest method would be to in turn select each as the primary event, and to treat the other as a censoring event. However, to obtain unbiased estimates of survival curves, this simplistic method would require the usually false assumption that the two causes of death are independent. We may illustrate this problem be considering prostate cancer patients ages 80 and over diagnosed with stage T2 poorly differentiated prostate cancer. We define indicator variables “status.other” and “status.prost”, and then select the subset “prostateSurvival.highrisk” as follows, using the “prostate survival” data.
library(asaur)
prostateSurvival <- within(prostateSurvival,
{status.prost <- as.numeric({status==1})
status.other <- as.numeric({status==2})})
attach(prostateSurvival)
## The following object is masked _by_ .GlobalEnv:
##
## status
## The following object is masked from pancreatic:
##
## stage
prostateSurvival.highrisk <-
prostateSurvival [{{grade=="poor"} & {stage=="T2"} & {ageGroup=="80+"}},]
head(prostateSurvival.highrisk)
## grade stage ageGroup survTime status status.other status.prost
## 13 poor T2 80+ 21 0 0 0
## 38 poor T2 80+ 105 0 0 0
## 41 poor T2 80+ 2 1 0 1
## 47 poor T2 80+ 67 2 1 0
## 78 poor T2 80+ 2 0 0 0
## 93 poor T2 80+ 60 2 1 0
Let us consider two analyses, one with death due to other causes (status = 2) as censored, and the other with death due to prostate cancer (status = 1) as censored. We set these up as follows:
status.prost <- {prostateSurvival.highrisk$status==1}
status.other <- {prostateSurvival.highrisk$status==2}
The Kaplan-Meier estimates of survival defined as time to death from prostate cancer (with other causes of death considered as censored) is as follows:
result.prostate.km <- survfit(Surv(survTime, event = status.prost) ~ 1, data=prostateSurvival.highrisk)
Similarly, to estimate survival for time to death from other causes, we have
result.other.km <- survfit(Surv(survTime, event = status.other) ~ 1, data=prostateSurvival.highrisk)
To illustrate the problem with this analysis, let us first extract the Kaplan-Meier survival curve for death from other causes:
surv.other.km <- result.other.km$surv
time.km <-result.other.km$time/12
Now let’s extract the corresponding survival curve for death from prostate cancer, and then express it as a cumulative incidence function, which is one minus the survival curve (also known as the cumulative distribution function):
surv.prost.km <- result.prostate.km$surv
cumDist.prost.km <- 1 - surv.prost.km
Now we may plot both on the same graph, using the plot option ‘type = “s”’ to produce step functions:
plot(cumDist.prost.km ~ time.km, type="s",
ylim=c(0,1), lwd=2,
xlab="Years from prostate cancer diagnosis", col="blue")
lines(surv.other.km~time.km, type="s", col="green", lwd=2)
The result, shown in Figure, shows that the two curves cross. At 10 years, for example, the probability of dying of prostate cancer is 0.46, and of other causes it is 0.88. The fact that the sum of these two probabilities exceeds one demonstrates that these estimates, viewed as probabilities that a particular patient would die of prostate cancer or something else, are severely biased. One might be tempted to view these curves as estimates of the probability of death from one cause if the other cause were eliminated as a possibility, but such an exercise would require the assumption that the causes be independent. This assumption cannot be tested from the data, and in any case the meaning of the resulting estimates would be purely hypothetical.
“Subject can die of only one of \(K\) causes”
To develop a formal model to accommodate competing risks, let us suppose that there are \(K\) distinct causes of death, which we may diagram as in Figure.
The distinguishing feature of this competing causes framework is that each subject can experience at most one of the \(K\) causes of death; the times that the subject would have experienced the remaining causes is thus unknown. This framework can also accommodate applications with non-fatal events, as long as all of the events are mutually exclusive. With competing risks, it is helpful to define, for each cause of interest, a function known as the cumulative risk function, also called the sub-distribution function. This is the cumulative probability that an individual dies from that particular cause by time \(t\), and is given by
\[F_j(t)=P(T \le t, C=j)=\int_0^t h_j(u)S(u)du\] This function is similar to the cumulative distribution function in that it is always increasing (or more precisely, non-decreasing). But unlike a cumulative distribution function, it goes, in the limit, to the probability of death from that particular cause, rather than to 1. Formally, we have
\[F_j(\infty) =P(C=j))\] The cause-specific hazard is defined in a manner similar to the hazard function, but now it is the probability that a specific event occurs at time \(t\) given that the individual survives that long:
\[h_j(t)=\lim_{\delta \to 0} (\frac{P(t<T<t+\delta, C=j|T>t)}{\delta})\] If we add up all of the cause-specific hazards at a particular time, we get the hazard function
\[h(t)=\sum_{j=1}^{K} h_j(t)\] That is, the risk of death at a particular time is the sum of the risks of all of the specific causes of death at that time.
Suppose now that we have \(D\) distinct ordered failure times \(t_1,t_2,\dots, t_D\). We may estimate the hazard at the \(i^{th}\) time \(t_i\) using \(\hat{h}(t_i)=\frac{d_i}{n_i}\), as we have seen in previous chapters. The cause-specific hazard for the \(k^{th}\) hazard may be written in a similar form as \(\hat{h}_k(t_i)=\frac{d_{ik}}{n_i}\). This is just the number of events of type \(k\) at that time divided by the number at risk at that time. The sum over all cause-specific hazards is the overall hazard, \(\hat{h}_k(t_i)=\frac{\sum_k d_{ik}}{n_i}\). The probability of failure from any cause at time \(t_i\) is the product of \(\hat{S}(t_{i-1})\), the probability of being alive just before \(t_i\), and \(\hat{h}(t_i)\), the risk of dying at \(t_i\). Similarly, the probability of failure due to cause \(k\) at that time is \(\hat{S}(t_{i-1})\hat{h}_k(t_i)\). The sub-distribution function, or cumulative incidence function, is the probability of dying of cause \(k\) at time \(t_i\). This is the sum of all probabilities of dying of this cause up to time ti and is given by
\[\hat{F}_k(t)=\sum_{t_i \le t} \hat{S}(t_{i-1})\hat{h}_k(t_i))\] That is, once we have an estimate of the overall survival function \(\hat{S}(t)\), we can obtain the cumulative incidence function for a particular cause by summing over the product of this and the cause-specific hazards for that cause.
To illustrate this methodology, let us consider a simple hypothetical data set with six observations and two possible causes of death, displayed in Fig. 9.3. Denoting the event types with the numbers 1 and 2, and the censored observations with the number 0, we may enter the data into R as follows:
“Competing risk survival data”
# install.packages("mstate")
library(survival)
library(mstate)
tt <- c(2,7,5,3,4,6)
status <- c(1,2,1,2,0,0)
We first compute the overall survival distribution,
status.any <- as.numeric(status >= 1)
result.any <-survfit(Surv(tt,status.any)~1)
result.any$surv
## [1] 0.8333333 0.6666667 0.6666667 0.4444444 0.4444444 0.0000000
We compute the cumulative incidence functions as in the following table.
“Competing risk survival data”
For example, the probability of event type 1 at the first time \((t=2)\) is given by \(1.000 \times \frac{1}{6}=0.167\) This is also the estimate of the cumulative incidence function at this time. The probability of an event of this type at time \((t=5)\) is \(0.667 \times \frac{1}{3}=0.222\). Then the cumulative incidence for this event at time \(t=5\) is \(0.167+0.222=0.389\): These results may be more easily obtained using the “Cuminc” function in the “mstate” R package
ci <- Cuminc(time=tt, status=status)
ci
## time Surv CI.1 CI.2 seSurv seCI.1 seCI.2
## 1 2 0.8333333 0.1666667 0.0000000 0.1521452 0.1521452 0.0000000
## 2 3 0.6666667 0.1666667 0.1666667 0.1924501 0.1521452 0.1521452
## 3 5 0.4444444 0.3888889 0.1666667 0.2222222 0.2187224 0.1521452
## 4 7 0.0000000 0.3888889 0.6111111 0.0000000 0.2187224 0.2187224
The standard errors for the survival curve are computed using Greenwood’s formula and the standard errors for the cumulative incidence functions are computed in an analogous manner.
Returning to the prostate cancer example of Fig. 9.1, we may now estimate the competing risks cumulative incidence functions using the “Cuminc” function in the R package “mstate” as follows:
library(mstate)
ci.prostate <- Cuminc(time=prostateSurvival.highrisk$survTime,
status=prostateSurvival.highrisk$status)
head(ci.prostate)
## time Surv CI.1 CI.2 seSurv seCI.1 seCI.2
## 1 0 1.0000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000
## 2 1 0.9940758 0.000000000 0.005924171 0.002641510 0.000000000 0.002641510
## 3 2 0.9880511 0.006024702 0.005924171 0.003756151 0.002686199 0.002641510
## 4 3 0.9843644 0.008482541 0.007153090 0.004303185 0.003192654 0.002910105
## 5 4 0.9831168 0.009730151 0.007153090 0.004474935 0.003423736 0.002910105
## 6 5 0.9780751 0.014771775 0.007153090 0.005112934 0.004233796 0.002910105
The first column, “time” is the time in months. The column “Surv” is the Kaplan-Meier survival estimate for time to death from any cause (prostate or something else). The next two columns are the cumulative incidence function estimates for causes 1 (prostate) and 2 (other). The remaining columns are standard errors of the respective estimates. We may plot the cause-specific cumulative incidence functions as follows:
ci1 <- ci.prostate$CI.1 # CI.1 is for prostate cancer
ci2 <- ci.prostate$CI.2 # CI.2 is for other causes
times <- ci.prostate$time/12 # convert months to years
Rci2 <- 1 - ci2
We may plot the cumulative incidence function for death from prostate cancer, and for death from other causes in solid green and blue, respectively, and the previous estimates with thin lines of the same (but lighter) colors,
plot(Rci2 ~ times, type="s", ylim=c(0,1), lwd=2, col="green",
xlab="Time in years", ylab="Survival probability")
lines(ci1 ~ times, type="s", lwd=2, col="blue")
lines(surv.other.km ~ time.km, type="s", col="lightgreen", lwd=1)
lines(cumDist.prost.km ~ time.km, type="s", col="lightblue", lwd=1)
Figure clearly illustrates the value of displaying competing risks cumulative incidence functions. These curves represent estimates of the actual probabilities that a patient will die of a particular cause, rather than hypothetical probabilities that he would die of one cause in the absence of the other.
A common way to display competing risk cumulative incidence curves is via a stacked plot, as shown in Fig. 9.5. The lower, blue curve represents the cumulative probability of death from prostate cancer, and the difference between the blue and upper, green curve represents the probability of death from other causes. The sum of the two probabilities of death, i.e. the upper, green curve, represents the cumulative probability of death from any cause, and is equal to one minus the Kaplan-Meier survival curve for death from any cause.
ci1 <- ci.prostate$CI.1 # CI.1 is for prostate cancer
ci2 <- ci.prostate$CI.2 # CI.2 is for other causes
times <- ci.prostate$time/12 # convert months to years
sumci <- ci1 +ci2
plot(sumci ~ times, type="s", ylim=c(0,1), lwd=2, col="green",
xlab="Years from prostate cancer diagnosis",
ylab="probability patient has died")
lines(ci1 ~ times, type="s", lwd=2, col="blue")
When there is a single outcome of interest, the Cox proportional hazards model provides an elegant method for accommodating covariate information. However, modeling covariate information for competing risks data presents special challenges, since it is difficult to define precisely the hazard function on which the covariates should operate. The first method we will discuss is the most direct. We will illustrate using the prostate cancer data, this time restricting our attention (for now) to patients with stage T2 prostate cancer. Essentially, we will study the effects of the remaining covariates (grade and age) on prostate cancer death, treating other causes of death as censoring indicators, and vice versa for the effects of the covariates on other causes of death. We set up the data as follows:
prostateSurvival.T2 <- prostateSurvival[prostateSurvival$stage=="T2",]
attach(prostateSurvival.T2)
## The following objects are masked _by_ .GlobalEnv:
##
## status, status.other, status.prost
## The following objects are masked from prostateSurvival:
##
## ageGroup, grade, stage, status, status.other, status.prost,
## survTime
## The following object is masked from pancreatic:
##
## stage
We then fit a standard Cox model for prostate cancer death as follows:
result.prostate <- coxph(Surv(survTime, status.prost) ~ grade + ageGroup,
data=prostateSurvival.T2)
summary(result.prostate)
## Call:
## coxph(formula = Surv(survTime, status.prost) ~ grade + ageGroup,
## data = prostateSurvival.T2)
##
## n= 5920, number of events= 410
##
## coef exp(coef) se(coef) z Pr(>|z|)
## gradepoor 1.2199 3.3867 0.1004 12.154 < 2e-16 ***
## ageGroup70-74 -0.2860 0.7513 0.2595 -1.102 0.2704
## ageGroup75-79 0.4027 1.4958 0.2257 1.784 0.0744 .
## ageGroup80+ 0.9728 2.6454 0.2148 4.529 5.92e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## gradepoor 3.3867 0.2953 2.7819 4.123
## ageGroup70-74 0.7513 1.3310 0.4518 1.249
## ageGroup75-79 1.4958 0.6685 0.9611 2.328
## ageGroup80+ 2.6454 0.3780 1.7364 4.030
##
## Concordance= 0.74 (se = 0.012 )
## Likelihood ratio test= 252.4 on 4 df, p=<2e-16
## Wald test = 243.6 on 4 df, p=<2e-16
## Score (logrank) test = 278.9 on 4 df, p=<2e-16
These results show that patients having poorly differentiated disease (grade = poor) have much worse prognosis than do patients with moderately differentiated disease (the reference group here), with a log-hazard ratio of 1.2199. These results also show that the hazard of dying from prostate cancer increases with increasing age of diagnosis (the reference is the youngest age group, 65–69).
Considering death from other causes as the event of interest, we have
result.other <- coxph(Surv(survTime, status.other) ~ grade + ageGroup,
data=prostateSurvival.T2)
summary(result.other)
## Call:
## coxph(formula = Surv(survTime, status.other) ~ grade + ageGroup,
## data = prostateSurvival.T2)
##
## n= 5920, number of events= 1345
##
## coef exp(coef) se(coef) z Pr(>|z|)
## gradepoor 0.28104 1.32451 0.05875 4.784 1.72e-06 ***
## ageGroup70-74 0.09462 1.09924 0.12492 0.757 0.44879
## ageGroup75-79 0.31330 1.36793 0.11709 2.676 0.00746 **
## ageGroup80+ 0.79012 2.20367 0.11204 7.052 1.76e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## gradepoor 1.325 0.7550 1.1805 1.486
## ageGroup70-74 1.099 0.9097 0.8605 1.404
## ageGroup75-79 1.368 0.7310 1.0874 1.721
## ageGroup80+ 2.204 0.4538 1.7692 2.745
##
## Concordance= 0.583 (se = 0.008 )
## Likelihood ratio test= 159.6 on 4 df, p=<2e-16
## Wald test = 159.3 on 4 df, p=<2e-16
## Score (logrank) test = 164.6 on 4 df, p=<2e-16
Taken at face value, these results indicate that patients with poorly differentiated cancer have a higher risk of death from non-prostate-cancer related disease than do those with moderately differentiated disease. While the log hazard ratio is much smaller than with prostate cancer death as the outcome (0.28104 vs. 1.2199), one might expect that cancer grade wouldn’t have any effect on death from nonprostate-cancer causes. These hazard ratios refer to hazard functions for death from prostate cancer and for death from other causes, and these are assumed to be operating independently. As we have discussed previously, these assumptions are highly suspect, and it is unclear to what extent the hazard functions that have been estimated correspond to actual (and unobservable) hazards.
To address this issue, Fine and Gray developed an alternative method for modeling covariate data with competing risks. Instead of defining the effects of covariates on the cause-specific hazards, they define a “sub-distribution hazard”
\[\bar{h}_k(t)=\lim_{\delta \to 0}\frac{P(t<T_k<t+\delta|E)}{\delta}\] where the conditional event is given by \[E=[{(T_k>t)}\;or\;({T_{k^{'}}\;and\;k^{'} \ne k})]\] That is, the sub-distribution hazard for cause \(k\), like the definition of the ordinary hazard function, is essentially the probability that the failure time lies in a small interval at \(t\) conditional on an event \(E\), divided by the length of that small interval. The difference is that, in addition to referring to the \(k^{th}\) failure time, the conditioning set specifies not only that \(T_k > t\) but also allows inclusion of events other than the \(k^{th}\) event in question, in which case we must have \(T_{k^{'}} \le t\). Thus, when computing these sub-distribution hazards, the risk set includes not only those currently alive and at risk for the \(k^{th}\) event type, but also those who died earlier of other causes.
Consider, for example, for the data in Fig. 9.3, the risk set for death from Cause #2 (triangles) at time \(t=7\) consists not only of Patient 2, the sole patient still alive at that time, but also Patients 1 and 3, since they died of Cause #1 (squares) earlier. Patient 4 is not in the risk set for death from Cause #2 at time \(t=7\) since that person died earlier from Cause #2, the same cause as Patient 2. Patients 5 and 6 also are not in the risk set at this time since they were censored. The sub-distribution hazard may be written in a more compact equivalent form as
\[\bar{h}_k(t)=-\frac{d \log(1-F_k(t))}{dt}\].
The Fine and Gray method uses these sub-distribution hazards for modeling the effects of covariates on a specific cause of death analogously to the Cox model,
\[\bar{h}_k(t;z,\beta)=\bar{h}_{0k}(t)e^{z\beta}\].
That is, the sub-distribution hazard for a subject with covariates \(z\) is proportional to a baseline sub-distribution function \(\bar{h}_{0k}(t)\).
The Fine and Gray methods are implemented in the ”crr” function in the R package “cmprsk”. Before we can use the competing risk function “crr” in this package, we need to put the covariates into a model matrix using the “model.matrix” function. Using our attached data set “prostateSurvival.T2”, we do this as follows:
#install.packages("cmprsk")
library(cmprsk)
cov.matrix <- model.matrix(~ grade + ageGroup, data=prostateSurvival.T2)
head(cov.matrix)
## (Intercept) gradepoor ageGroup70-74 ageGroup75-79 ageGroup80+
## 4 1 0 1 0 0
## 6 1 1 0 1 0
## 10 1 1 0 1 0
## 13 1 1 0 0 1
## 15 1 0 0 1 0
## 18 1 0 0 1 0
cov.matrix.use <- cov.matrix[,-1] # drop the first column
We obtain estimates for the prostate cancer as follows, dropping the first (intercept) column of the covariate matrix:
library(cmprsk)
prostateSurvival.T2 <- prostateSurvival[prostateSurvival$stage=="T2",]
attach(prostateSurvival.T2)
## The following objects are masked _by_ .GlobalEnv:
##
## status, status.other, status.prost
## The following objects are masked from prostateSurvival.T2 (pos = 4):
##
## ageGroup, grade, stage, status, status.other, status.prost,
## survTime
## The following objects are masked from prostateSurvival:
##
## ageGroup, grade, stage, status, status.other, status.prost,
## survTime
## The following object is masked from pancreatic:
##
## stage
cov.matrix <- model.matrix(~ grade + ageGroup, data=prostateSurvival.T2)
head(cov.matrix)
## (Intercept) gradepoor ageGroup70-74 ageGroup75-79 ageGroup80+
## 4 1 0 1 0 0
## 6 1 1 0 1 0
## 10 1 1 0 1 0
## 13 1 1 0 0 1
## 15 1 0 0 1 0
## 18 1 0 0 1 0
#result.prostate.crr <- crr(survTime, status, cov1=cov.matrix[,-1], failcode=1)
#summary(result.prostate.crr)
“Death from prostate cancer”
The argument “failcode=1” refers to death from prostate cancer. For death from other causes, we use “failcode=2”,
#result.other.crr <- crr(survTime, status, cov1=cov.matrix[,-1], failcode=2)
#summary(result.other.crr)
“Death from other causes”
Again we see that poorly differentiated patients have higher risk for death from other causes (risk ratio = 0.126), but the effect size is smaller than we obtained from the Putter et al. method (risk ratio 0.281). The estimated effect on death from prostate cancer of having poorly differentiated disease is similar for both methods (risk ratio of 1.22 for Putter et al. vs. 1.132 for Fine and Gray).
An advantage of the Putter et al. method over the Fine and Gray method is the ease with which we can compare the effects of a covariate on, for example, death from prostate cancer and death from other causes. For example, we know that the risk of both causes of death increase with age. But does the effect of age differ for these two causes? To answer this question, we first need to convert the data set from the original one where each patient has his own row in the data set into one where each patient’s data is split into separate rows, one for each cause of death. In the prostate cancer case, we need to create, for each patient, two rows, one for death from prostate cancer and one for death from other causes. To simplify this process, we can use utilities in the “mstate” package. This package is capable of handling complex multistate survival models, but can also be used to set up competing risks as a special case. We begin by setting up a “transition” matrix using the function “trans.comprisk”,
library(mstate)
tmat <- trans.comprisk(2, names = c("event-free", "prostate", "other"))
tmat
## to
## from event-free prostate other
## event-free NA 1 2
## prostate NA NA NA
## other NA NA NA
The first argument is the number of specific outcomes, and the second argument (“names”) gives the name of the censored outcome and the two other outcomes. The resulting matrix states that a patient’s status can change from “event-free” to either “prostate” or “other”, these latter two being causes of death. The other entries of the matrix simply state that once a patient dies of one cause, they cannot change to another cause or return to the “event-free” status. Next, we use the function “msprep” to create the new data set, and examine the first few rows:
#attach(prostateSurvival.T2)
#prostate.long <- msprep(time = cbind(NA, survTime, survTime),
# status = cbind(NA, status.prost, status.other),
# keep = data.frame(grade, ageGroup), trans = tmat)
#head(prostate.long)
In this “msprep” function, the argument “time” consists of three columns, each corresponding the states defined by the “tmat” transition matrix. The first “eventfree” state is represented by a placeholder, “NA”; the second and third by the survival times for time to death from prostate cancer and from other causes. In our data set, both are represented by the “survTime” vector. The two times are distinguished in the next argument, “status”. This also has three columns. The first is a placeholder, “NA” as before; the second is the censoring indicator for prostate cancer (“status.prost”), and the third is for other causes (“status.other”). These latter two variables were defined earlier from the “status” column of the data frame “prostateSurvival.T2”. Finally, the transition matrix is defined by “trans = tmat”. Note that the variables “survTime”, “grade”, and “ageGroup” from the “prostateSurvival.T2” file are available for use to us because we have previously attached it.
The output file has twice as many rows as the original “prostateSurvival.T2” file. The first column, “id”, refers to the patient number in the original file; here, each is repeated twice. For our purposes, we can ignore the columns “from” and “two”. The column “trans” will be important, because it contains an indicator of the cause of death; here “1” refers to death from prostate cancer and “2” refers to death from other causes. The “Tstart” column contains all 0’s, since for our data, “time = 0” indicates the diagnosis with prostate cancer. We can ignore “Tstop”, and use the “time” column as the survival time and the “status” column as the censoring indicator. Note that for each patient, there are two entries for “status”. Both can be 0, or one can be 1 and the other 0; they can’t both be 1 because each patient can die of only one cause, not both. Finally, the last two columns are covariate columns we carried over from the original “prostateSurvival.T2” data frame. Each original value is doubled, since each patient has one covariate value, regardless of their cause of death.
We may obtain a summary of the numbers of events of each type as follows:
#events(prostate.long)$Frequencies
These results indicate that there are 410 deaths due to prostate cancer, 1345 due to other causes, and 4165 censored observations, for 5920 total. (We may ignore the second two rows, which are relevant only for multistate models.)
To show how to use our newly expanded data set, we can use it to reproduce our analysis from the previous section. To obtain these estimates of the effects of covariates on prostate-specific and other death causes, we use separate commands, one for “trans = 1” (prostate cancer) and the other for “trans = 2” (other causes of death), as follows:
#summary(coxph(Surv(time, status) ~ grade + ageGroup, data=prostate.long, subset={trans==1}))
#summary(coxph(Surv(time, status) ~ grade + ageGroup, data=prostate.long, subset={trans==2}))
The results (not shown) are identical to what we obtained before.
If we stratify on cause of death using “strata(trans)” we get estimates of the effect of the covariates on cause of death under the assumption that they affect both causes of death equally,
#summary(coxph(Surv(time, status) ~ grade + ageGroup + strata(trans),
# data=prostate.long))
“Results”
In this example, this model wouldn’t be appropriate, since we would expect that cancer grade affects prostate cancer death differently than it does death from other causes. To test this formally, we fit the following model:
#summary(coxph(Surv(time, status) ~
# grade*factor(trans) + ageGroup + strata(trans),
# data=prostate.long))
“Results”
The coefficient estimate 1.239 for “gradepoor” is the effect of grade on prostate cancer death, and is similar to the estimate we got earlier (1.220) for prostate cancer death alone. Here however, we also have an estimate in the last row for the difference between the effect on prostate cancer death and death from other causes. This is the interaction between a grade of “poor” and cause “2” (other death). The estimate, -0.963, which is highly statistically significant, represents the additional effect of poor grade on risk of death from other causes relative to its effect on prostate cancer death. Specifically, the hazard of death from other causes is exp(-0.963) = 0.381 times the hazard of death from prostate cancer.
We have determined that having a poor grade of prostate cancer strongly affects the risk of dying from prostate cancer, and this effect is much stronger on the risk of death from prostate cancer than on the risk of death from other causes. We may next ask how increasing age affects the risk of dying from prostate cancer and of other causes. Unsurprisingly, the trend is clear in both cases, as we have seen above. But is the effect any different on these two causes? We can answer this by examining the interaction between age group and cause of death as follows:
#summary(coxph(Surv(time, status) ~
# (grade + ageGroup)*trans + ageGroup + strata(trans),
# data=prostate.long))
“Results”
The results are in the last three rows of parameter estimates. None of these differences are statistically significant, so we conclude that there is no difference in the effect of age on the two death causes, after adjusting for grade.
Standard survival data measure the time span from some time origin until the occurrence of one type of event. If several types of events occur, a model describing progression to each of these competing risks is needed. Multi-state models generalize competing risks models by also describing transitions to intermediate events. Methods to analyze such models have been developed over the last two decades. Fortunately, most of the analyzes can be performed within the standard statistical packages, but may require some extra effort with respect to data preparation and programming. This tutorial aims to review statistical methods for the analysis of competing risks and multi-state models. Although some conceptual issues are covered, the emphasis is on practical issues like data preparation, estimation of the effect of covariates, and estimation of cumulative incidence functions and state and transition probabilities. Examples of analysis with standard software are shown.
The data used in Section 4 of the tutorial are 2204 patients transplanted at the EBMT between 1995 and 1998. These data are included in the mstate package.
EBMT platelet recovery data {mstate}: R Documentation Data from the European Society for Blood and Marrow Transplantation (EBMT)
Description: A data frame of 2204 patients transplanted at the EBMT between 1995 and 1998. These data were used in Section 4 of the tutorial on competing risks and multi-state models (Putter, Fiocco & Geskus, 2007). The included variables are
Source: We acknowledge the European Society for Blood and Marrow Transplantation (EBMT) for making available these data. Disclaimer: these data were simplified for the purpose of illustration of the analysis of competing risks and multi-state models and do not reflect any real life situation. No clinical conclusions should be drawn from these data.
References: Putter H, Fiocco M, Geskus RB (2007). Tutorial in biostatistics: Competing risks and multi-state models. Statistics in Medicine 26, 2389-2430.
library(survival)
library(dplyr)
library(mstate)
data(ebmt3)
head(ebmt3)
## id prtime prstat rfstime rfsstat dissub age drmatch tcd
## 1 1 23 1 744 0 CML >40 Gender mismatch No TCD
## 2 2 35 1 360 1 CML >40 No gender mismatch No TCD
## 3 3 26 1 135 1 CML >40 No gender mismatch No TCD
## 4 4 22 1 995 0 AML 20-40 No gender mismatch No TCD
## 5 5 29 1 422 1 AML 20-40 No gender mismatch No TCD
## 6 6 38 1 119 1 ALL >40 No gender mismatch No TCD
#help(ebmt3)
n <- nrow(ebmt3)
table(ebmt3$dissub)
##
## AML ALL CML
## 853 447 904
round(100 * table(ebmt3$dissub)/n)
##
## AML ALL CML
## 39 20 41
table(ebmt3$age)
##
## <=20 20-40 >40
## 419 1057 728
round(100 * table(ebmt3$age)/n)
##
## <=20 20-40 >40
## 19 48 33
table(ebmt3$drmatch)
##
## No gender mismatch Gender mismatch
## 1648 556
round(100 * table(ebmt3$drmatch)/n)
##
## No gender mismatch Gender mismatch
## 75 25
table(ebmt3$tcd)
##
## No TCD TCD
## 1928 276
round(100 * table(ebmt3$tcd)/n)
##
## No TCD TCD
## 87 13
The European Society for Blood and Marrow Transplantation (EBMT) illness-death model
tmat <- matrix(NA, 3, 3)
tmat[1, 2:3] <- 1:2
tmat[2, 3] <- 3
dimnames(tmat) <- list(from = c("Tx", "PR", "RelDeath"),
to = c("Tx", "PR", "RelDeath"))
tmat
## to
## from Tx PR RelDeath
## Tx NA 1 2
## PR NA NA 3
## RelDeath NA NA NA
tmat <- transMat(x = list(c(2, 3), c(3), c()), names = c("Tx", "PR", "RelDeath"))
tmat
## to
## from Tx PR RelDeath
## Tx NA 1 2
## PR NA NA 3
## RelDeath NA NA NA
tmat <- trans.illdeath(names = c("Tx", "PR", "RelDeath"))
tmat
## to
## from Tx PR RelDeath
## Tx NA 1 2
## PR NA NA 3
## RelDeath NA NA NA
paths(tmat)
## [,1] [,2] [,3]
## [1,] 1 NA NA
## [2,] 1 2 NA
## [3,] 1 2 3
## [4,] 1 3 NA
ebmt3$prtime <- ebmt3$prtime/365.25
ebmt3$rfstime <- ebmt3$rfstime/365.25
covs <- c("dissub", "age", "drmatch", "tcd", "prtime")
msbmt <- msprep(time = c(NA, "prtime", "rfstime"),
status = c(NA,"prstat", "rfsstat"),
data = ebmt3,
trans = tmat,
keep = covs)
head(msbmt)
## An object of class 'msdata'
##
## Data:
## id from to trans Tstart Tstop time status dissub age
## 1 1 1 2 1 0.00000000 0.06297057 0.06297057 1 CML >40
## 2 1 1 3 2 0.00000000 0.06297057 0.06297057 0 CML >40
## 3 1 2 3 3 0.06297057 2.03696099 1.97399042 0 CML >40
## 4 2 1 2 1 0.00000000 0.09582478 0.09582478 1 CML >40
## 5 2 1 3 2 0.00000000 0.09582478 0.09582478 0 CML >40
## 6 2 2 3 3 0.09582478 0.98562628 0.88980151 1 CML >40
## drmatch tcd prtime
## 1 Gender mismatch No TCD 0.06297057
## 2 Gender mismatch No TCD 0.06297057
## 3 Gender mismatch No TCD 0.06297057
## 4 No gender mismatch No TCD 0.09582478
## 5 No gender mismatch No TCD 0.09582478
## 6 No gender mismatch No TCD 0.09582478
In the above call of msprep, the time and status arguments specify the column names in the data ebmt3 corresponding to the three states in the multi-state model. Since all the patients start in state 1 at time 0, the time and status arguments corresponding to the first state do not really have a value. In such cases, the corresponding elements of time and status may be given the value NA. An alternative way of specifying time and status (and keep as well) is as matrices of dimension \(n × S\) with \(S\) the number of states (and \(n × p\) with \(p\) the number of covariates for keep). The data argument doesn’t need to be specified then.
The number of events in the data can be summarized with the function events.
events(msbmt)
## $Frequencies
## to
## from Tx PR RelDeath no event total entering
## Tx 0 1169 458 577 2204
## PR 0 0 383 786 1169
## RelDeath 0 0 0 841 841
##
## $Proportions
## to
## from Tx PR RelDeath no event
## Tx 0.0000000 0.5303993 0.2078040 0.2617967
## PR 0.0000000 0.0000000 0.3276305 0.6723695
## RelDeath 0.0000000 0.0000000 0.0000000 1.0000000
expcovs <- expand.covs(msbmt, covs[2:3], append = FALSE)
head(expcovs)
## age20.40.1 age20.40.2 age20.40.3 age.40.1 age.40.2 age.40.3
## 1 0 0 0 1 0 0
## 2 0 0 0 0 1 0
## 3 0 0 0 0 0 1
## 4 0 0 0 1 0 0
## 5 0 0 0 0 1 0
## 6 0 0 0 0 0 1
## drmatchGender.mismatch.1 drmatchGender.mismatch.2 drmatchGender.mismatch.3
## 1 1 0 0
## 2 0 1 0
## 3 0 0 1
## 4 0 0 0
## 5 0 0 0
## 6 0 0 0
msbmt <- expand.covs(msbmt, covs, append = TRUE, longnames = FALSE)
head(msbmt)
## An object of class 'msdata'
##
## Data:
## id from to trans Tstart Tstop time status dissub age
## 1 1 1 2 1 0.00000000 0.06297057 0.06297057 1 CML >40
## 2 1 1 3 2 0.00000000 0.06297057 0.06297057 0 CML >40
## 3 1 2 3 3 0.06297057 2.03696099 1.97399042 0 CML >40
## 4 2 1 2 1 0.00000000 0.09582478 0.09582478 1 CML >40
## 5 2 1 3 2 0.00000000 0.09582478 0.09582478 0 CML >40
## 6 2 2 3 3 0.09582478 0.98562628 0.88980151 1 CML >40
## drmatch tcd prtime dissub1.1 dissub1.2 dissub1.3 dissub2.1
## 1 Gender mismatch No TCD 0.06297057 0 0 0 1
## 2 Gender mismatch No TCD 0.06297057 0 0 0 0
## 3 Gender mismatch No TCD 0.06297057 0 0 0 0
## 4 No gender mismatch No TCD 0.09582478 0 0 0 1
## 5 No gender mismatch No TCD 0.09582478 0 0 0 0
## 6 No gender mismatch No TCD 0.09582478 0 0 0 0
## dissub2.2 dissub2.3 age1.1 age1.2 age1.3 age2.1 age2.2 age2.3 drmatch.1
## 1 0 0 0 0 0 1 0 0 1
## 2 1 0 0 0 0 0 1 0 0
## 3 0 1 0 0 0 0 0 1 0
## 4 0 0 0 0 0 1 0 0 0
## 5 1 0 0 0 0 0 1 0 0
## 6 0 1 0 0 0 0 0 1 0
## drmatch.2 drmatch.3 tcd.1 tcd.2 tcd.3 prtime.1 prtime.2 prtime.3
## 1 0 0 0 0 0 0.06297057 0.00000000 0.00000000
## 2 1 0 0 0 0 0.00000000 0.06297057 0.00000000
## 3 0 1 0 0 0 0.00000000 0.00000000 0.06297057
## 4 0 0 0 0 0 0.09582478 0.00000000 0.00000000
## 5 0 0 0 0 0 0.00000000 0.09582478 0.00000000
## 6 0 0 0 0 0 0.00000000 0.00000000 0.09582478
The names indeed are quite a bit shorter. The downside however is that we need to remember for ourselves to which category for instance the number 1 in age1.2 corresponds (age 20-40 with \(\ge\) 20 as reference category).
After having prepared the data in long format, estimation of covariate effects using Cox regression is straightforward using the coxph function of the survival package. This is not at all a feature of the mstate package, other than that msprep has facilitated preparation of the data. Let us consider the Markov model, where we assume different effects of the covariates for different transitions; hence we use the transition-specific covariates obtained by expand.covs. The delayed entry aspect of this model for transition 3 (see discussion in the tutorial) is achieved by specifying Surv(Tstart,Tstop,status), where (this is reflected in the long format data) Tstart is the time of entry in the state, and Tstop the event or censoring time, depending on the value of status. We consider first the model without any proportionality assumption on the baseline hazards; this is achieved by adding strata(trans) to the formula, which estimates separate baseline hazards for different values of trans (the transitions). The results appear in the left column of Table III of the tutorial.
In the disease/recovery process, often more than one type of event plays a role. Usually, one type of event can be singled out as the event of interest. The other event types may prevent the event of interest from occurring. Leukaemia relapse or AIDS may be unobservable because the person died before the diagnosis of these events. Caution is needed in estimating the probability of the event of interest occurring in the presence of these so-called competing risks. Treating the events of the competing causes as censored observations will lead to a bias in the Kaplan-Meier estimate if one of the fundamental assumptions underlying the Kaplan-Meier estimator is violated: the assumption of independence of the time to event and the censoring distributions. The Cox proportional hazards model can still be used, but the interpretation of the results is different. This will be outlined in some detail in Section 3.
In other situations, another event may substantially change the risk of the event of interest to occur. If one is only interested in the event of interest as a first event, the other event can still be seen as competing. Often, one is also interested in what happens after the first non-fatal event. Then intermediate event types provide more detailed information on the disease/recovery process and allow for more precision in predicting the prognosis of patients. For a leukaemia patient, if the event of interest is death, then relapse becomes an intermediate event worth modelling and not preventing death. Such non-fatal events during the disease course can be seen as transitions from one state to another. The time origin is characterized by a transition into an initial, transient, state, such as the start of treatment; the endpoint is an ‘absorbing’ final transition. Instead of survival data or time-to-event data, data on the history of events is available. Multi-state models provide a framework that allow for the analysis of such event history data. They are an extension of competing risk models, since they extend the analysis to what happens after the first event. Multi-state models are the subject of Section 4.
Several of the ideas presented in the sections on competing risks and multi-state models can also be found in Reference [1]. For more information on competing risks and multi-state models we refer to the relevant chapters in the textbooks [2-7]. A recent issue of Statistical Methods in Medical Research, entirely devoted to multi-state models, is also of interest, see e.g. References [1, 8, 9]
This tutorial reviews statistical methods for the analysis of competing risks and multi-state models. Fortunately, the theory that has been developed over the past two decades for the analysis of right censored survival data can be applied to competing risks and multi-state models as well and often most of the analyzes can be performed within the standard statistical packages, but may require some extra effort with respect to data preparation and programming. Section 2 introduces background and notation needed for the sequel of the paper and discusses the implications of the (lack of) independence between the censoring and time-to-event distributions. Sections 3 and 4 discuss competing risks and multi-state models respectively. Each of these sections is concluded with a subsection on available software. We illustrate estimation and modelling aspects of competing risks and multi-state models using the statistical package R [10]. The full code for the analyzes performed in this tutorial as well as the data used are available at http://www.msbi.nl/multistate.
A multistate model for breast cancer
The illness-death model
Typically, a multi-state model contains one initial state, which we will assign the number 1. In the above examples, this state is entered at the moment of surgery for cancer, bone marrow transplantation and HIV infection respectively. Some states represent an endpoint; when a patient enters such a state, he or she will remain there or one is not interested in what happens after this state has been reached. We call these states final or absorbing states (the latter terminology comes from the theory of Markov chains and processes). The absorbing states in our examples are death (in the cancer example), relapse and death (BMT), AIDS (HIV/AIDS). States that are neither initial nor absorbing states are called intermediate or transient states (again borrowed from Markov chain theory); strictly speaking, the initial state is also transient.
Transitions are represented by arrows going from one state to another. When we assign numbers to all states, we represent a transition from state \(i\) to \(j\) by \(i \to j\). If \(T\) denotes the time of reaching state \(j\) from state \(i\), we denote the hazard rate (transition intensity) of the \(i \to j\) transition by \[h_{ij}(t)=\lim_{\Delta t \to 0} \frac{P(t \le t+\Delta t, D=k|T\ge t)}{\Delta t}\]
we define the cumulative hazard for transition \(i \to j\) by \[H_{ij}(t)=\int_{0}^{t} h_{ij}(s)ds\]
In the above definition, the question remains: what is \(t\), or more precisely, what is the time scale to which t refers? Two approaches are in frequent use, which we shall denote here by the ‘clock forward’ or ‘clock reset’ approach.
Clock forward: Time \(t\) refers to the time since the patient entered the initial state. The clock keeps moving forward for the patient, also when intermediate events occur.
Clock reset: Time \(t\) in \(h_{ij}(t)\) refers to the time since entry in state \(i\), also called backward recurrence time. The clock is reset to 0 each time the patient enters a new state.
The difference between the two approaches is illustrated in the Figure.
Illustration of the ‘clock forward’ and ‘clock reset’ approach.
The upper half shows the dates of surgery and subsequent events for a cancer patient. At 13 May 2005, the patient is still alive. The lower picture shows the patient time-scale, first in the ‘clock forward’ approach, where time is measured from date of surgery, then in the ‘clock reset’ approach, where time intervals between state visits are recorded. In both instances the patient is censored for the last event, due to the end of follow-up.
A property that is often assumed in practice is that the multi-state model is a Markov model. Loosely speaking, the Markov property states that the future depends on the history only through the present. For a multi-state model this means that, given the present state and the event history of a patient, the next state to be visited and the time at which this will occur will only depend on the present state. Strictly speaking, only ‘clock forward’ models can be Markov models; for ‘clock reset’ models the Markov property cannot hold since the time scale itself depends on the history through the time since the current state was reached. However, if it is assumed that the sojourn times depend on the history of the process only through the present state and the time since entry of that state, the resulting multi-state model forms a sequence of embedded Markov models, called a Markov renewal model or also a semi-Markov model.
Data structure: long format
We will illustrate estimation of the effect of prognostic factors on the transition rates in multi-state models, using the simplest non-trivial multi-state model, the illness-death model. Some aspects that play a role and that we will try to cover here are:
We will use data from the European Blood and Marrow Transplant registry (EBMT) for illustration in this and the next subsection. The data consists of 2204 patients in this registry, who received bone marrow transplantation between 1995 and 1998, and who had complete information on the prognostic factors considered here. These are as summarized in Table below
The European Society for Blood and Marrow Transplantation (EBMT) illness-death model
The European Society for Blood and Marrow Transplantation (EBMT) illness-death model
The multi-state model that we shall use for illustration here and in the next subsection is the bone marrow transplantation illness-death model. Here, the ‘illness’ state corresponds to platelet recovery and ‘death’ corresponds to relapse or death. The model is illustrated in Figure 13 along with the number of events. We can see that for 1169 of 2204 patients (53 per cent), platelet levels returned to normal levels; 383 of these 1169 (33 per cent) subsequently relapsed or died, the remaining 786 (67 per cent) did not relapse or die after platelet recovery. There were 458 patients (21 per cent) that relapsed or died without platelet recovery prior to relapse or death. Finally, 577 (26 per cent) of all 2204 patients did not experience any event in our data.
Using Cox PH notation, modeling \(i \to j\) is \[h_{ij}(t|x)=h_{ij, 0}(t)\exp(\beta_{ij} x)\]
Parameter estimates in different models; ‘clock forward’ approach
c1 <- coxph(Surv(Tstart, Tstop, status) ~ dissub1.1 + dissub2.1 +
age1.1 + age2.1 + drmatch.1 + tcd.1 + dissub1.2 + dissub2.2 +
age1.2 + age2.2 + drmatch.2 + tcd.2 + dissub1.3 + dissub2.3 +
age1.3 + age2.3 + drmatch.3 + tcd.3 + strata(trans), data = msbmt,
method = "breslow")
c1
## Call:
## coxph(formula = Surv(Tstart, Tstop, status) ~ dissub1.1 + dissub2.1 +
## age1.1 + age2.1 + drmatch.1 + tcd.1 + dissub1.2 + dissub2.2 +
## age1.2 + age2.2 + drmatch.2 + tcd.2 + dissub1.3 + dissub2.3 +
## age1.3 + age2.3 + drmatch.3 + tcd.3 + strata(trans), data = msbmt,
## method = "breslow")
##
## coef exp(coef) se(coef) z p
## dissub1.1 -0.04359 0.95734 0.07789 -0.560 0.575698
## dissub2.1 -0.29724 0.74287 0.06800 -4.371 1.23e-05
## age1.1 -0.16461 0.84822 0.07905 -2.082 0.037317
## age2.1 -0.08979 0.91412 0.08647 -1.038 0.299075
## drmatch.1 0.04575 1.04681 0.06660 0.687 0.492127
## tcd.1 0.42907 1.53583 0.08043 5.335 9.57e-08
## dissub1.2 0.25589 1.29161 0.13520 1.893 0.058411
## dissub2.2 0.01675 1.01689 0.10838 0.155 0.877188
## age1.2 0.25516 1.29067 0.15103 1.689 0.091127
## age2.2 0.52649 1.69298 0.15790 3.334 0.000855
## drmatch.2 -0.07525 0.92751 0.11028 -0.682 0.495006
## tcd.2 0.29673 1.34545 0.15007 1.977 0.048006
## dissub1.3 0.13646 1.14621 0.14804 0.922 0.356634
## dissub2.3 0.24692 1.28007 0.11685 2.113 0.034596
## age1.3 0.06156 1.06350 0.15343 0.401 0.688239
## age2.3 0.58075 1.78737 0.16014 3.627 0.000287
## drmatch.3 0.17280 1.18863 0.11452 1.509 0.131315
## tcd.3 0.20088 1.22248 0.12636 1.590 0.111873
##
## Likelihood ratio test=117.7 on 18 df, p=< 2.2e-16
## n= 5577, number of events= 2010
msbmt$pr <- 0
msbmt$pr[msbmt$trans == 3] <- 1
c2 <- coxph(Surv(Tstart, Tstop, status) ~ dissub1.1 + dissub2.1 +
age1.1 + age2.1 + drmatch.1 + tcd.1 + dissub1.2 + dissub2.2 +
age1.2 + age2.2 + drmatch.2 + tcd.2 + dissub1.3 + dissub2.3 +
age1.3 + age2.3 + drmatch.3 + tcd.3 + pr + strata(to), data = msbmt,
method = "breslow")
c2
## Call:
## coxph(formula = Surv(Tstart, Tstop, status) ~ dissub1.1 + dissub2.1 +
## age1.1 + age2.1 + drmatch.1 + tcd.1 + dissub1.2 + dissub2.2 +
## age1.2 + age2.2 + drmatch.2 + tcd.2 + dissub1.3 + dissub2.3 +
## age1.3 + age2.3 + drmatch.3 + tcd.3 + pr + strata(to), data = msbmt,
## method = "breslow")
##
## coef exp(coef) se(coef) z p
## dissub1.1 -0.043592 0.957345 0.077887 -0.560 0.575698
## dissub2.1 -0.297240 0.742866 0.067996 -4.371 1.23e-05
## age1.1 -0.164613 0.848222 0.079054 -2.082 0.037317
## age2.1 -0.089790 0.914123 0.086468 -1.038 0.299075
## drmatch.1 0.045751 1.046814 0.066602 0.687 0.492127
## tcd.1 0.429071 1.535831 0.080432 5.335 9.57e-08
## dissub1.2 0.260968 1.298186 0.135182 1.930 0.053546
## dissub2.2 0.003637 1.003644 0.108368 0.034 0.973226
## age1.2 0.250894 1.285174 0.151057 1.661 0.096727
## age2.2 0.525790 1.691796 0.157895 3.330 0.000868
## drmatch.2 -0.072067 0.930469 0.110260 -0.654 0.513364
## tcd.2 0.318537 1.375114 0.149970 2.124 0.033669
## dissub1.3 0.139811 1.150056 0.147981 0.945 0.344767
## dissub2.3 0.250328 1.284447 0.116788 2.143 0.032078
## age1.3 0.055559 1.057131 0.153372 0.362 0.717166
## age2.3 0.562484 1.755027 0.159970 3.516 0.000438
## drmatch.3 0.169149 1.184297 0.114446 1.478 0.139414
## tcd.3 0.211029 1.234948 0.126198 1.672 0.094484
## pr -0.378633 0.684797 0.211523 -1.790 0.073449
##
## Likelihood ratio test=135.3 on 19 df, p=< 2.2e-16
## n= 5577, number of events= 2010
cox.zph(c2)
## chisq df p
## dissub1.1 2.46e+01 1 6.9e-07
## dissub2.1 9.68e+00 1 0.00187
## age1.1 1.05e-01 1 0.74633
## age2.1 6.48e+00 1 0.01092
## drmatch.1 6.99e+00 1 0.00821
## tcd.1 1.41e+01 1 0.00017
## dissub1.2 5.43e+00 1 0.01975
## dissub2.2 4.43e+00 1 0.03535
## age1.2 4.79e+00 1 0.02863
## age2.2 1.46e+00 1 0.22647
## drmatch.2 1.12e-01 1 0.73759
## tcd.2 1.07e+00 1 0.30179
## dissub1.3 4.93e-05 1 0.99440
## dissub2.3 2.41e+01 1 9.4e-07
## age1.3 2.64e+00 1 0.10394
## age2.3 6.80e+00 1 0.00913
## drmatch.3 4.65e+00 1 0.03109
## tcd.3 1.83e+01 1 1.9e-05
## pr 1.64e+01 1 5.2e-05
## GLOBAL 1.17e+02 19 4.8e-16
There is no evidence of non-proportionality of the baseline transition intensities of transitions 2 (p=0.496 for pr). There is strong evidence that the proportional hazards assumption for dissub2 (CML vs AML) is violated, at least for the transitions into relapse and death. This makes sense, clinically, since CML and AML are two diseases with completely different biological pathways. It would have been much better to study separate multi-state models for the three disease subclassifications. However, since the purpose of this manuscript is to illustrate the use of mstate, we will blatantly ignore the clear evidence of non-proportionality for the disease subclassifications.
Building on the Markov PH model, we can investigate whether the time at which a patient arrived in state 2 (PR) influences the subsequent RFS rate, that is, the transition hazard of PR -> RelDeath. Here the purpose of expanding prtime becomes apparent. Since prtime only makes sense for transition 3 (PR -> RelDeath), we need the transition-specific covariate of prtime for transition 3, which is prtime.3. The corresponding model is termed the “state arrival extended Markov PH” model in the tutorial, and appears on the right of Table III.
c3 <- coxph(Surv(Tstart, Tstop, status) ~ dissub1.1 + dissub2.1 +
age1.1 + age2.1 + drmatch.1 + tcd.1 + dissub1.2 + dissub2.2 +
age1.2 + age2.2 + drmatch.2 + tcd.2 + dissub1.3 + dissub2.3 +
age1.3 + age2.3 + drmatch.3 + tcd.3 + pr + prtime.3 + strata(to),
data = msbmt, method = "breslow")
c3
## Call:
## coxph(formula = Surv(Tstart, Tstop, status) ~ dissub1.1 + dissub2.1 +
## age1.1 + age2.1 + drmatch.1 + tcd.1 + dissub1.2 + dissub2.2 +
## age1.2 + age2.2 + drmatch.2 + tcd.2 + dissub1.3 + dissub2.3 +
## age1.3 + age2.3 + drmatch.3 + tcd.3 + pr + prtime.3 + strata(to),
## data = msbmt, method = "breslow")
##
## coef exp(coef) se(coef) z p
## dissub1.1 -0.043592 0.957345 0.077887 -0.560 0.575698
## dissub2.1 -0.297240 0.742866 0.067996 -4.371 1.23e-05
## age1.1 -0.164613 0.848222 0.079054 -2.082 0.037317
## age2.1 -0.089790 0.914123 0.086468 -1.038 0.299075
## drmatch.1 0.045751 1.046814 0.066602 0.687 0.492127
## tcd.1 0.429071 1.535831 0.080432 5.335 9.57e-08
## dissub1.2 0.260899 1.298097 0.135182 1.930 0.053609
## dissub2.2 0.003761 1.003768 0.108368 0.035 0.972315
## age1.2 0.250952 1.285248 0.151056 1.661 0.096649
## age2.2 0.525772 1.691764 0.157894 3.330 0.000869
## drmatch.2 -0.072088 0.930449 0.110260 -0.654 0.513238
## tcd.2 0.318238 1.374703 0.149971 2.122 0.033838
## dissub1.3 0.132021 1.141132 0.148849 0.887 0.375109
## dissub2.3 0.251811 1.286353 0.116823 2.155 0.031123
## age1.3 0.058227 1.059956 0.153426 0.380 0.704306
## age2.3 0.565752 1.760771 0.160011 3.536 0.000407
## drmatch.3 0.166817 1.181538 0.114556 1.456 0.145334
## tcd.3 0.207404 1.230480 0.126431 1.640 0.100911
## pr -0.406872 0.665729 0.219075 -1.857 0.063279
## prtime.3 0.295226 1.343430 0.594952 0.496 0.619741
##
## Likelihood ratio test=135.5 on 20 df, p=< 2.2e-16
## n= 5577, number of events= 2010
Parameter estimates in different models; ‘clock forward’ approach
c4 <- coxph(Surv(time, status) ~ dissub1.1 + dissub2.1 + age1.1 +
age2.1 + drmatch.1 + tcd.1 + dissub1.2 + dissub2.2 + age1.2 +
age2.2 + drmatch.2 + tcd.2 + dissub1.3 + dissub2.3 + age1.3 +
age2.3 + drmatch.3 + tcd.3 + strata(trans), data = msbmt,
method = "breslow")
c4
## Call:
## coxph(formula = Surv(time, status) ~ dissub1.1 + dissub2.1 +
## age1.1 + age2.1 + drmatch.1 + tcd.1 + dissub1.2 + dissub2.2 +
## age1.2 + age2.2 + drmatch.2 + tcd.2 + dissub1.3 + dissub2.3 +
## age1.3 + age2.3 + drmatch.3 + tcd.3 + strata(trans), data = msbmt,
## method = "breslow")
##
## coef exp(coef) se(coef) z p
## dissub1.1 -0.04359 0.95734 0.07789 -0.560 0.575698
## dissub2.1 -0.29724 0.74287 0.06800 -4.371 1.23e-05
## age1.1 -0.16461 0.84822 0.07905 -2.082 0.037317
## age2.1 -0.08979 0.91412 0.08647 -1.038 0.299075
## drmatch.1 0.04575 1.04681 0.06660 0.687 0.492127
## tcd.1 0.42907 1.53583 0.08043 5.335 9.57e-08
## dissub1.2 0.25589 1.29161 0.13520 1.893 0.058411
## dissub2.2 0.01675 1.01689 0.10838 0.155 0.877188
## age1.2 0.25516 1.29067 0.15103 1.689 0.091127
## age2.2 0.52649 1.69298 0.15790 3.334 0.000855
## drmatch.2 -0.07525 0.92751 0.11028 -0.682 0.495006
## tcd.2 0.29673 1.34545 0.15007 1.977 0.048006
## dissub1.3 0.12026 1.12779 0.14793 0.813 0.416269
## dissub2.3 0.25245 1.28717 0.11685 2.160 0.030737
## age1.3 0.06541 1.06760 0.15338 0.426 0.669773
## age2.3 0.58154 1.78880 0.16002 3.634 0.000279
## drmatch.3 0.16974 1.18499 0.11453 1.482 0.138341
## tcd.3 0.19676 1.21745 0.12633 1.557 0.119365
##
## Likelihood ratio test=118.1 on 18 df, p=< 2.2e-16
## n= 5577, number of events= 2010
The influence of the time at which platelet recovery occurred seems small and is not significant (p=0.62, last row)
The clock-reset models may be obtained very similarly to those of the clock-forward models. The only difference is that Surv(Tstart,Tstop,status) is replaced by Surv(time,status). This reflects the fact (recall that in our long format data each row corresponds to a transition) that for each transition the time starts at 0, rather than Tstart, the time since start of study at which the state has been entered. We will only show the code, not the output; the reader may try this for him-or herself.
c5 <- coxph(Surv(time, status) ~ dissub1.1 + dissub2.1 + age1.1 +
age2.1 + drmatch.1 + tcd.1 + dissub1.2 + dissub2.2 + age1.2 +
age2.2 + drmatch.2 + tcd.2 + dissub1.3 + dissub2.3 + age1.3 +
age2.3 + drmatch.3 + tcd.3 + pr + strata(to), data = msbmt,
method = "breslow")
c5
## Call:
## coxph(formula = Surv(time, status) ~ dissub1.1 + dissub2.1 +
## age1.1 + age2.1 + drmatch.1 + tcd.1 + dissub1.2 + dissub2.2 +
## age1.2 + age2.2 + drmatch.2 + tcd.2 + dissub1.3 + dissub2.3 +
## age1.3 + age2.3 + drmatch.3 + tcd.3 + pr + strata(to), data = msbmt,
## method = "breslow")
##
## coef exp(coef) se(coef) z p
## dissub1.1 -0.043592 0.957345 0.077887 -0.560 0.575698
## dissub2.1 -0.297240 0.742866 0.067996 -4.371 1.23e-05
## age1.1 -0.164613 0.848222 0.079054 -2.082 0.037317
## age2.1 -0.089790 0.914123 0.086468 -1.038 0.299075
## drmatch.1 0.045751 1.046814 0.066602 0.687 0.492127
## tcd.1 0.429071 1.535831 0.080432 5.335 9.57e-08
## dissub1.2 0.258695 1.295239 0.135188 1.914 0.055672
## dissub2.2 0.008247 1.008281 0.108339 0.076 0.939324
## age1.2 0.252081 1.286701 0.151041 1.669 0.095126
## age2.2 0.527568 1.694805 0.157887 3.341 0.000833
## drmatch.2 -0.072862 0.929729 0.110261 -0.661 0.508733
## tcd.2 0.310010 1.363439 0.149921 2.068 0.038657
## dissub1.3 0.117420 1.124592 0.147863 0.794 0.427128
## dissub2.3 0.253025 1.287915 0.116781 2.167 0.030260
## age1.3 0.063925 1.066013 0.153279 0.417 0.676641
## age2.3 0.574319 1.775920 0.159857 3.593 0.000327
## drmatch.3 0.164298 1.178565 0.114469 1.435 0.151199
## tcd.3 0.200781 1.222357 0.126189 1.591 0.111583
## pr -0.415591 0.659950 0.210949 -1.970 0.048827
##
## Likelihood ratio test=142.1 on 19 df, p=< 2.2e-16
## n= 5577, number of events= 2010
c6 <- coxph(Surv(time, status) ~ dissub1.1 + dissub2.1 + age1.1 +
age2.1 + drmatch.1 + tcd.1 + dissub1.2 + dissub2.2 + age1.2 +
age2.2 + drmatch.2 + tcd.2 + dissub1.3 + dissub2.3 + age1.3 +
age2.3 + drmatch.3 + tcd.3 + pr + prtime.3 + strata(to),
data = msbmt, method = "breslow")
c6
## Call:
## coxph(formula = Surv(time, status) ~ dissub1.1 + dissub2.1 +
## age1.1 + age2.1 + drmatch.1 + tcd.1 + dissub1.2 + dissub2.2 +
## age1.2 + age2.2 + drmatch.2 + tcd.2 + dissub1.3 + dissub2.3 +
## age1.3 + age2.3 + drmatch.3 + tcd.3 + pr + prtime.3 + strata(to),
## data = msbmt, method = "breslow")
##
## coef exp(coef) se(coef) z p
## dissub1.1 -0.043592 0.957345 0.077887 -0.560 0.575698
## dissub2.1 -0.297240 0.742866 0.067996 -4.371 1.23e-05
## age1.1 -0.164613 0.848222 0.079054 -2.082 0.037317
## age2.1 -0.089790 0.914123 0.086468 -1.038 0.299075
## drmatch.1 0.045751 1.046814 0.066602 0.687 0.492127
## tcd.1 0.429071 1.535831 0.080432 5.335 9.57e-08
## dissub1.2 0.258710 1.295258 0.135188 1.914 0.055658
## dissub2.2 0.008234 1.008268 0.108339 0.076 0.939419
## age1.2 0.252110 1.286737 0.151041 1.669 0.095089
## age2.2 0.527581 1.694827 0.157887 3.342 0.000833
## drmatch.2 -0.072845 0.929744 0.110261 -0.661 0.508829
## tcd.2 0.310030 1.363465 0.149921 2.068 0.038644
## dissub1.3 0.137509 1.147412 0.148842 0.924 0.355560
## dissub2.3 0.249439 1.283306 0.116828 2.135 0.032754
## age1.3 0.058214 1.059942 0.153456 0.379 0.704425
## age2.3 0.567443 1.763752 0.160190 3.542 0.000397
## drmatch.3 0.170000 1.185305 0.114543 1.484 0.137766
## tcd.3 0.209244 1.232746 0.126368 1.656 0.097755
## pr -0.350169 0.704569 0.218695 -1.601 0.109339
## prtime.3 -0.658183 0.517791 0.584642 -1.126 0.260255
##
## Likelihood ratio test=143.5 on 20 df, p=< 2.2e-16
## n= 5577, number of events= 2010
In the preceding subsection, we have modelled the effects of covariates on the transition hazard. In Section 3 on competing risks we have already seen that effects on the cumulative incidence function may be different from what the regression coefficients suggest. In a multi-state setting, this becomes even more of an issue, since intermediate events also contribute to effects on the cumulative scale. This subsection is devoted to estimation of cumulative effects, or prediction, to answer clinically important questions such as the following in our example: * Given a bone marrow transplantation patient whose platelets have recovered after 60 days and who has had no further events at one year post-transplant, what is then the probability of surviving relapse-free for 2 more years? How does this probability compare to a patient whose platelets have not yet recovered?
newd <- data.frame(dissub = rep(0, 3), age = rep(0, 3),
drmatch = rep(0,3), tcd = rep(0, 3), trans = 1:3)
newd$dissub <- factor(newd$dissub, levels = 0:2, labels = levels(ebmt3$dissub))
newd$age <- factor(newd$age, levels = 0:2, labels = levels(ebmt3$age))
newd$drmatch <- factor(newd$drmatch, levels = 0:1, labels = levels(ebmt3$drmatch))
newd$tcd <- factor(newd$tcd, levels = 0:1, labels = levels(ebmt3$tcd))
attr(newd, "trans") <- tmat
class(newd) <- c("msdata", "data.frame")
newd <- expand.covs(newd, covs[1:4], longnames = FALSE)
newd$strata = 1:3
newd
## An object of class 'msdata'
##
## Data:
## dissub age drmatch tcd trans dissub1.1 dissub1.2 dissub1.3
## 1 AML <=20 No gender mismatch No TCD 1 0 0 0
## 2 AML <=20 No gender mismatch No TCD 2 0 0 0
## 3 AML <=20 No gender mismatch No TCD 3 0 0 0
## dissub2.1 dissub2.2 dissub2.3 age1.1 age1.2 age1.3 age2.1 age2.2 age2.3
## 1 0 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0 0
## drmatch.1 drmatch.2 drmatch.3 tcd.1 tcd.2 tcd.3 strata
## 1 0 0 0 0 0 0 1
## 2 0 0 0 0 0 0 2
## 3 0 0 0 0 0 0 3
msf1 <- msfit(c1, newdata = newd, trans = tmat)
summary(msf1)
##
## Transition 1 (head and tail):
## time Haz seHaz lower upper
## 1 0.002737851 0.0005277714 0.0005290102 7.400248e-05 0.003763964
## 2 0.008213552 0.0010560892 0.0007502708 2.624139e-04 0.004250249
## 3 0.010951403 0.0010560892 0.0007502708 2.624139e-04 0.004250249
## 4 0.016427105 0.0010560892 0.0007502708 2.624139e-04 0.004250249
## 5 0.019164956 0.0015857558 0.0009219748 5.073865e-04 0.004956027
## 6 0.021902806 0.0015857558 0.0009219748 5.073865e-04 0.004956027
##
## ...
## time Haz seHaz lower upper
## 500 6.253251 0.9513165 0.07182285 0.8204662 1.103035
## 501 6.357290 0.9513165 0.07182285 0.8204662 1.103035
## 502 6.362765 0.9513165 0.07182285 0.8204662 1.103035
## 503 6.798084 0.9513165 0.07182285 0.8204662 1.103035
## 504 7.110198 0.9513165 0.07182285 0.8204662 1.103035
## 505 7.731691 0.9513165 0.07182285 0.8204662 1.103035
##
## Transition 2 (head and tail):
## time Haz seHaz lower upper
## 506 0.002737851 0.0003046955 0.0003077143 4.209506e-05 0.002205469
## 507 0.008213552 0.0003046955 0.0003077143 4.209506e-05 0.002205469
## 508 0.010951403 0.0006097444 0.0004396591 1.483833e-04 0.002505594
## 509 0.016427105 0.0012203981 0.0006340496 4.408243e-04 0.003378606
## 510 0.019164956 0.0018316171 0.0007912068 7.854882e-04 0.004271001
## 511 0.021902806 0.0024438486 0.0009303805 1.158829e-03 0.005153820
##
## ...
## time Haz seHaz lower upper
## 1005 6.253251 0.5020560 0.08219369 0.3642490 0.6919997
## 1006 6.357290 0.5020560 0.08219369 0.3642490 0.6919997
## 1007 6.362765 0.5248419 0.08821373 0.3775385 0.7296182
## 1008 6.798084 0.5248419 0.08821373 0.3775385 0.7296182
## 1009 7.110198 0.5248419 0.08821373 0.3775385 0.7296182
## 1010 7.731691 0.5248419 0.08821373 0.3775385 0.7296182
##
## Transition 3 (head and tail):
## time Haz seHaz lower upper
## 1011 0.002737851 0 0 0 0
## 1012 0.008213552 0 0 0 0
## 1013 0.010951403 0 0 0 0
## 1014 0.016427105 0 0 0 0
## 1015 0.019164956 0 0 0 0
## 1016 0.021902806 0 0 0 0
##
## ...
## time Haz seHaz lower upper
## 1510 6.253251 0.3291154 0.05058502 0.2435110 0.4448133
## 1511 6.357290 0.3427115 0.05413323 0.2514645 0.4670688
## 1512 6.362765 0.3427115 0.05413323 0.2514645 0.4670688
## 1513 6.798084 0.3693677 0.06340696 0.2638388 0.5171055
## 1514 7.110198 0.4647197 0.12159613 0.2782724 0.7760899
## 1515 7.731691 0.4647197 0.12159613 0.2782724 0.7760899
vH1 <- msf1$varHaz
head(vH1[vH1$trans1 == 1 & vH1$trans2 == 1, ])
## time varHaz trans1 trans2
## 1 0.002737851 2.798518e-07 1 1
## 2 0.008213552 5.629062e-07 1 1
## 3 0.010951403 5.629062e-07 1 1
## 4 0.016427105 5.629062e-07 1 1
## 5 0.019164956 8.500376e-07 1 1
## 6 0.021902806 8.500376e-07 1 1
tail(vH1[vH1$trans1 == 1 & vH1$trans2 == 1, ])
## time varHaz trans1 trans2
## 500 6.253251 0.005158522 1 1
## 501 6.357290 0.005158522 1 1
## 502 6.362765 0.005158522 1 1
## 503 6.798084 0.005158522 1 1
## 504 7.110198 0.005158522 1 1
## 505 7.731691 0.005158522 1 1
tail(vH1[vH1$trans1 == 1 & vH1$trans2 == 2, ])
## time varHaz trans1 trans2
## 1005 6.253251 0 1 2
## 1006 6.357290 0 1 2
## 1007 6.362765 0 1 2
## 1008 6.798084 0 1 2
## 1009 7.110198 0 1 2
## 1010 7.731691 0 1 2
tail(vH1[vH1$trans1 == 1 & vH1$trans2 == 3, ])
## time varHaz trans1 trans2
## 1510 6.253251 0 1 3
## 1511 6.357290 0 1 3
## 1512 6.362765 0 1 3
## 1513 6.798084 0 1 3
## 1514 7.110198 0 1 3
## 1515 7.731691 0 1 3
tail(vH1[vH1$trans1 == 2 & vH1$trans2 == 3, ])
## time varHaz trans1 trans2
## 2520 6.253251 0 2 3
## 2521 6.357290 0 2 3
## 2522 6.362765 0 2 3
## 2523 6.798084 0 2 3
## 2524 7.110198 0 2 3
## 2525 7.731691 0 2 3
newd$strata = c(1, 2, 2)
newd$pr <- c(0, 0, 1)
msf2 <- msfit(c2, newdata = newd, trans = tmat)
summary(msf2)
##
## Transition 1 (head and tail):
## time Haz seHaz lower upper
## 1 0.002737851 0.0005277714 0.0005290102 7.400248e-05 0.003763964
## 2 0.008213552 0.0010560892 0.0007502708 2.624139e-04 0.004250249
## 3 0.010951403 0.0010560892 0.0007502708 2.624139e-04 0.004250249
## 4 0.016427105 0.0010560892 0.0007502708 2.624139e-04 0.004250249
## 5 0.019164956 0.0015857558 0.0009219748 5.073865e-04 0.004956027
## 6 0.021902806 0.0015857558 0.0009219748 5.073865e-04 0.004956027
##
## ...
## time Haz seHaz lower upper
## 500 6.253251 0.9513165 0.07182285 0.8204662 1.103035
## 501 6.357290 0.9513165 0.07182285 0.8204662 1.103035
## 502 6.362765 0.9513165 0.07182285 0.8204662 1.103035
## 503 6.798084 0.9513165 0.07182285 0.8204662 1.103035
## 504 7.110198 0.9513165 0.07182285 0.8204662 1.103035
## 505 7.731691 0.9513165 0.07182285 0.8204662 1.103035
##
## Transition 2 (head and tail):
## time Haz seHaz lower upper
## 506 0.002737851 0.0003053084 0.0003083331 4.217979e-05 0.002209902
## 507 0.008213552 0.0003053084 0.0003083331 4.217979e-05 0.002209902
## 508 0.010951403 0.0006107971 0.0004404176 1.486397e-04 0.002509915
## 509 0.016427105 0.0012223306 0.0006350522 4.415233e-04 0.003383948
## 510 0.019164956 0.0018344413 0.0007924245 7.867013e-04 0.004277576
## 511 0.021902806 0.0024473467 0.0009317088 1.160491e-03 0.005161183
##
## ...
## time Haz seHaz lower upper
## 1005 6.253251 0.5040408 0.07806657 0.3720749 0.6828118
## 1006 6.357290 0.5146993 0.08030652 0.3790914 0.6988167
## 1007 6.362765 0.5255361 0.08256535 0.3862540 0.7150431
## 1008 6.798084 0.5476683 0.08851937 0.3989682 0.7517906
## 1009 7.110198 0.6357669 0.13427464 0.4202651 0.9617730
## 1010 7.731691 0.6357669 0.13427464 0.4202651 0.9617730
##
## Transition 3 (head and tail):
## time Haz seHaz lower upper
## 1011 0.002737851 0.0002090742 0.0002116301 2.875366e-05 0.001520225
## 1012 0.008213552 0.0002090742 0.0002116301 2.875366e-05 0.001520225
## 1013 0.010951403 0.0004182719 0.0003029499 1.011445e-04 0.001729717
## 1014 0.016427105 0.0008370481 0.0004386272 2.997137e-04 0.002337729
## 1015 0.019164956 0.0012562195 0.0005493845 5.330994e-04 0.002960212
## 1016 0.021902806 0.0016759351 0.0006481990 7.853066e-04 0.003576640
##
## ...
## time Haz seHaz lower upper
## 1510 6.253251 0.3451655 0.05260815 0.2560308 0.4653317
## 1511 6.357290 0.3524644 0.05411648 0.2608699 0.4762189
## 1512 6.362765 0.3598855 0.05563688 0.2658103 0.4872555
## 1513 6.798084 0.3750415 0.05964162 0.2746095 0.5122042
## 1514 7.110198 0.4353712 0.09072076 0.2893943 0.6549820
## 1515 7.731691 0.4353712 0.09072076 0.2893943 0.6549820
vH2 <- msf2$varHaz
tail(vH2[vH2$trans1 == 1 & vH2$trans2 == 2, ])
## time varHaz trans1 trans2
## 1005 6.253251 0 1 2
## 1006 6.357290 0 1 2
## 1007 6.362765 0 1 2
## 1008 6.798084 0 1 2
## 1009 7.110198 0 1 2
## 1010 7.731691 0 1 2
tail(vH2[vH2$trans1 == 1 & vH2$trans2 == 3, ])
## time varHaz trans1 trans2
## 1510 6.253251 0 1 3
## 1511 6.357290 0 1 3
## 1512 6.362765 0 1 3
## 1513 6.798084 0 1 3
## 1514 7.110198 0 1 3
## 1515 7.731691 0 1 3
tail(vH2[vH2$trans1 == 2 & vH2$trans2 == 3, ])
## time varHaz trans1 trans2
## 2520 6.253251 0.0004142378 2 3
## 2521 6.357290 0.0005227029 2 3
## 2522 6.362765 0.0006348311 2 3
## 2523 6.798084 0.0011112104 2 3
## 2524 7.110198 0.0088628795 2 3
## 2525 7.731691 0.0088628795 2 3
par(mfrow = c(1, 2))
plot(msf1, cols = rep(1, 3), lwd = 2, lty = 1:3, xlab = "Years since transplant",
ylab = "Stratified baseline hazards", legend.pos = c(2, 0.9))
plot(msf2, cols = rep(1, 3), lwd = 2, lty = 1:3, xlab = "Years since transplant",
ylab = "Proportional baseline hazards", legend.pos = c(2, 0.9))
par(mfrow = c(1, 1))
Figure 1: Baseline cumulative hazard curves for the EBMT illness-death model. On the left the Markov stratified hazards model, on the right the Markov PH model.
Define the multi-state model as \(X(t)\), a random process taking values in \(1, \dots, S\) (\(S\) being the number of states). We are interested in estimating so called transition probabilities \(P_{gh}(s,t)=P(X(t)=h|X(s)=g)\), possibly depending on covariates. For instance, \(P_{13}(0, t)\) indicates the probability of having relapsed/died (state 3) by time \(t\), given that the individual was alive without relapse or platelet recovery (state 1) at time \(s = 0\). By fixing \(s\) and varying \(t\), we can predict the future behavior of the multi-state model given the present at time \(s\). For Markov models, these probabilities will depend only on the state at time s, not on what happened before. For these Markov models there is a powerful relation between these transition probabilities and the transition intensities, given by
\[P(s,t)=\prod_{(s,t]} (I+d\Lambda(u))\]
Here \(P(s,t)\) is an \(S \times S\) matrix with as \((g, h)\) element the \(P_{gh}(s,t)\) in which we are interested, and \(\Lambda(t)\) is an \(S \times S\) matrix with as off-diagonal \((g, h)\) elements the transition intensities \(\Lambda_{gh}(t)\) of transition \(g \to h\). If such a direct transition is not possible, then \(\Lambda_{gh}(t)=0\). The diagonal elements of \(\Lambda(t)\) are defined as \(\Lambda_{gg}(t) = -\sum_{h \neq g} \Lambda_{gh}(t)\), i.e. as minus the sum of the transition intensities of the transitions out from state \(g\). Finally, \(I\) is the \(S \times S\) identity matrix. This equation describes a theoretical relation between the true underlying transition intensities and transition probabilities. The product is a so called product integral (Andersen et al. 1993) when the transition intensities are continuous.
We already have estimates of all the transition intensities. If we gather these in a matrix and plug them in equation (1), we get
\[\hat{P}(s,t)=\prod_{(s<u\le t]} (I+d\hat{\Lambda}(u))\]
*as an estimate of the transition probabilities. This estimator is called the Aalen-Johansen estimator, and it is implemented in probtrans. By working with matrices, we immediately get all the transition probabilities from all the starting states g to all the receiving states h in one go. When we fix s, we can calculate all these transition probabilities by forward matrix multiplications using the simple recursive relation
\[\hat{P}(s,t+)=\hat{P}(s,t+) \cdot (I+d\hat{\Lambda}(t+))\]
pt <- probtrans(msf2, predt = 0)
head(pt[[3]])
## time pstate1 pstate2 pstate3 se1 se2 se3
## 1 0.000000000 0 0 1 0 0 0
## 2 0.002737851 0 0 1 0 0 0
## 3 0.008213552 0 0 1 0 0 0
## 4 0.010951403 0 0 1 0 0 0
## 5 0.016427105 0 0 1 0 0 0
## 6 0.019164956 0 0 1 0 0 0
tail(pt[[3]])
## time pstate1 pstate2 pstate3 se1 se2 se3
## 501 6.253251 0 0 1 0 0 0
## 502 6.357290 0 0 1 0 0 0
## 503 6.362765 0 0 1 0 0 0
## 504 6.798084 0 0 1 0 0 0
## 505 7.110198 0 0 1 0 0 0
## 506 7.731691 0 0 1 0 0 0
#Output Suppressed
#summary(pt, from = 2)
#Output Suppressed
#summary(pt, from = 1)
tmat2 <- transMat(x = list(c(2, 4), c(3), c(), c()))
tmat2
## to
## from State 1 State 2 State 3 State 4
## State 1 NA 1 NA 2
## State 2 NA NA 3 NA
## State 3 NA NA NA NA
## State 4 NA NA NA NA
msf2$trans <- tmat2
pt <- probtrans(msf2, predt = 0)
summary(pt, from = 1)
##
## Prediction from state 1 (head and tail):
## time pstate1 pstate2 pstate3 pstate4 se1
## 1 0.000000000 1.0000000 0.0000000000 0.000000e+00 0.0000000000 0.0000000000
## 2 0.002737851 0.9991669 0.0005277714 0.000000e+00 0.0003053084 0.0006117979
## 3 0.008213552 0.9986390 0.0010556490 0.000000e+00 0.0003053084 0.0008100529
## 4 0.010951403 0.9983340 0.0010554282 2.208393e-07 0.0006103813 0.0008685356
## 5 0.016427105 0.9977235 0.0010549862 6.628276e-07 0.0012208961 0.0009807157
## 6 0.019164956 0.9965843 0.0015830048 1.105048e-06 0.0018316132 0.0012115670
## se2 se3 se4 lower1 lower2 lower3
## 1 0.0000000000 0.000000e+00 0.0000000000 1.0000000 0.000000e+00 0.000000e+00
## 2 0.0005285695 1.116923e-07 0.0003080762 0.9979685 7.412369e-05 0.000000e+00
## 3 0.0007492497 1.116923e-07 0.0003080762 0.9970526 2.626497e-04 0.000000e+00
## 4 0.0007490930 2.989514e-07 0.0004397978 0.9966331 2.625948e-04 1.555250e-08
## 5 0.0007487794 6.308958e-07 0.0006336859 0.9958031 2.624848e-04 1.026138e-07
## 6 0.0009191199 1.032427e-06 0.0007900509 0.9942125 5.072942e-04 1.770590e-07
## lower4 upper1 upper2 upper3 upper4
## 1 0.0000000000 1.0000000 0.000000000 0.000000e+00 0.000000000
## 2 0.0000422494 1.0000000 0.003757809 NaN 0.002206261
## 3 0.0000422494 1.0000000 0.004242894 NaN 0.002206261
## 4 0.0001486912 1.0000000 0.004242006 3.135832e-06 0.002505631
## 5 0.0004414450 0.9996475 0.004240230 4.281495e-06 0.003376609
## 6 0.0007864573 0.9989617 0.004939745 6.896741e-06 0.004265720
##
## ...
## time pstate1 pstate2 pstate3 pstate4 se1 se2
## 501 6.253251 0.2308531 0.4336481 0.1681264 0.1673724 0.02448884 0.02974526
## 502 6.357290 0.2283925 0.4304829 0.1712916 0.1698330 0.02460675 0.03002904
## 503 6.362765 0.2259175 0.4272883 0.1744862 0.1723080 0.02472281 0.03031296
## 504 6.798084 0.2209174 0.4208123 0.1809622 0.1773081 0.02518284 0.03119272
## 505 7.110198 0.2014549 0.3954248 0.2063497 0.1967706 0.03067690 0.03987257
## 506 7.731691 0.2014549 0.3954248 0.2063497 0.1967706 0.03067690 0.03987257
## se3 se4 lower1 lower2 lower3 lower4 upper1
## 501 0.02379684 0.02100629 0.1875169 0.3790974 0.1273960 0.1308738 0.2842045
## 502 0.02430502 0.02136056 0.1849160 0.3754732 0.1297050 0.1327282 0.2820911
## 503 0.02480762 0.02170882 0.1823058 0.3718215 0.1320509 0.1346059 0.2799621
## 504 0.02616939 0.02264879 0.1766850 0.3639092 0.1362993 0.1380380 0.2762233
## 505 0.03690104 0.02987965 0.1494719 0.3245138 0.1453401 0.1461185 0.2715164
## 506 0.03690104 0.02987965 0.1494719 0.3245138 0.1453401 0.1461185 0.2715164
## upper2 upper3 upper4
## 501 0.4960483 0.2218790 0.2140499
## 502 0.4935519 0.2262118 0.2173106
## 503 0.4910294 0.2305584 0.2205703
## 504 0.4866130 0.2402604 0.2277500
## 505 0.4818309 0.2929694 0.2649813
## 506 0.4818309 0.2929694 0.2649813
plot(pt, ord = c(2, 3, 4, 1), lwd = 2, xlab = "Years since transplant",
ylab = "Prediction probabilities", cex = 0.75,
legend = c("Alive in remission, no PR",
"Alive in remission, PR", "Relapse or death after PR",
"Relapse or death without PR"))
pt <- probtrans(msf2, predt = 0.5)
summary(pt, from = 1)
##
## Prediction from state 1 (head and tail):
## time pstate1 pstate2 pstate3 pstate4 se1
## 1 0.5000000 1.0000000 0.000000000 0.000000e+00 0.000000000 0.000000000
## 2 0.5010267 0.9985898 0.000000000 0.000000e+00 0.001410218 0.003237571
## 3 0.5037645 0.9976488 0.000000000 0.000000e+00 0.002351164 0.004183373
## 4 0.5065024 0.9955387 0.001639506 0.000000e+00 0.002821775 0.006169060
## 5 0.5092402 0.9938957 0.003282495 0.000000e+00 0.002821775 0.007422321
## 6 0.5119781 0.9915469 0.003277183 5.312169e-06 0.005170580 0.008513835
## se2 se3 se4 lower1 lower2 lower3
## 1 0.000000000 0.000000e+00 0.000000000 1.0000000 0.000000e+00 0.0000e+00
## 2 0.000000000 0.000000e+00 0.003237571 0.9922644 0.000000e+00 0.0000e+00
## 3 0.000000000 0.000000e+00 0.004183373 0.9894832 0.000000e+00 0.0000e+00
## 4 0.004136138 2.101143e-06 0.004583357 0.9835207 1.167630e-05 0.0000e+00
## 5 0.005848968 2.101143e-06 0.004583357 0.9794542 9.987955e-05 0.0000e+00
## 6 0.005839510 1.353036e-05 0.006209919 0.9749997 9.971745e-05 3.6076e-08
## lower4 upper1 upper2 upper3 upper4
## 1 0.000000e+00 1 0.0000000 0.0000000000 0.00000000
## 2 1.567120e-05 1 0.0000000 0.0000000000 0.12690255
## 3 7.190497e-05 1 0.0000000 0.0000000000 0.07687883
## 4 1.169315e-04 1 0.2302081 NaN 0.06809471
## 5 1.169315e-04 1 0.1078777 NaN 0.06809471
## 6 4.911765e-04 1 0.1077036 0.0007822136 0.05443032
##
## ...
## time pstate1 pstate2 pstate3 pstate4 se1 se2
## 330 6.253251 0.6872018 0.02597812 0.005991102 0.2808290 0.05248379 0.01448894
## 331 6.357290 0.6798772 0.02578851 0.006180714 0.2881535 0.05348008 0.01438691
## 332 6.362765 0.6725095 0.02559713 0.006372091 0.2955212 0.05445049 0.01428397
## 333 6.798084 0.6576254 0.02520918 0.006760043 0.3104053 0.05723289 0.01407791
## 334 7.110198 0.5996895 0.02368832 0.008280903 0.3683412 0.07993696 0.01332734
## 335 7.731691 0.5996895 0.02368832 0.008280903 0.3683412 0.07993696 0.01332734
## se3 se4 lower1 lower2 lower3 lower4
## 330 0.003565503 0.05117341 0.5916642 0.008706862 0.001866073 0.1964870
## 331 0.003675647 0.05224080 0.5827386 0.008640867 0.001926781 0.2019786
## 332 0.003786522 0.05327926 0.5738257 0.008574230 0.001988236 0.2075517
## 333 0.004019125 0.05620683 0.5544966 0.008437438 0.002108021 0.2176694
## 334 0.005060910 0.07944552 0.4618104 0.007863898 0.002499552 0.2413567
## 335 0.005060910 0.07944552 0.4618104 0.007863898 0.002499552 0.2413567
## upper1 upper2 upper3 upper4
## 330 0.7981661 0.07750930 0.01923468 0.4013749
## 331 0.7932082 0.07696533 0.01982645 0.4110953
## 332 0.7881646 0.07641656 0.02042190 0.4207761
## 333 0.7799349 0.07531940 0.02167824 0.4426505
## 334 0.7787343 0.07135602 0.02743426 0.5621360
## 335 0.7787343 0.07135602 0.02743426 0.5621360
plot(pt, ord = c(2, 3, 4, 1), lwd = 2, xlab = "Years since transplant",
ylab = "Prediction probabilities", cex = 0.75,
legend = c("Alive in remission, no PR",
"Alive in remission, PR", "Relapse or death after PR",
"Relapse or death without PR"))
msf2$trans <- tmat
msf.20 <- msf2 # copy msfit result for reference (young) patient
newd <- newd[,1:5] # use the basic covariates of the reference patient
newd2 <- newd
newd2$age <- 1
newd2$age <- factor(newd2$age,levels=0:2,labels=levels(ebmt3$age))
attr(newd2, "trans") <- tmat
class(newd2) <- c("msdata","data.frame")
newd2 <- expand.covs(newd2,covs[1:4],longnames=FALSE)
newd2$strata=c(1,2,2)
newd2$pr <- c(0,0,1)
msf.2040 <- msfit(c2, newdata=newd2, trans=tmat)
newd3 <- newd
newd3$age <- 2
newd3$age <- factor(newd3$age,levels=0:2,labels=levels(ebmt3$age))
attr(newd3, "trans") <- tmat
class(newd3) <- c("msdata","data.frame")
newd3 <- expand.covs(newd3,covs[1:4],longnames=FALSE)
newd3$strata=c(1,2,2)
newd3$pr <- c(0,0,1)
msf.40 <- msfit(c2, newdata=newd3, trans=tmat)
pt.20 <- probtrans(msf.20,predt=0) # original young (<= 20) patient
pt.201 <- pt.20[[1]]; pt.202 <- pt.20[[2]]
pt.2040 <- probtrans(msf.2040,predt=0) # patient 20-40
pt.20401 <- pt.2040[[1]]; pt.20402 <- pt.2040[[2]]
pt.40 <- probtrans(msf.40,predt=0) # patient > 40
pt.401 <- pt.40[[1]]; pt.402 <- pt.40[[2]]
pt.201[488:489,] # 5 years falls between 488th and 489th time point
## time pstate1 pstate2 pstate3 se1 se2 se3
## 488 4.985626 0.2452605 0.4519872 0.3027523 0.02411439 0.02853645 0.02693539
## 489 5.084189 0.2445602 0.4511034 0.3043365 0.02412385 0.02858110 0.02707436
pt.202[488:489,] # 5-years probabilities
## time pstate1 pstate2 pstate3 se1 se2 se3
## 488 4.985626 0 0.7378970 0.2621030 0 0.03339911 0.03339911
## 489 5.084189 0 0.7364541 0.2635459 0 0.03356217 0.03356217
plot(pt.201$time, 1 - pt.201$pstate3, ylim = c(0.425, 1), type = "s",
lwd = 2, col = "red", xlab = "Years since transplant", ylab = "Relapse-free survival")
lines(pt.20401$time, 1 - pt.20401$pstate3, type = "s", lwd = 2,
col = "blue")
lines(pt.401$time, 1 - pt.401$pstate3, type = "s", lwd = 2, col = "green")
lines(pt.202$time, 1 - pt.202$pstate3, type = "s", lwd = 2, col = "red",
lty = 2)
lines(pt.20402$time, 1 - pt.20402$pstate3, type = "s", lwd = 2,
col = "blue", lty = 2)
lines(pt.402$time, 1 - pt.402$pstate3, type = "s", lwd = 2, col = "green",
lty = 2)
legend(6, 1, c("no PR", "PR"), lwd = 2, lty = 1:2, xjust = 1,
bty = "n")
legend("topright", c("<=20", "20-40", ">40"), lwd = 2,
col = c("red", "blue", "green"), bty = "n")
Figure 4: Predicted relapse-free survival probabilities for three patients in different age categories, given platelet recovery (dashed) and given no platelet recovery (solid). The time of prediction was at transplant (note: in the tutorial this was at 1 month after transplant).
It is also possible to do prediction with a fixed horizon. This should not be understood as attempting to predict the past. It means that in our prediction probabilities \(P_{gh}(s, t)\), we fix \(t\), a time horizon, and we want to study how \(P_{gh}(s, t)\) changes as more and more information on a patient becomes available. From a computational point of view this just means that the order of the matrix multiplication in (2) is reversed. We will plot \(1 -\hat{P}_{13}(s, 5)\) and \(1 -\hat{P}_{23}(s, 5)\), the 5-years relapse-free survival probabilities given that the patient is in state 1 (no PR) and in state 2 (PR), respectively, for the same three patients as before.
pt.20 <- probtrans(msf.20, direction = "fixedhorizon", predt = 5)
pt.201 <- pt.20[[1]]
pt.202 <- pt.20[[2]]
head(pt.201)
## time pstate1 pstate2 pstate3 se1 se2 se3
## 1 0.000000000 0.2452605 0.4519872 0.3027523 0.02411439 0.02853645 0.02693539
## 2 0.002737851 0.2454650 0.4519742 0.3025608 0.02413403 0.02854695 0.02694328
## 3 0.008213552 0.2455948 0.4518230 0.3025823 0.02414644 0.02854909 0.02694380
## 4 0.010951403 0.2456698 0.4519611 0.3023691 0.02415369 0.02855746 0.02695114
## 5 0.016427105 0.2458201 0.4522376 0.3019422 0.02416821 0.02857418 0.02696574
## 6 0.019164956 0.2461011 0.4523628 0.3015361 0.02419520 0.02859303 0.02698076
head(pt.202)
## time pstate1 pstate2 pstate3 se1 se2 se3
## 1 0.000000000 0 0.7378970 0.2621030 0 0.03339911 0.03339911
## 2 0.002737851 0 0.7380513 0.2619487 0 0.03340572 0.03340572
## 3 0.008213552 0 0.7380513 0.2619487 0 0.03340572 0.03340572
## 4 0.010951403 0 0.7382057 0.2617943 0 0.03341233 0.03341233
## 5 0.016427105 0 0.7385150 0.2614850 0 0.03342551 0.03342551
## 6 0.019164956 0 0.7388247 0.2611753 0 0.03343863 0.03343863
pt.2040 <- probtrans(msf.2040, direction = "fixedhorizon", predt = 5)
pt.20401 <- pt.2040[[1]]
pt.20402 <- pt.2040[[2]]
pt.40 <- probtrans(msf.40, direction = "fixedhorizon", predt = 5)
pt.401 <- pt.40[[1]]
pt.402 <- pt.40[[2]]
plot(pt.201$time, 1 - pt.201$pstate3, ylim = c(0.425, 1), type = "s",
lwd = 2, col = "red", xlab = "Years since transplant", ylab = "Relapse-free survival")
lines(pt.20401$time, 1 - pt.20401$pstate3, type = "s", lwd = 2,
col = "blue")
lines(pt.401$time, 1 - pt.401$pstate3, type = "s", lwd = 2, col = "green")
lines(pt.202$time, 1 - pt.202$pstate3, type = "s", lwd = 2, col = "red",
lty = 2)
lines(pt.20402$time, 1 - pt.20402$pstate3, type = "s", lwd = 2,
col = "blue", lty = 2)
lines(pt.402$time, 1 - pt.402$pstate3, type = "s", lwd = 2, col = "green",
lty = 2)
legend("topleft", c("<=20", "20-40", ">40"), lwd = 2,
col = c("red", "blue", "green"), bty = "n")
legend(1, 1, c("no PR", "PR"), lwd = 2, lty = 1:2, bty = "n")
title(main = "Backward prediction")
* Figure 5: Predicted probabilities of 5-years relapse-free survival,
conditional on being alive without relapse with (PR) and without
platelet recovery (no PR). Patients in three age categories.
Competing risks concern the situation where more than one cause of failure is possible. If failures are different causes of death, only the first of these to occur is observed. In other situations, observations after the first failure may be observable, but not of interest. We can represent a competing risks model graphically with an initial state (alive or more generally event-free) and a number of different endpoints.
A competing risks situation with K causes of failure
The subject of competing risks goes as far back as the 18th century, when Bernoulli [12] studied the possible consequences of eradication of smallpox on mortality rates. Indeed, the problem of estimation of failure probabilities after elimination (or modification) of one of the competing risks has been of great importance and has been the subject of much debate in the 1970s [13, 14].
The central criticism is the assumption that upon removal of one cause of failure, the risks of failure of the remaining causes is unchanged. While this may be a reasonable assumption in the industrial setting, in human studies it will rarely be true.
In some case, each failure type is equally important. In other cases, one failure type can be singled out as the event of interest, while the remaining failure types are of less importance. One is then interested in the probability of failing from the cause of interest in the presence of competing risks (or, as in the first example, each of the death causes in turn is the cause of interest, with all the other death causes taken as competing risks).
One method that is often used to estimate this failure probability is the Kaplan–Meier estimate, where the failures from the competing causes are treated as censored observations.
However, this naive Kaplan–Meier is biased. The basic issue in competing risks models that results in the bias of the naive Kaplan–Meier estimator is the violation of one of the assumptions underlying the Kaplan–Meier estimator: the assumption of independence of the censoring distribution, i.e. the distribution of the time to the competing events.
The bias is greater when the competition is heavier, i.e. when the hazard of the competing events is larger. This is different from censoring due to end of study or loss to follow-up. In the latter situations, individuals may still fail at a later time point.
For illustration of several concepts and techniques we will use data from 329 homosexual men from the Amsterdam Cohort Studies on HIV infection and AIDS [15]. During the course of HIV infection, the so-called syncytium inducing (SI) HIV phenotype appears in many individuals. Prognosis is strongly impaired after the appearance of this SI phenotype [16]. Little is known about factors that induce the appearance of SI phenotype. When analysing time to SI appearance before AIDS diagnosis, AIDS acts as a competing event.
Using a naive KM approach, for time to AIDS, all individuals in which SI phenotype appeared first were treated as censored, while for SI appearance, all AIDS diagnoses were treated as censored.
The hazard is defined as \[h(t)=\lim_{\Delta t \to 0} \frac{P(t \le t+\Delta t, D=k|T\ge t)}{\Delta t}\]
the survival function is defined through \[h(t)=\frac{1}{S(t)}\lim_{\Delta t \to 0} \frac{S(t)-S(t+\Delta t)}{\Delta t}=-\frac{d \log S(t)}{dt}\]
The cumularive hazard id defined as \[H(t)=\int_{0}^{t} h(s)ds\]
The survival function can be found from the cumulative hazard through the relation \[S(t)=\exp(-H(t))\]
\[H_k(t)=\int_{0}^{t} h_k(s)ds\] \[S_k(t)=\exp(-H_k(t))\]
thus, \[S(t)=\exp(-\sum_{k=1}^{k} H_k(t))\]
This survival function does have an interpretation: it is the probability of not having failed from any cause at time \(t\). The cumulative incidence function of casue \(k\), \((I_k(t)\), is defined by the probability (\(P(T\ le t, D=k)\)) of failing from cause \(k\) before time \(t\). \[I_k(t)=\int_{0}^{t} h_k(s)S(s)ds\]
Alternatively, the latent failure time approach focused on the joint distribution of the times to the \(K\) different events,, as described by the joint survival function \[\bar{S}(t_1, \dots, t_K)=P(\tilde{T}_1 > t_1, \dots, \tilde{T}_k > t_k)\]
Estimation of the cumulative incidence functions.
Let \(0 < t_1 < t_2, \dots, < t_N\) be the ordered distinct time points at which failures of any cause occur. Let \(d_{kj}\) denote the number of patients failing from cause \(k\) at \(t_j\), and let \(d_j = \sum_{k=1}^{K} d_{kj}\) denote the total number of failures (from any cause) at \(t_j\). In the absence of ties only one of the \(d_{kj}\) equals 1 for a given \(j\), and \(d_j = 1\).
The formulas are also valid, however, in the presence of ties. Let \(n_j\) be the number of patients at risk (i.e. that are still in follow-up and have not failed from any cause) at time \(t_j\). The overall survival probability \(S(t)\) at \(t\) can be estimated, without considering the cause of failure, by the Kaplan–Meier estimator \[\hat{S}(t)=\prod_{j:t_j \le t} (1-\frac{d_j}{n_j})\]
A discretized version of the cause-specific hazard of equation is the proportion of subjects at risk that fail from cause \(k\), \[h_k(t)=p(T=t_j, D=k|T>t_{j-1})\]
This quantity would be estimated by \[\hat{h}_k(t_j)=\frac{d_{kj}}{n_j}\]
Thus, \[\hat{S}(t)=\prod_{j:t_j \le t} (1-\sum_{k=1}^{K} \hat{h}_k(t_j))\]
The unconditional probability of failing from cause \(k\) at \(t_j\), \(p_k(t_j)=P(T=t_j, D=k)\) is the product of the hazard and the probability of being event-free at \(t_j\) and is estimated as \[\hat{p}_k (t_j)=\hat{h}_k (t_j) \hat{S}(t_{j-1})\]
Finally, the cumulative incidence \(I_{k}(t)\) of cause \(k\) at \(t\) is estimated as the sum of these terms for all time points before \(t\); in summary \[\hat{I}_k (t)=\sum_{j:t_j \ge t} \hat{p}_k (t_j),\] \[\hat{p}_k (t_j)=\hat{h}_k (t_j) \hat{S}(t_{j-1}),\] \[\hat{h}_k(t_j)=\frac{d_{kj}}{n_j}\]
The effect of covariates on disease progression is most often modelled using the Cox proportional hazards model. In its simplest form, the hazard for a subject with covariate values \(x = (x_1, \dots, , x_p)\) is assumed to be \[h(t|x)=h_0(t)\exp(\beta x)\]
Assuming all event times are distinct, the parameter vector \(\beta\) is found by maximizing the partial likelihood. This is a product, over the event times, of a quotient that compares the hazard of the individual with the event at \(t_j\) to the hazard of all the individuals at risk at \(t_j\): \[L(\beta)=\prod_{j=1}^{N} \frac{\exp(\beta x_j)}{\sum_{l: R_j}\exp(\beta x_l)}\]
Note that the baseline hazard cancels out. The estimate \(\beta\) is used in Breslow’s estimate of the baseline cumulative hazard \[\hat{H}(t)=\sum_{j:t_j /le t} \frac{1}{\sum_{l:R_j}\exp(\hat{\beta}x_j)}\]
Sometimes, one may want to allow the baseline hazard to be different across subgroups \(h = 1, \dots, m,\) called strata: \[h_h(t|x)=h_{h,0}(t)\exp(\beta x)\]
Parameter estimation in this stratified Cox model is performed by maximization of the partial likelihood per stratum \[L(\beta)=\prod_{h=1}^{m} L_h(\beta)\] with \[L_h(\beta)=\prod_{j=1}^{N} \frac{\exp(\beta x_j)}{\sum_{l: R_{h,j}}\exp(\beta x_l)}\]
Here, the product in \(L_h(\beta)\) is only taken over the event times from individuals in stratum \(h\), and \(R_{h,j}\) denotes the risk set at event time \(t_j\) in stratum \(h\). If all relative risk parameters \(\beta\) are allowed to differ per strata, then the \(L_h(\beta_h)\) have nothing in common and fitting such a stratified Cox model boils down to fitting m different Cox models, i.e. one per stratum.
The results from a Cox model, which models effects of covariates on the hazard, can also be used to describe cumulative effects. For the moment, assume that only effects of time-fixed covariates have been modelled. If an individual has covariate values \(x\), then, his or her survival curve is estimated as \[\hat{S}(t)=\exp(-\hat{H}_0(t) \exp(\beta x))=\hat{S}(t)\exp(\beta x)\]
Illustration of the steps used in estimating the cumulative incidence functions for AIDS and SI appearance in the SI data
data(aidssi)
si <- aidssi # Just a shorter name
head(si)
## patnr time status cause ccr5
## 1 1 9.106 1 AIDS WW
## 2 2 11.039 0 event-free WM
## 3 3 2.234 1 AIDS WW
## 4 4 9.878 2 SI WM
## 5 5 3.819 1 AIDS WW
## 6 6 6.801 1 AIDS WW
table(si$status)
##
## 0 1 2
## 107 114 108
Here a single time and cause variable are used to indicate time of failure (or censoring) and cause of failure. The variable status is just a numeric representation of cause. The whole data set represented in this format will be called si.
An alternative way of representing the same data is in long format (the SI data set in long format is called silong). We will see later that this representation allows for more flexibility in modelling the effect of covariates.
To prepare data in long format, it is possible to use msprep. In this case there is not a huge advantage in using msprep; the long data may just as easily be prepared directly. Nevertheless we will illustrate the use of msprep to obtain data in long format. The function trans.comprisk prepares a transition matrix for competing risks models. The first argument is the number of causes of failure; in the names argument a character vector of length three (the total number of states in the multi-state model including the failure-free state) may be given. The transition matrix has three states with stte 1 being the failure-free state and the subsequent sttes representing the different causes of failure.
tmat <- trans.comprisk(2, names = c("event-free", "AIDS", "SI"))
tmat
## to
## from event-free AIDS SI
## event-free NA 1 2
## AIDS NA NA NA
## SI NA NA NA
si$stat1 <- as.numeric(si$status == 1)
si$stat2 <- as.numeric(si$status == 2)
silong <- msprep(time = c(NA, "time", "time"),
status = c(NA, "stat1", "stat2"),
data = si, keep = "ccr5",
trans = tmat)
events(silong)
## $Frequencies
## to
## from event-free AIDS SI no event total entering
## event-free 0 114 108 107 329
## AIDS 0 0 0 114 114
## SI 0 0 0 108 108
##
## $Proportions
## to
## from event-free AIDS SI no event
## event-free 0.0000000 0.3465046 0.3282675 0.3252280
## AIDS 0.0000000 0.0000000 0.0000000 1.0000000
## SI 0.0000000 0.0000000 0.0000000 1.0000000
silong <- expand.covs(silong, "ccr5")
silong[1:8, ]
## An object of class 'msdata'
##
## Data:
## id from to trans Tstart Tstop time status ccr5 ccr5WM.1 ccr5WM.2
## 1 1 1 2 1 0 9.106 9.106 1 WW 0 0
## 2 1 1 3 2 0 9.106 9.106 0 WW 0 0
## 3 2 1 2 1 0 11.039 11.039 0 WM 1 0
## 4 2 1 3 2 0 11.039 11.039 0 WM 0 1
## 5 3 1 2 1 0 2.234 2.234 1 WW 0 0
## 6 3 1 3 2 0 2.234 2.234 0 WW 0 0
## 7 4 1 2 1 0 9.878 9.878 0 WM 1 0
## 8 4 1 3 2 0 9.878 9.878 1 WM 0 1
c1 <- coxph(Surv(time, status) ~ 1, data = silong,
subset = (trans == 1), method = "breslow")
c2 <- coxph(Surv(time, status) ~ 1, data = silong,
subset = (trans == 2), method = "breslow")
h1 <- survfit(c1)
h1 <- data.frame(time = h1$time, surv = h1$surv)
h2 <- survfit(c2)
h2 <- data.frame(time = h2$time, surv = h2$surv)
idx1 <- (h1$time<13) # this restricts the plot to the first 13 years
plot(c(0,h1$time[idx1],13),c(1,h1$surv[idx1],min(h1$surv[idx1])),type="s",
xlim=c(0,13),ylim=c(0,1),xlab="Years from HIV infection",ylab="Probability",lwd=2)
idx2 <- (h2$time<13)
lines(c(0,h2$time[idx2],13),c(0,1-h2$surv[idx2],max(1-h2$surv[idx2])),type="s",lwd=2)
text(8,0.71,adj=0,"AIDS")
text(8,0.32,adj=0,"SI")
* The figure is estimated survival curves for AIDS and probability of SI
appearance, based on the naive Kaplan-Meier estimator.
ci <- Cuminc(time = si$time, status = si$status)
ci <- Cuminc(time = "time", status = "status", data = aidssi)
The result is a data frame containing the failure-free probabilities (Surv) and the cumulative incidence functions with their standard errors. Other arguments allow to specify the codes for the causes of failure and a group identifier.
head(ci)
## time Surv CI.1 CI.2 seSurv seCI.1 seCI.2
## 1 0.112 0.9969605 0 0.003039514 0.003034891 0 0.003034891
## 2 0.137 0.9939210 0 0.006079027 0.004285436 0 0.004285436
## 3 0.474 0.9908628 0 0.009137246 0.005251290 0 0.005251290
## 4 0.824 0.9877760 0 0.012224046 0.006074796 0 0.006074796
## 5 0.884 0.9846795 0 0.015320522 0.006799283 0 0.006799283
## 6 0.969 0.9815830 0 0.018416998 0.007449696 0 0.007449696
tail(ci)
## time Surv CI.1 CI.2 seSurv seCI.1 seCI.2
## 211 11.943 0.2312339 0.4035707 0.3651954 0.02638091 0.02978948 0.02881464
## 212 12.129 0.2266092 0.4081954 0.3651954 0.02625552 0.02989297 0.02881464
## 213 12.400 0.2219845 0.4081954 0.3698201 0.02612382 0.02989297 0.02896110
## 214 12.936 0.2165702 0.4081954 0.3752344 0.02604167 0.02989297 0.02919663
## 215 13.361 0.2067261 0.4180395 0.3752344 0.02665370 0.03089977 0.02919663
## 216 13.936 0.0000000 0.4180395 0.5819605 0.00000000 0.03089977 0.03089977
idx0 <- (ci$time < 13)
plot(c(0, ci$time[idx0], 13), c(1, 1 - ci$CI.1[idx0],
min(1 - ci$CI.1[idx0])),
type = "s", xlim = c(0, 13), ylim = c(0, 1),
xlab = "Years from HIV infection",
ylab = "Probability", lwd = 2)
idx1 <- (h1$time < 13)
lines(c(0, h1$time[idx1], 13), c(1, h1$surv[idx1], min(h1$surv[idx1])),
type = "s", lwd = 2, col = 8)
lines(c(0, ci$time[idx0], 13), c(0, ci$CI.2[idx0], max(ci$CI.2[idx0])),
type = "s", lwd = 2)
idx2 <- (h2$time < 13)
lines(c(0, h2$time[idx2], 13), c(0, 1 - h2$surv[idx2],
max(1 - h2$surv[idx2])),
type = "s", lwd = 2, col = 8)
text(8, 0.77, adj = 0, "AIDS")
text(8, 0.275, adj = 0, "SI")
idx0 <- (ci$time < 13)
plot(c(0, ci$time[idx0]), c(0, ci$CI.1[idx0]), type = "s",
xlim = c(0,13),
ylim = c(0, 1), xlab = "Years from HIV infection", ylab = "Probability",
lwd = 2)
lines(c(0, ci$time[idx0]), c(0, ci$CI.1[idx0] + ci$CI.2[idx0]),
type = "s", lwd = 2)
text(13, 0.5 * max(ci$CI.1[idx0]), adj = 1, "AIDS")
text(13, max(ci$CI.1[idx0]) + 0.5 * max(ci$CI.2[idx0]), adj = 1, "SI")
text(13, 0.5 + 0.5 * max(ci$CI.1[idx0]) + 0.5 * max(ci$CI.2[idx0]),
adj = 1, "Event-free")
* The figure indicates cumulative incidence curves of AIDS and SI
appearance. The cumulative incidence functions are stacked; the
distances between two curves represent the probabilities of the
different events.
Just like in standard survival analysis, the effect of one or two binary covariates is most easily investigated by estimating cumulative incidence curves non-parametrically and testing whether the curves differ by covariate value. Gray [20] developed a log-rank type test for equality of cumulative incidence curves.
In proportional hazards regression on the cause-specific hazards, we model the cause-specific hazard of cause \(k\) for a subject with covariate vector \(x\) as (where \(h_{k,0}(t)\) is the baseline cause-specific hazard of cause \(k\), and the vector \(\beta_k\) represents the covariate effects on cause \(k\),) \[h_k(t|Z)=h_{k,0}(t)\exp(\beta_k x)\]
The analysis is completely standard, but the interpretation requires caution. At each time some person moves to state \(k\), the covariate values of this individual are compared with the covariates of all other individuals still event-free and in follow-up. Persons who move to another state are censored at their transition time.
In this subsection we shall illustrate the use of R in carrying out some of the regression analyzes based on the SI data set. A specific deletion in the C–C chemokine receptor 5 gene (CCR5 \(\Delta\) 32) has been associated with reduced susceptibility to HIV infection and delayed AIDS progression. Since NSI viruses use CCR5 for cell entry, whereas SI viruses can also use C-X-C chemokine receptor 4 (CXCR4), the latter virus type may have an advantage in persons with the deletion.
Therefore, we investigate whether in persons with the deletion the SI phenotype appears more rapidly. This question has been addressed using standard survival analysis techniques, which implicitly assumed that a switch to SI and progression to AIDS are independent mechanisms. The CCR5 genotype is incorporated in the SI data set through the covariate ccr5. Persons without the deletion (‘wild type’) have WW, the reference category, whereas individuals who have the deletion on one of the chromosomes have WM (individuals with the deletion on both chromosomes were not present in our data).
let us look at the effect of CCR5 (classified as wild-type (WW) or mutant (WM)) on AIDS and SI appearance. A total of 259 out of 324 patients (80 per cent) had the wild-type variant, while 65 patients (20 per cent) had the mutant variant. Five patients had unknown CCR5-genotype
coxph(Surv(time, status == 1) ~ ccr5, data = si) # AIDS
## Call:
## coxph(formula = Surv(time, status == 1) ~ ccr5, data = si)
##
## coef exp(coef) se(coef) z p
## ccr5WM -1.2358 0.2906 0.3071 -4.024 5.72e-05
##
## Likelihood ratio test=21.98 on 1 df, p=2.756e-06
## n= 324, number of events= 113
## (5 observations deleted due to missingness)
coxph(Surv(time, status == 2) ~ ccr5, data = si) # SI appearance
## Call:
## coxph(formula = Surv(time, status == 2) ~ ccr5, data = si)
##
## coef exp(coef) se(coef) z p
## ccr5WM -0.2542 0.7755 0.2380 -1.068 0.286
##
## Likelihood ratio test=1.19 on 1 df, p=0.2748
## n= 324, number of events= 107
## (5 observations deleted due to missingness)
The estimated coefficient for the mutant with respect to the wild-type variant for AIDS was -1.24 (SE 0.31), giving a significant protective effect of the mutant variant (hazard ratio (HR) = 0.29, P \(<\) 0.0001). The effect of CCR5 on SI appearance was not significant (coefficient -0.25, SE 0.24, HR 0.78, p = 0.29).
The same analysis can be performed using the long format dataset silong in several ways. For instance, as separate Cox regressions.
coxph(Surv(time, status) ~ ccr5, data = silong,
subset = (trans == 1), method = "breslow")
## Call:
## coxph(formula = Surv(time, status) ~ ccr5, data = silong, subset = (trans ==
## 1), method = "breslow")
##
## coef exp(coef) se(coef) z p
## ccr5WM -1.2358 0.2906 0.3071 -4.024 5.73e-05
##
## Likelihood ratio test=21.98 on 1 df, p=2.758e-06
## n= 324, number of events= 113
## (5 observations deleted due to missingness)
coxph(Surv(time, status) ~ ccr5, data = silong,
subset = (trans == 2), method = "breslow")
## Call:
## coxph(formula = Surv(time, status) ~ ccr5, data = silong, subset = (trans ==
## 2), method = "breslow")
##
## coef exp(coef) se(coef) z p
## ccr5WM -0.2542 0.7755 0.2380 -1.068 0.286
##
## Likelihood ratio test=1.19 on 1 df, p=0.2748
## n= 324, number of events= 107
## (5 observations deleted due to missingness)
coxph(Surv(time, status) ~ ccr5WM.1 + ccr5WM.2 + strata(trans),
data = silong)
## Call:
## coxph(formula = Surv(time, status) ~ ccr5WM.1 + ccr5WM.2 + strata(trans),
## data = silong)
##
## coef exp(coef) se(coef) z p
## ccr5WM.1 -1.2358 0.2906 0.3071 -4.024 5.72e-05
## ccr5WM.2 -0.2542 0.7755 0.2380 -1.068 0.286
##
## Likelihood ratio test=23.17 on 2 df, p=9.294e-06
## n= 648, number of events= 220
## (10 observations deleted due to missingness)
The n = 648 mentioned here equals the number of rows (two times 324) in the long data set without missing data (a warning from R that 10 observations were not used because of missing covariates has been removed from the output). The same model can also be fitted by adding an interaction term between the cause stratum variable and age.
The same model, but now using a covariate by cause interaction.
coxph(Surv(time, status) ~ ccr5 * factor(trans) +
strata(trans),
data = silong)
## Call:
## coxph(formula = Surv(time, status) ~ ccr5 * factor(trans) + strata(trans),
## data = silong)
##
## coef exp(coef) se(coef) z p
## ccr5WM -1.2358 0.2906 0.3071 -4.024 5.72e-05
## factor(trans)2 NA NA 0.0000 NA NA
## ccr5WM:factor(trans)2 0.9816 2.6688 0.3886 2.526 0.0115
##
## Likelihood ratio test=23.17 on 2 df, p=9.294e-06
## n= 648, number of events= 220
## (10 observations deleted due to missingness)
Now we see the advantage of the use of the long format. The notation allows the effect of the covariates to be different for each failure cause. Use of the long format makes it possible to assume that the effects of CCR5 are identical for the different causes and to test for equality of the effects of CCR5 on AIDS and SI appearance. The coefficient -1.236 is (as before) for the effect of CCR5 on AIDS. The deviant coefficient 0.982 now represents the difference in the effect of CCR5 on the two cause-specific hazards. The CCR5 genotype by cause interaction term is significant, indicating that the effect of CCR5 is quite different on AIDS and SI appearance. The effect of CCR5 on SI appearance is thus given by -1.236 + 0.982 = -0.254, as before. Note that the second row with NA’s in the output above is caused by the fact that the cause main effect cannot be estimated, since the baseline cause-specific hazards are both freely estimated.
Although not applicable here, if we were to assume that the effect of CCR5 on the two cause-specific hazards is equal, we could use a stratified model. In the model below we assume that the effect of CCR5 on the two cause-specific hazards is equal. The significant effect of the interaction in the model we just saw indicates that this is not a good idea.
coxph(Surv(time, status) ~ ccr5 + strata(trans), data = silong)
## Call:
## coxph(formula = Surv(time, status) ~ ccr5 + strata(trans), data = silong)
##
## coef exp(coef) se(coef) z p
## ccr5WM -0.7012 0.4960 0.1860 -3.77 0.000163
##
## Likelihood ratio test=16.46 on 1 df, p=4.972e-05
## n= 648, number of events= 220
## (10 observations deleted due to missingness)
coxph(Surv(time, status) ~ ccr5, data = silong)
## Call:
## coxph(formula = Surv(time, status) ~ ccr5, data = silong)
##
## coef exp(coef) se(coef) z p
## ccr5WM -0.7012 0.4960 0.1860 -3.771 0.000163
##
## Likelihood ratio test=16.46 on 1 df, p=4.964e-05
## n= 648, number of events= 220
## (10 observations deleted due to missingness)
coxph(Surv(time, status != 0) ~ ccr5, data = si)
## Call:
## coxph(formula = Surv(time, status != 0) ~ ccr5, data = si)
##
## coef exp(coef) se(coef) z p
## ccr5WM -0.7013 0.4959 0.1860 -3.771 0.000163
##
## Likelihood ratio test=16.47 on 1 df, p=4.953e-05
## n= 324, number of events= 220
## (5 observations deleted due to missingness)
coxph(Surv(time, status) ~ ccr5WM.1 + ccr5WM.2 + factor(trans),
data = silong)
## Call:
## coxph(formula = Surv(time, status) ~ ccr5WM.1 + ccr5WM.2 + factor(trans),
## data = silong)
##
## coef exp(coef) se(coef) z p
## ccr5WM.1 -1.1664 0.3115 0.3063 -3.808 0.00014
## ccr5WM.2 -0.3316 0.7178 0.2366 -1.401 0.16112
## factor(trans)2 -0.1843 0.8317 0.1477 -1.248 0.21201
##
## Likelihood ratio test=21.54 on 3 df, p=8.124e-05
## n= 648, number of events= 220
## (10 observations deleted due to missingness)
The coefficient -0.184 and its hazard ratio 0.832 would indicate that (under the assumption of the cause-specific hazards being proportional) the baseline cause-specific hazard of SI appearance is somewhat smaller than that of AIDS, though not significant (p = 0.21).
Even though the assumption of proportional baseline cause-specific hazards will often be unrealistic, this proportional risk model has the nice property that the probability of an individual failing of cause k follows a logistic model.
Or, again using covariate by cause (transition) interaction.
coxph(Surv(time, status) ~ ccr5 * factor(trans), data = silong)
## Call:
## coxph(formula = Surv(time, status) ~ ccr5 * factor(trans), data = silong)
##
## coef exp(coef) se(coef) z p
## ccr5WM -1.1664 0.3115 0.3063 -3.808 0.00014
## factor(trans)2 -0.1843 0.8317 0.1477 -1.248 0.21201
## ccr5WM:factor(trans)2 0.8348 2.3044 0.3855 2.165 0.03035
##
## Likelihood ratio test=21.54 on 3 df, p=8.124e-05
## n= 648, number of events= 220
## (10 observations deleted due to missingness)
coxph(Surv(time, status) ~ ccr5 * factor(trans) + cluster(id),
data = silong)
## Call:
## coxph(formula = Surv(time, status) ~ ccr5 + factor(trans) + ccr5:factor(trans),
## data = silong, cluster = id)
##
## coef exp(coef) se(coef) robust se z p
## ccr5WM -1.1664 0.3115 0.3063 0.2928 -3.983 6.81e-05
## factor(trans)2 -0.1843 0.8317 0.1477 0.1477 -1.248 0.2121
## ccr5WM:factor(trans)2 0.8348 2.3044 0.3855 0.3855 2.165 0.0304
##
## Likelihood ratio test=21.54 on 3 df, p=8.124e-05
## n= 648, number of events= 220
## (10 observations deleted due to missingness)
c1 <- coxph(Surv(time, status) ~ ccr5WM.1 + ccr5WM.2 + strata(trans),
data = silong, method = "breslow")
WW <- data.frame(ccr5WM.1 = c(0, 0), ccr5WM.2 = c(0, 0),
trans = c(1,2), strata = c(1, 2))
msf.WW <- msfit(c1, WW, trans = tmat)
pt.WW <- probtrans(msf.WW, 0)[[1]]
WM <- data.frame(ccr5WM.1 = c(1, 0), ccr5WM.2 = c(0, 1),
trans = c(1, 2), strata = c(1, 2))
msf.WM <- msfit(c1, WM, trans = tmat)
pt.WM <- probtrans(msf.WM, 0)[[1]]
idx1 <- (pt.WW$time < 13)
idx2 <- (pt.WM$time < 13)
plot(c(0, pt.WW$time[idx1]), c(0, pt.WW$pstate2[idx1]), type = "s",
ylim = c(0, 0.5), xlab = "Years from HIV infection", ylab = "Probability",
lwd = 2)
lines(c(0, pt.WM$time[idx2]), c(0, pt.WM$pstate2[idx2]), type = "s",
lwd = 2, col = 8)
title(main = "AIDS")
text(9.2, 0.345, "WW", adj = 0, cex = 0.75)
text(9.2, 0.125, "WM", adj = 0, cex = 0.75)
plot(c(0, pt.WW$time[idx1]), c(0, pt.WW$pstate3[idx1]), type = "s",
ylim = c(0, 0.5), xlab = "Years from HIV infection", ylab = "Probability",
lwd = 2)
lines(c(0, pt.WM$time[idx2]), c(0, pt.WM$pstate3[idx2]), type = "s",
lwd = 2, col = 8)
title(main = "SI appearance")
text(7.5, 0.31, "WW", adj = 0, cex = 0.75)
text(7.5, 0.245, "WM", adj = 0, cex = 0.75)
The difference between covariate effects on cause-specific hazards and cumulative incidence explained
We assume that WW individuals have a constant failure rate of 30 per cent at discrete time points, for both endpoints. The mutation WM is protective for the cause-specific hazard to SI appearance (hazard ratio 0.90). However, it is even more protective for AIDS diagnosis (hazard ratio 0.33). This latter aspect causes more individuals to remain at risk after the first round for WM. Hence, in the second round, SI appears in more individuals with WM than in individuals with WW (1701 to 1200).
As a result, after the second round, the cumulative incidence for SI appearance is higher for individuals with WM than for individuals with WW genotype. The second illustration of this phenomenon is through Figures below, which shows what would happen if we were to change the baseline hazard of AIDS by multiplying the estimate from the data with different multiplication factors, while keeping everything else (the baseline cause-specific hazard of SI appearance, and the effects of CCR5 on both cause-specific hazards) the same.
The sub-plot with factor = 0 corresponds to the standard Cox regression in the absence of the competing risk ‘AIDS’. Here the difference in probabilities of SI appearance between wild-type and mutant indeed increases with time. As the competition from AIDS is increased, the higher cause-specific hazard for SI appearance, SI(s), for WW compared to WM is offset against an increasingly smaller contribution from the overall survival \(S(s) = \exp(-(H_{AIDS}(s)+ H_{SI}(s)))\) for WW, where the contribution of AIDS, \(H_{AIDS}(s)\), increases as the multiplication factor increases. At first this results in a crossing of the cumulative incidence curves (see e.g. factor = 1, this is not possible in the absence of competing risks), which occurs earlier with increasing multiplication factor. With factor = 4, the effect of CCR5 on the cumulative incidence of SI appearance is inverse to what the hazard ratio of 0.78 of WM with respect to WW seems to suggest.
The illustration of the phenomenon that the same cause-specific hazard ratio may have different effects on the cumulative incidences may be performed as well, by replacing the appropriate parts of the cumulative hazard of AIDS (trans=1), and calling prob(trans). We are interested in SI appearance and adjust the hazards of the competing risk (AIDS) while keeping the remainder the same. The result is shown as followings. We multiply the baseline hazard of AIDS with factors (ff = 0, 0.5, 1, 1.5, 2, 4).
ffs <- c(0, 0.5, 1, 1.5, 2, 4)
newmsf.WW <- msf.WW
newmsf.WM <- msf.WM
par(mfrow = c(2, 3))
for (ff in ffs) {
newmsf.WW$Haz$Haz[newmsf.WW$Haz$trans == 1] <- ff * msf.WW$Haz$Haz[msf.WW$Haz$trans == 1]
pt.WW <- probtrans(newmsf.WW, 0, variance = FALSE)[[1]]
newmsf.WM$Haz$Haz[newmsf.WM$Haz$trans == 1] <- ff * msf.WM$Haz$Haz[msf.WM$Haz$trans == 1]
pt.WM <- probtrans(newmsf.WM, 0, variance = FALSE)[[1]]
idx1 <- (pt.WW$time < 13)
idx2 <- (pt.WM$time < 13)
plot(c(0, pt.WW$time[idx1]), c(0, pt.WW$pstate3[idx1]), type = "s",
ylim = c(0, 0.52), xlab = "Years from HIV infection",
ylab = "Probability", lwd = 2)
lines(c(0, pt.WM$time[idx2]), c(0, pt.WM$pstate3[idx2]),
type = "s", lwd = 2, col = 8)
title(main = paste("Factor =", ff))
}
par(mfrow = c(1, 1))
This figure represents cumulative incidence functions for Si appearance, for CCR5 wild-type WW (black) and mutant WM (grey). The baseline hazard of AIDS was multiplied with different factors, while keeping everything else the same.
The use of long format, in particular in combination with the use of cause-specific dummies (ccr5.1 and ccr5.2 in our example) and stratified Cox regression offers great flexibility in modelling the effect of covariates on the cause-specific intensity rates, while using standard statistical software.
Several authors have suggested that robust estimates of standard errors should be used in order to correct for the correlation caused by multiplication of the data set. However, each individual still has at most one event, so that standard estimates of the standard error do suffice.
In order to avoid the highly nonlinear effects of covariates on the cumulative incidence functions when modelling is done on the cause-specific hazards, Fine and Gray introduced a way to regress directly on cumulative incidence functions. In analogy with the relation between hazard and survival, they defined a subdistribution hazard \[\bar{h}_k(t)=-\frac{d \log(1-I_k(t))}{dt}\]
This is not the cause-specific hazard. In terms of estimates of this quantity, the difference is in the risk set. For the cause-specific hazard, the risk set decreases at each time point at which there is a failure of another cause. For \(\bar{h}_k(t)\), persons who fail from another cause remain in the risk set. If there is no censoring, they remain in the risk set forever and once these individuals are given a censoring time that is larger than all event times, the analysis becomes completely standard. If there is censoring, they remain in the risk set until their potential censoring time, which is not observed if they experienced another event before. With administrative censoring, the potential censoring time is still known. If individuals may also be lost to follow-up, a censoring distribution is estimated from the data. Fine and Gray imposed a proportional hazards assumption on the subdistribution hazards: \[\bar{h}_k(t|x)=\bar{h}_{k,0}(t)\exp(\beta_k x)\]
Estimation follows the partial likelihood approach used in a standard Cox model. In a later paper, Fine extended this idea to other link functions using an estimating equations approach. Using the R library cmprsk we obtain the following results (after removing the five subjects with missing CCR5 covariate values and making ccr5 numeric).
Fine and Gray regression on cumulative incidence functions is not implemented in mstate, but in the R package cmprsk. Using the R library cmprsk we obtain the following results (after removing the five subjects with missing CCR5 covariate values and making ccr5 numeric).
library(cmprsk)
sic <- si[!is.na(si$ccr5),]
ftime <- sic$time
fstatus <- sic$status
cov <- as.numeric(sic$ccr5)-1
# for failures of type 1 (AIDS)
z1 <- crr(ftime,fstatus,cov)
z1
## convergence: TRUE
## coefficients:
## cov1
## -1.004
## standard errors:
## [1] 0.295
## two-sided p-values:
## cov1
## 0.00066
# for failures of type 2 (SI)
z2 <- crr(ftime,fstatus,cov,failcode=2)
z2
## convergence: TRUE
## coefficients:
## cov1
## 0.02359
## standard errors:
## [1] 0.2266
## two-sided p-values:
## cov1
## 0.92
z1.pr <- predict(z1,matrix(c(0,1),2,1))
# this will contain predicted cum inc curves, both for WW (2nd column) and WM (3rd)
z2.pr <- predict(z2,matrix(c(0,1),2,1))
# Standard plots, not shown
par(mfrow=c(1,2))
plot(z1.pr,lty=1,lwd=2,color=c(8,1))
plot(z2.pr,lty=1,lwd=2,color=c(8,1))
par(mfrow=c(1,1))
## AIDS
n1 <- nrow(z1.pr) # remove last jump
plot(c(0,z1.pr[-n1,1]),c(0,z1.pr[-n1,2]),type="s",ylim=c(0,0.5),
xlab="Years from HIV infection",ylab="Probability",lwd=2)
lines(c(0,z1.pr[-n1,1]),c(0,z1.pr[-n1,3]),type="s",lwd=2,col=8)
title(main="AIDS")
text(9.3,0.35,"WW",adj=0,cex=0.75)
text(9.3,0.14,"WM",adj=0,cex=0.75)
## SI appearance
n2 <- nrow(z2.pr) # again remove last jump
plot(c(0,z2.pr[-n2,1]),c(0,z2.pr[-n2,2]),type="s",ylim=c(0,0.5),
xlab="Years from HIV infection",ylab="Probability",lwd=2)
lines(c(0,z2.pr[-n2,1]),c(0,z2.pr[-n2,3]),type="s",lwd=2,col=8)
title(main="SI appearance")
text(7.9,0.28,"WW",adj=0,cex=0.75)
text(7.9,0.31,"WM",adj=0,cex=0.75)
This figure represents cumulative incidence functions for AIDS (left) and SI appearance (right), for CCR5 wild-type WW and mutant WM, based on the Fine and Gray model.
To judge the “fit”of the cause-specific and Fine & Gray regression models we estimate cumulative incidence curves nonparametrically, i.e., for two subgroups of WW and WM CCR5-genotypes. Here we can use the group argument of Cuminc.
ci <- Cuminc(si$time, si$status, group = si$ccr5)
ci.WW <- ci[ci$group == "WW", ]
ci.WM <- ci[ci$group == "WM", ]
idx1 <- (ci.WW$time < 13)
idx2 <- (ci.WM$time < 13)
plot(c(0, ci.WW$time[idx1]), c(0, ci.WW$CI.1[idx1]), type = "s",
ylim = c(0, 0.5), xlab = "Years from HIV infection", ylab = "Probability",
lwd = 2)
lines(c(0, ci.WM$time[idx2]), c(0, ci.WM$CI.1[idx2]), type = "s",
lwd = 2, col = 8)
title(main = "AIDS")
text(9.3, 0.35, "WW", adj = 0, cex = 0.75)
text(9.3, 0.11, "WM", adj = 0, cex = 0.75)
plot(c(0, ci.WW$time[idx1]), c(0, ci.WW$CI.2[idx1]), type = "s",
ylim = c(0, 0.5), xlab = "Years from HIV infection", ylab = "Probability",
lwd = 2)
lines(c(0, ci.WM$time[idx2]), c(0, ci.WM$CI.2[idx2]), type = "s",
lwd = 2, col = 8)
title(main = "SI appearance")
text(7.9, 0.32, "WW", adj = 0, cex = 0.75)
text(7.9, 0.245, "WM", adj = 0, cex = 0.75)
This figure represents Non-parametric cumulative incidence functions for SI appearance for CCR5 wild-type WW and mutant WM.
As far as we know the Fine and Gray regression does not yet allow the flexibility (e.g. in testing for or assuming equality of covariate effects across different causes) of regression on cause-specific hazards. Also, it is not clear how left truncated data or time-dependent covariates can be included in their approach.
Goran Brostrom, Event History Analysis with R; Kleinbaum and Klein, Survival Analysis↩︎
This section is a summary from Cleves et al, An Introduction to Survival Analysis Using Stata↩︎
Cleves MA. An Introduction to Survival Analysis Using Stata. 3rd ed. Stata Press; 2010.↩︎
This section is a summary from 1) Cleves et al, An Introduction to Survival Analysis Using Stata, 2)Goran Brostrom, Event History Analysis with R, and Kleinbaum and Klein, Survival Analysis↩︎
Goran Brostrom, Event History Analysis with R↩︎
This section is a summarized excerpt from Goran Brostrom, Event History Analysis with R↩︎
This section is a summarized excerpt from Goran Brostrom, Event History Analysis with R and Kleinbaum and Klein, Survival Analysis↩︎
This section is an excerpt and summary from Kleinbaum and Klein, Survival Analysis↩︎
Moore, Dirk. 2016. Applied Survival Analysis Using R↩︎
For more details, see Therneau et al., 2020. Using Time Dependent Covariates and Time Dependent Coefficients in the Cox Model (https://cran.r-project.org/web/packages/survival/vignettes/timedep.pdf)↩︎
Kleinbaum and Klein, Survival Analysis: A Self-Learning Text, \(3^{rd}\) edition↩︎
Moore, Dirk. 2016. Applied Survival Analysis Using R↩︎
Moore, Dirk. 2016. Applied Survival Analysis Using R↩︎
Moore, Dirk. 2016. Applied Survival Analysis Using R; Kleinbaum and Klein, Survival Analysis: A Self-Learning Text, \(3^{rd}\) edition↩︎
Right truncation is another form of length-biased sampling, but it is much more difficult to accommodate than left truncation.↩︎
Pencina MJ, Larson MG, D’Agostino RB. Choice of time scale and its effect on significance of predictors in longitudinal studies. Statistics in Medicine. 2007;26(6):1343-1359. doi:https://doi.org/10.1002/sim.2699↩︎
Moore, Dirk. 2016. Applied Survival Analysis Using R↩︎
Allison, P. Survival Analysis Using SAS. \(2^{nd}\) eds.↩︎
Germán Rodríguez, Survival Analysis, https://data.princeton.edu/pop509/recid1↩︎
Germán Rodríguez, Survival Analysis, https://data.princeton.edu/pop509/recid3↩︎
Agresti, Categorical Data Analysis; Germán Rodríguez, Survival Analysis, https://data.princeton.edu/pop509/recid3↩︎
Hosmer, Remeshow, and May. Applied Survival Analysis↩︎
Conditional logistic regression. https://rdrr.io/cran/survival/man/clogit.html↩︎
Moore DF. Applied Survival Analysis Using R. Springer International Publishing; 2016. doi:10.1007/978-3-319-31245-3; A revised details on R code is available at Putter H. 2020. Tutorial in biostatistics: Competing risks and multi-state models Analyses using the mstate package. (https://cran.r-project.org/web/packages/mstate/vignettes/Tutorial.pdf)↩︎
Putter H, Fiocco M, Geskus RB. Tutorial in biostatistics: competing risks and multi-state models. Statistics in Medicine. 2007;26(11):2389-2430. doi:https://doi.org/10.1002/sim.2712; Putter H. Tutorial in biostatistics: Competing risks and multi-state models Analyses using the mstate package.↩︎
For theoretical background, please refer to Putter H, Fiocco M, Geskus RB. Tutorial in biostatistics: competing risks and multi-state models. Statistics in Medicine. 2007;26(11):2389-2430. doi:https://doi.org/10.1002/sim.2712; Putter H. Tutorial in biostatistics: Competing risks and multi-state models Analyses using the mstate package. :64.↩︎