Example: Lung data

This “Lung” Dataset Contains data on patients with advanced lung cancer from the North Central Cancer Treatment Group.
Key Variables:
- time: Survival time in days.
- status: Censoring status (1 = censored, 2 = dead).
- age: Age in years.
- sex: Male (1) or Female (2).
- ph.ecog: ECOG performance score (0 = good to 5 = dead).
- Other variables include Karnofsky performance scores and calorie intake.
This dataset provides rich information for exploring survival analysis techniques and understanding factors influencing survival among lung cancer patients. This dataset is typically used for survival analysis, including Kaplan-Meier survival curves and Cox proportional hazards models.
The primary outcome variable is time, representing survival duration, and status, indicating whether the event of interest (death) occurred or if the observation was censored. Performance scores (ph.ecog, ph.karno, and pat.karno) are important predictors of survival, reflecting the functional status of patients.
Key Notes:
- Censoring (status): Observations with a status of “1” are censored, meaning that the patient was still alive at their last follow-up.
- ECOG and Karnofsky Scores: These scores assess a patient’s ability to perform daily activities and are commonly used as prognostic indicators in cancer studies.
- Dietary Intake (meal.cal) and Weight Loss (wt.loss): These variables may reflect nutritional status, which can influence survival outcomes.

Setup

#install.packages("survival")
library(survival)
library(eha)
library(tidyverse)      # For data manipulation

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(summarytools)  # Optional: For detailed summary tables

## Warning in fun(libname, pkgname): couldn't connect to display ":0"

## system might not have X11 capabilities; in case of errors when using dfSummary(), set st_options(use.x11 = FALSE)
## 
## Attaching package: 'summarytools'
## 
## The following object is masked from 'package:tibble':
## 
##     view

library(tinytex)
library(flexsurv)

## 
## Attaching package: 'flexsurv'
## 
## The following objects are masked from 'package:eha':
## 
##     dgompertz, dllogis, hgompertz, Hgompertz, hllogis, Hllogis, hlnorm,
##     Hlnorm, hweibull, Hweibull, pgompertz, pllogis, qgompertz, qllogis,
##     rgompertz, rllogis

lung <- lung
descr(lung) # summarytools

## Descriptive Statistics  
## lung  
## N: 228  
## 
##                        age     inst   meal.cal   pat.karno   ph.ecog   ph.karno      sex   status
## ----------------- -------- -------- ---------- ----------- --------- ---------- -------- --------
##              Mean    62.45    11.09     928.78       79.96      0.95      81.94     1.39     1.72
##           Std.Dev     9.07     8.30     402.17       14.62      0.72      12.33     0.49     0.45
##               Min    39.00     1.00      96.00       30.00      0.00      50.00     1.00     1.00
##                Q1    56.00     3.00     635.00       70.00      0.00      70.00     1.00     1.00
##            Median    63.00    11.00     975.00       80.00      1.00      80.00     1.00     2.00
##                Q3    69.00    16.00    1150.00       90.00      1.00      90.00     2.00     2.00
##               Max    82.00    33.00    2600.00      100.00      3.00     100.00     2.00     2.00
##               MAD     9.64     8.90     296.52       14.83      1.48      14.83     0.00     0.00
##               IQR    13.00    13.00     515.00       20.00      1.00      15.00     1.00     1.00
##                CV     0.15     0.75       0.43        0.18      0.75       0.15     0.35     0.26
##          Skewness    -0.37     0.66       1.00       -0.60      0.14      -0.57     0.43    -0.99
##       SE.Skewness     0.16     0.16       0.18        0.16      0.16       0.16     0.16     0.16
##          Kurtosis    -0.40    -0.22       3.35        0.13     -0.85      -0.20    -1.82    -1.02
##           N.Valid   228.00   227.00     181.00      225.00    227.00     227.00   228.00   228.00
##                 N   228.00   228.00     228.00      228.00    228.00     228.00   228.00   228.00
##         Pct.Valid   100.00    99.56      79.39       98.68     99.56      99.56   100.00   100.00
## 
## Table: Table continues below
## 
##  
## 
##                        time   wt.loss
## ----------------- --------- ---------
##              Mean    305.23      9.83
##           Std.Dev    210.65     13.14
##               Min      5.00    -24.00
##                Q1    166.50      0.00
##            Median    255.50      7.00
##                Q3    399.00     16.00
##               Max   1022.00     68.00
##               MAD    160.86     10.38
##               IQR    229.75     15.75
##                CV      0.69      1.34
##          Skewness      1.08      1.17
##       SE.Skewness      0.16      0.17
##          Kurtosis      0.86      2.33
##           N.Valid    228.00    214.00
##                 N    228.00    228.00
##         Pct.Valid    100.00     93.86

Modeling

OLS

# Perform OLS regression with survival time as the dependent variable
ols_model <- lm(time ~ age  , data = lung)

# Summary of the OLS model
summary(ols_model)

## 
## Call:
## lm(formula = time ~ age, data = lung)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -295.61 -143.37  -57.01  107.12  737.70 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  418.375     97.147   4.307 2.47e-05 ***
## age           -1.812      1.540  -1.177    0.241    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 210.5 on 226 degrees of freedom
## Multiple R-squared:  0.006091,   Adjusted R-squared:  0.001693 
## F-statistic: 1.385 on 1 and 226 DF,  p-value: 0.2405

Logistic

# Recode the 'status' variable: 1 = censored (0), 2 = dead (1)
lung$death <- ifelse(lung$status == 2, 1, 0)

# Perform logistic regression with survival status as the dependent variable
logistic_model <- glm(death ~ age + sex , data = lung, family = binomial)

# Summary of the logistic regression model
summary(logistic_model)

## 
## Call:
## glm(formula = death ~ age + sex, family = binomial, data = lung)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.51507    1.17422   0.439 0.660916    
## age          0.03189    0.01701   1.875 0.060854 .  
## sex         -1.04839    0.30844  -3.399 0.000676 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 268.78  on 227  degrees of freedom
## Residual deviance: 251.93  on 225  degrees of freedom
## AIC: 257.93
## 
## Number of Fisher Scoring iterations: 4

# Optional: Calculate odds ratios and confidence intervals
exp(cbind(Odds_Ratio = coef(logistic_model), confint(logistic_model)))

## Waiting for profiling to be done...

##             Odds_Ratio     2.5 %     97.5 %
## (Intercept)  1.6737524 0.1701245 17.3552243
## age          1.0324022 0.9986430  1.0678140
## sex          0.3505024 0.1899317  0.6385775

# Load the required package
library(survival)

# Load the lung dataset
data(lung)

## Warning in data(lung): data set 'lung' not found

# Convert the status variable to a factor (since logistic regression requires a binary outcome)
lung$status <- as.factor(lung$status)

# Fit the logistic regression model
logistic_model <- glm(status ~ age + time, data = lung, family = binomial)

# Display the summary of the model
summary(logistic_model)

## 
## Call:
## glm(formula = status ~ age + time, family = binomial, data = lung)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)  
## (Intercept) -0.6772741  1.0629153  -0.637   0.5240  
## age          0.0352921  0.0167466   2.107   0.0351 *
## time        -0.0016807  0.0006951  -2.418   0.0156 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 268.78  on 227  degrees of freedom
## Residual deviance: 257.87  on 225  degrees of freedom
## AIC: 263.87
## 
## Number of Fisher Scoring iterations: 4

# Extract the coefficients
coefficients <- coef(logistic_model)

# Convert the coefficients to odds ratios
odds_ratios <- exp(coefficients)

# Display the odds ratios
odds_ratios

## (Intercept)         age        time 
##   0.5079999   1.0359223   0.9983207

table(lung$death)

## 
##   0   1 
##  63 165

Cox

# Load the required package
library(survival)

# Load the lung dataset
data(lung)

## Warning in data(lung): data set 'lung' not found

# Create a dummy variable for status (1 = dead, 0 = censored)
lung$status_dummy <- ifelse(lung$status == 2, 1, 0)

# Fit the Cox proportional hazards model
cox_model <- coxph(Surv(time, status_dummy) ~ age + sex, data = lung)

# Display the summary of the model
summary(cox_model)

## Call:
## coxph(formula = Surv(time, status_dummy) ~ age + sex, data = lung)
## 
##   n= 228, number of events= 165 
## 
##          coef exp(coef)  se(coef)      z Pr(>|z|)   
## age  0.017045  1.017191  0.009223  1.848  0.06459 . 
## sex -0.513219  0.598566  0.167458 -3.065  0.00218 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##     exp(coef) exp(-coef) lower .95 upper .95
## age    1.0172     0.9831    0.9990    1.0357
## sex    0.5986     1.6707    0.4311    0.8311
## 
## Concordance= 0.603  (se = 0.025 )
## Likelihood ratio test= 14.12  on 2 df,   p=9e-04
## Wald test            = 13.47  on 2 df,   p=0.001
## Score (logrank) test = 13.72  on 2 df,   p=0.001

A short example of Kaplan-Meier estimator

Setup

library(survival)

tt <- c(7,6,6,5,2,4)
cens <- c(0,1,0,0,1,1)
Surv(tt,cens)

## [1] 7+ 6  6+ 5+ 2  4

Kaplan-Meier estimator

result.km <- survfit(Surv(tt,cens)~1, conf.type="log-log")
result.km

## Call: survfit(formula = Surv(tt, cens) ~ 1, conf.type = "log-log")
## 
##      n events median 0.95LCL 0.95UCL
## [1,] 6      3      6       2      NA

summary(result.km)

## Call: survfit(formula = Surv(tt, cens) ~ 1, conf.type = "log-log")
## 
##  time n.risk n.event survival std.err lower 95% CI upper 95% CI
##     2      6       1    0.833   0.152       0.2731        0.975
##     4      5       1    0.667   0.192       0.1946        0.904
##     6      3       1    0.444   0.222       0.0662        0.785

Plots

KM Sirvival curves

plot(result.km, 
ylab = "Survival probability",
xlab = "Time",
mark.time = T,
main="KM survival curve")
abline(h = 0.5, col = "sienna", lty = 3)

KM cumulative hazard curve

plot(result.km, 
ylab = "Cumulative hazard",
xlab = "Time",
mark.time = T,
fun="cumhaz",
main="KM cumulative hazard curve")
abline(h = 0.5, col = "sienna", lty = 3)

Nelson-Aalen estimator

result.fh <- survfit(Surv(tt,cens)~1, conf.type="log-log", type="fh")
result.fh

## Call: survfit(formula = Surv(tt, cens) ~ 1, conf.type = "log-log", 
##     type = "fh")
## 
##      n events median 0.95LCL 0.95UCL
## [1,] 6      3      6       2      NA

summary(result.fh)

## Call: survfit(formula = Surv(tt, cens) ~ 1, conf.type = "log-log", 
##     type = "fh")
## 
##  time n.risk n.event survival std.err lower 95% CI upper 95% CI
##     2      6       1    0.846   0.141        0.306        0.977
##     4      5       1    0.693   0.180        0.229        0.913
##     6      3       1    0.497   0.210        0.101        0.807

# NA survival curve
plot(result.fh, 
ylab = "Survival probability",
xlab = "Time",
mark.time = T,
main="NA survival curve")
abline(h = 0.5, col = "sienna", lty = 3)

# NA cumulative hazard curve
plot(result.fh, 
ylab = "Cumulative hazard",
xlab = "Time",
mark.time = T,
fun="cumhaz",
main="NA cumulative hazard curve")
abline(h = 0.5, col = "sienna", lty = 3)

Modeling With Addicts dataset

The dataset originates from an Australian study by Caplehorn and Bell (1991) and is featured in by Kleinbaum and Klein. The primary goal is to analyze factors influencing retention time in methadone clinics, such as clinic type, methadone dose, and prison history. This dataset is widely used for illustrating Kaplan-Meier survival curves, Cox proportional hazards models, and other survival analysis techniques.

We are interested in either 1) the hazard of dropping out of the clinic or was censored or 2) the time (in days) until the person dropped out of the clinic or was censored. The predictor of interest is CLINIC (coded 1 or 2) for two methadone clinics for heroin addicts. Covariates include DOSE (continuous) for methadone dose (mg/day), and PRISON (coded 1 if patient has a prison record and 0 if not).

Description of Variables

ID (id): Each patient has a unique numeric ID to distinguish them in the dataset.
Clinic (clinic): Indicates which methadone clinic the patient attended. Values are coded as: 1: Clinic 1; 2: Clinic 2
Status (status): Represents whether the patient experienced the event of interest (dropout) or was censored. Values are coded as:
- 0: Censored (the patient remained in treatment at the end of the study period).
- 1: Event occurred (the patient dropped out of treatment).
Survival Time (survt): The time (in days) from admission to either dropout or censoring. This is the primary time-to-event variable used for survival analysis.
Prison Record (prison): Indicates whether the patient had a history of incarceration. Values are coded as:
- 1: Yes, has a prison record.
- 0: No, does not have a prison record.
Methadone Dose (dose): The maximum daily dose of methadone prescribed to the patient during treatment, measured in milligrams per day.

# Install and load necessary packages
# install.packages("haven")
library(haven)
library(survival)
library(eha)
library(tidyverse)


# Load the addicts dataset
addicts <- read_dta("http://web1.sph.emory.edu/dkleinb/allDatasets/surv2datasets/addicts.dta")

# Preview the dataset
head(addicts)

## # A tibble: 6 × 6
##      id clinic status survt prison  dose
##   <dbl>  <dbl>  <dbl> <dbl>  <dbl> <dbl>
## 1     1      1      1   428      0    50
## 2     2      1      1   275      1    55
## 3     3      1      1   262      0    55
## 4     4      1      1   183      0    30
## 5     5      1      1   259      1    65
## 6     6      1      1   714      0    55

summary(addicts)

##        id             clinic          status           survt       
##  Min.   :  1.00   Min.   :1.000   Min.   :0.0000   Min.   :   2.0  
##  1st Qu.: 65.25   1st Qu.:1.000   1st Qu.:0.0000   1st Qu.: 171.2  
##  Median :131.50   Median :1.000   Median :1.0000   Median : 367.5  
##  Mean   :134.13   Mean   :1.315   Mean   :0.6303   Mean   : 402.6  
##  3rd Qu.:205.75   3rd Qu.:2.000   3rd Qu.:1.0000   3rd Qu.: 585.5  
##  Max.   :266.00   Max.   :2.000   Max.   :1.0000   Max.   :1076.0  
##      prison            dose      
##  Min.   :0.0000   Min.   : 20.0  
##  1st Qu.:0.0000   1st Qu.: 50.0  
##  Median :0.0000   Median : 60.0  
##  Mean   :0.4664   Mean   : 60.4  
##  3rd Qu.:1.0000   3rd Qu.: 70.0  
##  Max.   :1.0000   Max.   :110.0

Description

#install.packages("summarytools")

# Load required libraries
library(tidyverse)      # For data manipulation
library(summarytools)  # Optional: For detailed summary tables

# View the structure of the dataset
str(addicts)

## tibble [238 × 6] (S3: tbl_df/tbl/data.frame)
##  $ id    : num [1:238] 1 2 3 4 5 6 7 8 9 10 ...
##   ..- attr(*, "label")= chr "Subject ID"
##   ..- attr(*, "format.stata")= chr "%10.0g"
##  $ clinic: num [1:238] 1 1 1 1 1 1 1 1 1 1 ...
##   ..- attr(*, "label")= chr "Coded 1 or 2"
##   ..- attr(*, "format.stata")= chr "%10.0g"
##  $ status: num [1:238] 1 1 1 1 1 1 1 0 1 1 ...
##   ..- attr(*, "label")= chr "status (0=censored, 1=endpoint)"
##   ..- attr(*, "format.stata")= chr "%10.0g"
##  $ survt : num [1:238] 428 275 262 183 259 714 438 796 892 393 ...
##   ..- attr(*, "label")= chr "survival time in days"
##   ..- attr(*, "format.stata")= chr "%10.0g"
##  $ prison: num [1:238] 0 1 0 0 1 0 1 1 0 1 ...
##   ..- attr(*, "label")= chr "0=none, 1=prison record"
##   ..- attr(*, "format.stata")= chr "%10.0g"
##  $ dose  : num [1:238] 50 55 55 30 65 55 65 60 50 65 ...
##   ..- attr(*, "label")= chr "methadone dose (mg/day)"
##   ..- attr(*, "format.stata")= chr "%10.0g"

# Basic summary statistics for all variables
summary(addicts)

##        id             clinic          status           survt       
##  Min.   :  1.00   Min.   :1.000   Min.   :0.0000   Min.   :   2.0  
##  1st Qu.: 65.25   1st Qu.:1.000   1st Qu.:0.0000   1st Qu.: 171.2  
##  Median :131.50   Median :1.000   Median :1.0000   Median : 367.5  
##  Mean   :134.13   Mean   :1.315   Mean   :0.6303   Mean   : 402.6  
##  3rd Qu.:205.75   3rd Qu.:2.000   3rd Qu.:1.0000   3rd Qu.: 585.5  
##  Max.   :266.00   Max.   :2.000   Max.   :1.0000   Max.   :1076.0  
##      prison            dose      
##  Min.   :0.0000   Min.   : 20.0  
##  1st Qu.:0.0000   1st Qu.: 50.0  
##  Median :0.0000   Median : 60.0  
##  Mean   :0.4664   Mean   : 60.4  
##  3rd Qu.:1.0000   3rd Qu.: 70.0  
##  Max.   :1.0000   Max.   :110.0

# Descriptive statistics for numeric variables
numeric_summary <- addicts %>%
  select_if(is.numeric) %>%
  summarise_all(list(
    Mean = mean,
    Median = median,
    SD = sd,
    Min = min,
    Max = max,
    NAs = ~sum(is.na(.))
  ))
print(numeric_summary)

## # A tibble: 1 × 36
##   id_Mean clinic_Mean status_Mean survt_Mean prison_Mean dose_Mean id_Median
##     <dbl>       <dbl>       <dbl>      <dbl>       <dbl>     <dbl>     <dbl>
## 1    134.        1.32       0.630       403.       0.466      60.4      132.
## # ℹ 29 more variables: clinic_Median <dbl>, status_Median <dbl>,
## #   survt_Median <dbl>, prison_Median <dbl>, dose_Median <dbl>, id_SD <dbl>,
## #   clinic_SD <dbl>, status_SD <dbl>, survt_SD <dbl>, prison_SD <dbl>,
## #   dose_SD <dbl>, id_Min <dbl>, clinic_Min <dbl>, status_Min <dbl>,
## #   survt_Min <dbl>, prison_Min <dbl>, dose_Min <dbl>, id_Max <dbl>,
## #   clinic_Max <dbl>, status_Max <dbl>, survt_Max <dbl>, prison_Max <dbl>,
## #   dose_Max <dbl>, id_NAs <int>, clinic_NAs <int>, status_NAs <int>, …

# Frequency table for categorical variables (e.g., clinic and prison)
categorical_summary <- addicts %>%
  reframe(clinic, prison) %>%
  summarise_all(~table(.))

## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
##   always returns an ungrouped data frame and adjust accordingly.
## ℹ The deprecated feature was likely used in the dplyr package.
##   Please report the issue at <https://github.com/tidyverse/dplyr/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

print(categorical_summary)

## # A tibble: 2 × 2
##   clinic      prison     
##   <table[1d]> <table[1d]>
## 1 163         127        
## 2  75         111

# Optional: Use summarytools for a detailed report
library(summarytools)

# Generate a detailed descriptive statistics table
descr(addicts)

## Descriptive Statistics  
## addicts  
## N: 238  
## 
##                     clinic     dose       id   prison   status     survt
## ----------------- -------- -------- -------- -------- -------- ---------
##              Mean     1.32    60.40   134.13     0.47     0.63    402.57
##           Std.Dev     0.47    14.45    79.29     0.50     0.48    267.85
##               Min     1.00    20.00     1.00     0.00     0.00      2.00
##                Q1     1.00    50.00    65.00     0.00     0.00    170.00
##            Median     1.00    60.00   131.50     0.00     1.00    367.50
##                Q3     2.00    70.00   206.00     1.00     1.00    587.00
##               Max     2.00   110.00   266.00     1.00     1.00   1076.00
##               MAD     0.00    14.83   104.52     0.00     0.00    306.16
##               IQR     1.00    20.00   140.50     1.00     1.00    414.25
##                CV     0.35     0.24     0.59     1.07     0.77      0.67
##          Skewness     0.79     0.26    -0.01     0.13    -0.54      0.37
##       SE.Skewness     0.16     0.16     0.16     0.16     0.16      0.16
##          Kurtosis    -1.38     0.08    -1.32    -1.99    -1.72     -0.87
##           N.Valid   238.00   238.00   238.00   238.00   238.00    238.00
##                 N   238.00   238.00   238.00   238.00   238.00    238.00
##         Pct.Valid   100.00   100.00   100.00   100.00   100.00    100.00

GLM: linear regression

lm <- glm(survt ~ clinic + dose + prison, data=addicts, family = "gaussian")

b_lm = coef(lm)

summary(lm)

## 
## Call:
## glm(formula = survt ~ clinic + dose + prison, family = "gaussian", 
##     data = addicts)
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -109.405     76.216  -1.435  0.15249    
## clinic        86.527     33.725   2.566  0.01092 *  
## dose           7.244      1.087   6.666 1.87e-10 ***
## prison       -84.357     31.072  -2.715  0.00712 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 57139.85)
## 
##     Null deviance: 17003606  on 237  degrees of freedom
## Residual deviance: 13370725  on 234  degrees of freedom
## AIC: 3288.3
## 
## Number of Fisher Scoring iterations: 2

Model specification

\[ y=b_0 +b_1 x_1 + b_2 x_2 + b_3 x_3 + \epsilon, \;\; \epsilon \sim N(0,1) \]

Estimated equation

time to dropout ($t$) = (-109.4045583) + (86.5268733)$\times$clinic + (7.2438949)$\times$dose + (-84.3568952)$\times$prison

Interpretations
- What is the metric of $y, b_0, b_1,$ and $b_2$, respectively?
- What is the interpretation when 1) $b_i=0$, 2) $b_i<0$, or 3) $b_i>0$?
- How would you compare the time to dropout between two groups of people below? Is it additive or multiplicative?
  - What is the estimated time to dropout ($t_{x_1=1}$) for those who with clinic=1, dose=50, and prison=0?
  - What is the estimated time to dropout ($t_{x_1=2}$) for those who with clinic=2, dose=50, and prison=0?
  - What is the estimated time to dropout ($t_{x_1=1}$) for those who with clinic=1, dose=50, and prison=1?
  - What is the estimated time to dropout ($t_{x_1=2}$) for those who with clinic=2, dose=50, and prison=1?
- Compare the following two groups
  - A group with clinic=1, dose=10, and prison=1 vs. another group with clinic=1, dose=50, and prison=1
  - A group with clinic=1, dose=10, and prison=1 vs. another group with clinic=1, dose=10, and prison=0
- Interpret $b_0, b_1,$ and $b_2$, respectively.
The following model includes an interaction term between clinic and prison. In this model specification, prison is a (mediator, moderator) on the association between clinic and time to dropout.

lm_int <- glm(survt~ clinic + dose + prison + clinic*prison, 
                  data=addicts, 
                  family = "gaussian")
b_lm_int = coef(lm_int)
summary(lm_int)

## 
## Call:
## glm(formula = survt ~ clinic + dose + prison + clinic * prison, 
##     family = "gaussian", data = addicts)
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -60.688     84.631  -0.717   0.4740    
## clinic          44.359     46.492   0.954   0.3410    
## dose             7.350      1.088   6.756 1.12e-10 ***
## prison        -200.229     93.392  -2.144   0.0331 *  
## clinic:prison   87.989     66.891   1.315   0.1897    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 56962.08)
## 
##     Null deviance: 17003606  on 237  degrees of freedom
## Residual deviance: 13272164  on 233  degrees of freedom
## AIC: 3288.5
## 
## Number of Fisher Scoring iterations: 2

State the estimated equation: time to dropout ($t$) = (-60.6876553) + (44.3587802)$\times$clinic + (7.3504292)$\times$dose + (-200.2294217)$\times$prison + (87.9891836)$\times$clinic*prison
Compare the time to dropout between clinic=1 ($t_{x_1=1}$) and clinic=2 ($t_{x_1=2}$) with consideration of an interaction term.

GLM: Logistic regression

logi <- glm(status~ clinic + dose + prison, 
                data=addicts, 
                family = "binomial")
b_logi = coef(logi)
summary(logi)

## 
## Call:
## glm(formula = status ~ clinic + dose + prison, family = "binomial", 
##     data = addicts)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  4.22797    0.78182   5.408 6.38e-08 ***
## clinic      -1.54175    0.30493  -5.056 4.28e-07 ***
## dose        -0.02630    0.01048  -2.509   0.0121 *  
## prison      -0.04155    0.29257  -0.142   0.8871    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 313.60  on 237  degrees of freedom
## Residual deviance: 276.33  on 234  degrees of freedom
## AIC: 284.33
## 
## Number of Fisher Scoring iterations: 4

Model specification

\[\ln(\frac{p(y)}{1-p(y)})=b_0 + b_1x_1+b_2x_2+b_3x_3+\epsilon, \;\;\;\; \epsilon \sim Bernoulli(p)\]

Estimated equation

\[\ln(\frac{p(DropOut)}{1-p(DropOut)}) = (4.2279749) + (-1.5417485)\times clinic + (-0.026296) \times dose + (-0.0415542) \times prison\]

Interpretations

What is the metric of $y, b_0, b_1,$ and $b_2$, respectively?
What is the difference between coefficients and $\exp$(coefficients)? Specify the metric.
What is the interpretation when 1) $b_i=0$, 2) $b_i<0$, or 3) $b_i>0$?
What is the interpretation when 1) $\exp(b_i)=1$, 2) $\exp(b_i)<1$ or 3) $\exp(b_i)>1$?
How would you compare the odds of dropout between two groups of people below? Is it additive or multiplicative?
- What is the estimated odds of dropout ($\frac{p(DropOut)}{1-p(DropOut)}$ for clinic=1) for those who with clinic=1, dose=50, and prison=1?
- What is the estimated odds of dropout ($\frac{p(DropOut)}{1-p(DropOut)}$ for clinic=2) for those who with clinic=2, dose=50, and prison=1?
Compare the following two groups:
- A group with clinic=1, dose=100, and prison=1 vs. another group with clinic=1, dose=200, and prison=1
- A group with clinic=2, dose=100, and prison=1 vs. another group with clinic=2, dose=100, and prison=0
Interpret $b_0, b_1,$ and $b_2$, respectively.
The following model includes an interaction term between clinic and prison. In this model specification, prison is a (mediator, moderator) on the association between clinic and time to dropout.

logi_int <- glm(status~ clinic + dose + prison + clinic*prison, 
                data=addicts, 
                family = "binomial")
b_logi_int = coef(logi_int)
summary(logi_int)

## 
## Call:
## glm(formula = status ~ clinic + dose + prison + clinic * prison, 
##     family = "binomial", data = addicts)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    4.11777    0.85788   4.800 1.59e-06 ***
## clinic        -1.45279    0.42049  -3.455  0.00055 ***
## dose          -0.02646    0.01048  -2.524  0.01159 *  
## prison         0.21200    0.88104   0.241  0.80985    
## clinic:prison -0.18671    0.61194  -0.305  0.76028    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 313.60  on 237  degrees of freedom
## Residual deviance: 276.24  on 233  degrees of freedom
## AIC: 286.24
## 
## Number of Fisher Scoring iterations: 4

State the estimated equation: Log odds of dropout ($\ln(\frac{p(DropOut)}{1-p(DropOut)})$) = (-60.6876553) + (44.3587802)$\times$clinic + (7.3504292)$\times$dose + (-200.2294217)$\times$prison + (87.9891836)$\times$clinic*prison
Compare the odds of dropout between clinic=1 ($t_{x_1=1}$) and clinic=2 ($t_{x_1=2}$) with consideration of an interaction term.

Non-parametric models: KM

# Create a survival object
km_surv_obj <- Surv(time = addicts$survt, event = addicts$status == 1)
km_surv_obj[1:20]

##  [1] 428  275  262  183  259  714  438  796+ 892  393  161+ 836  523  612  212 
## [16] 399  771  514  512  624

# Fit Kaplan-Meier survival curves stratified by clinic
km_fit <- survfit(km_surv_obj ~ clinic, data = addicts, conf.type = "log-log")


# Plot Kaplan-Meier curves using base R
plot(km_fit, col = c("blue", "red"), lty = 1:2, xlab = "Time (days)", ylab = "Survival Probability",
     main = "Kaplan-Meier Survival Curves by Clinic")
legend("bottomleft", legend = c("Clinic 1", "Clinic 2"), col = c("blue", "red"), lty = 1:2)

Model specification

\[ S(t) = \prod_{T \le t} (1 - \frac{d_i}{n_i}) \], where $d_i$ is the number of events at time $t_i$ and $n_i$ is the number of individuals at risk just before $t_i$.

Estimated equation

km_fit

## Call: survfit(formula = km_surv_obj ~ clinic, data = addicts, conf.type = "log-log")
## 
##            n events median 0.95LCL 0.95UCL
## clinic=1 163    122    428     341     512
## clinic=2  75     28     NA     661      NA

summary(km_fit)

## Call: survfit(formula = km_surv_obj ~ clinic, data = addicts, conf.type = "log-log")
## 
##                 clinic=1 
##  time n.risk n.event survival std.err lower 95% CI upper 95% CI
##     7    162       1   0.9938 0.00615      0.95699       0.9991
##    17    161       1   0.9877 0.00868      0.95154       0.9969
##    19    160       1   0.9815 0.01059      0.94369       0.9940
##    29    157       1   0.9752 0.01223      0.93535       0.9906
##    30    156       1   0.9690 0.01366      0.92708       0.9870
##    33    155       1   0.9627 0.01493      0.91892       0.9831
##    35    154       1   0.9565 0.01609      0.91087       0.9790
##    37    153       1   0.9502 0.01716      0.90293       0.9748
##    41    152       1   0.9440 0.01815      0.89509       0.9704
##    47    151       1   0.9377 0.01907      0.88734       0.9660
##    49    150       1   0.9315 0.01994      0.87967       0.9615
##    50    149       1   0.9252 0.02077      0.87207       0.9568
##    59    147       1   0.9189 0.02156      0.86446       0.9521
##    62    146       1   0.9126 0.02231      0.85692       0.9473
##    67    144       1   0.9063 0.02304      0.84937       0.9424
##    75    143       1   0.9000 0.02373      0.84188       0.9375
##    84    142       1   0.8936 0.02440      0.83444       0.9325
##    90    141       1   0.8873 0.02503      0.82704       0.9274
##    95    140       1   0.8809 0.02564      0.81969       0.9224
##    96    139       1   0.8746 0.02623      0.81239       0.9172
##   117    135       1   0.8681 0.02683      0.80491       0.9120
##   126    134       1   0.8616 0.02740      0.79748       0.9067
##   127    133       1   0.8552 0.02795      0.79009       0.9013
##   129    132       1   0.8487 0.02848      0.78274       0.8959
##   136    130       1   0.8422 0.02899      0.77535       0.8905
##   145    129       1   0.8356 0.02950      0.76800       0.8850
##   147    128       1   0.8291 0.02998      0.76068       0.8795
##   150    126       1   0.8225 0.03045      0.75332       0.8739
##   157    124       1   0.8159 0.03092      0.74592       0.8683
##   160    123       1   0.8093 0.03138      0.73856       0.8626
##   167    121       1   0.8026 0.03182      0.73114       0.8569
##   168    120       1   0.7959 0.03225      0.72376       0.8511
##   175    119       1   0.7892 0.03267      0.71641       0.8453
##   176    117       1   0.7824 0.03308      0.70901       0.8394
##   180    116       2   0.7690 0.03385      0.69429       0.8276
##   181    114       1   0.7622 0.03422      0.68697       0.8217
##   183    113       1   0.7555 0.03458      0.67968       0.8158
##   192    112       1   0.7487 0.03492      0.67241       0.8098
##   193    111       1   0.7420 0.03525      0.66517       0.8038
##   204    110       1   0.7352 0.03557      0.65795       0.7977
##   205    108       1   0.7284 0.03589      0.65067       0.7916
##   207    107       1   0.7216 0.03619      0.64341       0.7855
##   209    106       1   0.7148 0.03648      0.63618       0.7794
##   212    104       2   0.7011 0.03705      0.62161       0.7670
##   216    102       1   0.6942 0.03732      0.61436       0.7607
##   223    101       1   0.6873 0.03758      0.60713       0.7545
##   237    100       1   0.6804 0.03783      0.59992       0.7482
##   244     99       1   0.6736 0.03807      0.59273       0.7419
##   247     98       1   0.6667 0.03829      0.58557       0.7356
##   257     97       1   0.6598 0.03851      0.57842       0.7292
##   258     96       1   0.6530 0.03872      0.57129       0.7229
##   259     95       1   0.6461 0.03892      0.56418       0.7165
##   262     94       2   0.6323 0.03928      0.55002       0.7037
##   275     92       1   0.6255 0.03945      0.54296       0.6973
##   293     90       1   0.6185 0.03962      0.53583       0.6908
##   294     89       1   0.6116 0.03978      0.52872       0.6842
##   299     88       1   0.6046 0.03993      0.52163       0.6777
##   302     87       1   0.5977 0.04007      0.51456       0.6712
##   314     86       1   0.5907 0.04020      0.50750       0.6646
##   337     83       1   0.5836 0.04035      0.50026       0.6579
##   341     81       1   0.5764 0.04049      0.49294       0.6511
##   348     78       1   0.5690 0.04063      0.48541       0.6441
##   350     77       1   0.5616 0.04077      0.47791       0.6371
##   358     76       1   0.5542 0.04090      0.47043       0.6301
##   367     75       1   0.5468 0.04102      0.46297       0.6230
##   368     74       1   0.5394 0.04112      0.45554       0.6160
##   376     73       1   0.5321 0.04122      0.44812       0.6089
##   386     72       1   0.5247 0.04130      0.44073       0.6018
##   393     71       1   0.5173 0.04138      0.43336       0.5947
##   394     70       1   0.5099 0.04144      0.42601       0.5876
##   399     69       1   0.5025 0.04149      0.41868       0.5805
##   428     66       1   0.4949 0.04156      0.41111       0.5731
##   434     65       1   0.4873 0.04161      0.40357       0.5657
##   438     64       1   0.4797 0.04165      0.39605       0.5583
##   452     62       1   0.4719 0.04169      0.38841       0.5508
##   457     61       1   0.4642 0.04172      0.38079       0.5433
##   465     59       1   0.4563 0.04175      0.37305       0.5356
##   482     56       1   0.4482 0.04179      0.36501       0.5277
##   489     55       1   0.4400 0.04182      0.35699       0.5198
##   496     54       1   0.4319 0.04183      0.34901       0.5118
##   504     53       1   0.4237 0.04183      0.34106       0.5039
##   512     52       1   0.4156 0.04181      0.33315       0.4958
##   514     51       1   0.4074 0.04177      0.32526       0.4878
##   517     50       1   0.3993 0.04173      0.31740       0.4797
##   518     48       1   0.3910 0.04168      0.30938       0.4715
##   522     47       1   0.3826 0.04161      0.30140       0.4632
##   523     46       2   0.3660 0.04143      0.28553       0.4466
##   532     44       1   0.3577 0.04132      0.27765       0.4383
##   533     43       1   0.3494 0.04119      0.26981       0.4299
##   546     40       1   0.3406 0.04107      0.26153       0.4211
##   550     39       1   0.3319 0.04094      0.25329       0.4124
##   560     38       1   0.3232 0.04078      0.24510       0.4035
##   563     37       1   0.3144 0.04060      0.23695       0.3947
##   581     33       1   0.3049 0.04048      0.22794       0.3852
##   591     31       1   0.2951 0.04035      0.21865       0.3753
##   612     29       2   0.2747 0.04005      0.19953       0.3550
##   624     26       1   0.2641 0.03988      0.18965       0.3444
##   646     25       1   0.2536 0.03966      0.17987       0.3337
##   652     24       1   0.2430 0.03939      0.17020       0.3230
##   667     23       1   0.2325 0.03907      0.16064       0.3122
##   679     22       1   0.2219 0.03869      0.15118       0.3012
##   683     21       1   0.2113 0.03827      0.14184       0.2902
##   714     20       1   0.2008 0.03778      0.13261       0.2791
##   739     19       1   0.1902 0.03724      0.12350       0.2679
##   749     18       1   0.1796 0.03664      0.11451       0.2566
##   755     17       1   0.1691 0.03598      0.10565       0.2452
##   760     16       1   0.1585 0.03525      0.09692       0.2337
##   771     15       1   0.1479 0.03444      0.08834       0.2220
##   774     14       1   0.1374 0.03357      0.07991       0.2102
##   785     13       1   0.1268 0.03260      0.07165       0.1983
##   821     10       2   0.1014 0.03062      0.05164       0.1708
##   836      7       1   0.0869 0.02948      0.04051       0.1556
##   837      6       1   0.0725 0.02790      0.03022       0.1396
##   857      4       1   0.0543 0.02615      0.01784       0.1216
##   892      3       1   0.0362 0.02286      0.00809       0.1017
##   899      2       1   0.0181 0.01717      0.00171       0.0801
## 
##                 clinic=2 
##  time n.risk n.event survival std.err lower 95% CI upper 95% CI
##    13     74       1    0.986  0.0134        0.908        0.998
##    26     73       1    0.973  0.0189        0.896        0.993
##    35     72       1    0.959  0.0229        0.880        0.987
##    41     71       1    0.946  0.0263        0.862        0.979
##    79     68       1    0.932  0.0294        0.844        0.971
##   109     66       1    0.918  0.0321        0.826        0.962
##   122     65       1    0.904  0.0346        0.809        0.953
##   143     64       1    0.890  0.0368        0.791        0.943
##   149     62       1    0.875  0.0389        0.774        0.933
##   161     61       1    0.861  0.0408        0.757        0.923
##   170     60       1    0.847  0.0426        0.740        0.912
##   190     59       1    0.832  0.0442        0.723        0.901
##   216     58       1    0.818  0.0457        0.707        0.890
##   231     56       1    0.803  0.0472        0.690        0.879
##   232     55       1    0.789  0.0486        0.674        0.867
##   268     54       2    0.759  0.0510        0.642        0.843
##   280     52       1    0.745  0.0520        0.626        0.831
##   286     51       1    0.730  0.0530        0.610        0.819
##   322     50       1    0.716  0.0539        0.594        0.806
##   366     47       1    0.700  0.0549        0.578        0.794
##   389     45       1    0.685  0.0558        0.561        0.780
##   450     43       1    0.669  0.0568        0.544        0.767
##   460     41       1    0.653  0.0577        0.527        0.753
##   540     35       1    0.634  0.0590        0.507        0.737
##   661     23       1    0.606  0.0625        0.473        0.716
##   708     19       1    0.575  0.0669        0.433        0.693
##   878     10       1    0.517  0.0812        0.349        0.661

summary(km_fit, times=c(0,100,200,300,400,500,600,700,800,900,1000))

## Call: survfit(formula = km_surv_obj ~ clinic, data = addicts, conf.type = "log-log")
## 
##                 clinic=1 
##  time n.risk n.event survival std.err lower 95% CI upper 95% CI
##     0    163       0   1.0000  0.0000      1.00000       1.0000
##   100    137      20   0.8746  0.0262      0.81239       0.9172
##   200    110      20   0.7420  0.0353      0.66517       0.8038
##   300     87      20   0.6046  0.0399      0.52163       0.6777
##   400     68      14   0.5025  0.0415      0.41868       0.5805
##   500     53       9   0.4319  0.0418      0.34901       0.5118
##   600     30      16   0.2951  0.0403      0.21865       0.3753
##   700     20       8   0.2113  0.0383      0.14184       0.2902
##   800     10       8   0.1268  0.0326      0.07165       0.1983
##   900      1       7   0.0181  0.0172      0.00171       0.0801
## 
##                 clinic=2 
##  time n.risk n.event survival std.err lower 95% CI upper 95% CI
##     0     75       0    1.000  0.0000        1.000        1.000
##   100     66       5    0.932  0.0294        0.844        0.971
##   200     58       7    0.832  0.0442        0.723        0.901
##   300     50       7    0.730  0.0530        0.610        0.819
##   400     43       3    0.685  0.0558        0.561        0.780
##   500     39       2    0.653  0.0577        0.527        0.753
##   600     27       1    0.634  0.0590        0.507        0.737
##   700     19       1    0.606  0.0625        0.473        0.716
##   800     11       1    0.575  0.0669        0.433        0.693
##   900      7       1    0.517  0.0812        0.349        0.661
##  1000      3       0    0.517  0.0812        0.349        0.661

Visualization

#install.packages("survminer")
library(survminer)

## Loading required package: ggpubr

## 
## Attaching package: 'survminer'

## The following object is masked from 'package:survival':
## 
##     myeloma

# Optional: Enhanced KM plot using survminer
ggsurvplot(km_fit, data = addicts,
           conf.int = TRUE,          # Add confidence intervals
           pval = TRUE,              # Add log-rank test p-value
           risk.table = TRUE,        # Add risk table below the plot
           xlab = "Time (days)", 
           ylab = "Survival Probability",
           title = "Kaplan-Meier Survival Curves by Clinic",
           legend.labs = c("Clinic 1", "Clinic 2"),
           palette = c("blue", "red"))

Log rank test

#log rank test
survdiff(Surv(survt,status)~clinic, data=addicts, rho=0) #log-rank

## Call:
## survdiff(formula = Surv(survt, status) ~ clinic, data = addicts, 
##     rho = 0)
## 
##            N Observed Expected (O-E)^2/E (O-E)^2/V
## clinic=1 163      122     90.9      10.6      27.9
## clinic=2  75       28     59.1      16.4      27.9
## 
##  Chisq= 27.9  on 1 degrees of freedom, p= 1e-07

Stratified Log rank test

#log rank test
survdiff(Surv(survt,status)~clinic + strata(prison), data=addicts, rho=0) #log-rank

## Call:
## survdiff(formula = Surv(survt, status) ~ clinic + strata(prison), 
##     data = addicts, rho = 0)
## 
##            N Observed Expected (O-E)^2/E (O-E)^2/V
## clinic=1 163      122     91.7      10.0      26.9
## clinic=2  75       28     58.3      15.8      26.9
## 
##  Chisq= 26.9  on 1 degrees of freedom, p= 2e-07

Semi-parametric models: Cox Proportional Hazards Model

Readings for the section

Clark TG, Bradburn MJ, Love SB, Altman DG. Survival Analysis Part I: Basic concepts and first analyses. British Journal of Cancer. 2003;89(2):232-238. doi: 10.1038/sj.bjc.6601118
Bradburn MJ, Clark TG, Love SB, Altman DG. Survival Analysis Part II: Multivariate data analysis – an introduction to concepts and methods. British Journal of Cancer. 2003;89(3):431-436. doi:10.1038/sj.bjc.6601119
John H. Beigel et al., 2020. Remdesivir for the Treatment of Covid-19: Final Report. N Engl J Med 2020; 383:1813-1826. https://doi.org/10.1056/NEJMoa2007764
Baden LR, El Sahly HM, Essink B, et al. Efficacy and Safety of the mRNA-1273 SARS-CoV-2 Vaccine. N Engl J Med. Published online December 30, 2020:NEJMoa2035389. doi:10.1056/NEJMoa2035389

Notations

\[h(t, X) = h_0(t) \exp\left( \sum_{i=1}^{p} \beta_i X_i \right), \;\; where\; X_i = (X_1, X_2, \ldots, X_p) \]

Proportional hazard (PH) assumption

$h_0(t)$: baseline hazard is a function of $t$ but not $X’s$
When all the $X’s$ are equal to $0$, than the formula reduces to the baseline hazard function, $h_0(t)$, as $e^0 = 1$.
When no $X’s$ are in the model, than the formula reduces to the baseline hazard function, $h_0(t)$.
$\exp\left( \sum_{i=1}^{p} \beta_i X_i \right)$: the exponential component is a function of $X’s$ but not $t$ (i.e., $X’s$ are time-independent variables)
A time-independent variable is defined to be any variable whose value for a given individual does not change over time. (e.g., sex, race/ethnicity)
It may be appropriate to treat Age or Height as time-independent in the analysis if their values do not change much over time or if the effect of such variables on survival risk depends essentially on the value at only one measurement.
It is possible to consider $X’s$ which do involve $t$, so that $X$ s are called time-dependent variables. This is so called Extended Cox model, which no longer satisfies the proportional hazard assumption.

Why we call this model as semi-parametric model

No assumption on $h_0(t)$ + proportional hazard assumption
The formulation of a likelihood function is based on the distribution of the outcome.
The Cox PH model does not impose any assumption on the distribution of the outcome, time to event. It simply uses the observed order of the failure time. (thus, it is a partial likelihood)
If any distributional assumption was imposed, then it is a parametric survival model.

Why the Cox PH model is so popular

The Cox PH model is a robust model, so that the results from using the Cox model will closely approximate the results for the correct parametric model.

Even though the baseline hazard is not specified,
reasonably good estimates of regression coefficients, hazard ratios of interest, and adjusted survival curves can be obtained for a wide variety of data situations.
We would prefer to use a parametric model if we were sure of the correct model. However, we may not be completely certain that a given parametric model is appropriate.
When in doubt, the Cox model is a “safe” choice.

Along with ”robustness”, the model specification of the Cox PH model has several good properties.

The exponential part of this product ensures that the fitted model will always give estimated hazards that are non-negative. (vs. a linear model with negative coefficients)
The measure of effect, which is called a hazard ratio, is calculated without having to estimate the baseline hazard function.
With a minimum of assumption, we can obtain the primary information about a hazard ratio and a survival curve.
As compared to logistic model, the Cox PH model incorporate the survival time and censoring information

Comparisons with the crude and adjusted models

# Load the required package
library(survival)

# Load the addicts dataset
addicts <- read_dta("http://web1.sph.emory.edu/dkleinb/allDatasets/surv2datasets/addicts.dta")


# Fit the Cox proportional hazards model
Y=Surv(addicts$survt,addicts$status==1)

crude_cox <- coxph(Y~ prison , data=addicts)
adj_cox <- coxph(Y~ prison + dose , data=addicts)

# Display the summary of the model
#summary(baseline_cox)
summary(crude_cox)

## Call:
## coxph(formula = Y ~ prison, data = addicts)
## 
##   n= 238, number of events= 150 
## 
##          coef exp(coef) se(coef)     z Pr(>|z|)
## prison 0.1838    1.2018   0.1642 1.119    0.263
## 
##        exp(coef) exp(-coef) lower .95 upper .95
## prison     1.202     0.8321    0.8711     1.658
## 
## Concordance= 0.536  (se = 0.023 )
## Likelihood ratio test= 1.25  on 1 df,   p=0.3
## Wald test            = 1.25  on 1 df,   p=0.3
## Score (logrank) test = 1.26  on 1 df,   p=0.3

summary(adj_cox)

## Call:
## coxph(formula = Y ~ prison + dose, data = addicts)
## 
##   n= 238, number of events= 150 
## 
##            coef exp(coef) se(coef)      z Pr(>|z|)    
## prison  0.18965   1.20883  0.16427  1.155    0.248    
## dose   -0.03608   0.96457  0.00600 -6.013 1.83e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##        exp(coef) exp(-coef) lower .95 upper .95
## prison    1.2088     0.8272    0.8761     1.668
## dose      0.9646     1.0367    0.9533     0.976
## 
## Concordance= 0.663  (se = 0.025 )
## Likelihood ratio test= 38.21  on 2 df,   p=5e-09
## Wald test            = 37.15  on 2 df,   p=9e-09
## Score (logrank) test = 37.48  on 2 df,   p=7e-09

Similar to other analytic models, we fit both crude and adjusted models.

In principle, include confounders derived from theory, not empirical assessment
Often used to assess if confounding effect exists (e.g., a 10% rule)
Report both even if there is no difference of the model fits for crude and adjusted models
test statistics: difference of -2LL / difference of d.f.s, under χ2 distributions

First, let’s examine the model fit statistics.

Wald statistics: $z = {coef}/{se(coef)}$ is normally distributed
Likelihood ratio (LR) statistics: “In general, the LR and Wald statistics may not give exactly the same answer. Statisticians have shown that of the two test procedures, the LR statistic has better statistical properties, so when in doubt, you should use the LR test.”
Concordance (https://cran.r-project.org/web/packages/survival/vignettes/concordance.pdf)
Score (logrank) test

Now, let’s examine coefficients.

As other regression outputs, we have point estimates, ses and p-values, and confidence intervals.
Note that there is no $\beta_0$ term
coef: $log(Hazard Ratio)$
$exp(coef)$: Hazard ratio (HR) ($exp(0.1897) = 1.2088$), the hazard for the test group (prison=1) is 1.2 times the hazard for the comparison group (prison=0).

More about Hazard ratio

Hazard ratio (HR) ($e^{\hat{\beta}}$)
In general, a HR is defined as the hazard for one individual divided by the hazard for a different individual. The two individuals being compared can be distinguished by their values for the set of predictors, that is, the $X's$ vs. $X^*'s$. Therefore,

Example: When $X_1$ denotes $(0, 1)$ exposure status, then $X_1^{∗} = 1$, $X_1 = 0$, thus

\[\hat{HR} = \exp \left( \sum_{i=1}^{p} \hat{\beta}_i (X_i^* - X_i) \right) = \exp [\hat{\beta}_1 (1 - 0)] = \exp (\hat{\beta}_1)\]

As with an odds ratio, it is easier to interpret an HR that exceeds the null value of 1 than an HR that is less than 1. Thus, the $X’s$ are typically coded so that group with the larger hazard corresponds to $X^∗$, and the group with the smaller hazard corresponds to $X$.

Example

library(eha)

Cox.PH <- coxreg(Y ~ clinic + dose + prison,
                 data=addicts)
b_coxph = coef(Cox.PH)
Cox.PH

## Call:
## coxreg(formula = Y ~ clinic + dose + prison, data = addicts)
## 
## Covariate             Mean       Coef     Rel.Risk   S.E.    Wald p
## clinic                1.378    -1.010     0.364     0.215     0.000 
## dose                 64.317    -0.035     0.965     0.006     0.000 
## prison                0.418     0.327     1.386     0.167     0.051 
## 
## Events                    150 
## Total time at risk         95812 
## Max. log. likelihood      -673.26 
## LR test statistic         64.56 
## Degrees of freedom        3 
## Overall p-value           6.22835e-14

plot(Cox.PH, main = "Cumulative hazard function")

Model specification

\[\ln h(t)= b_1x_1+b_2x_2+b_3x_3+\epsilon\]

or \[h(t)=h_0 (t)\exp(b_1x_1+b_2x_2+b_3x_3)\]

Estimated equation

\[\ln(h(t)) = (-1.0098959)\times clinic + (-0.0353692) \times dose + (0.326555) \times prison\] or \[h(t)=h_0 (t)\exp((-1.0098959)\times clinic + (-0.0353692) \times dose + (0.326555)\times prison)\]

Interpretations

What is the metric of $y, b_0, b_1,$ and $b_2$, respectively?
What is the difference between coefficients and $\exp$(coefficients)? Specify the metric.
What is the interpretation when 1) $b_i=0$, 2) $b_i<0$, or 3) $b_i>0$?
What is the interpretation when 1) $\exp(b_i)=1$, 2) $\exp(b_i)<1$ or 3) $\exp(b_i)>1$?
How would you compare the hazard of dropout between two groups of people below? Is it additive or multiplicative?
- What is the estimated hazard of dropout ($h(t)$ for clinic=1) for those who with clinic=1, dose=50, and prison=1?
- What is the estimated hazard of dropout ($h(t)$ for clinic=2) for those who with clinic=2, dose=50, and prison=1?
Compare the following two groups:
- A group with clinic=1, dose=100, and prison=1 vs. another group with clinic=1, dose=200, and prison=1
- A group with clinic=2, dose=100, and prison=1 vs. another group with clinic=2, dose=100, and prison=0
Interpret $b_0, b_1,$ and $b_2$, respectively.
The following model includes an interaction term between clinic and prison. In this model specification, prison is a (mediator, moderator) on the association between clinic and time to dropout.

Cox.PH_int <- coxreg(Y ~ clinic + dose + prison + clinic*prison,
                 data=addicts)
b_coxph_int = coef(Cox.PH_int)
Cox.PH_int

## Call:
## coxreg(formula = Y ~ clinic + dose + prison + clinic * prison, 
##     data = addicts)
## 
## Covariate             Mean       Coef     Rel.Risk   S.E.    Wald p
## clinic                1.378    -0.655     0.519     0.289     0.023 
## dose                 64.317    -0.037     0.964     0.007     0.000 
## prison                0.418     1.164     3.203     0.540     0.031 
## clinic:prison    
##    :                         -0.699     0.497     0.429     0.103 
## 
## Events                    150 
## Total time at risk         95812 
## Max. log. likelihood      -671.93 
## LR test statistic         67.22 
## Degrees of freedom        4 
## Overall p-value           8.78186e-14

In this model specification, prison is a (mediator, moderator) on the association between clinic and time to dropout.
State the estimated equation: \[h(t) = h_0(t)\exp((-0.6551914)\times clinic + (-0.0369766) \times dose + (1.1640273) \times prison + (-0.6993124) \times clinic*prison)\]
Compare the hazard of dropout between clinic=1 ($t_{x_1=1}$) and clinic=2 ($t_{x_1=2}$) with consideration of an interaction term.

Proportional hazard (PH) assumption: Revisited

Recall that

\[ HR = \frac{\hat{h}(t, X^*)}{\hat{h}(t, X)} = \frac{h_0(t) \exp(\sum_{i=1}^{p} \beta_i X_i^*)}{h_0(t) \exp(\sum_{i=1}^{p} \beta_i X_i)} = \frac{\exp(\sum_{i=1}^{p} \beta_i X_i^*)}{\exp(\sum_{i=1}^{p} \beta_i X_i)} = \exp\left( {\sum_{i=1}^{p} \beta_i (X_i^* - X_i )}\right) \] - Notice that the baseline hazard function $h_0(t)$ appears in both the numerator and denominator of the hazard ratio and cancels out of the formula.

The final expression for the hazard ratio therefore involves the estimated coefficients $\hat{β_i}$ and the values of $X^∗$ and $X$ for each variable. However, because the baseline hazard has canceled out, the final expression does not involve time $t$.
Thus, once the model is fitted and the values for $X^∗$ and $X$ are specified, the value of the exponential expression for the estimated hazard ratio is a constant, which does not depend on time $t$:

\[ HR = \frac{\hat{h}(t, X^*)}{\hat{h}(t, X)} = \exp \left( \sum_{i=1}^{p} \beta_i (X_i^* - X_i) \right) = \theta \quad \text{therefore,} \] \[ \hat{h}(t, X^*) = \hat{\theta} h(t, X) \]

The last expression indicates that the hazard function for one individual is proportional to the hazard function for another individual, where the proportionality constant is $\hat{θ}$, which does not depend on time $t$.
In the Cox PH model with 0 and 1 for $X_1; \hat{θ} = e^{\hat{\beta}}$
When the PH assumption is inappropriate (e.g., the hazards cross), a Cox PH model is inappropriate and alternative model (e.g., extended Cox model) should be used.

Evaluating the Proportional hazard (PH) assumption

The Cox PH model assumes that the hazard ratio comparing any two specifications of predictors is constant over time. Equivalently, this means that the hazard for one individual is proportional to the hazard for any other individual, where the proportionality constant is independent of time.

Graphical evaluation

The PH assumption is not met if the graph of the hazards cross for two or more categories of a predictor of interest. However, even if the hazard functions do not cross, it is possible that the PH assumption is not met. Thus, rather than checking for crossing hazards, we must use other approaches to evaluate the reasonableness of the PH assumption.

Comparing estimated $–ln(–ln)$ survivor curves over different (combinations of) categories of variables
- Parallel curves, say comparing males with females, indicate that the PH assumption is satisfied
- By definition, $-ln(-ln\hat{S}) = -ln\left( \int_{0}^{t} h(u) \, du \right)$
  - The scale of an estimated survival curve (S^) ranges between 0 and 1, whereas the corresponding scale for a $−ln(−ln \hat{S})$ ranges between −1 and +1
  - By empirical plots, we mean plotting log–log survival curves based on Kaplan–Meier (KM) estimates that do not assume an underlying Cox model.
  - Alternatively, one could plot log–log survival curves which have been adjusted for predictors already assumed to satisfy the PH assumption but have not included the predictor being assessed in a PH model. ∗ If observed and predicted curves are ”visually” parallel, then the PH assumption is reasonable
- Assessing the PH assumption for variables one-at-a-time
- Assessing the PH assumption after adjusting for other variables

plot(survfit(Y ~ clinic, data=addicts), col=c("black", "red"), fun="cloglog")

Goodness-of-fit (GOF)

This is more objective decision using a statistical test than graphical evaluation. A non-significant (i.e., large) $p$-value from large sample $z$ or $\chi ^2$ statistics suggests that the PH assumption is reasonable, whereas a small $p$-value suggests that the variable being tested does not satisfy this assumption.
Schoenfeld residuals

## Fit a model assuming PH for all variables
adj_cox <- coxph(Surv(time, status_dummy) ~ sex + age, data = lung)

## Use cox.zph. The survival times are transformed to ranks.
res.zph <- cox.zph(adj_cox, transform = c("km","rank","idenityt")[2])

## Print test results
res.zph

##        chisq df    p
## sex    2.378  1 0.12
## age    0.137  1 0.71
## GLOBAL 2.475  2 0.29

## scaled Schoenfeld residuals vs
## Plotting can be useful. A non-horizontal trend means changes in HR over time

plot(res.zph)

Stratified Cox models

If the proportional hazards assumption is violated for the variable CLINIC but met for PRISON and DOSE, a stratified Cox model can be performed with CLINIC the stratified variable. The coxph function includes a strata() option in the model formula. First we define the response variable Y with the Surv function and then the coxph function is used to run a stratified Cox model (code and output shown below):

# Load the addicts dataset
addicts <- read_dta("http://web1.sph.emory.edu/dkleinb/allDatasets/surv2datasets/addicts.dta")

Y=Surv(addicts$survt,addicts$status==1)
coxph(Y~ prison + dose + strata(clinic),data=addicts)

## Call:
## coxph(formula = Y ~ prison + dose + strata(clinic), data = addicts)
## 
##             coef exp(coef)  se(coef)      z        p
## prison  0.389605  1.476397  0.168930  2.306   0.0211
## dose   -0.035115  0.965495  0.006465 -5.432 5.59e-08
## 
## Likelihood ratio test=33.91  on 2 df, p=4.322e-08
## n= 238, number of events= 150

Interaction terms for CLINIC can be included directly in the model formula by including product terms using the : operator (clinic:prison and clinic:dose) (code and output follow)

coxph(Y~ prison + dose + clinic:prison + clinic:dose +strata(clinic),
data=addicts)

## Call:
## coxph(formula = Y ~ prison + dose + clinic:prison + clinic:dose + 
##     strata(clinic), data = addicts)
## 
##                    coef exp(coef)  se(coef)      z      p
## prison         1.085836  2.961914  0.538636  2.016 0.0438
## dose          -0.034635  0.965958  0.019797 -1.750 0.0802
## prison:clinic -0.582989  0.558227  0.428135 -1.362 0.1733
## dose:clinic   -0.001164  0.998837  0.014570 -0.080 0.9363
## 
## Likelihood ratio test=35.77  on 4 df, p=3.222e-07
## n= 238, number of events= 150

Suppose we wish to estimate the hazard ratio for PRISON=1 vs. PRISON=0 for CLINIC=2. This hazard ratio can be estimated by exponentiating the coefficient for prison plus 2 times the coefficient for the CLINIC* PRISON interaction term. This expression is obtained by substituting the appropriate values into the hazard in both the numerator (for PRISON=1) and denominator (for PRISON=0):

\[ HR = \frac{h_0(t) \exp[\beta_4 + \beta_2 DOSE + (2)(t)\beta_3 + \beta_{clinic} \times DOSE]}{h_0(t) \exp[(0)\beta_4 + \beta_2 DOSE + (2)(0)\beta_3 + \beta_{clinic} \times DOSE]} = \exp(\beta_t + 2\beta_3) \]

The resulting hazard ratio, $exp(β_1 + 2β_2)$, is an exponentiated linear combination of parameters. Unfortunately, R does not have a lincom command that Stata provides or an estimate statement that SAS provides in order to calculate a linear combination of parameter estimates. However an approach that can be used in any statistical software package for such a situation is to recode the variable(s) of interest such that the desired estimate is no longer a linear combination of parameter estimates.

In this example, we are interested in a hazard ratio PRISON=1 versus PRISON=0 for CLINIC=2. We can define a new variable CLINIC × 2 so when CLINIC=2, CLINIC × 2=0.

addicts$clinic2=addicts$clinic-2
summary(coxph(Y~ prison + dose + clinic2:prison + clinic2:dose
+ strata(clinic2), data=addicts))

## Call:
## coxph(formula = Y ~ prison + dose + clinic2:prison + clinic2:dose + 
##     strata(clinic2), data = addicts)
## 
##   n= 238, number of events= 150 
## 
##                     coef exp(coef)  se(coef)      z Pr(>|z|)   
## prison         -0.080143  0.922985  0.384305 -0.209  0.83481   
## dose           -0.036964  0.963711  0.012346 -2.994  0.00275 **
## prison:clinic2 -0.582989  0.558227  0.428135 -1.362  0.17329   
## dose:clinic2   -0.001164  0.998837  0.014570 -0.080  0.93632   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##                exp(coef) exp(-coef) lower .95 upper .95
## prison            0.9230      1.083    0.4346    1.9603
## dose              0.9637      1.038    0.9407    0.9873
## prison:clinic2    0.5582      1.791    0.2412    1.2919
## dose:clinic2      0.9988      1.001    0.9707    1.0278
## 
## Concordance= 0.649  (se = 0.026 )
## Likelihood ratio test= 35.77  on 4 df,   p=3e-07
## Wald test            = 34.09  on 4 df,   p=7e-07
## Score (logrank) test = 34.97  on 4 df,   p=5e-07

The first line of code defines a new variable CLINIC2. CLINIC2 is used in the stratified Cox model rather than CLINIC. We are interested in the hazard ratio for PRISON=1 vs PRISON=0 for CLINIC2=0. When CLINIC2=0, the product terms cancel and the hazard ratio reduces to $exp(β_1)$.

The second line of code applies the summary function to the coxph function. The summary function applied in this way produces additional output including 95% confidence intervals for the hazard ratios.

The estimate for $exp(β_1)$ can be found in the second table, $exp(coef)$ for prison = 0.9203. The lower and upper confidence limits are 0.4346 and 1.9603, respectively. If we did not recode the variable CLINIC the problem would have been more complicated in that we would have had to use variance–covariance matrix (which can be obtained with the vcov function) to calculate a 95% confidence interval for this hazard ratio.

Time-dependent variable approaches

The Cox model is extended to contain product (i.e., interaction) terms involving the time-independent variable being assessed and some function of time. If the coefficient of the product term turns out to be significant, we can conclude that the PH assumption is violated.

Using the above one-at-a-time model, we assess the PH assumption by testing for the significance of the product term. The null hypothesis is therefore “$t$ equal to zero.” Note that if the null hypothesis is true, the model reduces to a Cox PH model containing the single variable $X$. The test can be carried out using either a Wald statistic or a likelihood ratio statistic.
To assess the PH assumption for several predictors simultaneously, the form of the extended model is \[ h(t, X) = h_0(t) \exp \left( \sum_{i=1}^{p} (\beta_i X_i + \delta_i (X_i \times g_i(t))) \right), \] where $g_i(t)$ is a function of time for $ i^{th} $ predictor.
This model contains the predictors being assessed as main effect terms and also as product terms with some function of time. Note that different predictors may require different functions of time; hence, the notation gi(t) is used to define the time function for the ith predictor ∗ With the above model, we test for the PH assumption simultaneously by assessing the null hypothesis that all the $δi$ coefficients are equal to zero. This requires a likelihood ratio chi-square statistic with p degrees of freedom, where $p$ denotes the number of predictors being assessed. The LR statistic computes the difference between the log likelihood statistic (i.e., $−2 ln L$) for the PH model and the log likelihood statistic for the extended Cox model. Note that under the null hypothesis, the model reduces to the Cox PH model.
If the above test is found to be significant, then we can conclude that the PH assumption is not satisfied for at least one of the predictors in the model. To determine which predictor(s) do not satisfy the PH assumption, we could proceed by backward elimination of nonsignificant product terms until a final model is attained.
The primary drawback of the use of an extended Cox model for assessing the PH assumption concerns the choice of the functions $g_i(t)$ for the timedependent product terms in the model. This choice is typically not clear-cut, and it is possible that different choices, such as $g(t)$ equal to $t$ versus $log t$ versus a heaviside function, may result in different conclusions about whether the PH assumption is satisfied.

Parametric Models

A note for PH and AFT models

Proportional hazards (PH) parameterization

\[ h(t|x) = h_0(t) \cdot \exp(\mathbf{x}^\top \boldsymbol{\beta}) \]

Accelerated failure time (AFT) parameterization:

\[ S(t|x) = S_0(t \cdot \exp(\mathbf{x}^\top \boldsymbol{\beta})) \]

Key Equations:

PH coefficients: $\beta_{PH} = -\beta_{AFT}/\sigma$
Shape parameter: $\gamma = 1/\sigma$
Hazard Ratio: $HR = \exp(\beta_{PH})$
Event Time Ratio: $ETR = \exp(-\beta_{PH}/\gamma)$

Feature	AFT	PH
Parameterization	Log-time	Hazard ratio
Time Acceleration	Direct interpretation	Requires conversion
Diagnostic Tools	Residual plots	Stratified analysis
Covariate Effects	Event Time Ratios	Hazard Ratios

Examples
- Cox PH

# Load required library and data
library(survival)
data(lung)

## Warning in data(lung): data set 'lung' not found

# If your event variable is a factor (e.g., "Alive"/"Dead"):
lung$status <- as.numeric(lung$status == 2)  # Convert to 0/1 (censored/event)

# Create a survival object
surv_obj <- with(lung, Surv(time, status))

# Fit a Cox proportional hazards model
cox_model <- coxph(surv_obj ~ age + sex + ph.ecog, data = lung)

# Summarize the model
summary(cox_model)

## Call:
## coxph(formula = surv_obj ~ age + sex + ph.ecog, data = lung)
## 
##   n= 227, number of events= 164 
##    (1 observation deleted due to missingness)
## 
##              coef exp(coef)  se(coef)      z Pr(>|z|)    
## age      0.011067  1.011128  0.009267  1.194 0.232416    
## sex     -0.552612  0.575445  0.167739 -3.294 0.000986 ***
## ph.ecog  0.463728  1.589991  0.113577  4.083 4.45e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##         exp(coef) exp(-coef) lower .95 upper .95
## age        1.0111     0.9890    0.9929    1.0297
## sex        0.5754     1.7378    0.4142    0.7994
## ph.ecog    1.5900     0.6289    1.2727    1.9864
## 
## Concordance= 0.637  (se = 0.025 )
## Likelihood ratio test= 30.5  on 3 df,   p=1e-06
## Wald test            = 29.93  on 3 df,   p=1e-06
## Score (logrank) test = 30.5  on 3 df,   p=1e-06

# Plot the survival curves for different groups
plot(survfit(cox_model), xlab = "Time", ylab = "Survival Probability",
     main = "Cox Proportional Hazards Model")

Accelerated Failure Time (AFT) Model

# Fit an AFT model with Weibull distribution
aft_model <- survreg(Surv(time, status) ~ age + sex + ph.ecog, data = lung, dist = "weibull")

# Summarize the model
summary(aft_model)

## 
## Call:
## survreg(formula = Surv(time, status) ~ age + sex + ph.ecog, data = lung, 
##     dist = "weibull")
##                Value Std. Error     z       p
## (Intercept)  6.27344    0.45358 13.83 < 2e-16
## age         -0.00748    0.00676 -1.11  0.2690
## sex          0.40109    0.12373  3.24  0.0012
## ph.ecog     -0.33964    0.08348 -4.07 4.7e-05
## Log(scale)  -0.31319    0.06135 -5.11 3.3e-07
## 
## Scale= 0.731 
## 
## Weibull distribution
## Loglik(model)= -1132.4   Loglik(intercept only)= -1147.4
##  Chisq= 29.98 on 3 degrees of freedom, p= 1.4e-06 
## Number of Newton-Raphson Iterations: 5 
## n=227 (1 observation deleted due to missingness)

# Extract scale parameter and coefficients
scale_param <- aft_model$scale
coefficients <- aft_model$coefficients

# Plot Kaplan-Meier curve and overlay Weibull fit
km_fit <- survfit(Surv(time, status) ~ 1, data = lung)
plot(km_fit, xlab = "Time", ylab = "Survival Probability", conf.int = TRUE,
     main = "Kaplan-Meier Curve with Weibull Fit")

# Overlay Weibull survival curve
time_seq <- seq(0, max(lung$time), length.out = 100)
weibull_surv <- exp(-((time_seq / exp(coefficients[1]))^scale_param))
lines(time_seq, weibull_surv, col = "red", lwd = 2)
legend("topright", legend = c("Kaplan-Meier", "Weibull Fit"), col = c("black", "red"), lty = 1)

Shape and Scale

In parametric survival models, the shape and scale parameters play key roles in defining the distribution of survival times and the behavior of the hazard function. Below is an explanation of these parameters, particularly in the context of commonly used distributions like the Weibull model.

Shape Parameter

The shape parameter ($\beta$ or $k$, depending on notation) controls the form of the hazard function over time. It determines whether the hazard rate (the risk of an event occurring at a given time) is constant, increasing, or decreasing.

For the Weibull distribution, the hazard function is:

\[ h(t) = \frac{\beta}{\lambda} \left(\frac{t}{\lambda}\right)^{\beta - 1} \]

where $\beta$ is the shape parameter and $\lambda$ is the scale parameter.

Interpretation of Shape ($\beta$):
- $\beta = 1$: Constant hazard rate (reduces to exponential distribution).
- $\beta > 1$: Increasing hazard rate over time (e.g., aging-related risks).
- $\beta < 1$: Decreasing hazard rate over time (e.g., early failures in mechanical systems).
Practical Interpretation

The shape parameter reflects how risk evolves: in medical studies, a higher shape parameter may indicate that risks increase with age or disease progression.

Scale Parameter

The scale parameter ($\lambda$) stretches or compresses the time axis. It determines how quickly events occur on average.

For the Weibull distribution, the survival function is:

\[ S(t) = e^{-(t / \lambda)^\beta} \]

where $\lambda > 0$ is the scale parameter.

Interpretation of Scale ($\lambda$):
- Larger values of $\lambda$: Longer survival times (events occur more slowly).
- Smaller values of $\lambda$: Shorter survival times (events occur more quickly).
Practical Interpretation

The scale parameter reflects the “average” time to event: in medical studies, a larger scale parameter might indicate longer expected survival.

Combined Effect on Hazard Function

The interaction between shape and scale parameters determines the overall behavior of the hazard function:

For example, in a Weibull model:
- If $\beta > 1$ and $\lambda > 0.5$, hazards increase over time but events are delayed.
- If $\beta < 1$ and $\lambda < 0.5$, hazards decrease over time with rapid early events.

Examples

Exponential Distribution (Special Case of Weibull)

When $\beta = 1$, the Weibull model reduces to the exponential model:

\[ h(t) = \frac{1}{\lambda} \]

This implies a constant hazard rate, independent of time.

Weibull Model with Increasing Hazard

For $t =[^1][^2][^3]$, $S(t) = e^{-(t/2)^2}$:

Shape ($\beta = 2.0$): Increasing hazard.
Scale ($\lambda = 2.0$): Moderate survival duration.

Summary Table

Parameter	Role	Interpretation
Shape ($\beta$)	Determines hazard trend	Constant ($=1$), increasing ($>1$), or decreasing ($<1$)
Scale ($\lambda$)	Stretches/compresses time axis	Larger values: longer survival times; smaller values: shorter survival times

These parameters provide flexibility for modeling various real-world scenarios in survival analysis.

Exponential, PH

Exp.PH <- phreg(Y ~ clinic + dose + prison,
                data=addicts, shape=1, dist="weibull", param="survreg")
b_Exp.PH = coef(Exp.PH)
Exp.PH

## Call:
## phreg(formula = Y ~ clinic + dose + prison, data = addicts, dist = "weibull", 
##     shape = 1, param = "survreg")
## 
## Covariate          W.mean      Coef Exp(Coef)  se(Coef)    Wald p
## clinic              1.378    -0.881     0.415     0.211     0.000 
## dose               64.317    -0.029     0.971     0.006     0.000 
## prison              0.418     0.253     1.287     0.165     0.125 
## 
## log(scale)                    3.684               0.431     0.000 
## 
##  Shape is fixed at  1 
## 
## Events                    150 
## Total time at risk         95812 
## Max. log. likelihood       -1094 
## LR test statistic         49.91 
## Degrees of freedom        3 
## Overall p-value           8.34931e-11

plot(Exp.PH)

Model specification

\[\ln h(t)= b_0 + b_1x_1+b_2x_2+b_3x_3+\epsilon\]

or \[h(t)=h_0 (t)\exp(b_0 + b_1x_1+b_2x_2+b_3x_3)\] Note: we will get back to $h_0(t)$ later.

Estimated equation

\[\ln(h(t)) = (3.6843409) + (-0.8805819)\times clinic + (-0.0289167) \times dose + (0.2526491) \times prison\] or \[h(t)=\exp((3.6843409) + -0.8805819)\times clinic + (-0.0289167) \times dose + (0.2526491)\]

Interpretations

What is the metric of $y, b_0, b_1,$ and $b_2$, respectively?
What is the difference between coefficients and $\exp$(coefficients)? Specify the metric.
What is the interpretation when 1) $b_i=0$, 2) $b_i<0$, or 3) $b_i>0$?
What is the interpretation when 1) $\exp(b_i)=1$, 2) $\exp(b_i)<1$ or 3) $\exp(b_i)>1$?
How would you compare the hazard of dropout between two groups of people below? Is it additive or multiplicative?
- What is the estimated hazard of dropout ($h(t)$ for clinic=1) for those who with clinic=1, dose=50, and prison=1?
- What is the estimated hazard of dropout ($h(t)$ for clinic=2) for those who with clinic=2, dose=50, and prison=1?
Compare the following two groups:
- A group with clinic=1, dose=100, and prison=1 vs. another group with clinic=1, dose=200, and prison=1
- A group with clinic=2, dose=100, and prison=1 vs. another group with clinic=2, dose=100, and prison=0
Interpret $b_0, b_1,$ and $b_2$, respectively.
The following model includes an interaction term between clinic and prison. In this model specification, prison is a (mediator, moderator) on the association between clinic and time to dropout.

Exp.PH_int <- phreg(Y ~ clinic + dose + prison + clinic*prison,
                data=addicts, shape=1, dist="weibull", param="survreg")
b_Exp.PH_int = coef(Exp.PH_int)
Exp.PH_int

## Call:
## phreg(formula = Y ~ clinic + dose + prison + clinic * prison, 
##     data = addicts, dist = "weibull", shape = 1, param = "survreg")
## 
## Covariate          W.mean      Coef Exp(Coef)  se(Coef)    Wald p
## clinic              1.378    -0.670     0.512     0.288     0.020 
## dose               64.317    -0.030     0.971     0.006     0.000 
## prison              0.418     0.754     2.127     0.529     0.154 
## clinic:prison    
##    :                         -0.421     0.656     0.423     0.319 
## 
## log(scale)                    3.893               0.475     0.000 
## 
##  Shape is fixed at  1 
## 
## Events                    150 
## Total time at risk         95812 
## Max. log. likelihood      -1093.5 
## LR test statistic         50.91 
## Degrees of freedom        4 
## Overall p-value           2.33502e-10

State the estimated equation: \[h(t) = \exp((3.8931281) + (-0.6703443)\times clinic + (-0.0295687) \times dose + (0.7544933) \times prison + (-0.4213606) \times clinic*prison)\]
Compare the hazard of dropout between clinic=1 ($t_{x_1=1}$) and clinic=2 ($t_{x_1=2}$) with consideration of an interaction term.

Exponential, AFT

Exp.AFT <- aftreg(Y ~ clinic + dose + prison,
                data=addicts, shape=1, dist="weibull", param="survreg")

## Warning in aftreg(Y ~ clinic + dose + prison, data = addicts, shape = 1, :
## 'survreg' is a deprecated argument value

a_Exp.AFT = coef(Exp.AFT)
Exp.AFT

## Call:
## aftreg(formula = Y ~ clinic + dose + prison, data = addicts, 
##     dist = "weibull", shape = 1, param = "survreg")
## 
## Covariate          W.mean      Coef Life-Expn  se(Coef)    Wald p
## clinic              1.378     0.881     2.412     0.211     0.000 
## dose               64.317     0.029     1.029     0.006     0.000 
## prison              0.418    -0.253     0.777     0.165     0.125 
## 
## Baseline parameters:
## log(scale)                    3.684               0.431     0.000 
## Baseline life expectancy:  NA 
## 
##  Shape is fixed at  1 
## 
## Events                    150 
## Total time at risk         95812 
## Max. log. likelihood       -1094 
## LR test statistic         49.9 
## Degrees of freedom        3 
## Overall p-value           8.34931e-11

#plot(Exp.AFT)
#check.dist(Cox.PH, Exp.AFT)

Model specification

\[\ln (t)= a_0 + a_1x_1 + a_2x_2 + a_3x_3 + \epsilon\] Note: we will get back to $h_0(t)$ later.

Estimated equation

\[\ln(t) = (3.6843122) + (0.880538)\times clinic + (0.0289172) \times dose + (-0.2526278) \times prison\]

Interpretations

What is the metric of $y, a_0, a_1,$ and $a_2$, respectively?
What is the difference between coefficients and $\exp$(coefficients)? Specify the metric.
What is the interpretation when 1) $a_i=0$, 2) $a_i<0$, or 3) $a_i>0$?
What is the interpretation when 1) $\exp(a_i)=1$, 2) $\exp(a_i)<1$ or 3) $\exp(a_i)>1$?
How would you compare the time to dropout between two groups of people below? Is it additive or multiplicative?
- What is the estimated time to dropout ($h(t)$ for clinic=1) for those who with clinic=1, dose=50, and prison=1?
- What is the estimated time to dropout ($h(t)$ for clinic=2) for those who with clinic=2, dose=50, and prison=1?
Compare the following two groups:
- A group with clinic=1, dose=100, and prison=1 vs. another group with clinic=1, dose=200, and prison=1
- A group with clinic=2, dose=100, and prison=1 vs. another group with clinic=2, dose=100, and prison=0
Interpret $b_0, b_1,$ and $b_2$, respectively.
The following model includes an interaction term between clinic and prison. In this model specification, prison is a (mediator, moderator) on the association between clinic and time to dropout.

Exp.AFT_int <- aftreg(Y ~ clinic + dose + prison + clinic*prison,
                data=addicts, shape=1, dist="weibull", param="survreg")

## Warning in aftreg(Y ~ clinic + dose + prison + clinic * prison, data = addicts,
## : 'survreg' is a deprecated argument value

a_Exp.AFT_int = coef(Exp.AFT_int)
Exp.AFT_int

## Call:
## aftreg(formula = Y ~ clinic + dose + prison + clinic * prison, 
##     data = addicts, dist = "weibull", shape = 1, param = "survreg")
## 
## Covariate          W.mean      Coef Life-Expn  se(Coef)    Wald p
## clinic              1.378     0.670     1.955     0.288     0.020 
## dose               64.317     0.030     1.030     0.006     0.000 
## prison              0.418    -0.754     0.470     0.529     0.154 
## clinic:prison    
##    :                          0.421     1.524     0.423     0.319 
## 
## Baseline parameters:
## log(scale)                    3.893               0.475     0.000 
## Baseline life expectancy:  NA 
## 
##  Shape is fixed at  1 
## 
## Events                    150 
## Total time at risk         95812 
## Max. log. likelihood      -1093.5 
## LR test statistic         50.9 
## Degrees of freedom        4 
## Overall p-value           2.33502e-10

State the estimated equation: \[ln(t) = (3.8931535) + (0.6703699)\times clinic + (0.0295675) \times dose + (-0.7544644) \times prison + (0.4213244) \times clinic*prison\]
Compare the time to dropout between clinic=1 ($t_{x_1=1}$) and clinic=2 ($t_{x_1=2}$) with consideration of an interaction term.

Weibull, PH

Weib.PH <- phreg(Y ~ clinic + dose + prison,
                 data=addicts, dist="weibull", param="survreg")
b_Weib.PH = coef(Weib.PH)
Weib.PH

## Call:
## phreg(formula = Y ~ clinic + dose + prison, data = addicts, dist = "weibull", 
##     param = "survreg")
## 
## Covariate          W.mean      Coef Exp(Coef)  se(Coef)    Wald p
## clinic              1.378    -0.972     0.379     0.212     0.000 
## dose               64.317    -0.033     0.967     0.006     0.000 
## prison              0.418     0.314     1.369     0.166     0.058 
## 
## log(scale)                    4.105               0.328     0.000 
## log(shape)                    0.315               0.068     0.000 
## 
## Events                    150 
## Total time at risk         95812 
## Max. log. likelihood      -1084.5 
## LR test statistic         60.89 
## Degrees of freedom        3 
## Overall p-value           3.8014e-13

plot(Weib.PH)

Model specification

\[\ln h(t)= b_0 + b_1x_1+b_2x_2+b_3x_3+\epsilon\]

or \[h(t)=h_0 (t)\exp(b_0 + b_1x_1+b_2x_2+b_3x_3)\] Note: we will get back to $h_0(t)$ later.

Estimated equation

\[\ln(h(t)) = (4.1048451) + (-0.9715245)\times clinic + (-0.0334675) \times dose + (0.3144143) \times prison\] or \[h(t)=\exp((4.1048451) + (-0.9715245)\times clinic + (-0.0334675) \times dose + (0.3144143)\times prison\]

Weibull, AFT

Weib.AFT <- aftreg(Y ~ clinic + dose + prison,
                   data=addicts, dist="weibull", param="survreg")

## Warning in aftreg(Y ~ clinic + dose + prison, data = addicts, dist = "weibull",
## : 'survreg' is a deprecated argument value

a_Weib.AFT = coef(Weib.AFT)
Weib.AFT

## Call:
## aftreg(formula = Y ~ clinic + dose + prison, data = addicts, 
##     dist = "weibull", param = "survreg")
## 
## Covariate          W.mean      Coef Life-Expn  se(Coef)    Wald p
## clinic              1.378     0.709     2.032     0.157     0.000 
## dose               64.317     0.024     1.025     0.005     0.000 
## prison              0.418    -0.230     0.795     0.121     0.057 
## 
## Baseline parameters:
## log(scale)                    4.105               0.328     0.000 
## log(shape)                    0.315               0.068     0.000 
## Baseline life expectancy:  55.5 
## 
## Events                    150 
## Total time at risk         95812 
## Max. log. likelihood      -1084.5 
## LR test statistic         60.9 
## Degrees of freedom        3 
## Overall p-value           3.8014e-13

plot(Weib.AFT)

#check.dist(Cox.PH, Weib.AFT)

Model specification

\[\ln (t)= a_0 + a_1x_1 + a_2x_2 + a_3x_3 + \epsilon\]

Estimated equation

\[\ln(t) = (4.1051824) + (0.7089269)\times clinic + (0.0244211) \times dose + (-0.2296602) \times prison\]

Log-logistic, Proportional Odds

Llogis.PO <- phreg(Y ~ clinic + dose + prison,
                     data=addicts, dist="loglogistic", param="survreg")
b_Llogis.PO = coef(Llogis.PO)
Llogis.PO

## Call:
## phreg(formula = Y ~ clinic + dose + prison, data = addicts, dist = "loglogistic", 
##     param = "survreg")
## 
## Covariate          W.mean      Coef Exp(Coef)  se(Coef)    Wald p
## (Intercept)                  18.490             781.991     0.981 
## clinic              1.378    -0.972     0.378     0.212     0.000 
## dose               64.317    -0.033     0.967     0.006     0.000 
## prison              0.418     0.314     1.369     0.166     0.058 
## 
## log(scale)                   17.598             570.780     0.975 
## log(shape)                    0.315               0.068     0.000 
## 
## Events                    150 
## Total time at risk         95812 
## Max. log. likelihood      -1084.5 
## LR test statistic         60.89 
## Degrees of freedom        3 
## Overall p-value           3.78697e-13

Model specification

\[\ln h(t)= b_0 + b_1x_1+b_2x_2+b_3x_3+\epsilon\]

or \[h(t)=h_0 (t)\exp(b_0 + b_1x_1+b_2x_2+b_3x_3)\]

Estimated equation

\[\ln(h(t)) = (0.3144273) + (18.4898785)\times clinic + (-0.9715453) \times dose + (-0.0334685) \times prison\] or \[h(t)=\exp((0.3144273) + 18.4898785)\times clinic + (-0.9715453) \times dose + (-0.0334685)\times prison\]

Please note that the log-logistic model is not proportional to the hazard, but proportional to the odds.

\[S(x)=\frac{1}{1+\lambda t^p} \] \[1-S(x) = 1- \frac{1}{1+\lambda t^p}= \frac{1+\lambda t^p - 1}{1+\lambda t^p} = \frac{\lambda t^p}{1+\lambda t^p} \] \[ Survival\;\;odds\; (SO) = \frac{S(x)}{1-S(x)}= \frac{\frac{1}{1+\lambda t^p}}{\frac{\lambda t^p}{1+\lambda t^p}}= \frac{1}{\lambda t^p}\] \[ Failure\;\; odds\; (FO) = \frac{1-S(x)}{S(x)} = \lambda t^p \] \[ \ln (FO) = \ln(\lambda t^p)=\ln(\lambda)+p \times \ln(t) \]

Log-logistic, AFT

Llogis.AFT <- aftreg(Y ~ clinic + dose + prison,
                     data=addicts, dist="loglogistic", param="survreg")

## Warning in aftreg(Y ~ clinic + dose + prison, data = addicts, dist =
## "loglogistic", : 'survreg' is a deprecated argument value

a_Llogis.AFT = coef(Llogis.AFT)
Llogis.AFT

## Call:
## aftreg(formula = Y ~ clinic + dose + prison, data = addicts, 
##     dist = "loglogistic", param = "survreg")
## 
## Covariate          W.mean      Coef Life-Expn  se(Coef)    Wald p
## clinic              1.378     0.581     1.787     0.172     0.001 
## dose               64.317     0.032     1.032     0.006     0.000 
## prison              0.418    -0.291     0.747     0.144     0.043 
## 
## Baseline parameters:
## log(scale)                    3.563               0.389     0.000 
## log(shape)                    0.533               0.069     0.000 
## Baseline life expectancy:  NA 
## 
## Events                    150 
## Total time at risk         95812 
## Max. log. likelihood      -1093.9 
## LR test statistic         52.2 
## Degrees of freedom        3 
## Overall p-value           2.73866e-11

plot(Llogis.AFT)

#check.dist(Cox.PH, Llogis.AFT)

Model specification

\[\ln (t)= a_0 + a_1x_1 + a_2x_2 + a_3x_3 + \epsilon\]

Estimated equation

\[\ln(t) = (3.5634386) + (0.5805364)\times clinic + (0.0316117) \times dose + (-0.2912455) \times prison\]

Gompertz, PH

G.PH <- phreg(Y ~ clinic + dose + prison,
              data=addicts, dist="gompertz")
G.PH

## Call:
## phreg(formula = Y ~ clinic + dose + prison, data = addicts, dist = "gompertz")
## 
## Covariate          W.mean      Coef Exp(Coef)  se(Coef)    Wald p
## clinic              1.378    -1.030     0.357     0.214     0.000 
## dose               64.317    -0.035     0.965     0.006     0.000 
## prison              0.418     0.327     1.386     0.166     0.050 
## 
## log(scale)                    6.262               0.195     0.000 
## log(shape)                    2.508               0.664     0.000 
## 
## Events                    150 
## Total time at risk         95812 
## Max. log. likelihood      -1081.5 
## LR test statistic         65.61 
## Degrees of freedom        3 
## Overall p-value           3.70814e-14

plot(G.PH)

#check.dist(Cox.PH, G.PH)

Gompertz, AFT

G.AFT <- aftreg(Y ~ clinic + dose + prison,
              data=addicts, dist="gompertz", param="survreg")

## Warning in aftreg(Y ~ clinic + dose + prison, data = addicts, dist =
## "gompertz", : 'survreg' is a deprecated argument value

G.AFT

## Call:
## aftreg(formula = Y ~ clinic + dose + prison, data = addicts, 
##     dist = "gompertz", param = "survreg")
## 
## Covariate          W.mean      Coef Life-Expn  se(Coef)    Wald p
## clinic              1.378     0.726     2.067     0.153     0.000 
## dose               64.317     0.019     1.019     0.004     0.000 
## prison              0.418    -0.199     0.819     0.098     0.043 
## 
## Baseline parameters:
## log(scale)                    4.437               0.271     0.000 
## log(shape)                   -0.573               0.336     0.088 
## Baseline life expectancy:  72.6 
## 
## Events                    150 
## Total time at risk         95812 
## Max. log. likelihood      -1081.6 
## LR test statistic         65.5 
## Degrees of freedom        3 
## Overall p-value           4.00791e-14

plot(G.AFT)

#check.dist(Cox.PH, G.AFT)

Lognormal, AFT

Lognormal.AFT <- aftreg(Y ~ clinic + dose + prison,
                        data=addicts, dist="lognormal", param="survreg")

## Warning in aftreg(Y ~ clinic + dose + prison, data = addicts, dist =
## "lognormal", : 'survreg' is a deprecated argument value

Lognormal.AFT

## Call:
## aftreg(formula = Y ~ clinic + dose + prison, data = addicts, 
##     dist = "lognormal", param = "survreg")
## 
## Covariate          W.mean      Coef Life-Expn  se(Coef)    Wald p
## clinic              1.378     0.576     1.780     0.176     0.001 
## dose               64.317     0.034     1.034     0.006     0.000 
## prison              0.418    -0.309     0.734     0.154     0.045 
## 
## Baseline parameters:
## log(scale)                    3.407               0.398     0.000 
## log(shape)                   -0.075               0.059     0.207 
## Baseline life expectancy:  NA 
## 
## Events                    150 
## Total time at risk         95812 
## Max. log. likelihood      -1097.8 
## LR test statistic         51.9 
## Degrees of freedom        3 
## Overall p-value           3.22167e-11

plot(Lognormal.AFT)

#check.dist(Cox.PH, Lognormal.AFT)

Plotting survival curves for exp, weibull, and loglogistic

Exponential, AFT

Exp.AFT <- flexsurvreg(Y ~ clinic + dose + prison,
                data=addicts, dist="exponential")
Exp.AFT

## Call:
## flexsurvreg(formula = Y ~ clinic + dose + prison, data = addicts, 
##     dist = "exponential")
## 
## Estimates: 
##         data mean  est       L95%      U95%      se        exp(est)  L95%    
## rate          NA    0.02511   0.01080   0.05842   0.01082        NA        NA
## clinic   1.31513   -0.88058  -1.29340  -0.46776   0.21063   0.41454   0.27434
## dose    60.39916   -0.02892  -0.04096  -0.01687   0.00614   0.97150   0.95987
## prison   0.46639    0.25265  -0.07052   0.57582   0.16489   1.28743   0.93191
##         U95%    
## rate          NA
## clinic   0.62640
## dose     0.98327
## prison   1.77859
## 
## N = 238,  Events: 150,  Censored: 88
## Total time at risk: 95812
## Log-likelihood = -1093.971, df = 4
## AIC = 2195.942

Weibul, AFT

Weib.AFT <- flexsurvreg(Y ~ clinic + dose + prison,
                 data=addicts, dist="weibull")
Weib.AFT

## Call:
## flexsurvreg(formula = Y ~ clinic + dose + prison, data = addicts, 
##     dist = "weibull")
## 
## Estimates: 
##         data mean  est        L95%       U95%       se         exp(est) 
## shape          NA    1.37019    1.20026    1.56418    0.09257         NA
## scale          NA   60.63335   31.87629  115.33346   19.89127         NA
## clinic    1.31513    0.70904    0.40089    1.01720    0.15722    2.03204
## dose     60.39916    0.02443    0.01543    0.03342    0.00459    1.02473
## prison    0.46639   -0.22947   -0.46621    0.00727    0.12079    0.79496
##         L95%       U95%     
## shape          NA         NA
## scale          NA         NA
## clinic    1.49315    2.76543
## dose      1.01555    1.03399
## prison    0.62738    1.00730
## 
## N = 238,  Events: 150,  Censored: 88
## Total time at risk: 95812
## Log-likelihood = -1084.477, df = 5
## AIC = 2178.953

Log-logistic, AFT

Llogis.AFT <- flexsurvreg(Y ~ clinic + dose + prison,
                   data=addicts, dist="llogis")
Llogis.AFT

## Call:
## flexsurvreg(formula = Y ~ clinic + dose + prison, data = addicts, 
##     dist = "llogis")
## 
## Estimates: 
##         data mean  est       L95%      U95%      se        exp(est)  L95%    
## shape         NA    1.70428   1.48980   1.94964   0.11696        NA        NA
## scale         NA   35.27831  16.44372  75.68597  13.73944        NA        NA
## clinic   1.31513    0.58060   0.24432   0.91687   0.17157   1.78711   1.27675
## dose    60.39916    0.03161   0.02079   0.04244   0.00552   1.03212   1.02101
## prison   0.46639   -0.29127  -0.57344  -0.00910   0.14397   0.74731   0.56358
##         U95%    
## shape         NA
## scale         NA
## clinic   2.50146
## dose     1.04335
## prison   0.99094
## 
## N = 238,  Events: 150,  Censored: 88
## Total time at risk: 95812
## Log-likelihood = -1093.915, df = 5
## AIC = 2197.83

plot(Llogis.AFT, ci=FALSE, conf.int=FALSE, ylab="Survival", xlab="Time")
lines(Weib.AFT, col="blue", ci=FALSE)
lines(Exp.AFT, col="green", ci=FALSE)
legend("topright", lty=c(1,1,1), lwd=c(2,2,2), col=c("green", "blue", "red"),
       c("Exp", "Weibul", "Loglogistic"))

Final notes

The $\exp(\beta_1)$ indicates the ratio, thus - when $\exp(\beta_1)$=1: no difference of hazards between two groups; - when $\exp(\beta_1)$<1: the hazard of dropout in the comparison group (i.e., numerator) is lower than the hazard of dropout in the reference group (i.e., denominator) by “(1-$\exp(\beta_1)$%”; - when $\exp(\beta_1)$>1: the hazard of dropout in the comparison group (i.e., numerator) is higher than the hazard of dropout in the reference group (i.e., denominator) by “$\exp(\beta_1)$” times

The same rule applies to the logistic regression or other multiplicative models; simply change the “hazard” with “odds” (or the right metric in the mode). How about the additive model (e.g., linear regression)?

Notes for survival model selection¹

Akaike’s information criterion (AIC) provides an approach for comparing the fit of models with different underlying distributions, making use of the -2 log likelihood statistic
- The AIC statistic is calculated as: -2 log likelihood + 2$p$ (where $p$ is the number of parameters in the model).
- A smaller AIC statistic suggests a better fit.
- The addition of 2 times $p$ can be thought of as a penalty if nonpredictive parameters are added to the model.
- Nested vs. non-nested models (The likelihood ratio test for the nested model is considered a superior method to the AIC for comparing non-nested models)
Figure: “AIC Example”
- Likelihood ratio (LR) test: compute the difference between the log likelihood statistic of the reduced model (with fewer parameters to estimate) and the log likelihood statistic of the full model (with more parameters to estimate). In general, the LR statistic can be written in the form -2 ln LR minus -2 ln LF, where $R$ denotes the reduced model and $F$ denotes the full model. The test statistic has a $\chi$-square distribution with $p$ degrees of freedom, where $p$ denotes the number of additional parameters being assessed.
The exponential distribution is characterized by the fact that it lacks memory. In other words, items whose life lengths follow an exponential distribution do not age; no matter how old they are, if they are alive they are as good as new. This concept is not useful when it comes to human lives, but the life lengths of electronic components are often modeled by the exponential distribution in reliability theory.
If the exponential distribution is not useful in describing human lives, it may be so for short segments of life. At least it will be a good approximation if the segment is short enough. This is the idea behind the piecewise constant hazards distribution. Its definition involves a partition of time (age) axis, and one positive constant (the hazard level) corresponding to each interval. Note that the last interval will be open, with infinite length; only a finite number of cut points are allowed. The definition of the hazard function $h(x)$ becomes, with the cuts denoted $t=(t_1 < \cdots < t_n)$ and the levels denoted $h=(h_1, \dots, h_{n+1})$: \[h(t;t,h)= h_1 (t \ge t_1);\] \[h_i (t_{i_1} < t \ge t_i, i=2,\dots,n,);\] \[h_{n+1} (t_n<t)\]
- In this definition, the number of levels must be exactly one more than the number of cut points.
- Note that, despite the fact that the hazard function is not continuous, the other functions are. They are not differentiable at the cut points, though. The piecewise constant distribution is very flexible. It can be made arbitrarily close to any continuous distribution by increasing the number of cut points and choose the levels appropriately. Parametric proportional hazards modeling with the pch distribution is a serious competitor to the Cox regression model, especially with large data sets.

Piecewise constant hazards function, AFT

PCH.AFT1 <- pchreg(Y ~ clinic + dose + prison, data=addicts, cuts=1:1000)
PCH.AFT1

## Call:
## pchreg(formula = Y ~ clinic + dose + prison, data = addicts, 
##     cuts = 1:1000)
## 
## Covariate          W.mean      Coef Exp(Coef)  se(Coef)    Wald p
## clinic              1.378    -1.009     0.365     0.215     0.000 
## dose               64.317    -0.035     0.965     0.006     0.000 
## prison              0.418     0.327     1.386     0.167     0.051 
## 
## 
## Events                    150 
## Total time at risk         95812 
## Max. log. likelihood      -809.54 
## LR test statistic         64.52 
## Degrees of freedom        3 
## Overall p-value           6.36158e-14

plot(PCH.AFT1)

check.dist(Cox.PH, PCH.AFT1)

The Weibull distribution is a very popular parametric model for survival data, described in detail by Waloddi Weibull (Weibull 1951), known earlier. It is one of the so-called extreme-value distributions, and as such very useful in reliability theory. It is becoming popular in demographic applications, but in mortality studies it is wise to avoid it for old age mortality (the hazard grows too slow) and mortality in ages 0–15 years of age (U-shaped hazards, which the Weibull model doesn’t allow).
The lognormal distribution is connected to the normal distribution through the exponential function: If $X$ is normally distributed, then $Y = \exp(X)$ is lognormally distributed. Conversely, if $Y$ is lognormally distributed, then $X = \log(Y)$ is normally distributed. The lognormal distribution has the interesting property that the hazard function is first increasing, then decreasing, in contrast to the Weibull distribution which only allows for monotone (increasing or decreasing) hazard functions.
The loglogistic distribution is very close to the lognormal, but have heavier tail to the right. Its advantage over the lognormal is that the hazard function has closed form.
The Gompertz distribution is useful for modeling old age mortality. The hazard function is exponentially increasing. The Gompertz distribution was generalized by (Makeham 1860) (the Gompertz—Makeham Distribution). The generalization consists of adding a positive constant to the Gompertz hazard function.
The Gamma distribution is another generalization of the exponential distribution. It is popular in modeling shared frailty.
When you have no idea of what the baseline hazard looks like, use Cox regression. Exponential regression can be used to fit models in whihc the hazard varies with time, and that may be a reasonable thing to do, expecially if you want to berify the fit of another parametric model. For instance, pretend that you habe strong reason to beliebe that the formulation outgh to be Weibull. Even after fitting a Weibull model, you could use the exponential model with dummy variables for interbals to verify that the Weibull fit was reasonable.
Which test should we use?² There is no simple answer, so let us instead understand how to determine the answer in particular cases.
- The advantage of the modeling-the-effect approaches is that you can control for the effects of other variables. For instance, we would know that patients vary in age, and we would know age also affects outcome. In a carefully controlled experiment, we could ignore that effect because the average ages (and the distribution of age) of the control and experimental groups would be the same.
- The disadvange of the modeling-the-effect approaches is that you could model the effect incorrectly in two ways. You could model incorrectly the effect of other variables, or you could mismodel the effect itself, for example, by stating its functional form incorrectly.
- Effects of the form “apply the treatment and get an overall improvement” are often not simple. Effects can vary with other covariates (being perhaps larger for males than for females), and effects can vary with time, whcih is to say, aspects that change over time and that are not measured. For instance, a treatment might involve surgery, after whcih there may be a greater risk to be followed by a lesser risk in the future.
- It is because of these concerns that looking at graphs is useful, whether you are engaging in parametric or semiparametric modeling (although, when doing semiparametric modeling, you can only indirectly look at the hazard function by looking at the cumulative hazard or survival function).
- In most real circumstances, you will be forced into parametric or semiparametric analysis. Nonparametric analysis is useful when the experiment has been carefully controlled, although even controlled experiments are sometimes not adequately controlled. Nonparametric analysis is always a useful starting point. In nonexperimental situations in the presence of covariates, you do this more as a data description technique rather than in hopes of producing any final analysis that you can believe. You, as a researcher, should be able to describe the surbibal experience, say, as reflected in a graph of the survivor function or cumulative hazard function for your data, ignoring the complications of confounding variables and the like. Before disentangling reality, you need to be able to describe the reality that your are starting with.
- So, our position is that you will likely be forced into parameterizing the effect. This is perhaps due more to our past analysis experiences. In a well-designed, controlled experiment, however, there is nothing wrong with not parameterizing the effect and stopping at nonparametric analysis.
- If you do need to continue, should you parameterize the hazard function? On this issue, different researchers fell differently. We are favorable disposed to parametric analysis when you have good reason to believe that the hazard function outgh to follow a certain shape. Imposing a hazard function is an excellent way of improving the efficiency of your estimates and helping to avoid being misled by the fortuity of chance. On the other hand, when you do not have a good deductive reasons to know the shape of the hazard, you should use semiparametric analysis.
- When choosing between a semiparametric and parametric analysis, you much also take into consideration what information you are trying to obtain. In all you care about are hazard ratios (parameter effects) in a PH model, then you are probably better off with a semiparametric analysis. If you are interested in predicting the time to failure, however, some sort of parametric assumption as to the hazard is necessary. Here even if you do not have deductive knowledge as to the shape of the hazard, you can try all functional forms, to compare various functional forms of the hazard. You can use the piecewise exponential model to “nonparametrically” check the validity of any parametric form you wish to posit.

Accounting for heterogeneity³

Suppose that we collect data measuring time (variable $time$) from the onset of risk at time zero until occurrence of an event of interest (variable $fail$) on patients from different hospitals (variable $hospital$). We want to study patients’ survival as a function of some risk factors, say age and gender (variable $age$ and $gender$).

We can estimate the effect of predictors on survival by fitting a Cox model. \[h(t) = h_0(t) \exp(age\times x_1 + gender\times x_2)\]
- In this model, we ignore the fact that patients come from different hospitals and therefore assumed that hospitals have no effect on the results.
  - If we believe that there might be a group effect (e.g., the effect of a hospital), we should take it into account in the analysis.

There are various ways of adjusting for group effects (i.e., subjects are correlated we mean that subjects’ failure times are correlated or they are heterogenous). Each depends on the nature of the grouping of subjects and on the assumptions we are willing to make about the effect of grouping on subjects’ survival.

Stratified model
- Suppose we identified a fixed number of hospitals and then sampled our patients within each hospital; that is, we stratified on hospitals in our sampling design. Then we can adjust for the homogeneity of patients within a stratum (a hospital) using a stratified Cox model. \[h_g(t) = h_{0g}(t)\exp(age\times x_1 + gender\times x_2), \;\; where \;\;g=1, \cdots, n\]
- The same logic applies to
  - the situation when we believe that there is possible dependence among patients within a hospital. Subjects might be correlated, either because of how we sampled our data or because of some other reasons specific to the nature of the grouping (e.g., survey-specific approach in svu: in STATA), or
  - we want to allow baseline hazards to be different for each hospital rather than constraining them to be multiplicative version of each other. If your main focus is on the effect of other predictors (e.g., age and gender), you may benefit from accounting for the group-specific effects in a more general way by stratifying on the group.
Random effect model
- Alternately, we can model correlation by assuming that it is induced by an unobserved hospital-level random effect, or fraily, and by specifying the distribution of this random effect (only for parametric model). The effect of a hospital is assumed to be random and to have a multiplicative effect on the hazard function. Here the effect of a hospital is directly incorporated into the hazard function, resulting in a different model specification for the survival data: a shared frailty model. As such, both point estimates and their standard errors will change. For example, in the gamma distribution, the effect of a hospital is governed by a mean of 1 and variace of $\theta$. If the estimated $\hat{\theta}$ is not significantly different from zero, we ignore the correlation due to hospitals is ignored. \[h(t) = h_0(t) \exp(age\times x_1 + gender\times x_2) \;\;with\;\; frailty(hospital)\]
Fixed effect model
- Suppose we are only interested in the effect of our observed hospitals rather than in making inferences about the effect of all hospitals based on the observed random sample of hospitals. In this case, the effects of all hospitals are treated as fixed, and we estimate it by including in the model. We assume that the hospitals have a direct multiplicative effect on the hazard function. That is, all patients share the sam baseline hazard function, and the effect of a hospital multiplies this baseline hazard function up or down depending on the sign of the estimated coefficients for the hospital indicator. \[h(t) = h_0(t) \exp(age\times x_1 + gender\times x_2 + hospital \times x_3)\]
Interaction with stratification
- You may include an interaction term “hospital*age”, which will result in a different model: the effect of a hospital is absorbed in the baseline hazard but the effect of $age$ is allowed to vary with hospitals. \[h_g(t) = h_{0g}(t) \exp(age\times x_1 + gender\times x_2 + hospital \times age \times x_3), \;\; where\;\; g=1, \cdots, n\]

In sum, there is no definitive recommendation on how to account for the group effect and on which model is the most appropriate when analyzing data.

Robust standard error (aka empirical standard error, sandwich estimator)

A widely used technique for adjusting for the correlation among outcomes on the same subject is called robust estimation (also referred to as empirical estimation). This technique essentially involves adjusting the estimated variances of regression coefficients obtained for a fitted model to account for misspecification of the correlation structure assumed

library(survival)

fram <- read.csv("frmgham2.csv", header = TRUE)
attach(fram)
#head(fram)

Y=Surv(TIMECVD, CVD==1)

mod1=coxph(Y ~ BMI + factor(SEX), data=fram)
summary(mod1)

## Call:
## coxph(formula = Y ~ BMI + factor(SEX), data = fram)
## 
##   n= 11575, number of events= 2879 
##    (52 observations deleted due to missingness)
## 
##                   coef exp(coef)  se(coef)      z Pr(>|z|)    
## BMI           0.047292  1.048428  0.004396  10.76   <2e-16 ***
## factor(SEX)2 -0.704816  0.494200  0.037834 -18.63   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##              exp(coef) exp(-coef) lower .95 upper .95
## BMI             1.0484     0.9538    1.0394    1.0575
## factor(SEX)2    0.4942     2.0235    0.4589    0.5322
## 
## Concordance= 0.618  (se = 0.005 )
## Likelihood ratio test= 487.5  on 2 df,   p=<2e-16
## Wald test            = 476.4  on 2 df,   p=<2e-16
## Score (logrank) test = 495.1  on 2 df,   p=<2e-16

mod1_id=coxph(Y ~ BMI + factor(SEX), id=RANDID, data=fram)
summary(mod1_id)

## Call:
## coxph(formula = Y ~ BMI + factor(SEX), data = fram, id = RANDID)
## 
##   n= 11575, number of events= 2879 
##    (52 observations deleted due to missingness)
## 
##                   coef exp(coef)  se(coef) robust se      z Pr(>|z|)    
## BMI           0.047292  1.048428  0.004396  0.007177   6.59 4.41e-11 ***
## factor(SEX)2 -0.704816  0.494200  0.037834  0.062677 -11.24  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##              exp(coef) exp(-coef) lower .95 upper .95
## BMI             1.0484     0.9538    1.0338    1.0633
## factor(SEX)2    0.4942     2.0235    0.4371    0.5588
## 
## Concordance= 0.618  (se = 0.008 )
## Likelihood ratio test= 487.5  on 2 df,   p=<2e-16
## Wald test            = 187.1  on 2 df,   p=<2e-16
## Score (logrank) test = 495.1  on 2 df,   p=<2e-16,   Robust = 174.5  p=<2e-16
## 
##   (Note: the likelihood ratio and score tests assume independence of
##      observations within a cluster, the Wald and robust score tests do not).

Full parameterization with baseline hazard estimates⁴

There are occasions we want to estimate the baseline hazard:

Estimating $h_0(t)$, thus $S_0(t)$, is the major distinction between semi-parametric and parametric modeling
Stratified analysis may require to fit a different baseline hazard for each stratum. Please recall the COVID-19 papers, which stratified data collecting sites or countries or disease severeness. Each stratum is likely have different baseline hazard. A stratification is required⁵ when
- if the proportionality assumption does not hold for a factor covariate
- if a factor may have too many levels, so that it is inappropriate to treat it as an ordinary factor
- if data are matched

The full hazard function for the Weibull PH model is \[h(t)=\exp(b_0 + b_1 x_1 + b_2 x_2 + \cdots + b_n x_n)pt^{p-1}\] Therefore, in terms of $S(t)$, \[ S(t)=\exp(-(b_0 + b_1 x_1 + b_2 x_2 + \cdots + b_n x_n)t^p) \]

With the Weibull PH model estimates from the “Addicts” dataset, this model is specified as \[\hat{h(t)}=\exp((4.1048451) + (-0.9715245)\times clinic + (-0.0334675) \times dose + (0.3144143) \times prison) \exp(0.3149526)t^{\exp(0.3149526)-1}\]

\[ \hat{S(t)}=\exp(-((4.1048451) + (-0.9715245)\times clinic + (-0.0334675) \times dose + (0.3144143) \times prison))t^{\exp(0.3149526)} \]

In the EHA package in R, the baseline parameters are presented as log(scale)(i.e., $b_0 = (4.1048451)$) and log(shape)(i.e. shape parameter, $p=\exp(0.3149526)$) in the output. Please note that the parameterization and naming are different across statistical packages/R-packages.

For the exponential model ($p$=1 in the Weibull model),

The full hazard function is \[h(t)=\exp(b_0 + b_1 x_1 + b_2 x_2 + \cdots + b_n x_n)pt^{p-1}=\exp(b_0 + b_1 x_1 + b_2 x_2 + \cdots + b_n x_n)\] Therefore, in terms of $S(t)$, \[ S(t)=\exp(-(b_0 + b_1 x_1 + b_2 x_2 + \cdots + b_n x_n)t^p)=\exp(-(b_0 + b_1 x_1 + b_2 x_2 + \cdots + b_n x_n)t) \]

For the log-logistic model, The full hazard function is \[h(t)=\frac{\exp(b_0 + b_1 x_1 + b_2 x_2 + \cdots + b_n x_n)pt^{p-1}}{1+\exp(b_0 + b_1 x_1 + b_2 x_2 + \cdots + b_n x_n)t^{p}}\] Therefore, in terms of $S(t)$, \[ S(t)=\frac{1}{1+\exp(b_0 + b_1 x_1 + b_2 x_2 + \cdots + b_n x_n)t^{p}} \]

Dealing with baseline hazard estimates⁶

Let us assume that in a follow-up study, the cohort is not homogeneous but instead consists of two equally sized groups with differing hazard rates. Assume further that we have no indication of which group an individual belongs to, and that members of both groups follow an exponential life length distribution: \[ h_1(t)=\lambda_1, \;\; t>0 \] \[ h_2(t)=\lambda_2, \;\; t>0 \] This implies that the corresponding survival functions $S_1$ and $S_2$ are \[ S_1(t)=e^{-\lambda_1 t}, \;\; t>0 \] \[ S_2(t)=e^{-\lambda_2 t}, \;\; t>0 \] and a randomly chosen individual will follow the “population mortality” $S$, which is a mixture of the two distributions: \[ S(t)=\frac{1}{2} S_1(t) + \frac{1}{2} S_2(t), \;\; t>0.\]

Let us calculate the hazard function for this mixture. We start by finding the density function $f(t)$: \[ f(t)=-\frac{dS(x)}{dx} =\frac{1}{2} (\lambda_1 e^{-\lambda_1 t} + \lambda_2 e^{-\lambda_2 t}), ;\;\ t>0\] Then, by the definition of $h(t)$, we get \[ h(t) = \frac{f(t)}{S(t)} = \omega (t) \lambda_1 + (1- \omega (t)) \lambda_2, \;\; t>0\] with \[\omega(t)=\frac{e^{-\lambda_1 t}}{e^{-\lambda_1 t} + e^{-\lambda_2 t}} \] It is easy to see that as $t \rightarrow \infty$

$\omega (t) \rightarrow 0$, when $\lambda_1 > \lambda_2$
$\omega (t) \rightarrow \frac{1}{2}$, when $\lambda_1 = \lambda_2$
$\omega (t) \rightarrow 1$, when $\lambda_1 < \lambda_2$

implying that

\[h(t) \rightarrow min(\lambda_1, \lambda_2), \;\; t \rightarrow \infty\] The following Figure indicates a population hazard function(solid line). The dashed lines are the hazard functions of each group, $\lambda_1=1$, $\lambda_2=2$.

“Population hazard function”

Frailty Models⁷

Frailty models in survival analysis correspond to hierarchical models in linear or generalized linear models. They are also called mixed effects models. They contain an extra random component designed to account for individual(or subgroup)-level differences in the hazard otherwise unaccounted for by the model. The frailty, $\alpha$, is a multiplicative effect on the hazard assumed to follow some distribution. The hazard function conditional on the frailty can be expressed as $h(t|\alpha)=\alpha [h(t)]$.

Simple Frailty Model

Vaupel et al. (1979) described an individual frailty model, \[h(t;x,Z)=h_0(t)Z e^{\beta x}, \;\; t>0,\] where $Z$ is assumed to be drawn independently for each individual. Hazard rates for “random survivors” are not proportional, but converging (to each other) if the frailty distribution has finite variance. Thus, the problem may be less pronounced in AFT than in PH regression.

Shared Frailty Model

Frailty models work best when there is a natural grouping of the data, so that observations from the same group are dependent (share the same frailty), while two individual survival times from different groups can be regarded as independent. Such a model may be described as \[h_i(t;x)=h_{i0}(t) e^{\beta x}, \;\;i=1,\dots, s; \;\;t>0,\]

which simply is a stratified Cox regression model. By assuming \[h_{i0}(t)=Z_i h_0(t),\;\; i=1,\dots,s;\;\;t>0,\] the traditional multivariate frailty model emerges. Here it is assumed that $Z_1,\dots,Z_s$ are independent and identically distributed ($iid$), usually with a lognormal distribution. From what we get, with $U_i = \log(Z_i)$, \[h_i(t;x)=h_0(t) e^{\beta x +U_i}, \;\;i=1,…,s;\;\;t>0.\] In this formulation, $U_1,\dots, U_s$ are $iid$ normal with mean zero and unknown variance $\sigma^2$. Another popular choice of distribution for the $Z:s$ is the gamma distribution.

Cox model with and without frailty

R offers three choices for the distribution of the frailty: the gamma, Gaussian, and $t$ distributions. The variance (theta) of the frailty component is a parameter typically estimated by the model. If theta = 0, then there is no frailty.

First, we rerun a stratified Cox model without frailty. The stratified variable is CLINIC while PRISON and DOSE are predictor variables. A stratified Cox model is appropriate if the PH assumption is violated for CLINIC and met for PRISON and DOSE and our interest is in estimating a hazard ratio for PRISON or DOSE.

Y=Surv(addicts$survt,addicts$status==1)
coxph(Y~ prison + dose + strata(clinic),
      data=addicts)

## Call:
## coxph(formula = Y ~ prison + dose + strata(clinic), data = addicts)
## 
##             coef exp(coef)  se(coef)      z        p
## prison  0.389605  1.476397  0.168930  2.306   0.0211
## dose   -0.035115  0.965495  0.006465 -5.432 5.59e-08
## 
## Likelihood ratio test=33.91  on 2 df, p=4.322e-08
## n= 238, number of events= 150

The estimated hazard ratio for PRISON=1 versus PRISON=0 is $\exp(0.3896) = 1.476$. Next we illustrate how to include a frailty component in this model.

coxph(Y~ prison + dose + strata(clinic)
      + frailty(id, distribution="gamma"), 
      data=addicts)

## Warning in coxpenal.fit(X, Y, istrat, offset, init = init, control, weights =
## weights, : Inner loop failed to coverge for iterations 2

## Call:
## coxph(formula = Y ~ prison + dose + strata(clinic) + frailty(id, 
##     distribution = "gamma"), data = addicts)
## 
##                               coef se(coef)      se2    Chisq   DF       p
## prison                     0.39003  0.16916  0.16893  5.31590 1.00   0.021
## dose                      -0.03517  0.00647  0.00647 29.50946 1.00 5.6e-08
## frailty(id, distribution                              0.34134 0.32   0.314
## 
## Iterations: 5 outer, 41 Newton-Raphson
##      Variance of random effect= 0.00227   I-likelihood = -597.5 
## Degrees of freedom for terms= 1.0 1.0 0.3 
## Likelihood ratio test=34.6  on 2.32 df, p=5e-08
## n= 238, number of events= 150

The term “+ frailty(id, distribution=“gamma”)” is included in the model formula. The first argument of the frailty function is the variable id and indicates that the unmeasured heterogeneity (the frailty) is at the individual level. The second argument indicates that the distribution of the random component is the gamma distribution.

Under the table of parameter estimates the output indicates that the variance of random effect = 0.00227. The p-value for the frailty component of 3.1e-01= 0.31 is provided in the third row and right column of the table and indicates that the frailty component is not significant. We conclude that the variance of the random component is zero for this model (i.e., there is no frailty). The parameter estimates for PRISON and DOSE changed minimally in this model compared to the model previously run without the frailty.

Now, suppose the variable CLINIC was unmeasured. Next we consider a Cox model (without frailty) that does not contain CLINIC.

coxph(Y~ prison + dose, data=addicts)

## Call:
## coxph(formula = Y ~ prison + dose, data = addicts)
## 
##            coef exp(coef) se(coef)      z        p
## prison  0.18965   1.20883  0.16427  1.155    0.248
## dose   -0.03608   0.96457  0.00600 -6.013 1.83e-09
## 
## Likelihood ratio test=38.21  on 2 df, p=5.045e-09
## n= 238, number of events= 150

The estimated hazard ratio for PRISON=1 versus PRISON=0 is $\exp(0.1897) = 1.209$ as compared to $\exp(0.3896) = 1.476$ that was observed in the model that contained CLINIC as a stratified variable. In previous sections CLINIC was shown to be an important predictor that violates the proportional hazards assumption. If CLINIC was unaccounted for (as in the model above), there may be a source of unobserved heterogeneity that a frailty component might address.

The next model omits CLINIC but includes a frailty component and the predictors PRISON and DOSE. We also use SUMMARY function to get exponentiated estimates.

summary(coxph(Y~ prison + dose 
      + frailty(id, distribution="gamma"),
      data=addicts))

## Call:
## coxph(formula = Y ~ prison + dose + frailty(id, distribution = "gamma"), 
##     data = addicts)
## 
##   n= 238, number of events= 150 
## 
##                           coef     se(coef) se2     Chisq DF    p      
## prison                     0.41441 0.221604 0.17590   3.5  1.00 6.1e-02
## dose                      -0.05166 0.008448 0.00699  37.4  1.00 9.6e-10
## frailty(id, distribution                            100.5 69.34 8.6e-03
## 
##        exp(coef) exp(-coef) lower .95 upper .95
## prison    1.5135     0.6607    0.9803    2.3367
## dose      0.9496     1.0530    0.9341    0.9655
## 
## Iterations: 6 outer, 44 Newton-Raphson
##      Variance of random effect= 0.6495364   I-likelihood = -685.4 
## Degrees of freedom for terms=  0.6  0.7 69.3 
## Concordance= 0.854  (se = 0.015 )
## Likelihood ratio test= 190.4  on 70.65 df,   p=6e-13

The variance of the frailty component is estimated at 0.65 compared to 0.00227 for the model that we showed previously that contained CLINIC as the stratified variable. The p-value for the frailty is highly significant at 8.6e–3 = 0.0086. The hazard ratio for the effect of PRISON is $\exp(0.4144) = 1.51$. The summary function can be applied to the coxph function to get R to exponentiate the parameter estimates (with 95% CI) when a frailty component is included in a Cox model.

It is interesting that the estimated hazard ratio for PRISON (1.51) obtained in this model (without CLINIC but with the frailty component) is closer to the corresponding hazard ratio obtained from the model that included CLINIC (1.476) compared to the one that did not include CLINIC (1.209). In this example, the frailty component might be accounting to some extent for the fact that CLINIC was omitted from the model.

Stratified model

A simple way to eliminate the effect of clustering is to stratify on the clusters.

The drawback with a stratified analysis is that it is not possible to estimate the effect of covariates that are constant within clusters.
Notice also that the hazard functions for groups (e.g., males and females) differ only insofar as they have different baseline hazard functions, namely, $h_{01}(t)$ for females and $h_{02}(t)$ for males. However, the coefficients $\beta_i$s are the same for both female and male models.

Generalized stratified models

\[h_g(t,X)=h_0g (t) \exp[\beta_1X_1+\beta_2X_2+ \cdots+\beta_p x_p]\]

$g=1,2,\dots,k^*,$ strata defined from $Z^*$, which has $k^*$ categories
$Z^*$ is not included in the model
$X_i$s are included in the model
Hazard ratio is same for each stratum

Cox PH models with or without stratification⁸

If the proportional hazards assumption is violated for the variable CLINIC but met for PRISON and DOSE, a stratified Cox model can be performed with CLINIC the stratified variable. The coxph function includes a strata() option in the model formula. First we define the response variable $Y$ with the Surv function and then the coxph function is used to run a stratified Cox model (code and output shown below):

Y=Surv(addicts$survt,addicts$status==1)
coxph(Y~ prison + dose + strata(clinic),data=addicts)

## Call:
## coxph(formula = Y ~ prison + dose + strata(clinic), data = addicts)
## 
##             coef exp(coef)  se(coef)      z        p
## prison  0.389605  1.476397  0.168930  2.306   0.0211
## dose   -0.035115  0.965495  0.006465 -5.432 5.59e-08
## 
## Likelihood ratio test=33.91  on 2 df, p=4.322e-08
## n= 238, number of events= 150

Interaction terms for CLINIC can be included directly in the model formula by including product terms using the : operator (clinic:prison and clinic:dose) (code and output follow):

coxph(Y~ prison + dose + clinic:prison + clinic:dose +strata(clinic),
data=addicts)

## Call:
## coxph(formula = Y ~ prison + dose + clinic:prison + clinic:dose + 
##     strata(clinic), data = addicts)
## 
##                    coef exp(coef)  se(coef)      z      p
## prison         1.085836  2.961914  0.538636  2.016 0.0438
## dose          -0.034635  0.965958  0.019797 -1.750 0.0802
## prison:clinic -0.582989  0.558227  0.428135 -1.362 0.1733
## dose:clinic   -0.001164  0.998837  0.014570 -0.080 0.9363
## 
## Likelihood ratio test=35.77  on 4 df, p=3.222e-07
## n= 238, number of events= 150

\[HR=\frac{h_0(t) \exp[1\beta_1+\beta_2 DOSE + (2)(1)\beta_3 + \beta_4 clinic \times DOSE]}{h_0(t) \exp[(0)\beta_1+\beta_2 DOSE + (2)(0)\beta_3 + \beta_4 clinic \times DOSE]}=\exp(\beta_1 + 2 \beta_3)\]

The resulting hazard ratio, $\exp(\beta_1 + 2 \beta_2)$, is an exponentiated linear combination of parameters. Unfortunately, R does not have a lincom command that Stata provides or an estimate statement that SAS provides in order to calculate a linear combination of parameter estimates. However an approach that can be used in any statistical software package for such a situation is to recode the variable(s) of interest such that the desired estimate is no longer a linear combination of parameter estimates.

In this example, we are interested in a hazard ratio PRISON=1 versus PRISON=0 for CLINIC=2. We can define a new variable CLINIC $\times$ 2 so when CLINIC=2, CLINIC $\times$ 2=0.

addicts$clinic2=addicts$clinic-2
summary(coxph(Y~ prison + dose + clinic2:prison + clinic2:dose 
+ strata(clinic2), data=addicts))

## Call:
## coxph(formula = Y ~ prison + dose + clinic2:prison + clinic2:dose + 
##     strata(clinic2), data = addicts)
## 
##   n= 238, number of events= 150 
## 
##                     coef exp(coef)  se(coef)      z Pr(>|z|)   
## prison         -0.080143  0.922985  0.384305 -0.209  0.83481   
## dose           -0.036964  0.963711  0.012346 -2.994  0.00275 **
## prison:clinic2 -0.582989  0.558227  0.428135 -1.362  0.17329   
## dose:clinic2   -0.001164  0.998837  0.014570 -0.080  0.93632   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##                exp(coef) exp(-coef) lower .95 upper .95
## prison            0.9230      1.083    0.4346    1.9603
## dose              0.9637      1.038    0.9407    0.9873
## prison:clinic2    0.5582      1.791    0.2412    1.2919
## dose:clinic2      0.9988      1.001    0.9707    1.0278
## 
## Concordance= 0.649  (se = 0.026 )
## Likelihood ratio test= 35.77  on 4 df,   p=3e-07
## Wald test            = 34.09  on 4 df,   p=7e-07
## Score (logrank) test = 34.97  on 4 df,   p=5e-07

The estimate for $\exp(\beta_1)$ can be found in the second table, $\exp(coef)$ for prison = 0.9203. The lower and upper confidence limits are 0.4346 and 1.9603, respectively. If we did not recode the variable CLINIC the problem would have been more complicated in that we would have had to use variance–covariance matrix (which can be obtained with the vcov function) to calculate a 95% confidence interval for this hazard ratio.

Weibull PH models with or without stratification

Weib.PH_st <- phreg(Y ~ clinic + dose,
                 data=addicts, dist="weibull", param="survreg")
b_Weib.PH_st = coef(Weib.PH_st)
Weib.PH_st

## Call:
## phreg(formula = Y ~ clinic + dose, data = addicts, dist = "weibull", 
##     param = "survreg")
## 
## Covariate          W.mean      Coef Exp(Coef)  se(Coef)    Wald p
## clinic              1.378    -0.925     0.397     0.211     0.000 
## dose               64.317    -0.033     0.968     0.006     0.000 
## 
## log(scale)                    4.066               0.326     0.000 
## log(shape)                    0.304               0.068     0.000 
## 
## Events                    150 
## Total time at risk         95812 
## Max. log. likelihood      -1086.3 
## LR test statistic         57.34 
## Degrees of freedom        2 
## Overall p-value           3.54494e-13

Weib.PH_st <- phreg(Y ~ clinic + dose + strata(prison),
                 data=addicts, dist="weibull", param="survreg")
b_Weib.PH_st = coef(Weib.PH_st)
Weib.PH_st

## Call:
## phreg(formula = Y ~ clinic + dose + strata(prison), data = addicts, 
##     dist = "weibull", param = "survreg")
## 
## Covariate          W.mean      Coef Exp(Coef)  se(Coef)    Wald p
## clinic              1.378    -0.953     0.385     0.213     0.000 
## dose               64.317    -0.034     0.967     0.006     0.000 
## 
## log(scale):1                  4.256               0.332     0.000 
## log(shape):1                  0.388               0.093     0.000 
## log(scale):2                  3.700               0.408     0.000 
## log(shape):2                  0.241               0.098     0.013 
## 
## Events                    150 
## Total time at risk         95812 
## Max. log. likelihood      -1083.9 
## LR test statistic         59.36 
## Degrees of freedom        2 
## Overall p-value           1.28675e-13

Here are some differences between two models:

plot(Weib.PH_st)

Time varying covariates⁹

Example: Stanford heart transplant study

Initial analysis (Clark et al, 1971. Annals of Internal Medicine: https://doi.org/10.7326/0003-4819-75-1-15)

library(survival)
ht00 <- coxph(Surv(futime,fustat)~ transplant + age + surgery, 
              data=jasa)
summary(ht00)

## Call:
## coxph(formula = Surv(futime, fustat) ~ transplant + age + surgery, 
##     data = jasa)
## 
##   n= 103, number of events= 75 
## 
##                coef exp(coef) se(coef)      z Pr(>|z|)    
## transplant -1.71711   0.17958  0.27853 -6.165 7.05e-10 ***
## age         0.05889   1.06065  0.01505  3.913 9.12e-05 ***
## surgery    -0.41902   0.65769  0.37118 -1.129    0.259    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##            exp(coef) exp(-coef) lower .95 upper .95
## transplant    0.1796     5.5684    0.1040     0.310
## age           1.0607     0.9428    1.0298     1.092
## surgery       0.6577     1.5205    0.3177     1.361
## 
## Concordance= 0.732  (se = 0.031 )
## Likelihood ratio test= 45.85  on 3 df,   p=6e-10
## Wald test            = 47.15  on 3 df,   p=3e-10
## Score (logrank) test = 52.63  on 3 df,   p=2e-11

coef_ht00 <- coef(summary(ht00))
exp.ci_ht00 <- summary(ht00)$conf.int

The key covariate is “transplant”, which takes the value 1 for those patients who received a heart transplant and 0 for those who did not.
The estimate of the transplant coefficient was $-1.7171129;$ $\exp(-1.7171129)=0.1795839$; $95\% \; c.i.=( 0.1040357,0.3099933)$; $(p=7.0520176\times 10^{-10}$), indicating that transplants are extremely effective.
However, the problem of here is that receipt of a transplant is a time dependent covariate: patients who received a transplant had to live long enough to receive that transplant. Essentially, the above analysis only shows that patients who live longer (i.e., long enough to receive a transplant) have longer lives than patients who don’t live as long, which is of course a tautology.

Revised analysis: a “landmark” approach

A simple fix is to define a “landmark” time to divide patients into two groups. In this approach, patients who received the intervention prior to the landmark go into the intervention group and those who did not are placed in the comparison group.
Key requirements of this approach are that
- only patients who survive up to the landmark are included in the study, and
- all patients (in particular, those in the comparison group) remain in their originally assigned group regardless of what happens in the future, i.e., after the landmark.
We first select those patients who lived at least 30 days (79 of the 103 patients lived this long).
- Of these 79 patients, 33 had a transplant within 30 days, and 46 did not.
- Of these 46, 30 subsequently had a heart transplant, but we still count them in the “no transplant within 30 days” group.
- In this way we have created a variable, transplant30, which has a fixed value (that is, it does not change over time) for all patients in our set of 30-day survivors.

ind30<-jasa$futime >= 30
transplant30 <- {{jasa$transplant==1}&{jasa$wait.time <30}}
ht01 <- coxph(Surv(futime,fustat)~ transplant30 + age + surgery, 
              data=jasa, subset=ind30)
summary(ht01)

## Call:
## coxph(formula = Surv(futime, fustat) ~ transplant30 + age + surgery, 
##     data = jasa, subset = ind30)
## 
##   n= 79, number of events= 52 
## 
##                      coef exp(coef) se(coef)      z Pr(>|z|)  
## transplant30TRUE -0.04214   0.95874  0.28377 -0.148   0.8820  
## age               0.03720   1.03790  0.01714  2.170   0.0300 *
## surgery          -0.81966   0.44058  0.41297 -1.985   0.0472 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##                  exp(coef) exp(-coef) lower .95 upper .95
## transplant30TRUE    0.9587     1.0430    0.5497    1.6720
## age                 1.0379     0.9635    1.0036    1.0734
## surgery             0.4406     2.2697    0.1961    0.9898
## 
## Concordance= 0.618  (se = 0.044 )
## Likelihood ratio test= 9.5  on 3 df,   p=0.02
## Wald test            = 8.61  on 3 df,   p=0.03
## Score (logrank) test = 8.94  on 3 df,   p=0.03

coef_ht01 <- coef(summary(ht01))
exp.ci_ht01 <- summary(ht01)$conf.int

The coefficient for transplant30 (a true/false indicator for transplant within the first 30 days) was $-0.0421374;$ $\exp(-0.0421374)=0.958738$; $95\% \; c.i.=( 0.5497392,1.6720267)$; $(p=0.8819539$), which is no longer statistically significant, indicating there is no difference in survival between those who got a transplant and those who did not.
Limitations of this approach
- There is no guidance as to when to set the landmark (e.g., why 30 not 45 days?)
- We discard almost a quarter of the patients from the analysis

Revised analysis: a time-dependent variable

Example

To clarify this approach, let’s use a small set of data

id <- 1:nrow(jasa)
jasaT <- data.frame(id, jasa)
id.simple <- c(2,5,10,12,28,95)
heart.simple <- jasaT[id.simple, c(1,10,9,6,11)]
heart.simple

##    id wait.time futime fustat transplant
## 2   2        NA      5      1          0
## 5   5        NA     17      1          0
## 10 10        11     57      1          1
## 12 12        NA      7      1          0
## 28 28        70     71      1          1
## 95 95         1     15      1          1

In this simple dataset, all of the patients died within the follow-up time (stat=1 for all patients). We may model the data incorrectly (ignoring the fact that “transplant” is time dependent) as follows:

# Ignore that transplant is time dependent
ht02 <- coxph(Surv(futime,fustat)~ transplant, data=heart.simple)
summary(ht02)

## Call:
## coxph(formula = Surv(futime, fustat) ~ transplant, data = heart.simple)
## 
##   n= 6, number of events= 6 
## 
##               coef exp(coef) se(coef)     z Pr(>|z|)
## transplant -1.6878    0.1849   1.1718 -1.44     0.15
## 
##            exp(coef) exp(-coef) lower .95 upper .95
## transplant    0.1849      5.408    0.0186     1.838
## 
## Concordance= 0.733  (se = 0.077 )
## Likelihood ratio test= 2.47  on 1 df,   p=0.1
## Wald test            = 2.07  on 1 df,   p=0.1
## Score (logrank) test = 2.56  on 1 df,   p=0.1

coef_ht02 <- coef(summary(ht02))
exp.ci_ht02 <- summary(ht02)$conf.int

Sample of six patients from the Stanford heart transplant dataset

The “incorrect” estimate of the transplant coefficient was $-1.6878463;$ $\exp(-1.6878463)=0.1849174$; $95\% \; c.i.=( 0.0186016,1.8382481)$; $(p=0.149753$), indicating that transplants was not effective.
To do this correctly, we need to modify the partial likelihood function to accommodate these types of variables. Essentially, at each failure time, there are a certain number of patients at risk, and one fails. However, the contributions of each subject can change from one failure time to the next. The hazard function if given by $h(t)=h_0 (t)e^{z_k(t_i)\beta}$, where the covariate $z_k(t_i)$ is the value of the time-varying covariate for the $k^{th}$ subject at time $t_i$.
The maximum partial likelihood is \[L(\beta)=\prod_{i=1}^{D} \frac{\psi_{ii}}{\sum\limits_{k \in R_i} \psi_{ki}}, \;\;\; where\;\; \psi_{ki}=e^{z_k(t_i)\beta} \]
When the covariates were fixed at time 0, so that $z_k(t_i)$=$z_k$ for all failure times $t_i$, and the denominator at each time could be computed by, as time passes, successively deleting the value of $\psi_i$ for the subject (or subjects) that failed at that time.
With a time dependent covariate, by contrast, the entire denominator has to be recalculated at each failure time, since the values of the covariates for each subject may change from one failure time to the next.
- For example, the patient #2 is the first to fail, at $t=5$. At this time, all six patients are at risk, but only one, patient #95, has had a transplant at this time. So the denominator for the first factor is $5+e^{\beta}$, and the numerator is 1, since it was a non-transplant patient who died.
- Patient #12 is the next to die, at time $t=7$, and none of the patients in the risk set have changed their covariate value.
- But when the third patient #95 dies at $t=15$, one of the other patients (#10) has switched from being a non-transplant patient to one who has had one. There are now four patients at risk, of which two (#10 and #95) are transplant patients. The denominator is thus $2+2e^{\beta}$ and the numerator is $e^{\beta}$, since it was a transplant patient that died.
- Therefore, the full partial likelihood in this example is \[L(\beta)=\frac{1}{5+e^{\beta}}\cdot\frac{1}{4+e^{\beta}}\cdot\frac{e^{\beta}}{2+2e^{\beta}}\cdot\frac{1}{2+e^{\beta}}\cdot\frac{e^{\beta}}{1+e^{\beta}}\cdot\frac{e^{\beta}}{e^{\beta}} \]
Essentially, this approach divides the time data for patients who had a heart transplant into two time periods, one before the transplant and one after.
- For example, patient #10 was a non-transplant patient from entry until day 11. Since that patient received a transplant at that time, the future for that patient, had he or she not received a transplant, is unknown. Thus, we censor that portion of the patient’s life experience at $t=11$.
- Following the transplant, we start a new record for patient #10. This second piece of the record is left-truncated (i.e., patient’s survival experience with the transplant starts at that point) at time $t=11$, and a death is recorded at time $t=57$.
- For the first part of this patient’s experience, the ‘start’ time is 0, and the ‘stop’ time is 11, which is recorded as a censored observation. For the second piece of that patient’s experience, the start time is 11 and the stop time is 57.
- Thus, to put the sdata in start-stop format, the record of every patient with no transplant is carried forward as is, where as the record of each patient who received a transplant is split into pre-transplant and post-transplant records.
- Use “tmerge” in R to simplify this conversion.

# Accounting for time-dependent covariates
sdata <- tmerge(heart.simple, heart.simple, id=id, death=event(futime,fustat), 
                transpl=tdc(wait.time))
sdata

##   id wait.time futime fustat transplant tstart tstop death transpl
## 1  2        NA      5      1          0      0     5     1       0
## 2  5        NA     17      1          0      0    17     1       0
## 3 10        11     57      1          1      0    11     0       0
## 4 10        11     57      1          1     11    57     1       1
## 5 12        NA      7      1          0      0     7     1       0
## 6 28        70     71      1          1      0    70     0       0
## 7 28        70     71      1          1     70    71     1       1
## 8 95         1     15      1          1      0     1     0       0
## 9 95         1     15      1          1      1    15     1       1

heart.simple.counting <- sdata[,-(2:5)] # drop columns 2 through 5
heart.simple.counting

##   id tstart tstop death transpl
## 1  2      0     5     1       0
## 2  5      0    17     1       0
## 3 10      0    11     0       0
## 4 10     11    57     1       1
## 5 12      0     7     1       0
## 6 28      0    70     0       0
## 7 28     70    71     1       1
## 8 95      0     1     0       0
## 9 95      1    15     1       1

Start-stop counting process

Using this example, we fit Cox model.

ht03 <- coxph(Surv(tstart,tstop, death)~ transpl, 
              data=heart.simple.counting)
summary(ht03)

## Call:
## coxph(formula = Surv(tstart, tstop, death) ~ transpl, data = heart.simple.counting)
## 
##   n= 9, number of events= 6 
## 
##           coef exp(coef) se(coef)     z Pr(>|z|)
## transpl 0.2846    1.3292   0.9609 0.296    0.767
## 
##         exp(coef) exp(-coef) lower .95 upper .95
## transpl     1.329     0.7523    0.2021      8.74
## 
## Concordance= 0.5  (se = 0.082 )
## Likelihood ratio test= 0.09  on 1 df,   p=0.8
## Wald test            = 0.09  on 1 df,   p=0.8
## Score (logrank) test = 0.09  on 1 df,   p=0.8

coef_ht03 <- coef(summary(ht03))
exp.ci_ht03 <- summary(ht03)$conf.int

The “corrected” estimate of the transplant coefficient was $0.2845586;$ $\exp(0.2845586)=1.3291752$; $95\% \; c.i.=( 0.2021337,8.7402878)$; $(p=0.7671316$), indicating that transplants was not effective.

Now, we apply this approach accounting for time dependent covariates to the full dataset.¹⁰

jasa$subject <- 1:nrow(jasa) #we need an identifier variable
tdata <- with(jasa, data.frame(subject = subject,
                               futime= pmax(.5, fu.date - accept.dt),
                               txtime= ifelse(tx.date== fu.date,
                                              (tx.date -accept.dt) -.5,
                                              (tx.date - accept.dt)),
                               fustat = fustat
))
xdata <- tmerge(jasa, tdata, id=subject,
                death = event(futime, fustat),
                transplant = tdc(txtime),
                options= list(idname="subject"))

## Warning in tmerge(jasa, tdata, id = subject, death = event(futime, fustat), :
## replacement of variable 'transplant'

ht04 <- coxph(Surv(tstart, tstop, death) ~ transplant + age + surgery,
              data= xdata, ties="breslow")
summary(ht04)

## Call:
## coxph(formula = Surv(tstart, tstop, death) ~ transplant + age + 
##     surgery, data = xdata, ties = "breslow")
## 
##   n= 170, number of events= 75 
## 
##                coef exp(coef) se(coef)      z Pr(>|z|)  
## transplant  0.01238   1.01246  0.30815  0.040   0.9680  
## age         0.03055   1.03102  0.01390  2.198   0.0279 *
## surgery    -0.77155   0.46230  0.35967 -2.145   0.0319 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##            exp(coef) exp(-coef) lower .95 upper .95
## transplant    1.0125     0.9877    0.5534    1.8522
## age           1.0310     0.9699    1.0033    1.0595
## surgery       0.4623     2.1631    0.2284    0.9356
## 
## Concordance= 0.6  (se = 0.036 )
## Likelihood ratio test= 10.68  on 3 df,   p=0.01
## Wald test            = 9.65  on 3 df,   p=0.02
## Score (logrank) test = 9.97  on 3 df,   p=0.02

coef_ht04 <- coef(summary(ht04))
exp.ci_ht04 <- summary(ht04)$conf.int

The “corrected” estimate of the transplant coefficient was $0.0123801;$ $\exp(0.0123801)=1.0124571$; $95\% \; c.i.=( 0.5534473,1.8521533)$; $(p=0.9679534$), indicating that transplants was not effective.
The summary table is

SHS <- cbind(coef_ht00, coef_ht04)
SHS

##                  coef exp(coef)   se(coef)         z     Pr(>|z|)        coef
## transplant -1.7171129 0.1795839 0.27852978 -6.164917 7.052018e-10  0.01238011
## age         0.0588863 1.0606546 0.01504927  3.912902 9.119349e-05  0.03054984
## surgery    -0.4190159 0.6576938 0.37117505 -1.128890 2.589442e-01 -0.77154747
##            exp(coef)   se(coef)          z   Pr(>|z|)
## transplant 1.0124571 0.30815300  0.0401752 0.96795345
## age        1.0310213 0.01389822  2.1981113 0.02794117
## surgery    0.4622971 0.35967149 -2.1451449 0.03194126

Example: Addicts dataset¹¹

In contrast to Stata, SAS, and SPSS, in order to run an extended Cox model in R, the analytic dataset must be in the counting process (start, stop) format. Unfortunately, the addicts dataset is not in that format, so it needs to be altered in order to include a time-varying covariate. This can be accomplished with the survSplit function. The survSplit function can create a dataset that provides multiple observations for the same subject allowing a subject’s covariate to change values from observation to observation. The user supplies the time cutpoint(s).

The most general choice for time cutpoints that can accommodate the modeling of any time-varying covariate is a vector of time cutpoints that includes all event times in the data. The variable SURVT in the addicts dataset contains each individual’s time-to-event or time-to-censorship. The following code creates a new analytic dataset (called addicts.cp) which puts the addicts data in the counting process format using the survSplit function:

#addicts <- read.csv(file="addicts.csv", header = TRUE)

addicts.cp=survSplit(addicts,cut=addicts$survt[addicts$status==1],
end="survt", event="status",start="start",id="iid")

The first argument of the survSplit function specifies the dataframe (addicts) to be manipulated into the counting process format. The “survt[status==1]” option specified that the time cutpoints are indicated by the SURVT variable subsetted where the STATUS variable equals 1 (i.e., keeping the event times but omitting censorship times). The event=”status” option specifies STATUS as the variable indicating whether the individual had an event or was censored. The start=”start” option creates a new variable called START. This newly defined variable for the starting times for each observation is necessary for the data to be in counting process (start, stop) format. The end=”survt” option defines SURVT as the stop variable (i.e., the time-to-event variable). The option id=”id” indicates that ID is the variable that identifies each individual. The survSplit function creates multiple observations for individuals at risk at multiple time points. The dataset addicts.cp created above contains 18,708 observations from the 238 observations in the addicts dataset (use the nrow function and the code nrow(addicts.cp)) to return the number of observations.

Suppose the PH assumption was violated for the variable DOSE and we were interested in defining a time-varying covariate as the product of DOSE and the natural log of time (SURVT). This variable can easily be defined if the dataset is in counting process form with time cutpoints at each event time as shown below:

addicts.cp$logtdose=addicts.cp$dose*log(addicts.cp$survt)

We now have a new variable in the dataset (called LOGTDOSE=ln(DOSE)*T) that varies over time. We print the dataset for one individual (id=106) who had an event at time=35 days. Rather than print all the variables, we request a subset of them with the c function:

addicts.cp[addicts.cp$iid==106,
c("iid","start","survt","status","dose","logtdose")]

##      iid start survt status dose  logtdose
## 9802 106     0     7      0   40  77.83641
## 9803 106     7    13      0   40 102.59797
## 9804 106    13    17      0   40 113.32853
## 9805 106    17    19      0   40 117.77756
## 9806 106    19    26      0   40 130.32386
## 9807 106    26    29      0   40 134.69183
## 9808 106    29    30      0   40 136.04790
## 9809 106    30    33      0   40 139.86030
## 9810 106    33    35      1   40 142.21392

The variable LOGTDOSE is time dependent as its values increase with time as expected. The variable SURVT lists all the event times in the addicts dataset up to day 35 when this individual had an event. Notice STATUS=1 when the event occurred and STATUS=0 prior to the event. Next we run an extended Cox model including the predictors PRISON, DOSE, and CLINIC and the time-dependent variable LOGTDOSE:

coxph(Surv(addicts.cp$start,addicts.cp$survt,addicts.cp$status) ~
prison + dose + clinic + logtdose + cluster(iid), data=addicts.cp)

## Call:
## coxph(formula = Surv(addicts.cp$start, addicts.cp$survt, addicts.cp$status) ~ 
##     prison + dose + clinic + logtdose, data = addicts.cp, cluster = iid)
## 
##               coef exp(coef)  se(coef) robust se      z       p
## prison    0.340633  1.405837  0.167474  0.159717  2.133 0.03295
## dose     -0.082625  0.920696  0.035984  0.029601 -2.791 0.00525
## clinic   -1.019875  0.360640  0.215416  0.236365 -4.315 1.6e-05
## logtdose  0.008615  1.008652  0.006455  0.005248  1.642 0.10068
## 
## Likelihood ratio test=66.35  on 4 df, p=1.338e-13
## n= 18708, number of events= 150

The Surv function now takes three arguments: the start variable (called START), the stop variable (called SURVT), and the status variable (called STATUS). The term cluster (ID) in the model formula indicates that there are multiple observations (clusters) from the same subject and requests that robust standard errors be produced for the coefficient estimates. These robust standard errors are designed to account for the non-independence of observations from the same subject. The Wald test z statistic of 1.64 (p = 1.0e-01 or p=0.10) is not significant for LOGTDOSE, providing no evidence that the proportional hazards assumption is violated for DOSE.

Next we run an extended Cox model with heaviside functions for CLINIC defined about the time cutpoint of 365 days. We could use the dataset that we just created, addicts.cp, but since there is now only one cutpoint, we illustrate how to create a dataset in counting process format with only one cutpoint. The new dataset (called addicts.cp365) will have 360 observations compared to 18,708 in the dataset we previously had created called addicts.cp.

addicts.cp365=survSplit(addicts,cut=365,end="survt",
event="status",start="start",id="iid")

The cut=365 option in the survSplit function requests that day 365 be the only cutpoint. Next we create the two time dependent variables (HV1 and HV2). HV1 is defined to equal the value of CLINIC if survival time is less than 365 days and 0 otherwise. HV2 is defined to equal 0 if survival time is less than 365 days and equal the value of CLINIC otherwise:

addicts.cp365$hv1=addicts.cp365$clinic*(addicts.cp365$start<365)
addicts.cp365$hv2=addicts.cp365$clinic*(addicts.cp365$start>=365)

The conditional statements $addicts.cp365 start<365)+$ and $addicts.cp365 start>=365)+$, take the values of 1 if true and 0 if false and are then multiplied by the variable CLINIC to define HV1 and HV2. Next we’ll sort the dataset by the variables ID and START. This is not a necessary step but it is easier to view and understand the data when multiple observations from the same subject are consecutive. The order function sorts the dataset:

addicts.cp365=addicts.cp365[order(addicts.cp365$iid,addicts.cp365$start), ]

Next we print the first 10 observations for selected variables:

addicts.cp365[1:10,c('iid','start','survt','status','clinic','hv1','hv2')]

##    iid start survt status clinic hv1 hv2
## 1    1     0   365      0      1   1   0
## 2    1   365   428      1      1   0   1
## 3    2     0   275      1      1   1   0
## 4    3     0   262      1      1   1   0
## 5    4     0   183      1      1   1   0
## 6    5     0   259      1      1   1   0
## 7    6     0   365      0      1   1   0
## 8    6   365   714      1      1   0   1
## 9    7     0   365      0      1   1   0
## 10   7   365   438      1      1   0   1

Notice the sorted order of the ID variable is 1, 10, and 100 rather than 1, 2, and 3. The ID variable is a character rather than numeric variable and is sorted in “alphabetical” rather than numerical order. The first subject (ID=1) had an event at 428 days, so was censored (STATUS=0) during the first time interval (0, 365) but had an event (STATUS=1) during the second interval (365, 428). This subject has the value CLINIC=1, thus has the time-dependent values HV1=1 and HV2=0 over the first interval and HV1=0 and HV2=1 over the second interval.

Before running an extended Cox model with these heaviside functions we define an object (called Y365) for the response variable using the Surv function. This object is then used in the coxph model formula. It is not necessary to explicitly define this object and we did not do so for the previous extended Cox model that we ran containing LOGTDOSE, but the code is more readable with the notation for the response variable simplified.

Y365=Surv(addicts.cp365$start,addicts.cp365$survt,addicts.cp365$status)

Next we run the model with two heaviside functions:

coxph(Y365 ~ prison + dose + hv1 + hv2 + cluster(iid), data=addicts.cp365)

## Call:
## coxph(formula = Y365 ~ prison + dose + hv1 + hv2, data = addicts.cp365, 
##     cluster = iid)
## 
##             coef exp(coef)  se(coef) robust se      z        p
## prison  0.377951  1.459291  0.168415  0.167650  2.254   0.0242
## dose   -0.035480  0.965142  0.006435  0.006520 -5.442 5.27e-08
## hv1    -0.459373  0.631679  0.255290  0.259983 -1.767   0.0772
## hv2    -1.830517  0.160331  0.385954  0.398376 -4.595 4.33e-06
## 
## Likelihood ratio test=74.25  on 4 df, p=2.868e-15
## n= 360, number of events= 150

The estimated hazard ratio (CLINIC=2 vs. CLINIC=1) is 0.632 for days $< 365$ and 0.160 for days $\ge 365$ (found in the second numeric column under exp(coef)). If we wish to match the SAS, Stata, and SPSS output, we could run the model without robust standard errors and use the method=”breslow” to handle simultaneous events (ties) in the Cox likelihood.

coxph(Y365 ~ prison + dose + hv1 + hv2, data=addicts.cp365, method="breslow")

## Call:
## coxph(formula = Y365 ~ prison + dose + hv1 + hv2, data = addicts.cp365, 
##     method = "breslow")
## 
##             coef exp(coef)  se(coef)      z        p
## prison  0.377704  1.458931  0.168402  2.243   0.0249
## dose   -0.035512  0.965112  0.006435 -5.518 3.43e-08
## hv1    -0.459563  0.631560  0.255291 -1.800   0.0718
## hv2    -1.828228  0.160698  0.385946 -4.737 2.17e-06
## 
## Likelihood ratio test=74.17  on 4 df, p=2.978e-15
## n= 360, number of events= 150

To run an equivalent model with one heaviside function, we need to include the CLINIC variable in the model:

coxph(Y365 ~ prison + dose + hv1 + hv2 + cluster(iid), data=addicts.cp365)

## Call:
## coxph(formula = Y365 ~ prison + dose + hv1 + hv2, data = addicts.cp365, 
##     cluster = iid)
## 
##             coef exp(coef)  se(coef) robust se      z        p
## prison  0.377951  1.459291  0.168415  0.167650  2.254   0.0242
## dose   -0.035480  0.965142  0.006435  0.006520 -5.442 5.27e-08
## hv1    -0.459373  0.631679  0.255290  0.259983 -1.767   0.0772
## hv2    -1.830517  0.160331  0.385954  0.398376 -4.595 4.33e-06
## 
## Likelihood ratio test=74.25  on 4 df, p=2.868e-15
## n= 360, number of events= 150

The coefficient estimates are different with this model compared to the model with two heaviside functions but the estimated hazard ratios are the same. The estimated hazard ratio (CLINIC=2 vs. CLINIC=1) is 0.632 for days $< 365$ (exponentiate the coefficient for CLINIC). In order to estimate the hazard ratio for days $\ge 365$, we need to sum the coefficient estimates for CLINIC and HV2 and then exponentiate (exp(-0.4594)=-1.3711)) = 0.160). The significant p-value for the estimated coefficient for HV2 of (p = 3.6e-10 or p = 0.0036) suggests that the hazard ratios for CLINIC for the two different time periods are not equal. In other words, the significant p-value provides evidence that the proportional hazard assumption is violated for CLINIC.

Time varying variables that increase linearly with time¹²

A common source of confusion is whether or not one could treat patient age as a time dependent variable. The age at entry as a covariate in survival analysis is a fixed quantity at time 0, by definition.
But we know that the age of a patient increases in lock step with time itself. Can we treat increasing age of patient as a time dependent variable? The answer is yes, but doing so has no effect on the model.

#coxph(Surv(time, status==2)~age, data=lung)

#coxph(Surv(time, status==2)~ tt(age), data=lung,
#      tt=function(x,t, ...)
#      {age<-x + t/365.25})

There is no change at all in the fitted values.
Let us denote age at entry into the trial by $z(0)$ and current age by $z(t)=z(0)+t$. Then the hazard function is given by \[h(t)=h_0(t)e^{\beta z(t)}={h_0(t)e^{\beta t}\cdot e^{\beta z(0)}}\]
At the same time, in the partial likelihood, the time dependent part, $e^{\beta t}$, appears in both the numerator and the denominator of each factor, as does the baseline hazard. Both cancel, leaving only the age at entry variable $z(0)$. Thus, the coefficient $\beta$ for the time dependent model is identical to that from the non-time dependent model.
The same happens with any time dependent covariate that increases in lock step with time.
However, if the variable doesn’t change at a constant rate, this equivalence no longer holds.

Time varying coefficients¹³

An alternative way of modeling non-proportional hazards is to allow the coefficients for a particular covariate to vary with time. Specifically, if there is only one covariate, we have $h(t)=h_0(t)e^{z_k\beta(t)}$, where it is $\beta$ that varies with time, rather than the covariate $z_k$ as in the previous section.
Characterizing the functional form of the non-proportional hazards is a much harder problem than simply testing for a difference.
Although here it is the coefficient $\beta$ that is changing rather than the covariate $z$, we may model this by defining a new time dependent variable with fixed coefficients that achieves the same effect.

Example: Pancreatic data

The dataset “pancreatic” in the “asaur” package consists of pancreatic cancer data from a Phase II clinical trial where the primary outcome of interest is progression-free survival (PFS). This quantity is defined as the time from entry into a clinical trial until progression or death, whichever comes first.
The data consist of, for each patient, the stage, classified as “LAPC (locally advanced pancreatic cancer)” or “MPC (metastatic pancreatic cancer)”, the date of entry into the clinical trial, the date of death (all of the patients in this study died), and the date of progression, if that was observed before death.

#install.packages("asaur")
library(asaur)
head(pancreatic)

##   stage    onstudy progression      death
## 1     M 12/16/2005    2/2/2006 10/19/2006
## 2     M   1/6/2006   2/26/2006  4/19/2006
## 3    LA   2/3/2006    8/2/2006  1/19/2007
## 4     M  3/30/2006           .  5/11/2006
## 5    LA  4/27/2006   3/11/2007  5/29/2007
## 6     M   5/7/2006   6/25/2006 10/11/2006

Patient #4, for example, died with no recorded progression (i.e., missing), so that person’s PFS is time to death.For the five other patients in this list the PFS is time to the date of progression.
To calculate the PFS,

attach(pancreatic)
# convert the text dates into R dates
pdd <- as.Date(as.character(progression), format="%m/%d/%y")
odd <- as.Date(as.character(onstudy), format="%m/%d/%y")
ddd <- as.Date(as.character(death), format="%m/%d/%y")

pd <- (pdd - odd)
od <- (ddd - odd)

pfs <- pd
pfs[is.na(pfs)] <- od[is.na(pfs)]

pfs

## Time differences in days
##  [1] -318   51  181   42  -47   49 -209   58  244   49   61  244 -228 -181 -208
## [16]   43 -229 -241 -226  213   51   54   82 -113   63 -142   42 -144 -246   37
## [31]  162   64  174   56   50   57   61  120   69  105   63

plot(survfit(Surv(pfs)~stage), xlab="Time in days", col=c("blue", "red"), lwd=2)
legend("topright", legend=c("Locally advanced PC","Metastatic PC"), col=c("blue","red"), lwd=2)

The log-rank test is

survdiff(Surv(pfs)~stage, rho=0)

## Call:
## survdiff(formula = Surv(pfs) ~ stage, rho = 0)
## 
##           N Observed Expected (O-E)^2/E (O-E)^2/V
## stage=LA  8        8     8.78    0.0685     0.102
## stage=M  33       33    32.22    0.0186     0.102
## 
##  Chisq= 0.1  on 1 degrees of freedom, p= 0.7

The PFS is already calculated in the pancreatic2 dataset.

# pancreatic data2
head(pancreatic2)

##   pfs  os status stage
## 1  48 307      1     M
## 2  51 103      1     M
## 3 180 350      1    LA
## 4  42  42      1     M
## 5 318 397      1    LA
## 6  49 157      1     M

stage.n <- rep(0, nrow(pancreatic2))
stage.n[pancreatic2$stage=="M"]<-1
result.panc <- coxph(Surv(pfs)~stage.n, data=pancreatic2)
result.panc

## Call:
## coxph(formula = Surv(pfs) ~ stage.n, data = pancreatic2)
## 
##           coef exp(coef) se(coef)    z     p
## stage.n 0.5931    1.8095   0.4007 1.48 0.139
## 
## Likelihood ratio test=2.43  on 1 df, p=0.1188
## n= 41, number of events= 41

The estimate of stage (est, exp, ci, p) and model fit (log-rank test=??, p) show little evidence of a group difference.
The following Schoenfeld residual plot indicates the hazard ratio appears not to be constant.

result.sch.resid <- cox.zph(result.panc)
plot(result.sch.resid)

One way of dealing with this was to use the Prentice modification of the Wilcoxon test (using “rho=1” in the “survdiff”) function).
An alternative is to accommodate the changing hazard ratio by defining a time dependent covariate, $g(t)=z\cdot \log(t)$.
“tt” is the time transfer function in R.

result.panc2.tt <- coxph(Surv(pfs)~stage.n + tt(stage.n), data=pancreatic2,
                         tt=function(x,t, ...) x*log(t))
result.panc2.tt

## Call:
## coxph(formula = Surv(pfs) ~ stage.n + tt(stage.n), data = pancreatic2, 
##     tt = function(x, t, ...) x * log(t))
## 
##                 coef exp(coef) se(coef)      z      p
## stage.n       6.0096  407.3394   3.0598  1.964 0.0495
## tt(stage.n)  -1.0858    0.3376   0.5889 -1.844 0.0652
## 
## Likelihood ratio test=6.33  on 2 df, p=0.04229
## n= 41, number of events= 41

The fitted function is $\beta(t)= ??? + ???*\log(t)$.
- The time dependent variable was estimated as ??? with ci and p.
- The model fit, likelihood ratio test = ??? on d.f., p=???, indicates that the group indicator combined with a time-varying hazard ratio yields evidence of a group difference.
- This is consistent with the weighted log-rank test (e.g., rho=1).

survdiff(Surv(pfs)~stage, rho=0, data=pancreatic2)

## Call:
## survdiff(formula = Surv(pfs) ~ stage, data = pancreatic2, rho = 0)
## 
##           N Observed Expected (O-E)^2/E (O-E)^2/V
## stage=LA  8        8     12.3      1.49      2.25
## stage=M  33       33     28.7      0.64      2.25
## 
##  Chisq= 2.2  on 1 degrees of freedom, p= 0.1

survdiff(Surv(pfs)~stage, rho=1, data=pancreatic2)

## Call:
## survdiff(formula = Surv(pfs) ~ stage, data = pancreatic2, rho = 1)
## 
##           N Observed Expected (O-E)^2/E (O-E)^2/V
## stage=LA  8     2.34     5.88     2.128      4.71
## stage=M  33    18.76    15.22     0.822      4.71
## 
##  Chisq= 4.7  on 1 degrees of freedom, p= 0.03

We can also check this with a Schoenfeld residual plot.

result.sch.resid <- cox.zph(result.panc, transform = function(pfs) log(pfs))
plot(result.sch.resid)
abline(coef(result.panc2.tt), col="red")

In this plot, the curved line is a loess (smooth) curve through the residuals. The tick marks on the horizontal axis follow a logarithmic scale, as specified by the “transform” argument in the “coxph.zph” function. The red line is from the fitted time transfer function, not from a fit to the residuals; it is a log function whose plot appears straight because the horizontal axis is a logarithmic scale. This time transfer function indicates that overall, the log hazard ratio decrease over time. Other time dependent functions may not yield this result. For example, if $g(t)=z\cdot t$, we get a non-significant result (???) for the effect of “stage.n” on survival.

result.panc3.tt <- coxph(Surv(pfs)~stage.n + tt(stage.n), data=pancreatic2, 
                         tt=function(x,t, ...) x*t)
result.panc3.tt

## Call:
## coxph(formula = Surv(pfs) ~ stage.n + tt(stage.n), data = pancreatic2, 
##     tt = function(x, t, ...) x * t)
## 
##                  coef exp(coef)  se(coef)      z      p
## stage.n      1.278099  3.589808  0.661027  1.934 0.0532
## tt(stage.n) -0.003656  0.996350  0.002532 -1.444 0.1487
## 
## Likelihood ratio test=4.56  on 2 df, p=0.1025
## n= 41, number of events= 41

Left Truncation¹⁴

The left-truncation can arise in a clinical trial setting when the question of interest is time from diagnosis (rather than time from enrollment) to death or censoring.¹⁵
The same considerations arise in a comparative clinical trial. For example,
- consider data from a hypothetical trial of six patients, three receiving an experimental treatment and three receiving a standard therapy.
- The time “tt” represents the time from entry into the trial until death or censoring, “status” indicates whether or not a death was observed, and “grp” indicates which group the patient is in. The time “backtime” refers to the backwards recurrence time, that is, the time before entry when the patient was diagnosed.
A key decision in any survival analysis is where to define the starting point for determining individual’s “true” survival time, which we call time 0. Depending on the study, choices for time 0 might be: the time the subject enters the study, the time the subject begins treatment, the time of disease onset, the time of diagnosis, a point in calendar time, the time of a seminal event (e.g., surgery), birth, or conception. If we define time 0 at birth, then an individual’s survival time is represented by their age.
Time 0 is not necessarily the time point where a subject’s survival time is first observed (which we call time $t_0$). For example, if survival time is measured by age at follow-up and a subject enters the study at age 45, then $t_0=45$ years for this subject. In this example, the subject’s survival time has been left-truncated at $t_0=45$.
Left truncation at time $t_0$ is defined as follows
- The subject is not observed from time 0 to $t_0$. If the subject has the event before time $t_0$, then that subject is not included in the study. If the subject has the event after time $t_0$, the subject is included in the study but with the caveat that the subject was not at risk to be an observed event until time $t_0$.
  - The first type of left truncation occurs if the subject has the event before $t_0$ and thus is not included in the study. If, for example, the exposure (E) under study causes individuals to die before they could enter the study, this could lead to a (selective) survival bias that would underestimate the effect of exposure.
  - The second type of left truncation occurs if the subject survives beyond time $t_0$ (i.e., $t > t_0$). This is required in order for the subject to have his/her survival time observed.
- Thus, a condition of the subject’s entry into the study is that they survive until time t0. If they do not meet that condition, then their left truncation is of the first type and thus not included in the study. If they do survive past time t0, then their left truncation is of the second type.
Left truncation (of both types) at time $t$ is commonly confused with left censorship at time $t$. Let’s assume a study starting from time 0.
- Right-censored: Suppose a subject is lost to follow-up after 10 years of observation. The time of event is not observed because it happened after the 10th year. This subject is right censored at 10 years because the event happened to the right of 10 on the time line (i.e., $t > 10$).
- Left-censored: Suppose a subject had an event before the $10^{th}$ year but the exact time of the event is unknown. This subject is left-censored at 10 years (i.e., $t < 10$).
- Interval-censored. Suppose a subject had an event between the $8^{th}$ and $10^{th}$ year (exact time unknown). This subject is interval censored (i.e., $8 < t < 10$).
We now compare two approaches of measuring survival time. One approach is to measure survival time as time-on-study and the other is to measure survival time as age-at-follow-up until either an event or censorship. The choice of approach determines the risk set at the time of each event.
- Consider the data shown on the left on four different subjects, for each of which we have identified time-on-follow-up ($t$), whether failed or censored ($d$), age at study entry ($a_0$), and age at the end of follow-up ($a$). Note that t is simply a $a_0$, the difference between age at follow-up time and age at study entry.
- Using time-on-study (i.e., from entry into the study) as the time scale, the data layout based on ordered follow-up times is shown on the left, below which is shown a graphical representation that follows each subject from the time of study entry.

Survival times and backwards recurrence times for data from a comparative clinical trial. Patients marked “T” received the experimental treatment, and those marked “C” reveived the standard therapy

Re-aligned data with left truncation

Using time-on-study as the time scale can give a different view of the survival data (i.e., a closed cohort) than found when using age as the time scale (i.e., an open cohort). So which time scale should be used and how do we make such a decision in general?
- To answer this question, a key issue is to determine whether all subjects in the study first begin to be at risk for the outcome at the time they enter the study.
- Suppose the study is a clinical trial to compare, say, treatment and placebo groups, and subjects start to be followed shortly after random allocation into one of these two groups. Then, it may be reasonable to assume that study subjects begin to be at risk for the outcome upon entry into the study. In such a situation, using time-on-study as the time scale is typically appropriate. Further, covariates of interest may be controlled for by stratification and/or being entered into a regression model (e.g., Cox PH model) as predictors in addition to the treatment status variable.
- Suppose, instead of the above scenario, the study is observational (i.e., not a clinical trial) and subjects are already at risk for the outcome prior to their study entry. Also, suppose the time or age at which subjects first became at risk is unknown. For example, the subjects may all have high blood pressure when the study begins and are then followed until a coronary event occurs (or censorship); such subjects already had high blood pressure when recruited for the study, but the date or their age when their high blood pressure condition was first diagnosed is assumed unknown. In this situation, it seems reasonable that the time at risk prior to study entry (tr), which is unknown, contributes to the true survival time (T) for the individual, although only the observed time-on-study (t) is actually available to analyze. The individual’s true (i.e., total) survival time is therefore underestimated by the time-on-study information (obtained from study entry), i.e., the true survival time is left-truncated. So, for the situation where we have left-truncated survival data, the use of time-on-study follow-up times that ignores unknown delayed entry time may be questioned.
One way to account for the age difference at entry would simply to control for age at entry (i.e., a0) as a covariate in one’s survival analysis by adding the variable a0 to a Cox PH model. This approach is reasonable provided the model is specified correctly (e.g., proportional hazards assumption is met for age). Alternatively, considering subjects I and J, who have entered at the same time but are 9 years different in age, we might consider using age as the time scale to represent a subject’s potential for failure, which we now describe.
At this point, we might again ask, when, if at all, would using a model based on h(a, X) be preferable to simply using a model of the form h(t, X, a0) where t denotes time-on follow-up, and $a_0$ denotes age at entry?
- The answer is that “it depends”. Moreover, in many situations, it might not matter, since use of either model form will often lead to essentially the same results, provided the model is well-specified in each case.
- Using $h(a, X)$ may be preferable if age is a much stronger determinant of the outcome than time-on-study, i.e., age at event may have a larger effect on the hazard than time-on-study. Also, because age is taken into account in an unspecified baseline hazard h0(a), a more effective control of age may result that avoids the possibility of mispecifying the way the age at entry ($a_0$) might be entered into a time-on-study model, e.g., using only a linear term when a quadratic term such as is also required for model adequacy.
- Using $h(t, X, a_0)$ may be preferable if time-on-study is a stronger determinant of the outcome than age at the event, as in a randomized clinical trial. Also, a time-on-study model would seem appropriate if age at entry (a0) is “effectively controlled” (e.g., using a quadratic term if necessary) or is stratified in the model.
We now specify several alternative forms that a Cox PH model might take to account for risk-truncated survival data.
Time of study: these models are time-on-study models that control for age at entry ($a_0$), but do so differently.
- Model 0: $h(t,x)=h_0(t)\exp[\sum \beta_i x_i]$, unadjusted for $a_0$
- Model 1: $h(t,x, a_0)=h_0(t)\exp[\sum \beta_i x_i+\gamma_1 a_0]$, adjusted for $a_0$ as linear covariate
- Model 2: $h(t,x, a_0)=h_0(t)\exp[\sum \beta_i x_i+\gamma_1 a_0 + \gamma_2 a_0^2]$, adjusted for $a_0$ as quadratic covariate
- Model 3: $h_g(t,x)=h_{0g}(t)\exp[\sum \beta_i x_i]$, stratified by $a_0$ or birth cohort, $g=1, \cdots, s$
  - Model 0 is the least appropriate since this model uses time-on-study as the outcome and does not adjust for age at entry ($a_0$) in any way.
  - Model 1 controls for $a_0$ as a covariate and assumes a linear effect of $a_0$. Model 2, in contrast, assumes that $a_0$ has both linear and quadratic effects. Model 3 stratifies on either $a_0$ or on birth cohort defined from $a_0$. Model 3 is called a Stratified Cox (SC) PH model.
  - Models 1–3 are all reasonable if we assume that study subjects begin to be at risk upon study entry, as in a randomized clinical trial. Moreover, even for an observational study design in which subjects have different ages at entry, these models may appear justifiable if they provide effective control of $a_0$.
  - Model 3 controls for entry age by stratifying either on age at entry ($a_0$) or on birth cohort based on $a_0$. Model 3 provides an alternative way to control for age without explicitly putting $a_0$ as a covariate in the model (as was done in Models 1 and 2). If we stratify by birth cohort instead of by $a_0$, we can account for possible advances in medical management in later birth cohorts. Nevertheless, stratifying by age at entry or stratifying by birth cohort would likely give similar results unless enrollment happens over a long period of time. In the latter case, we recommend stratifying on birth cohort.
Age as time scale: these models use age-at-event or censorship rather than time-on-study as the outcome variable. These models differ in the way the baseline hazard function is specified.
- Model 4: $h(a,x)=h_0(a)\exp[\sum \beta_i x_i]$, unadjusted for left truncation at $a_0$
- Model 5: $h(a,x)=h_0(a|a_0)\exp[\sum \beta_i x_i]$, adjusted for left truncation at $a_0$
- Model 6: $h_g(a,x)=h_{0g}(a|a_0)\exp[\sum \beta_i x_i]$, adjusted for left truncation at $a_0$ and stratified by birth cohort, $g=1, \cdots, s$
  - Model 4 uses $h_0(a)$ to indicate that, although age is the outcome, the model does not adjust for left truncation at the entry age ($a_0$). In effect, this baseline hazard assumes that each subject’s observed risk period started at birth. In other words, Model 4 allows keeping in the risk set $R(a_P)$ any subject (e.g., subject Q in the figure at left) who was not under study at age $a_P$ but who enrolled later (at age $a_{0Q}$). Here, subject Q is in the risk set $R(a_P)$ because we assume he is at risk from birth (Age=0) when subject P fails at $a_P$. The data layout with ordered failure ages is thus a closed cohort that starts with all subjects in the risk set at birth. If Model 4 were applied to our previous example involving four subjects, subjects J and K would be incorrectly included in the risk set $R(a=67)$ when subject H failed, even though both these subjects were enrolled after age 67. This model inappropriately assumes that all subjects were at risk from birth; it does not adjust for age-truncation.
  - Model 5, on the other hand, accounts for left truncation by age at entry. The baseline hazard $h_0(a|a_0)$ is used to indicate that the data layout with ordered failure ages is an open cohort. For this model, the risk set $R(a)$ at time a contains only those subjects who are under study at age $a$. If Model 5 was applied to our previous example, subjects J and K, who had not enrolled when subject H failed at 67, would not be in the risk set $R(a=67)$. Also, subjects H and I, who were no longer in the study when subject K failed at age 78, would not be in the risk set $R(a=78)$.
  - Model 6 is similar to Model 3 stratified on birth cohort. However, Model 6 adjusts for age truncation, whereas Model 3 does not. As with Model 3, Model 6 is intended to account for possible advances in medical management in later birth cohorts. Model 6 would not be necessary if we are considering a study in which everyone is enrolled within a short period of time.
In summary, of the seven models we have presented, Models 0 and 4 are inappropriate because Model 0 does not account for age at all and Model 4 ignores age truncation by incorrectly assuming that all study subjects were observed for the outcome from birth. The other five models (i.e., 1–3, 5, 6) all adjust for age at study entry in some way. A logical question at this point is whether in practice, it makes a difference which model is used to analyze age-truncated survival data?
The above question was actually addressed by Pencina et al. (Statist. Med., 2007) by comparing Models 1–6 above in terms of the estimated regression coefficients they produce. These authors consider Model 5, the age-truncated age scale model, to be “possibly the most appropriate refinement” to account for age-truncation. They also view time-on-study Models 1 and 2, which use linear and/or quadratic terms to adjust for entry age as a covariate as “attempts to approximate” Model 5.
Nevertheless, by considering numerical simulations as well as four practical examples from the Framingham Heart Study, Pencina et al. conclude that correct adjustment for the age at entry is crucial in reducing bias ofthe estimated coefficients. The unadjusted age-scale model (Model 1) is inferior to any of the five other models considered, regardless of their choice of time scale. Moreover, if correct adjustment for age at entry is made when considering Models 2–6, their analyses suggest that there exists little if any practical or meaningful difference in the estimated regression coefficients depending on the choice of time scale.
To illustrate, we show on the left results from Pencina et al. corresponding to Models 1–6 applied to 12-year follow-up Framingham Heart Study data¹⁶. The outcome considered here is coronary heart disease (CHD) in men. These results focus on two risk factors measured at baseline: diabetes mellitus status and education status, the latter categorized into two groups defined by post-high-school education (yes/no). The estimated regression coefficients (separately) relating these two risk factors to CHD outcome are presented in the table.
As expected, the table shows a substantial difference in the coefficient of the risk group variable estimated by the unadjusted age-scale model (Model 4) and the five other models. Moreover, the results for Models 1–3, 5, and 6 are all quite similar.
Pencina also point out that the directions of coefficients for these five models are in the directions anticipated conceptually, e.g., diabetes coefficients are positive, whereas education coefficients are negative. The quadratic baseline age term (Model 2) was significant for both CHD risk factors. This suggests potential misspecification in the modeling of the relationship between CHD and age introduced by Model 1, which treats entry age as linear. However, its inclusion in the time-on study model did not materially influence the magnitude or significance of the estimated exposure variable (diabetes or education status) coefficient.
When using age as the time scale and accounting for age truncation (i.e., using Model 5 above), the data layout requires the counting process (CP) in start–stop format with $a_0$ as the start variable and a as the stop variable. However, since we are not considering recurrent events data here, the CP format for age truncated survival data has a simpler form, involving only one line of data for each study subject, as shown on the left. The computer code needed to program the analysis is described in the Computer Appendix for STATA, SAS, SPSS, or R packages.
Note that the CP format corresponding to Model 4, which assumes the starting time is birth, would modify the Model 4 layout by letting $a_0$ = 0 in the $a_0$ column for all subjects. Nevertheless, this layout would be equivalent to the “standard” layout that omits $a_0$ column and simply treats the a column data as time on-study information. Again, since Model 4 appears to be inferior to the other models, we caution the reader not to use this format unless the risk period was observed since birth.

Example: Hypothetical data

tt <- c(6,7,10,15,19,25)
status <- c(1,0,1,1,0,1)
grp <- c(0,0,1,0,1,1)
backtime <- c(-3,-11,-3,-7,-10,-5)

The data are plotted in Figure.

Survival times and backwards recurrence times for data from a comparative clinical trial. Patients marked “T” received the experimental treatment, and those marked “C” reveived the standard therapy
The standard way to compare the two groups is to ignore the backwards recurrence times.

coxph(Surv(tt,status)~grp)

## Call:
## coxph(formula = Surv(tt, status) ~ grp)
## 
##        coef exp(coef) se(coef)     z     p
## grp -1.3261    0.2655   1.2509 -1.06 0.289
## 
## Likelihood ratio test=1.21  on 1 df, p=0.2715
## n= 6, number of events= 4

This result shows that the experimental group has a lower hazard than the control group, but this difference is not statistically significant. (add Wald and likelihood test statistics).
There is nothing wrong with this standard and widely used method since there is no reason to believe that the backwards recurrence times would differ between the two groups, there should be no concern about bias.
However, in some circumstances one may wish to compare survival times starting from time from diagnosis, and then it is essential to account for the left truncation.
The data can be re-configured so that the diagnosis occurs at time 0 as follows.

tm.enter <- -backtime
tm.exit <- tt - backtime

These data are plotted as follows.

Re-aligned data with left truncation

Using the full survival times with left-truncated data leads to a similar non-significant treatment difference conclusion.

coxph(Surv(tm.enter, tm.exit, status)~grp)

## Call:
## coxph(formula = Surv(tm.enter, tm.exit, status) ~ grp)
## 
##       coef exp(coef) se(coef)      z     p
## grp -1.073     0.342    1.235 -0.869 0.385
## 
## Likelihood ratio test=0.81  on 1 df, p=0.3677
## n= 6, number of events= 4

Example: Channing House data (in the “asaur” package in R)

A serious problem arises with left-truncation data if the risk set becomes empty at an early survival time.
The Channing House dataset contains information on 96 men and 361 women who entered the Channing House retirement community, located in Palo Alto, CA. For each subject, the variable “entry” is the age (in months) that the person entered the Channing House and “exit” is the age at which the person either died, left the community, or was still alive at the time the data were analyzed. The variable “cens” is 1 if the patient had died and 0 otherwise.
This dataset is subject to left truncation because subjects who die at older ages are more likely to have enrolled in the center than patients who died at younger ages. Thus, to obtain an unbiased estimate of the age distribution, it is necessary to treat “entry” as a left truncation time.

head(ChanningHouse)

##    sex entry exit time cens
## 1 Male   782  909  127    1
## 2 Male  1020 1128  108    1
## 3 Male   856  969  113    1
## 4 Male   915  957   42    1
## 5 Male   863  983  120    1
## 6 Male   906 1012  106    1

ChanningHouse <- within(ChanningHouse, 
                        {
                          entryYears <- entry/12
                          exitYears <- exit/12
                        })
head(ChanningHouse)

##    sex entry exit time cens exitYears entryYears
## 1 Male   782  909  127    1  75.75000   65.16667
## 2 Male  1020 1128  108    1  94.00000   85.00000
## 3 Male   856  969  113    1  80.75000   71.33333
## 4 Male   915  957   42    1  79.75000   76.25000
## 5 Male   863  983  120    1  81.91667   71.91667
## 6 Male   906 1012  106    1  84.33333   75.50000

ChanningMales <- ChanningHouse[ChanningHouse$sex == "Male",]
head(ChanningMales)

##    sex entry exit time cens exitYears entryYears
## 1 Male   782  909  127    1  75.75000   65.16667
## 2 Male  1020 1128  108    1  94.00000   85.00000
## 3 Male   856  969  113    1  80.75000   71.33333
## 4 Male   915  957   42    1  79.75000   76.25000
## 5 Male   863  983  120    1  81.91667   71.91667
## 6 Male   906 1012  106    1  84.33333   75.50000

Next we estimate the survival distribution for men using first the Kaplan-Meier estimate and then the Nealson-Aalen estimator, and plot them.

result.km <- survfit(Surv(entryYears, exitYears, cens)~1, data=ChanningMales)
result.km

## Call: survfit(formula = Surv(entryYears, exitYears, cens) ~ 1, data = ChanningMales)
## 
##      records n.max n.start events median 0.95LCL 0.95UCL
## [1,]      96    39       2     46   64.9    64.8      NA

plot(result.km, xlim=c(64,101), xlab="Age", 
     ylim=c(0.0, 1.0), ylab="Survival probability", conf.int = F)

result.naa <- survfit(Surv(entryYears, exitYears, cens)~1,
                      data=ChanningMales, type="fleming-harrington")
result.naa

## Call: survfit(formula = Surv(entryYears, exitYears, cens) ~ 1, data = ChanningMales, 
##     type = "fleming-harrington")
## 
##      records n.max n.start events median 0.95LCL 0.95UCL
## [1,]      96    39       2     46   65.1    64.8      90

lines(result.naa, col="blue", conf.int=F)

result.km.68 <- survfit(Surv(entryYears, exitYears, cens)~1,
                        start.time=68, data=ChanningMales)
result.km.68

## Call: survfit(formula = Surv(entryYears, exitYears, cens) ~ 1, data = ChanningMales, 
##     start.time = 68)
## 
##      records n.max n.start events median 0.95LCL 0.95UCL
## [1,]      94    39      12     44   84.1    80.5    86.9

lines(result.km.68, col="green", conf.int=F)
legend("topright", legend=c("KM", "NAA", "KM 68 and older"), 
       lty=1, col=c("black","blue", "green"))

The black curve is the KM estimate, it plunges to zero at age 65 because, at this early age, the size of the risk set is small, and in fact reduces to 0. This forces the survival curve to zero. And, since the KM curve is a cumulative product, once it reaches zero, it can never vary from that.
The NAA estimate, shown in blue, is based on exponentiating a cumulative sum, so it doesn’t share this problem of going to zero early on. Still, it does take an early plunge, also due to the small size of the risk set at the younger ages. The problem here is that there is too little data to accurately estimate the overall survival distribution of men.
Instead, we can condition on men reaching the age of 68, using the “start.time” option, and estimate the survival among that cohort (a green line). The survival curve is much better behaved. So the only solution to the problem of a small risk set with left-truncated data is to select a realistic target(here, survival of men conditional on living to age 68) for which there is sufficient data to obtain a valid estimate.
Cox model without adjusting for left truncation

coxph(Surv(entryYears, exitYears, cens)~ sex,
      data=ChanningHouse)

## Call:
## coxph(formula = Surv(entryYears, exitYears, cens) ~ sex, data = ChanningHouse)
## 
##           coef exp(coef) se(coef)     z      p
## sexMale 0.3219    1.3798   0.1733 1.857 0.0633
## 
## Likelihood ratio test=3.28  on 1 df, p=0.07021
## n= 457, number of events= 175

Cox model adjusting for left truncation at 68

channing68 <- ChanningHouse[ChanningHouse$exitYears >= 68,]
coxph(Surv(entryYears, exitYears, cens)~ sex,
      data=channing68)

## Call:
## coxph(formula = Surv(entryYears, exitYears, cens) ~ sex, data = channing68)
## 
##           coef exp(coef) se(coef)     z     p
## sexMale 0.2733    1.3143   0.1762 1.552 0.121
## 
## Likelihood ratio test=2.3  on 1 df, p=0.1292
## n= 451, number of events= 172

Weibull model without adjusting for left truncation

library(eha)
phreg(Surv(entryYears, exitYears, cens) ~ sex,
      data=ChanningHouse, dist="weibull", param="survreg")

## Call:
## phreg(formula = Surv(entryYears, exitYears, cens) ~ sex, data = ChanningHouse, 
##     dist = "weibull", param = "survreg")
## 
## Covariate          W.mean      Coef Exp(Coef)  se(Coef)    Wald p
## sex 
##           Female    0.807     0         1           (reference)
##             Male    0.193     0.355     1.427     0.172     0.039 
## 
## log(scale)                    4.476               0.011     0.000 
## log(shape)                    2.185               0.111     0.000 
## 
## Events                    175 
## Total time at risk        3088.3 
## Max. log. likelihood      -642.63 
## LR test statistic         4.04 
## Degrees of freedom        1 
## Overall p-value           0.0445388

Weibull model adjusting for left truncation at 68

phreg(Surv(entryYears, exitYears, cens) ~ sex,
      data=channing68, dist="weibull", param="survreg")

## Call:
## phreg(formula = Surv(entryYears, exitYears, cens) ~ sex, data = channing68, 
##     dist = "weibull", param = "survreg")
## 
## Covariate          W.mean      Coef Exp(Coef)  se(Coef)    Wald p
## sex 
##           Female    0.808     0         1           (reference)
##             Male    0.192     0.318     1.375     0.175     0.068 
## 
## log(scale)                    4.480               0.010     0.000 
## log(shape)                    2.256               0.104     0.000 
## 
## Events                    172 
## Total time at risk        3074.3 
## Max. log. likelihood      -629.16 
## LR test statistic         3.15 
## Degrees of freedom        1 
## Overall p-value           0.0761057

Likelihood and partial likelihood

Maximum likelihood estimation

Maximum likelihood estimation (MLE, ML) is a general approach to estimate that has become popular in many different areas of application.
There are two reasons for this popularity.
- ML produces estimatros that have good large-sample properties. Provided that certain regularity conditions are met, ML estimators are consistent, asymptotically efficient, and asymptotically normal.
  - Consistency: the estimates converge in probability to the true values as the sample gets larger, implying that the estimates will be approximately unbiased in large samples
  - Asymptotically efficient: In large samples, the estimates will have standard errors that are approximately at least as small as those for any other estimation method
  - Asymptotically normal: the sampling distribution of the estimates will be approximately normal in large samples, implying that we can use the normal and chi-square distributions to compute confidence intervals and p-values.
- It is often straightforward to derive ML estimators when there are no other obvious possibilities.
  - One case is that ML handles nicely is data with censored observations. (OLS will leads to larger standard errors and there is little available theory to justify the construction of hypothesis tests or confidence intervals)
The basic principle of ML is to choose as estimates those values that will maximize the probability of observing what we have, in fact, observed.
- The first step is write down a formula for the probability of the data as a function of the unknown parameters (i.e., constructing the likelihood function)
- The second step is to find the values of the unknown parameters that maek the value of this formula as large as possible (i.e., maximization)
MLE
- Assume that we have $n$ independent individuals $(j=1,2,\dots,n)$.
- For each individual $i$, the data consist of three parts: $t_i$, $\delta_i$, and $x_i$, where
  - $t_i$ is the time of the event or the time of censoring;
  - $\delta_i$ is an indicator variable with a value of 1 if $t_i$ is uncensored or 0 if right censored; and
  - $x_i = [1\; x_{i1}\; \dots \;x_{ik}]$ is a vector of covariates values (the 1 is for the intercept) (for simplicity, we treat them as fixed rather than random)
- Suppose that all the observations are uncensored. Because we are assuming independence, it follows that the probability of the entire data is found by taking the product of the probabilities of the data for every individual. Because $t_i$ is assumed to be measured on a continuum, the probability that it will take on any specific value is 0.
- Instead, we represent the probability of each observation by the probability density function (p.d.f.), $f(t_i)$. Thus, the probability (or likelihood) of the data is given by the following expression, where $\prod$ indicates repeated multiplication: \[L=\prod_{i=1}^{n} f_i(t_i)\]
  - Note that $f_i$ is subscripted to indicate that each individual has a different p.d.f. that depends on the covariates.
- To proceed further, we need to substitute an expression for $f_i(t_i)$ that involves covariates and the unknown parameters.
  - Before we do this, however, let’s see how this likelihood is altered if we have censored cases.
  - If an individual is censored at time $t_i$, all we know is that the individual’s event time is greater than $t_i$. But the probability of an event time greater than $t_i$ is given by the survivor function $S(t)$ evaluated at time $t_i$. Now suppose that we have $r$ uncensored observations and $n-r$ censored observations.
  - If we arrange the data so that all the uncensored cases come first, we can write the likelihood as \[L=\prod_{i=1}^{r}f_i(t_i) \prod_{i=r+1}^{n} S_i(t_i)\]
  - where, again, we subscript the survivor function to indicate that it depends on the covariates. Using the censoring indicator $\delta$, we can equivalently write this as \[L=\prod_{i=1}^{n}[f_i(t_i)]^{\delta_i} [S_i(t_i)]^{1-\delta_i}\]
  - Here $\delta_i$ acts as a switch, turning the appropriate function on or off, depending on whether the observation is censored. As a result, we do not need to order the observations by censoring status. This last expression applies to all the models with right-censored data, shows how consored and uncensored cases are combined in ML estimation.
- Once we choose a particular model, we can substitute appropriate expressions for the p.d.f. and the survivor function.
  - For example, the exponential model is $f_i(t_i) = \lambda_i e^{-\lambda_i t_i}$ and $S_i(t_i)=e^{-\lambda_i t_i}$, where $\lambda_i=\exp(-\beta x_i)$ and $x_i$ is a vector of coefficients.
  - Substituting, we get \[L=\prod_{i=1}^{n} [\lambda_i e^{-\lambda_i t_i}]^{\delta_i}[e^{-\lambda_i t_i}]^{1-\delta_i}=\prod\lambda_i^{\delta_i}e^{-\lambda_i t_i}\]
- Although this expression can be maximized directly, it is generally easier to work with the natural logarithm of the likelihood function because products get converted into sums and exponents become coefficients. Because the logarithm is an increasing function, whatever maximizes the logarithm also maximizes the original function.
- Taking the logarithm of the likelihood, we get \[\log L= \sum_{i=1}^{n} \delta_i \log \lambda_i - \sum_{i=1}^{n} \lambda_i t_i = -\beta \sum_{i=1}^{n} \delta_i x_i - \sum_{i=1}^{n} t_i e^{-\beta x_i}\]
- Now we are ready for step 2, finding values of $\beta$ that make this expression as large as possible. There are many different methods for maximizing functions like this. One well-known approach is to find the derivative of the function with respect to $\beta$, set the derivative equal to 0, and then solve for $\beta$.
  - Taking the derivative and setting it equal to 0 gives us \[\sum_{i=1}^{n} \delta_i x_i = \sum_{i=1}^{n} x_i t_i e^{-\beta x_i}\]
  - Because $x_i$ is a vector, this is actually a system of $k+1$ equations, one for each element of $\beta$. While these equations are not terribly complicated, the problem is that they involve nonlinear functions of $\beta$. Consequently, except in special cases (like a single dichotomous $x$ variable), there is no explicit solution.
  - Instead, we have to rely on iterative methods, which amount to successive approximations to the solution until the approximations converge to the correct value.
  - Again, there are many different methods for doing this. All give the same solution, but they differ in such factors as speed of convergence, sensitivity to starting values, and computational difficulty at each iteration. (e.g., the Newton-Raphson algorithm)

Partial likelihood estimation

\[h_i(t)=h_0(t)\exp(\beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_n x_n )\]

The likelihood function for the proportional hazards model of this equation can be factored into two parts:

one part depends on both $h_0(t)$ and $\beta=[\beta_1, \beta_2,\dots,\beta_n]^{'}$, the vector of coefficients
the other part depends on $\beta$ $=[\beta_1, \beta_2,\dots,\beta_n]^{'}$ alone

What partial likelihood does, in effect, is discard the first part and treat the second part - the partial likelihood function - as though it were an ordinary likelihood function. You get estimates by finding values of $\beta$ that maximize the partial likelihood.

Because there is some information about $\beta$ in the discarded portion of the likelihood function, the resulting estimates are not fully efficient. There standard errors are larger than they would be if you used the entire likelihood function to obtain the estimates. In most cases, however, the loss of efficiency is quite small.
What we gain in return is robustness because the estimates have good properties regardless of the actual shape of the baseline hazard function.
To be specific, partial likelihood estimates still have two of the three standard properties of ML estimates: they are consistent and asymptotically normal. In other words, in large samples they are approximately unbiased and their sampling distribution is approximately normal.
Another interesting property of partial likelihood estimates is that they depend only on the ranks of the event times, not their numerical values. This implies that any monotonic transformation of the event times will leave the coefficient estimates unchanged.
- For example, we could add a constant to everyone’s event time, multiply the result by a constant, take the logarithm, and then take the square root - all without producing the slightest change in the coefficients or their standard errors.

Now let’s take a closer look at how the partial likelihood works.

Using the same notation as MLE,
- Assume that we have $n$ independent individuals $(j=1,2,\dots,n)$.
- For each individual $i$, the data consist of three parts: $t_i$, $\delta_i$, and $x_i$, where
  - $t_i$ is the time of the event or the time of censoring;
  - $\delta_i$ is an indicator variable with a value of 1 if $t_i$ is uncensored or 0 if right censored; and
  - $x_i = [1 x_{i1} \dots x_{ik}$ is a vector of covariates values (the 1 is for the intercept) (for simplicity, we treat them as fixed rather than random)
we can write the partial likelihoods as a product of the likelihoods for all the events that are observed. Thus if $J$ is the number of events, \[PL=\prod_{j=1}^{J}L_j\]
- where $L_j$ is the likelihood for the $J^{th}$ event.
Next we need to see how the individual $L_J$s are constructed. This is best explained by way of an example.
- First, we arrange data in ascending order by survival time, which is convenient for constructing the partial likelihood.
- Let’s say that the first death occurred to patient 1 in month 5. To construct the partial likelihood ($L_1$) for this event, we ask the following quetion: Given that a death occurred in month 5, what is the probability that it happened to patient 1 rather than to one of the other patients? The answer is the hazard for patient1 at month 5 divided by the sum of the hazards for all the patients who were at risk of death in that same month. At month 5, let all other 45 patients were at risk of death, so the probability is \[L_1=\frac{h_1(5)}{h_1(5)+h_2(5)+\dots +h_{45}(5)}\]
- The second death occurred to patient 2 in month 8. Again we ask, given that a death occurred in month 8, what is the probability that it occurred to patient 2 rather than to one of other patients ar risk? Patient 1 is no longer at risk of death bacuase she already died. So $L_2$ has the same form as $L_1$, but the hazard for patient 1 is removed from the denominator: \[L_2=\frac{h_2(8)}{h_2(8)+h_3(8)+\dots +h_{45}(8)}\]
- The set of all individuals who are at risk at a given point in time is often referred to as the risk set. At time 8, the risk set consists of patients 2 through 45, inclusive.
- We continue in this way for each successive death, deleting from the denominator the hazards for all those who have already died. Also deleted from the denominator are those who have been censored at an earlier point in time.
Until now, we made no assumptions about the form of the hazard function. Now, we invoke the proportional hazards model and substitute the expression for the hazard into the expression for $L_1$, \[L_1=\frac{h_0(5)\exp[\beta x_1]}{h_0(5)\exp[\beta x_1]+h_0(5)\exp[\beta x_2]+\dots +h_0(5)\exp[\beta x_{45}]}\] where $x_i$ is the value of $x$ for the $i^{th}$ patient.
- This leads to a considerable simplication because the unspecified function $h_0(5)$ is common to every term in the expression. Canceling, we get \[L_1=\frac{\exp[\beta x_1]}{\exp[\beta x_1]+\exp[\beta x_2]+\dots +\exp[\beta x_{45}]}\]
- It is this cancellation of the $\lambda$s that makes it possible to estimate the $\beta$ coefficients without haveing to specify the baseline function.
We can also test that the partial likelihood depends only on the order of the event times, not on their exact values.
- Although the first death occurred in month 5, $L_1$ would be exactly the same if it had occurred at any time from 0 up to (but not including) 8, the month of the second event.
- Similarly, $L_2$ would have been the same if the second death had occurred any time greater than 5 and less than 10 (the month of the third death).
Therefore, a general expression for the partial likelihood for data with time-invariant covariates from a proportional hazards model is \[PL=\prod_{i=1}^{n}[\frac{\exp(\beta x_i)}{\sum_{j=1}^{n} Y_{ij}\exp{\beta x_j}}]^{\delta_i}\]
- where $Y_{ij}=1$ if $t_j \le t_i$; and $Y_{ij}=0$ if $t_j < t_i$
- The $Y$s are just a convenient mechanism for excluding from the denominator those individuals who have already experienced the event and are, thus, not part of the risk set.
- Although this expression has the product taken over all individuals rather than over all events, the terms corresponding to censored observations are effectively excluded because $\delta_i=0$ for those cases.
- This expression is not valid for tied event times, but it does allow for ties between one event time and one or more censoring times.
Once the partial likelihood is constructed, we can maximize it with respect to $\beta$ just like an ordinary likelihood, which is \[\log PL=\sum_{i=1}^{n} \delta_i [\beta x_i - \log \sum_{j=1}^{n} Y_{ij}\exp{\beta x_j}]\]
Most partial likelihood programs use some version of the Newton-Raphson alhorithm to maximize this function with respect to $\beta$.
To account for the time-varying covariates¹⁷, we need to modify the partial likelihood function to accommodate these types of variables. Essentially, at each failure time, there are a certain number of patients at risk, and one fails. However, the contributions of each subject can change from one failure time to the next. The hazard function if given by $h(t)=h_0 (t)e^{z_k(t_i)\beta}$, where the covariate $z_k(t_i)$ is the value of the time-varying covariate for the $k^{th}$ subject at time $t_i$.

Sample of six patients from the Stanford heart transplant dataset

The maximum partial likelihood is \[L(\beta)=\prod_{i=1}^{D} \frac{\psi_{ii}}{\sum\limits_{k \in R_i} \psi_{ki}}, \;\;\; where\;\; \psi_{ki}=e^{z_k(t_i)\beta} \]
When the covariates were fixed at time 0, so that $z_k(t_i)$=$z_k$ for all failure times $t_i$, and the denominator at each time could be computed by, as time passes, successively deleting the value of $\psi_i$ for the subject (or subjects) that failed at that time.
With a time dependent covariate, by contrast, the entire denominator has to be recalculated at each failure time, since the values of the covariates for each subject may change from one failure time to the next.
- For example, the patient #2 is the first to fail, at $t=5$. At this time, all six patients are at risk, but only one, patient #95, has had a transplant at this time. So the denominator for the first factor is $5+e^{\beta}$, and the numerator is 1, since it was a non-transplant patient who died.
- Patient #12 is the next to die, at time $t=7$, and none of the patients in the risk set have changed their covariate value.
- But when the third patient #95 dies at $t=15$, one of the other patients (#10) has switched from being a non-transplant patient to one who has had one. There are now four patients at risk, of which two (#10 and #95) are transplant patients. The denominator is thus $2+2e^{\beta}$ and the numerator is $e^{\beta}$, since it was a transplant patient that died.
- Therefore, the full partial likelihood in this example is \[L(\beta)=\frac{1}{5+e^{\beta}}\cdot\frac{1}{4+e^{\beta}}\cdot\frac{e^{\beta}}{2+2e^{\beta}}\cdot\frac{1}{2+e^{\beta}}\cdot\frac{e^{\beta}}{1+e^{\beta}}\cdot\frac{e^{\beta}}{e^{\beta}}\]
Essentially, this approach divides the time data for patients who had a heart transplant into two time periods, one before the transplant and one after.
- For example, patient #10 was a non-transplant patient from entry until day 11. Since that patient received a transplant at that time, the future for that patient, had he or she not received a transplant, is unknown. Thus, we censor that portion of the patient’s life experience at $t=11$.
- Following the transplant, we start a new record for patient #10. This second piece of the record is left-truncated (i.e., patient’s survival experience with the transplant starts at that point) at time $t=11$, and a death is recorded at time $t=57$.
- For the first part of this patient’s experience, the ‘start’ time is 0, and the ‘stop’ time is 11, which is recorded as a censored observation. For the second piece of that patient’s experience, the start time is 11 and the stop time is 57.
- Thus, to put the sdata in start-stop format, the record of every patient with no transplant is carried forward as is, where as the record of each patient who received a transplant is split into pre-transplant and post-transplant records.
- Use “tmerge” in R to simplify this conversion.

Start-stop counting process

Dependence among the observations

A common reaction to the methods described here is that there must be something wrong. In general, when multiple observations are created for a single individual, it’s reasonable to suppose that those observations are not independent, thereby violating a basic assumption used to construct the likelihood function. The consequence of dependence is usually biased standard error estimates and inflated test statistics. Even worse, there are different numbers of observations for different individuals, so some apper to get more weight than others.
While concern about dependence is often legitimate, it is not applicable here. In this case, the creation of multiple observations in not an ad-hoc method; rather, it follows directly from factoring the likelihood function for the data. The basic idea is this: in its original form, the likelihood for data with no censoring can be written as a product of probabilities over all $n$ observations, as follows: \[\prod_{j=1}{n} P(T_i = t_i)\] where $T_i$ is the random variable and $t_i$ is the particular value of observed for individual $i$. Each of the probabilities can be factored in the following way. If $t_i=5$, we have \[P(T_i=5)=P_{i5}(1-P_{i4})(1-P_{i3})(1-P_{i2})(1-P_{i1})\] where, again, $P_{it}$ is the conditional probability of an event at time $t$, given that an event has not already occurred. This factorization follows directly from the definition of conditional probability. Each of the five terms behaves as if it came from a distinct, independent observation.
This lack of dependency holds only when no individual has more than one event. When events are repeatble, there is a real problem of dependence. But the problem is neither more nor less serious than it is for other methods fo survival analysis.

Tied or Discrete Data Analysis¹⁸

Example: Recidivism in the U.S.¹⁹

The dataset considered here is analyzed in Wooldridge (2002) and credited to Chung, Schmidt and Witte (1991). The data pertain to a random sample of convicts released from prison between July 1, 1977 and June 30, 1978. Of interest is the time until they return to prison. The information was collected retrospectively by looking at records in April 1984, so the maximum possible length of observation is 81 months. The data are available in binary format from the Stata website and consists of 1445 observations on 18 variables.

workprg: an indicator of participation in a work program
priors: the number of previous convictions
tserved: the time served rounded to months
felon: an indicator of felony sentences
alcohol: an indicator of alcohol problems
drugs:an indicator of drug use history
black: an indicator for African Americans
married: an indicator if married when incarcerated
educ: the number of years of schooling, and
age: in months.
durat: represents time in months until return to prison or end of follow up
cens: the censoring indicator and is coded 1 if the observation was censored (i.e. the individual had not returned to prison)

Data management

library(survival)
library(dplyr)
library(foreign)
recid <- read.dta("https://www.stata.com/data/jwooldridge/eacsap/recid.dta")

## Warning in read.dta("https://www.stata.com/data/jwooldridge/eacsap/recid.dta"):
## cannot read factor labels from Stata 5 files

head(recid)

##   black alcohol drugs super married felon workprg property person priors educ
## 1     0       1     0     1       1     0       1        0      0      0    7
## 2     1       0     0     1       0     1       1        1      0      0   12
## 3     0       0     0     0       0     0       1        1      0      0    9
## 4     0       0     1     1       0     1       1        1      0      2    9
## 5     0       0     1     1       0     0       0        0      0      0    9
## 6     1       0     0     1       0     0       1        0      0      1   12
##   rules age tserved follow durat cens   ldurat
## 1     2 441      30     72    72    1 4.276666
## 2     0 307      19     75    75    1 4.317488
## 3     5 262      27     81     9    0 2.197225
## 4     3 253      38     76    25    0 3.218876
## 5     0 244       4     81    81    1 4.394449
## 6     0 277      13     79    79    1 4.369448

recid$fail <- 1 - recid$cens

recidx <- survSplit(recid, cut = seq(12, 60, 12), 
                    start = "t0", end = "durat", 
                    event = "fail", 
                    episode = "interval")
labels <- paste("(",seq(0,60,12),",",c(seq(12,60,12),81), "]",sep="")

recidx <- mutate(recidx, exposure = durat - t0, 
                 interval = factor(interval + 1, labels = labels))

mf <- Surv(durat, fail) ~ workprg + priors + tserved + felon + 
  alcohol + drugs + black + married + educ + age

The treatment of ties

Breslow’s method, the standard formular for partial likelihood estimation with tied data, is often a poor approximation when there are many ties.
This problem was remedied by two exact methods, one that assumed that ties result from imprecise measurement and another that assumed that events really occur at the same (discrete) time.
Efron’s method provides a good approximation to the exact methods.

The Exact method

Let’s begin with the exact method bacause its underlying model is probably more plausible for most application. Since events can occur at any point in time, it’s reasonable to suppose that ties are merely the result of imprecise measurement of time and that there is a true but unknown time ordering for the tied events.
If we knew that ordering, we could construct the partial likelihood in the usual way. In the absence of any knowledge of that ordering, however, we have to consider all the possibilities.
For example, with five tied events, there are $5!=120$ different possible ordering.
- Let’s denote each of those possibilities by $A_i$, where $i=1, 2, \dots, 120$.
- What we want is the probability of the union of those possibilities, that is, $P$($A_1$ or $A_2$ or $\dots$ or $A_{120}$).
- Now the fundamental law of probability theory is that the probability of the union of a set of mutually exclusive event is just the sum of the probabilities for each of the events.
- Therefore, we can write, for example, the five tied event at $L_8$ as \[L_8 = \sum_{i}^{120} p(A_i)\]
- Each of these 120 probabilities is just a standard partial likelihood.
- Suppose, for example, that we arbitrarily label the five events at time 8 with the numbers 8, 9, 10, 11, and 12, and suppose further that $A_1$ denotes the ordering ${8, 9, 10, 11, 12}$. Then
- On the other hand, if $A_2$ denotes the ordering ${9, 10, 11, 12}$, we have
- We continue in this way for the rest of the combinations.

The Breslow and Efron method

Early recognition of these computational difficulties in the exact method led to the development of approximations.
If the exact methods are too time-comsuming, use the Efron approximation. It is nearly always better than the Breslow method, with virtually no increase in computing time.
Farewell and Prentice (1980) showed that the Breslow approximation deteriorates as the number of ties at a particular point in time becomes a large proportion of the number of cases at risk.

Comparisons

Let us compare all available methods of handling ties. As is often the case, the Efron method comes closer to the exact partial likelihood estimate with substantial;y less computational effort, although in this application all methods yield very similar results.

cox_efron <- coxph(mf, data = recidx, ties="efron")
cox_beslow <- coxph(mf, data = recidx, ties="breslow")
cox_exact <- coxph(mf, data = recidx, ties="exact")

data.frame(exactp = coef(cox_exact),
           efron = coef(cox_efron), 
           breslow = coef(cox_beslow))

##               exactp        efron      breslow
## workprg  0.111590748  0.111560134  0.111337070
## priors   0.096271298  0.095985297  0.095859808
## tserved  0.015595528  0.015558389  0.015519980
## felon   -0.334451818 -0.333671514 -0.333261160
## alcohol  0.478596659  0.477865298  0.477164506
## drugs    0.327465984  0.327094665  0.326467040
## black    0.504462343  0.503957605  0.503016140
## married -0.153975705 -0.153542571 -0.153523788
## educ    -0.024847512 -0.024770080 -0.024746660
## age     -0.004199204 -0.004195258 -0.004187421

The discrete method

The discrete method is also an exact method but one based on a fundamentally different model.
In fact, this is NOT a proportional hazard model at all. The model does fall within the framework of Cox regression, however, because it was proposed by Cox in his original 1972 paper and because the estimation method is a form of partial likelihood.
Unlike the exact model, which assumes that ties are merely the result of imprecise measurement of time, the discrete model assumes that time is really discrete.
When two or more events appear to happen at the same time, there is no underlying ordering - they really happened at the same time.
Cox’s model for discrete-time data can be described as follows. The time variable $t$ can only take on integer values. Let $P_{it}$ be the conditional probability that individual $i$ has an event at time $t$, given that an event has not already occurred to that individual.
This probability is sometimes called the discrete-time hazard. The model says that $P_{it}$ is related to the covariates by a logistic regression equation: \[\log[\frac{P_{it}}{1-P_{it}}]=\beta_0 + \beta_1 x_1 + \cdots + \beta_i x_i\]
The expression on the left side of the equation is the logit or log-odds of $P_{it}$. On the right side, we have a linear function of the covariates, plus a term $\beta_0$ that is a set of constants that can vary arbitrarily from one time point to another.
This model can be described as proportional odds model. The odds that individual $i$ has an event at time $t$ (given that $i$ did not already have an event) is $O_{it}=\frac{P_{it}}{1-P_{it}}$.
The model implies that the ratio of the odds for any two individuals $\frac{O_{it}}{O_{jt}}$ does not depend on time (although it may vary with covariates)
Estimation with partial likelihood: $PL=\sum_{j=1}^{J} L_i$, where $L_j$ is the partial likelihood of the $j^{th}$ event.
- This approach can be very cumbersome. Let’s say that at time 1, 22 people had events out of 100 people who were at risk. To get $L_1$, we ask the question: given that 22 events occurred, what is the probability that they occurred to these particular 22 people rather than to some different set of 22 people from among the 100 at risk? How many different ways are there of selecting 22 people out of a set of 100? It’s $_{22} C_{ 100} = 7.3321 \times 10 ^{21}$….
- In general, for a given set $q$, let $\psi_q$ be the product of the odds for all the individuals in that set. Thus, if the individuals who actually experienced events are labeled $i=1$ to $n$, we have \[\psi_1 = \prod_{i=1}^{n} O_{i1}\]
- We can then write \[L_1 = \frac{\psi_1}{\psi_1+\psi_2+ \cdots + \psi_q}\]
- This looks like a simple expression, but there are trillions of terms being summed in the denominator. Fortunately, there is a recursive algorithm that makes it practical, even with substantial numbers of ties.
Estimation with maximum likelihood method
- The basic approach
  - Each individual’s survival history is broken down into a set of discrete time units that are treated as distinct observation
  - After pooling these observations, the next step is to estimate a binary regression model predicting whether an event did or did not occur in each time unit
  - Covariates are allowed to vary over time from one time unit to another
- This approach has two versions depending on the form of the binary regression model:
  - By specifying a logit link, we get estimates of the discrete-time proportional odds model (this model is identical to the model estimates when we specify the “ties=discrete” in SAS PROC PHREG)
  - By specifying a complementary log-log link, we get estimates of an underlying proportional hazard modeling continuous time. This is identical to the model that is estimated when we specify the “ties=exact” option in SAS PROC PHREG)
- Advantages
  - This method does not rely on approximations
  - The computations are manageable even with large data sets
  - This method is particularly good at handling large numbers of time-dependent covariates
  - This method makes it easy to test hypotheses about the dependence of the hazard on time
- This approach is similar to those of the piecewise exponential model and the counting process. The main difference is that those methods assumed that we know the exact time of the event within a given interval. By contrast, the discrete model presume that we know only that an event occured within a given interval.

Continuous and Discrete Models²⁰

Let’s have another look at the recidivism data. We will split duration into single years with an open-ended category at 5+ and fit a piecewise exponential model with the same covariates as Wooldridge.

We will then treat the data as discrete, assuming that all we know is that recidivism occurred somewhere in the year. We will fit a binary data model with a logit link, which corresponds to the discrete time model, and using a complementary-log-log link, which corresponds to a grouped continuous time model.

A Piecewise Exponential Model²¹

This model is equivalent to the Poisson regression for a positive mean, which is a GLM assumes a Poisson distribution for $Y$ and uses the log link function. GLMs for the Poisson mean can use the identity link, but it is more commone to model the log of the mean. Like the linear predictor $\beta_0 + \beta_1 x_1$, the log of the mean can take any real-number value.

\[\log \mu = \beta_0 + \beta_1 x_1\]

The mean satisfies the exponential relationship \[\mu = \exp(\beta_0 + \beta_1 x_1) = \exp(\beta_0)\exp(\beta_1 x_1)\]
A one-unit increase in $x$ has a multiplicative impact of $\exp(\beta_1)$ on $\mu$: the mean of $Y$ at $x+1$ equals the mean of $Y$ at $x$ multiplied by $\exp(\beta_1)$. If $\beta_1=0$, then $\exp(\beta_1)=\exp(0)=1$ and the multiplicative factor is 1. Then, then mean of $Y$ does not change at $x$ changes. If $\beta_1 >0$, then $\exp(\beta_1)>1$, and the mean of $Y$ increases as $x$ increases. If $\beta_1 <0$, then $\exp(\beta_1)<1$, and the mean of $Y$ decreases as $x$ increases.
Overdispersion
- Count data often vary more than we would expect if the response distribution truly were Poisson. A common cause of overdispersion is heterogeneity among subjects. If the variance equals the mean wehn all relevant variables are controlled, it exceeds the mean wehn only a subset of those variables is controlled.
- Overdispersion is not an issue in ordinary regression models assuming normally distributed $Y$, because the normal has a seperate parameter from the mean (i.e., the variance, $\sigma^2$) to describe variability. However, the variance equals the mean in the Poisson distribution, thus overdispersion is common in applying Poisson GLMs to counts.
Negative binomial regression
- When the Poisson means follw a gamma distribution, unconditionally the distribution is the negative binomial.
- The negative binomial is another distribution that is concentrated on the nonnegative integers. Unlike the Poisson, it has an additional parameter such that the variance can exceed the mean. \[E(Y)=\mu, \;\;\; Var(Y)=\mu + D\mu^2\]
- The index, $D$, which is nonnegative, is called a dispersion parameter. The negative binomial distribution arises as a type of mixture of Poisson distributions. Greater heterogeneity in the Poisson means results in a larger value of $D$. As $D \to 0$, $Var(Y) \to \mu$ and the negative binomial distribution converges to the Poisson variability.
- Negative binomial GLMs for counts express $\mu$ in terms of explanatory variables. Most common is the log link, as in Poisson loglinear models, but sometimes the identity link is adequate. It is common to assume that the dispersion parameter $D$ takes the same value at all predictor values, much as regression models for a normal response take the variance parameter to be constant.

mmf <- fail ~  interval + workprg + priors + tserved + 
  felon + alcohol + drugs + black + married + educ + age

pwe <- glm(mmf, offset = log(exposure), data = recidx, family = poisson)
coef(summary(pwe))

##                     Estimate  Std. Error     z value     Pr(>|z|)
## (Intercept)     -3.830127469 0.280267334 -13.6659789 1.621090e-42
## interval(12,24]  0.036531989 0.109361775   0.3340471 7.383440e-01
## interval(24,36] -0.373815644 0.129611909  -2.8841150 3.925154e-03
## interval(36,48] -0.811543632 0.156401452  -5.1888497 2.115971e-07
## interval(48,60] -0.938231113 0.168321156  -5.5740534 2.488794e-08
## interval(60,81] -1.547177936 0.203348918  -7.6084886 2.773196e-14
## workprg          0.083829106 0.090794162   0.9232874 3.558575e-01
## priors           0.087245826 0.013473463   6.4753825 9.457203e-11
## tserved          0.013008862 0.001685901   7.7162667 1.197865e-14
## felon           -0.283925203 0.106148770  -2.6747856 7.477705e-03
## alcohol          0.432442493 0.105721133   4.0904073 4.306163e-05
## drugs            0.274714115 0.097863462   2.8071162 4.998720e-03
## black            0.433555955 0.088362277   4.9065729 9.268154e-07
## married         -0.154047742 0.109211869  -1.4105403 1.583802e-01
## educ            -0.021416177 0.019444026  -1.1014271 2.707108e-01
## age             -0.003580003 0.000522249  -6.8549738 7.132557e-12

A Logit Model

For a discrete-time survival analysis we have to make sure we only include intervals with complete exposure, where we can classify the outcome as failure or survival. The convicts were released between July 1, 1977 and June 30, 1978 and the data were collected in April 1984, so the length of observation ranges between 70 and 81 months. We therefore restrict our attention to 5 years or 60 months. (We could go up to 6 years or 72 months for some convicts, but unfortunately we don’t have the date of release, so we can’t identify these cases and must censor everyone at 60.)

recidx <- filter(recidx, interval != "(60,81]")
logit <- glm(mmf, data = recidx, family = binomial)  # no offset
coef(summary(logit))

##                     Estimate   Std. Error    z value     Pr(>|z|)
## (Intercept)     -1.140802599 0.3084159337 -3.6989094 2.165279e-04
## interval(12,24]  0.030528163 0.1193582701  0.2557692 7.981291e-01
## interval(24,36] -0.413140262 0.1384064532 -2.9849783 2.835984e-03
## interval(36,48] -0.864148699 0.1639957690 -5.2693353 1.369186e-07
## interval(48,60] -0.993662524 0.1756321916 -5.6576332 1.534747e-08
## workprg          0.110988653 0.1003087410  1.1064704 2.685230e-01
## priors           0.099292063 0.0164653717  6.0303566 1.635983e-09
## tserved          0.014922136 0.0021429307  6.9634244 3.320994e-12
## felon           -0.319662098 0.1178116529 -2.7133318 6.661038e-03
## alcohol          0.472499810 0.1184176515  3.9901130 6.604183e-05
## drugs            0.316729032 0.1086092071  2.9162264 3.542934e-03
## black            0.458027506 0.0973977193  4.7026512 2.568049e-06
## married         -0.204807338 0.1204592720 -1.7002206 8.908944e-02
## educ            -0.026725931 0.0215052145 -1.2427651 2.139544e-01
## age             -0.004023087 0.0005840427 -6.8883431 5.644594e-12

A Complementary Log-log Model

Finally we use a complementary log-log link

cloglog <- glm(mmf, data = recidx, family = binomial(link = cloglog))
coef(summary(cloglog))

##                     Estimate   Std. Error    z value     Pr(>|z|)
## (Intercept)     -1.238795113 0.2893607427 -4.2811444 1.859347e-05
## interval(12,24]  0.021613951 0.1095604758  0.1972787 8.436094e-01
## interval(24,36] -0.392613793 0.1297681301 -3.0255024 2.482204e-03
## interval(36,48] -0.824996440 0.1566132100 -5.2677321 1.381194e-07
## interval(48,60] -0.948338328 0.1684247392 -5.6306356 1.795467e-08
## workprg          0.104466422 0.0934228228  1.1182109 2.634769e-01
## priors           0.088706984 0.0145113085  6.1129556 9.780261e-10
## tserved          0.013266906 0.0018142064  7.3127875 2.616567e-13
## felon           -0.288542238 0.1096491770 -2.6315039 8.500789e-03
## alcohol          0.439780479 0.1090998881  4.0309893 5.554258e-05
## drugs            0.299102966 0.1003895869  2.9794222 2.887925e-03
## black            0.427210098 0.0910947168  4.6897352 2.735589e-06
## married         -0.183040394 0.1136073451 -1.6111669 1.071433e-01
## educ            -0.023334468 0.0202349457 -1.1531767 2.488379e-01
## age             -0.003851008 0.0005486362 -7.0192372 2.230827e-12

Comparison of Estimates

cbind(coef(pwe)[-6], coef(cloglog), coef(logit))

##                         [,1]         [,2]         [,3]
## (Intercept)     -3.830127469 -1.238795113 -1.140802599
## interval(12,24]  0.036531989  0.021613951  0.030528163
## interval(24,36] -0.373815644 -0.392613793 -0.413140262
## interval(36,48] -0.811543632 -0.824996440 -0.864148699
## interval(48,60] -0.938231113 -0.948338328 -0.993662524
## workprg          0.083829106  0.104466422  0.110988653
## priors           0.087245826  0.088706984  0.099292063
## tserved          0.013008862  0.013266906  0.014922136
## felon           -0.283925203 -0.288542238 -0.319662098
## alcohol          0.432442493  0.439780479  0.472499810
## drugs            0.274714115  0.299102966  0.316729032
## black            0.433555955  0.427210098  0.458027506
## married         -0.154047742 -0.183040394 -0.204807338
## educ            -0.021416177 -0.023334468 -0.026725931
## age             -0.003580003 -0.003851008 -0.004023087

As one would expect, the estimates of the relative risks based on the c-log-log link are closer to the continuous time estimates than those based on the logit link.
This result makes sense because the piecewise exponential and c-log-log link models are estimating the same continuous time hazard, one from continuous and one from grouped data, while the logit model is estimating a discrete time hazard.
Recall that in a continuous time model the relative risk multiplies the hazard or instantaneous failure rate, whereas in a discrete time logit model it multiplies the conditional odds of failure at a given time (or in a given time interval) given survival to that time (or the start of the interval). Interpretation of the results should take this fact into account.

All three approaches, however, lead to similar predicted survival probabilities.

Interval censoring²²

Discrete data are often the result of interval-censoring. Events might happen in a continuous range of time, but they can only be observed at discrete moments (e.g., longitudinal data by waves).

The modeling paradigm for interval-censored survival data is essentially the same as for non-interval-censored data. Interpretation and presentation of the results of a fitted proportional hazards model is identical for the two types of data.
However, model building with interval-censored data uses the binary regression likelihood if intervals are the same for all subjects. This implies that model building details, such as variable selections, identification of the scale of continuous covariates, and inclusion of interactions, use techniques based on binary regression modeling with the complimentary log-log model.

Conditional logistic regression and stratified Cox model²³

Under a particular data structure, the loglikelihood for a conditional logistic regression model is the same with loglikelihood from a Cox model. A stratified Cox model with each case/control group assigned to its own stratum, time set to a constant, status of 1=case 0=control, and using the exact partial likelihood has the same likelihood formula as a conditional logistic regression. The clogit routine creates the necessary dummy variable of times (all 1) and the strata, then calls coxph.
- Stratified Cox model

cox1 <- coxph(Surv(durat, fail) ~ workprg + married + educ + age + strata(black), 
             data = recidx, ties="efron")
coef(summary(cox1))

##                 coef exp(coef)     se(coef)         z     Pr(>|z|)
## workprg  0.156200755 1.1690609 0.0897549296  1.740303 8.180586e-02
## married -0.323341332 0.7237268 0.1133119425 -2.853550 4.323368e-03
## educ    -0.049860264 0.9513624 0.0190833630 -2.612761 8.981412e-03
## age     -0.001898137 0.9981037 0.0004532662 -4.187687 2.818123e-05

Conditional logistic model

clogit <- clogit(fail ~ workprg + married + educ + age + strata(black),
                 data = recidx, method=c("efron"))
coef(summary(clogit))

##                 coef exp(coef)     se(coef)         z     Pr(>|z|)
## workprg  0.151890146 1.1640324 0.0897545659  1.692283 9.059199e-02
## married -0.310120691 0.7333584 0.1132727550 -2.737822 6.184746e-03
## educ    -0.047916781 0.9532131 0.0191298600 -2.504816 1.225151e-02
## age     -0.001812212 0.9981894 0.0004512763 -4.015749 5.925726e-05

Competing risks and Multistate models I²⁴

Recurrent event

Up to this point, we have assumed that the event of interest can occur only once for a given subject. However, in many research scenarios in which the event of interest is not death, a subject may experience an event several times over follow-up. Examples of recurrent event data include:

Multiple episodes of relapses from remission comparing different treatments for leukemia patients.
Recurrent heart attacks of coronary patients being treated for heart disease.
Recurrence of bladder cancer tumors in a cohort of patients randomized to one of two treatment groups.
Multiple events of deteriorating visual acuity in patients with baseline macular degeneration, where each recurrent event is considered a more clinically advanced stage of a previous event.

An objective for such data is to assess the relationship of relevant predictors to the rate in which events are occurring, allowing for multiple events per subject.

The approach to analysis typically used when recurrent events are treated as identical is called the Counting Process Approach (Andersen et al., 1993).
When recurrent events involve different disease categories and/or the order of events is considered important, a number of alternative approaches to analysis have been proposed that involve using stratified Cox (SC) models.

The counting process model and method

\[h(t,x)=h_0(t)\exp[\sum \beta_i x_i]\]

The Cox PH model requires

PH assumtion for $x_i$
Consider either stratified or Extended Cox if PH assumption not satisfied
Extended Cox model for time-dependent variables

In nonrecurrent event data,

Subjects removed from risk set at time of failure or censorship
Different lines of data are treated as independent because they come from different subjects

Wehreas, in recurrent event data,

Subjects with > 1 time interval remain in the risk set until last interval is completed
Different lines of data are treated as independent even though several outcomes on the same subject

Stratified Cox approaches

The “strata” variable for each approach treats the time interval number as a categorical variable. For example, if the maximum number of failures that occur on any given subject in the dataset is, say, 4, then time interval #1 is assigned to stratum 1, time interval #2 to stratum 2, and so on.
Both Stratified CP and Gap Time approaches focus on survival time between two events. Stratified CP uses the actual times of the two events from study entry, whereas Gap Time starts survival time at 0 for the earlier event and stops at the later event.
- The stratified CP approach uses the exact same (start, stop) data layout format used for the CP approach, except that for Stratified CP, an SC model is used rather than a standard (unstratified) PH model.
- The Gap Time approach also uses a (start, stop) data layout, but the start value is always 0 and the stop value is the time interval length since the previous event.
The Marginal approach, in contrast to each conditional approach, focuses on total survival time from study entry until the occurrence of a specific (e.g., kth) event; this approach is suggested when recurrent events are viewed to be of different types.
- The Marginal approach uses the standard (nonrecurrent event) data layout instead of the (start, stop) layout

The stratified CP approach: Example

The modeling of recurrent events is illustrated with the bladder cancer dataset (bladder.rda). Recurrent events are represented in the data with multiple observations for subjects having multiple events. The data layout for the bladder cancer dataset is in the counting process (start, stop) format with time intervals defined for each observation. The load function is used to access an R dataframe that has been saved as a file.

##### data format
bladder <- read.csv(file="bladder.csv", header = TRUE)

bladder[12:20,]

##    ID EVENT INTERVAL INTTIME START STOP TX NUM SIZE
## 12 10     1        1      12     0   12  0   1    1
## 13 10     1        2       4    12   16  0   1    1
## 14 10     0        3       2    16   18  0   1    1
## 15 11     0        1      23     0   23  0   3    3
## 16 12     1        1      10     0   10  0   1    3
## 17 12     1        2       5    10   15  0   1    3
## 18 12     0        3       8    15   23  0   1    3
## 19 13     1        1       3     0    3  0   1    1
## 20 13     1        2      13     3   16  0   1    1

There are three observations for ID=10, one observation for ID=11, three observations for ID=12, and two observations for ID=13. The variables START and STOP represent the time interval for the risk period specific to that observation. The variable EVENT indicates whether an event (coded 1) occurred. The first three observations indicate that the subject with ID=10 had an event at 12 months, another event at 16 months, and was censored at 18 months.

Recall we analyzed data in the counting process format when we ran extended Cox models. We saw how a subject’s covariate can change values from time-interval to time-interval. With the bladder dataset, the (start,stop) data format provides a way to indicate that a subject experienced multiple events.

The coxph function can be used to run Cox models with recurrent events. First, we’ll define a response variable using the Surv function (called $Y$):

library(survival)
Y=Surv(bladder$START,bladder$STOP,bladder$EVENT==1)

## Warning in Surv(bladder$START, bladder$STOP, bladder$EVENT == 1): Stop time
## must be > start time, NA created

The Surv function requires three arguments with data in the counting process format: the start variable (called START), the stop variable (called STOP), and the status variable (called EVENT). The code bladder$event==1 indicates that an event is coded 1. R recognizes the value 1 as the default coding of an event, so it was not necessary to state this explicitly in the Surv function as we did. Next, a recurrent-events Cox model is run with the predictors: treatment status (TX), initial number of tumors (NUM), and the initial size of tumors (SIZE):

coxph(Y ~ TX + NUM + SIZE + cluster(ID), data=bladder)

## Call:
## coxph(formula = Y ~ TX + NUM + SIZE, data = bladder, cluster = ID)
## 
##          coef exp(coef) se(coef) robust se      z       p
## TX   -0.41164   0.66256  0.19989   0.24876 -1.655 0.09798
## NUM   0.16367   1.17782  0.04777   0.05842  2.801 0.00509
## SIZE -0.04108   0.95975  0.07029   0.07421 -0.554 0.57991
## 
## Likelihood ratio test=14.66  on 3 df, p=0.002127
## n= 190, number of events= 112 
##    (1 observation deleted due to missingness)

The term + cluster(id) in the model formula requests robust standard errors for the parameter estimates.

The treatment variable (TX) is coded 1 for treatment with thiotepa and 0 for the placebo. The estimated hazard ratio (TX=1 vs. TX=0) is 0.663 (with a p-value of 0.0980). There are two sets of standard errors presented in the table under the columns labeled: se(coef) and robust se. The p-values and z-test statistics in this table are calculated using the robust standard errors. We could obtain additional model output (including 95% CIs) by applying the summary function to the coxph function.

A stratified Cox model can also be run using the data in this format with the variable INTERVAL as the stratified variable. The stratified variable indicates whether the subject was at risk for their first, second, third, or fourth event. This approach is called a Stratified CP recurrent event model and is used if the investigator wants to distinguish the order in which recurrent events occur. The bladder data is in the proper format to run this model.

coxph(Y ~ TX + NUM + SIZE + strata(INTERVAL) + cluster(ID), data=bladder)

## Call:
## coxph(formula = Y ~ TX + NUM + SIZE + strata(INTERVAL), data = bladder, 
##     cluster = ID)
## 
##           coef exp(coef)  se(coef) robust se      z      p
## TX   -0.333489  0.716420  0.216168  0.204787 -1.628 0.1034
## NUM   0.119617  1.127065  0.053338  0.051387  2.328 0.0199
## SIZE -0.008495  0.991541  0.072762  0.061635 -0.138 0.8904
## 
## Likelihood ratio test=6.51  on 3 df, p=0.08928
## n= 190, number of events= 112 
##    (1 observation deleted due to missingness)

The only additional code from the previous model is the term + strata(interval) in the model formula which indicates that INTERVAL is the stratified variable. Interaction terms between the treatment variable (TX) and the stratified variable could be created to examine whether the effect of treatment differed for the 1st, 2nd, 3rd, or 4th event.

The Gap Time approach: Example

Another stratified approach (called Gap Time) is a slight variation of the Stratified CP approach. The difference is in the way the time intervals for the recurrent events are defined. There is no difference in the time intervals when subjects are at risk for their first event. However, with the Gap Time approach, the starting time at risk gets reset to zero for each subsequent event. To run a Gap Time model, we need to create two new (start, stop) variables in the bladder dataset, which we’ll call START2 and STOP2:

bladder$START2=0
bladder$STOP2=bladder$STOP - bladder$START

The first of the two newly defined variables (START2) is always zero. The second (STOP2) is defined as the time between each event (STOP–START). To print a subset of these variables, we can use the data.frame function. The attach function allows variables in the bladder dataset to be listed without the bladder$ prefix (code and output for printing the 12th–20th observation below).

attach(bladder)
data.frame(ID,EVENT,START,STOP,START2,STOP2)[12:20, ]

##    ID EVENT START STOP START2 STOP2
## 12 10     1     0   12      0    12
## 13 10     1    12   16      0     4
## 14 10     0    16   18      0     2
## 15 11     0     0   23      0    23
## 16 12     1     0   10      0    10
## 17 12     1    10   15      0     5
## 18 12     0    15   23      0     8
## 19 13     1     0    3      0     3
## 20 13     1     3   16      0    13

Next we need to reset our response variable using the Surv function by changing our time intervals from (START, STOP) to (START2, STOP2):

Y2=Surv(bladder$START2,bladder$STOP2,bladder$EVENT)

## Warning in Surv(bladder$START2, bladder$STOP2, bladder$EVENT): Stop time must
## be > start time, NA created

Next we run a Gap Time model with the bladder data using similar code that was used for the Stratified CP model except we use Y2 rather than Y as our response variable.

coxph(Y2 ~ TX + NUM + SIZE + strata(INTERVAL) + cluster(ID),data=bladder)

## Call:
## coxph(formula = Y2 ~ TX + NUM + SIZE + strata(INTERVAL), data = bladder, 
##     cluster = ID)
## 
##           coef exp(coef)  se(coef) robust se      z       p
## TX   -0.279005  0.756536  0.207348  0.215624 -1.294 0.19569
## NUM   0.158046  1.171220  0.051942  0.050940  3.103 0.00192
## SIZE  0.007415  1.007443  0.070023  0.064333  0.115 0.90824
## 
## Likelihood ratio test=9.33  on 3 df, p=0.02517
## n= 190, number of events= 112 
##    (1 observation deleted due to missingness)

The results using the Gap Time approach varies slightly from that obtained using the Stratified CP approach.

Until now we have considered survival times with a single, well-defined outcome, such as death or some other event. In some applications, however, a patient may potentially experience multiple events, only the first-occurring of which can be observed. For example, we may be interested in time from diagnosis with prostate cancer until death from that disease (Cause 1) or death from some other cause (Cause 2), but for a particular patient we can only observe the time to the first event. Of course, a patient may also be censored if he is still alive at the last follow-up time. If interest centers on a particular outcome, time to prostate cancer death, for example, a simplistic analysis method would be to treat death from other causes as a type of censoring. This approach has the advantage that implementing it is straightforward using the survival analysis methods we have discussed. However, a key assumption about censoring is that it is independent of the event in question. In most competing risk applications, this assumption may be questionable, and in some cases may be quite unrealistic. Furthermore, it is not possible to test the independence assumption using only the competing risks data. The only hope of evaluating the accuracy of the assumption would be to examine other data or appeal to theories concerning the etiology of the various death causes. Consequently, interpretation of survival analyses in the presence of competing risks will always be subject to at least some ambiguity due to uncertainty about the degree of dependence among the competing outcomes.

Kaplan-Meier Estimation with Competing Risks

We begin with estimating a survival curve in a single sample in the presence of competing events. The simplest method would be to in turn select each as the primary event, and to treat the other as a censoring event. However, to obtain unbiased estimates of survival curves, this simplistic method would require the usually false assumption that the two causes of death are independent. We may illustrate this problem be considering prostate cancer patients ages 80 and over diagnosed with stage T2 poorly differentiated prostate cancer. We define indicator variables “status.other” and “status.prost”, and then select the subset “prostateSurvival.highrisk” as follows, using the “prostate survival” data.

library(asaur)
prostateSurvival <- within(prostateSurvival, 
                           {status.prost <- as.numeric({status==1})
status.other <- as.numeric({status==2})})
attach(prostateSurvival)

## The following object is masked _by_ .GlobalEnv:
## 
##     status

## The following object is masked from pancreatic:
## 
##     stage

prostateSurvival.highrisk <- 
  prostateSurvival [{{grade=="poor"} & {stage=="T2"} & {ageGroup=="80+"}},]
head(prostateSurvival.highrisk)

##    grade stage ageGroup survTime status status.other status.prost
## 13  poor    T2      80+       21      0            0            0
## 38  poor    T2      80+      105      0            0            0
## 41  poor    T2      80+        2      1            0            1
## 47  poor    T2      80+       67      2            1            0
## 78  poor    T2      80+        2      0            0            0
## 93  poor    T2      80+       60      2            1            0

Let us consider two analyses, one with death due to other causes (status = 2) as censored, and the other with death due to prostate cancer (status = 1) as censored. We set these up as follows:

status.prost <- {prostateSurvival.highrisk$status==1}
status.other <- {prostateSurvival.highrisk$status==2}

The Kaplan-Meier estimates of survival defined as time to death from prostate cancer (with other causes of death considered as censored) is as follows:

result.prostate.km <- survfit(Surv(survTime, event = status.prost) ~ 1, data=prostateSurvival.highrisk)

Similarly, to estimate survival for time to death from other causes, we have

result.other.km <- survfit(Surv(survTime, event = status.other) ~ 1, data=prostateSurvival.highrisk)

To illustrate the problem with this analysis, let us first extract the Kaplan-Meier survival curve for death from other causes:

surv.other.km <- result.other.km$surv
time.km <-result.other.km$time/12

Now let’s extract the corresponding survival curve for death from prostate cancer, and then express it as a cumulative incidence function, which is one minus the survival curve (also known as the cumulative distribution function):

surv.prost.km <- result.prostate.km$surv
cumDist.prost.km <- 1 - surv.prost.km

Now we may plot both on the same graph, using the plot option ‘type = “s”’ to produce step functions:

plot(cumDist.prost.km ~ time.km, type="s", 
     ylim=c(0,1), lwd=2, 
     xlab="Years from prostate cancer diagnosis", col="blue")
lines(surv.other.km~time.km, type="s", col="green", lwd=2)

The result, shown in Figure, shows that the two curves cross. At 10 years, for example, the probability of dying of prostate cancer is 0.46, and of other causes it is 0.88. The fact that the sum of these two probabilities exceeds one demonstrates that these estimates, viewed as probabilities that a particular patient would die of prostate cancer or something else, are severely biased. One might be tempted to view these curves as estimates of the probability of death from one cause if the other cause were eliminated as a possibility, but such an exercise would require the assumption that the causes be independent. This assumption cannot be tested from the data, and in any case the meaning of the resulting estimates would be purely hypothetical.

Cause-Specific Hazards and Cumulative Incidence Functions

“Subject can die of only one of $K$ causes”

To develop a formal model to accommodate competing risks, let us suppose that there are $K$ distinct causes of death, which we may diagram as in Figure.

The distinguishing feature of this competing causes framework is that each subject can experience at most one of the $K$ causes of death; the times that the subject would have experienced the remaining causes is thus unknown. This framework can also accommodate applications with non-fatal events, as long as all of the events are mutually exclusive. With competing risks, it is helpful to define, for each cause of interest, a function known as the cumulative risk function, also called the sub-distribution function. This is the cumulative probability that an individual dies from that particular cause by time $t$, and is given by

\[F_j(t)=P(T \le t, C=j)=\int_0^t h_j(u)S(u)du\] This function is similar to the cumulative distribution function in that it is always increasing (or more precisely, non-decreasing). But unlike a cumulative distribution function, it goes, in the limit, to the probability of death from that particular cause, rather than to 1. Formally, we have

\[F_j(\infty) =P(C=j))\] The cause-specific hazard is defined in a manner similar to the hazard function, but now it is the probability that a specific event occurs at time $t$ given that the individual survives that long:

\[h_j(t)=\lim_{\delta \to 0} (\frac{P(t<T<t+\delta, C=j|T>t)}{\delta})\] If we add up all of the cause-specific hazards at a particular time, we get the hazard function

\[h(t)=\sum_{j=1}^{K} h_j(t)\] That is, the risk of death at a particular time is the sum of the risks of all of the specific causes of death at that time.

Suppose now that we have $D$ distinct ordered failure times $t_1,t_2,\dots, t_D$. We may estimate the hazard at the $i^{th}$ time $t_i$ using $\hat{h}(t_i)=\frac{d_i}{n_i}$, as we have seen in previous chapters. The cause-specific hazard for the $k^{th}$ hazard may be written in a similar form as $\hat{h}_k(t_i)=\frac{d_{ik}}{n_i}$. This is just the number of events of type $k$ at that time divided by the number at risk at that time. The sum over all cause-specific hazards is the overall hazard, $\hat{h}_k(t_i)=\frac{\sum_k d_{ik}}{n_i}$. The probability of failure from any cause at time $t_i$ is the product of $\hat{S}(t_{i-1})$, the probability of being alive just before $t_i$, and $\hat{h}(t_i)$, the risk of dying at $t_i$. Similarly, the probability of failure due to cause $k$ at that time is $\hat{S}(t_{i-1})\hat{h}_k(t_i)$. The sub-distribution function, or cumulative incidence function, is the probability of dying of cause $k$ at time $t_i$. This is the sum of all probabilities of dying of this cause up to time ti and is given by

\[\hat{F}_k(t)=\sum_{t_i \le t} \hat{S}(t_{i-1})\hat{h}_k(t_i))\] That is, once we have an estimate of the overall survival function $\hat{S}(t)$, we can obtain the cumulative incidence function for a particular cause by summing over the product of this and the cause-specific hazards for that cause.

To illustrate this methodology, let us consider a simple hypothetical data set with six observations and two possible causes of death, displayed in Fig. 9.3. Denoting the event types with the numbers 1 and 2, and the censored observations with the number 0, we may enter the data into R as follows:

“Competing risk survival data”

# install.packages("mstate")
library(survival)
library(mstate)

tt <- c(2,7,5,3,4,6)
status <- c(1,2,1,2,0,0)

We first compute the overall survival distribution,

status.any <- as.numeric(status >= 1)
result.any <-survfit(Surv(tt,status.any)~1)
result.any$surv

## [1] 0.8333333 0.6666667 0.6666667 0.4444444 0.4444444 0.0000000

We compute the cumulative incidence functions as in the following table.

“Competing risk survival data”

For example, the probability of event type 1 at the first time $(t=2)$ is given by $1.000 \times \frac{1}{6}=0.167$ This is also the estimate of the cumulative incidence function at this time. The probability of an event of this type at time $(t=5)$ is $0.667 \times \frac{1}{3}=0.222$. Then the cumulative incidence for this event at time $t=5$ is $0.167+0.222=0.389$: These results may be more easily obtained using the “Cuminc” function in the “mstate” R package

ci <- Cuminc(time=tt, status=status)
ci

##   time      Surv      CI.1      CI.2    seSurv    seCI.1    seCI.2
## 1    2 0.8333333 0.1666667 0.0000000 0.1521452 0.1521452 0.0000000
## 2    3 0.6666667 0.1666667 0.1666667 0.1924501 0.1521452 0.1521452
## 3    5 0.4444444 0.3888889 0.1666667 0.2222222 0.2187224 0.1521452
## 4    7 0.0000000 0.3888889 0.6111111 0.0000000 0.2187224 0.2187224

The standard errors for the survival curve are computed using Greenwood’s formula and the standard errors for the cumulative incidence functions are computed in an analogous manner.

Cumulative Incidence Functions for Prostate Cancer Data

Returning to the prostate cancer example of Fig. 9.1, we may now estimate the competing risks cumulative incidence functions using the “Cuminc” function in the R package “mstate” as follows:

library(mstate)
ci.prostate <- Cuminc(time=prostateSurvival.highrisk$survTime,
                      status=prostateSurvival.highrisk$status)
head(ci.prostate)

##   time      Surv        CI.1        CI.2      seSurv      seCI.1      seCI.2
## 1    0 1.0000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000
## 2    1 0.9940758 0.000000000 0.005924171 0.002641510 0.000000000 0.002641510
## 3    2 0.9880511 0.006024702 0.005924171 0.003756151 0.002686199 0.002641510
## 4    3 0.9843644 0.008482541 0.007153090 0.004303185 0.003192654 0.002910105
## 5    4 0.9831168 0.009730151 0.007153090 0.004474935 0.003423736 0.002910105
## 6    5 0.9780751 0.014771775 0.007153090 0.005112934 0.004233796 0.002910105

The first column, “time” is the time in months. The column “Surv” is the Kaplan-Meier survival estimate for time to death from any cause (prostate or something else). The next two columns are the cumulative incidence function estimates for causes 1 (prostate) and 2 (other). The remaining columns are standard errors of the respective estimates. We may plot the cause-specific cumulative incidence functions as follows:

ci1 <- ci.prostate$CI.1 # CI.1 is for prostate cancer
ci2 <- ci.prostate$CI.2 # CI.2 is for other causes
times <- ci.prostate$time/12 # convert months to years
Rci2 <- 1 - ci2

We may plot the cumulative incidence function for death from prostate cancer, and for death from other causes in solid green and blue, respectively, and the previous estimates with thin lines of the same (but lighter) colors,

plot(Rci2 ~ times, type="s", ylim=c(0,1), lwd=2, col="green", 
     xlab="Time in years", ylab="Survival probability")
lines(ci1 ~ times, type="s", lwd=2, col="blue")
lines(surv.other.km ~ time.km, type="s", col="lightgreen", lwd=1)
lines(cumDist.prost.km ~ time.km, type="s", col="lightblue", lwd=1)

Figure clearly illustrates the value of displaying competing risks cumulative incidence functions. These curves represent estimates of the actual probabilities that a patient will die of a particular cause, rather than hypothetical probabilities that he would die of one cause in the absence of the other.

A common way to display competing risk cumulative incidence curves is via a stacked plot, as shown in Fig. 9.5. The lower, blue curve represents the cumulative probability of death from prostate cancer, and the difference between the blue and upper, green curve represents the probability of death from other causes. The sum of the two probabilities of death, i.e. the upper, green curve, represents the cumulative probability of death from any cause, and is equal to one minus the Kaplan-Meier survival curve for death from any cause.

ci1 <- ci.prostate$CI.1 # CI.1 is for prostate cancer
ci2 <- ci.prostate$CI.2 # CI.2 is for other causes
times <- ci.prostate$time/12 # convert months to years
sumci <- ci1 +ci2

plot(sumci ~ times, type="s", ylim=c(0,1), lwd=2, col="green", 
     xlab="Years from prostate cancer diagnosis", 
     ylab="probability patient has died")
lines(ci1 ~ times, type="s", lwd=2, col="blue")

Regression Methods for Cause-Specific Hazards

When there is a single outcome of interest, the Cox proportional hazards model provides an elegant method for accommodating covariate information. However, modeling covariate information for competing risks data presents special challenges, since it is difficult to define precisely the hazard function on which the covariates should operate. The first method we will discuss is the most direct. We will illustrate using the prostate cancer data, this time restricting our attention (for now) to patients with stage T2 prostate cancer. Essentially, we will study the effects of the remaining covariates (grade and age) on prostate cancer death, treating other causes of death as censoring indicators, and vice versa for the effects of the covariates on other causes of death. We set up the data as follows:

prostateSurvival.T2 <- prostateSurvival[prostateSurvival$stage=="T2",]
attach(prostateSurvival.T2)

## The following objects are masked _by_ .GlobalEnv:
## 
##     status, status.other, status.prost

## The following objects are masked from prostateSurvival:
## 
##     ageGroup, grade, stage, status, status.other, status.prost,
##     survTime

## The following object is masked from pancreatic:
## 
##     stage

We then fit a standard Cox model for prostate cancer death as follows:

result.prostate <- coxph(Surv(survTime, status.prost) ~ grade + ageGroup,
                         data=prostateSurvival.T2)
summary(result.prostate)

## Call:
## coxph(formula = Surv(survTime, status.prost) ~ grade + ageGroup, 
##     data = prostateSurvival.T2)
## 
##   n= 5920, number of events= 410 
## 
##                  coef exp(coef) se(coef)      z Pr(>|z|)    
## gradepoor      1.2199    3.3867   0.1004 12.154  < 2e-16 ***
## ageGroup70-74 -0.2860    0.7513   0.2595 -1.102   0.2704    
## ageGroup75-79  0.4027    1.4958   0.2257  1.784   0.0744 .  
## ageGroup80+    0.9728    2.6454   0.2148  4.529 5.92e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##               exp(coef) exp(-coef) lower .95 upper .95
## gradepoor        3.3867     0.2953    2.7819     4.123
## ageGroup70-74    0.7513     1.3310    0.4518     1.249
## ageGroup75-79    1.4958     0.6685    0.9611     2.328
## ageGroup80+      2.6454     0.3780    1.7364     4.030
## 
## Concordance= 0.74  (se = 0.012 )
## Likelihood ratio test= 252.4  on 4 df,   p=<2e-16
## Wald test            = 243.6  on 4 df,   p=<2e-16
## Score (logrank) test = 278.9  on 4 df,   p=<2e-16

These results show that patients having poorly differentiated disease (grade = poor) have much worse prognosis than do patients with moderately differentiated disease (the reference group here), with a log-hazard ratio of 1.2199. These results also show that the hazard of dying from prostate cancer increases with increasing age of diagnosis (the reference is the youngest age group, 65–69).

Considering death from other causes as the event of interest, we have

result.other <- coxph(Surv(survTime, status.other) ~ grade + ageGroup,
                      data=prostateSurvival.T2)
summary(result.other)

## Call:
## coxph(formula = Surv(survTime, status.other) ~ grade + ageGroup, 
##     data = prostateSurvival.T2)
## 
##   n= 5920, number of events= 1345 
## 
##                  coef exp(coef) se(coef)     z Pr(>|z|)    
## gradepoor     0.28104   1.32451  0.05875 4.784 1.72e-06 ***
## ageGroup70-74 0.09462   1.09924  0.12492 0.757  0.44879    
## ageGroup75-79 0.31330   1.36793  0.11709 2.676  0.00746 ** 
## ageGroup80+   0.79012   2.20367  0.11204 7.052 1.76e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##               exp(coef) exp(-coef) lower .95 upper .95
## gradepoor         1.325     0.7550    1.1805     1.486
## ageGroup70-74     1.099     0.9097    0.8605     1.404
## ageGroup75-79     1.368     0.7310    1.0874     1.721
## ageGroup80+       2.204     0.4538    1.7692     2.745
## 
## Concordance= 0.583  (se = 0.008 )
## Likelihood ratio test= 159.6  on 4 df,   p=<2e-16
## Wald test            = 159.3  on 4 df,   p=<2e-16
## Score (logrank) test = 164.6  on 4 df,   p=<2e-16

Taken at face value, these results indicate that patients with poorly differentiated cancer have a higher risk of death from non-prostate-cancer related disease than do those with moderately differentiated disease. While the log hazard ratio is much smaller than with prostate cancer death as the outcome (0.28104 vs. 1.2199), one might expect that cancer grade wouldn’t have any effect on death from nonprostate-cancer causes. These hazard ratios refer to hazard functions for death from prostate cancer and for death from other causes, and these are assumed to be operating independently. As we have discussed previously, these assumptions are highly suspect, and it is unclear to what extent the hazard functions that have been estimated correspond to actual (and unobservable) hazards.

To address this issue, Fine and Gray developed an alternative method for modeling covariate data with competing risks. Instead of defining the effects of covariates on the cause-specific hazards, they define a “sub-distribution hazard”

\[\bar{h}_k(t)=\lim_{\delta \to 0}\frac{P(t<T_k<t+\delta|E)}{\delta}\] where the conditional event is given by \[E=[{(T_k>t)}\;or\;({T_{k^{'}}\;and\;k^{'} \ne k})]\] That is, the sub-distribution hazard for cause $k$, like the definition of the ordinary hazard function, is essentially the probability that the failure time lies in a small interval at $t$ conditional on an event $E$, divided by the length of that small interval. The difference is that, in addition to referring to the $k^{th}$ failure time, the conditioning set specifies not only that $T_k > t$ but also allows inclusion of events other than the $k^{th}$ event in question, in which case we must have $T_{k^{'}} \le t$. Thus, when computing these sub-distribution hazards, the risk set includes not only those currently alive and at risk for the $k^{th}$ event type, but also those who died earlier of other causes.

Consider, for example, for the data in Fig. 9.3, the risk set for death from Cause #2 (triangles) at time $t=7$ consists not only of Patient 2, the sole patient still alive at that time, but also Patients 1 and 3, since they died of Cause #1 (squares) earlier. Patient 4 is not in the risk set for death from Cause #2 at time $t=7$ since that person died earlier from Cause #2, the same cause as Patient 2. Patients 5 and 6 also are not in the risk set at this time since they were censored. The sub-distribution hazard may be written in a more compact equivalent form as

\[\bar{h}_k(t)=-\frac{d \log(1-F_k(t))}{dt}\].

The Fine and Gray method uses these sub-distribution hazards for modeling the effects of covariates on a specific cause of death analogously to the Cox model,

\[\bar{h}_k(t;z,\beta)=\bar{h}_{0k}(t)e^{z\beta}\].

That is, the sub-distribution hazard for a subject with covariates $z$ is proportional to a baseline sub-distribution function $\bar{h}_{0k}(t)$.

The Fine and Gray methods are implemented in the ”crr” function in the R package “cmprsk”. Before we can use the competing risk function “crr” in this package, we need to put the covariates into a model matrix using the “model.matrix” function. Using our attached data set “prostateSurvival.T2”, we do this as follows:

#install.packages("cmprsk")
library(cmprsk)
cov.matrix <- model.matrix(~ grade + ageGroup, data=prostateSurvival.T2)
head(cov.matrix)

##    (Intercept) gradepoor ageGroup70-74 ageGroup75-79 ageGroup80+
## 4            1         0             1             0           0
## 6            1         1             0             1           0
## 10           1         1             0             1           0
## 13           1         1             0             0           1
## 15           1         0             0             1           0
## 18           1         0             0             1           0

cov.matrix.use <- cov.matrix[,-1] # drop the first column

We obtain estimates for the prostate cancer as follows, dropping the first (intercept) column of the covariate matrix:

library(cmprsk)

prostateSurvival.T2 <- prostateSurvival[prostateSurvival$stage=="T2",]
attach(prostateSurvival.T2)

## The following objects are masked _by_ .GlobalEnv:
## 
##     status, status.other, status.prost

## The following objects are masked from prostateSurvival.T2 (pos = 4):
## 
##     ageGroup, grade, stage, status, status.other, status.prost,
##     survTime

## The following objects are masked from prostateSurvival:
## 
##     ageGroup, grade, stage, status, status.other, status.prost,
##     survTime

## The following object is masked from pancreatic:
## 
##     stage

cov.matrix <- model.matrix(~ grade + ageGroup, data=prostateSurvival.T2)
head(cov.matrix)

##    (Intercept) gradepoor ageGroup70-74 ageGroup75-79 ageGroup80+
## 4            1         0             1             0           0
## 6            1         1             0             1           0
## 10           1         1             0             1           0
## 13           1         1             0             0           1
## 15           1         0             0             1           0
## 18           1         0             0             1           0

#result.prostate.crr <- crr(survTime, status, cov1=cov.matrix[,-1], failcode=1)
#summary(result.prostate.crr)

“Death from prostate cancer”

The argument “failcode=1” refers to death from prostate cancer. For death from other causes, we use “failcode=2”,

#result.other.crr <- crr(survTime, status, cov1=cov.matrix[,-1], failcode=2)
#summary(result.other.crr)

“Death from other causes”

Again we see that poorly differentiated patients have higher risk for death from other causes (risk ratio = 0.126), but the effect size is smaller than we obtained from the Putter et al. method (risk ratio 0.281). The estimated effect on death from prostate cancer of having poorly differentiated disease is similar for both methods (risk ratio of 1.22 for Putter et al. vs. 1.132 for Fine and Gray).

Comparing the Effects of Covariates on Different Causes of Death

An advantage of the Putter et al. method over the Fine and Gray method is the ease with which we can compare the effects of a covariate on, for example, death from prostate cancer and death from other causes. For example, we know that the risk of both causes of death increase with age. But does the effect of age differ for these two causes? To answer this question, we first need to convert the data set from the original one where each patient has his own row in the data set into one where each patient’s data is split into separate rows, one for each cause of death. In the prostate cancer case, we need to create, for each patient, two rows, one for death from prostate cancer and one for death from other causes. To simplify this process, we can use utilities in the “mstate” package. This package is capable of handling complex multistate survival models, but can also be used to set up competing risks as a special case. We begin by setting up a “transition” matrix using the function “trans.comprisk”,

library(mstate)
tmat <- trans.comprisk(2, names = c("event-free", "prostate", "other"))
tmat

##             to
## from         event-free prostate other
##   event-free         NA        1     2
##   prostate           NA       NA    NA
##   other              NA       NA    NA

The first argument is the number of specific outcomes, and the second argument (“names”) gives the name of the censored outcome and the two other outcomes. The resulting matrix states that a patient’s status can change from “event-free” to either “prostate” or “other”, these latter two being causes of death. The other entries of the matrix simply state that once a patient dies of one cause, they cannot change to another cause or return to the “event-free” status. Next, we use the function “msprep” to create the new data set, and examine the first few rows:

#attach(prostateSurvival.T2)
#prostate.long <- msprep(time = cbind(NA, survTime, survTime),
#                        status = cbind(NA, status.prost, status.other),
#                        keep = data.frame(grade, ageGroup), trans = tmat)
#head(prostate.long)

In this “msprep” function, the argument “time” consists of three columns, each corresponding the states defined by the “tmat” transition matrix. The first “eventfree” state is represented by a placeholder, “NA”; the second and third by the survival times for time to death from prostate cancer and from other causes. In our data set, both are represented by the “survTime” vector. The two times are distinguished in the next argument, “status”. This also has three columns. The first is a placeholder, “NA” as before; the second is the censoring indicator for prostate cancer (“status.prost”), and the third is for other causes (“status.other”). These latter two variables were defined earlier from the “status” column of the data frame “prostateSurvival.T2”. Finally, the transition matrix is defined by “trans = tmat”. Note that the variables “survTime”, “grade”, and “ageGroup” from the “prostateSurvival.T2” file are available for use to us because we have previously attached it.

The output file has twice as many rows as the original “prostateSurvival.T2” file. The first column, “id”, refers to the patient number in the original file; here, each is repeated twice. For our purposes, we can ignore the columns “from” and “two”. The column “trans” will be important, because it contains an indicator of the cause of death; here “1” refers to death from prostate cancer and “2” refers to death from other causes. The “Tstart” column contains all 0’s, since for our data, “time = 0” indicates the diagnosis with prostate cancer. We can ignore “Tstop”, and use the “time” column as the survival time and the “status” column as the censoring indicator. Note that for each patient, there are two entries for “status”. Both can be 0, or one can be 1 and the other 0; they can’t both be 1 because each patient can die of only one cause, not both. Finally, the last two columns are covariate columns we carried over from the original “prostateSurvival.T2” data frame. Each original value is doubled, since each patient has one covariate value, regardless of their cause of death.

We may obtain a summary of the numbers of events of each type as follows:

#events(prostate.long)$Frequencies

These results indicate that there are 410 deaths due to prostate cancer, 1345 due to other causes, and 4165 censored observations, for 5920 total. (We may ignore the second two rows, which are relevant only for multistate models.)

To show how to use our newly expanded data set, we can use it to reproduce our analysis from the previous section. To obtain these estimates of the effects of covariates on prostate-specific and other death causes, we use separate commands, one for “trans = 1” (prostate cancer) and the other for “trans = 2” (other causes of death), as follows:

#summary(coxph(Surv(time, status) ~ grade + ageGroup, data=prostate.long, subset={trans==1}))

#summary(coxph(Surv(time, status) ~ grade + ageGroup, data=prostate.long, subset={trans==2}))

The results (not shown) are identical to what we obtained before.

If we stratify on cause of death using “strata(trans)” we get estimates of the effect of the covariates on cause of death under the assumption that they affect both causes of death equally,

#summary(coxph(Surv(time, status) ~ grade + ageGroup + strata(trans),
#              data=prostate.long))

“Results”

In this example, this model wouldn’t be appropriate, since we would expect that cancer grade affects prostate cancer death differently than it does death from other causes. To test this formally, we fit the following model:

#summary(coxph(Surv(time, status) ~ 
#                grade*factor(trans) + ageGroup + strata(trans),
#              data=prostate.long))

“Results”

The coefficient estimate 1.239 for “gradepoor” is the effect of grade on prostate cancer death, and is similar to the estimate we got earlier (1.220) for prostate cancer death alone. Here however, we also have an estimate in the last row for the difference between the effect on prostate cancer death and death from other causes. This is the interaction between a grade of “poor” and cause “2” (other death). The estimate, -0.963, which is highly statistically significant, represents the additional effect of poor grade on risk of death from other causes relative to its effect on prostate cancer death. Specifically, the hazard of death from other causes is exp(-0.963) = 0.381 times the hazard of death from prostate cancer.

We have determined that having a poor grade of prostate cancer strongly affects the risk of dying from prostate cancer, and this effect is much stronger on the risk of death from prostate cancer than on the risk of death from other causes. We may next ask how increasing age affects the risk of dying from prostate cancer and of other causes. Unsurprisingly, the trend is clear in both cases, as we have seen above. But is the effect any different on these two causes? We can answer this by examining the interaction between age group and cause of death as follows:

#summary(coxph(Surv(time, status) ~ 
#                (grade + ageGroup)*trans + ageGroup + strata(trans),
#              data=prostate.long))

“Results”

The results are in the last three rows of parameter estimates. None of these differences are statistically significant, so we conclude that there is no difference in the effect of age on the two death causes, after adjusting for grade.

Competing risks and Multistate models II²⁵

Introduction

Standard survival data measure the time span from some time origin until the occurrence of one type of event. If several types of events occur, a model describing progression to each of these competing risks is needed. Multi-state models generalize competing risks models by also describing transitions to intermediate events. Methods to analyze such models have been developed over the last two decades. Fortunately, most of the analyzes can be performed within the standard statistical packages, but may require some extra effort with respect to data preparation and programming. This tutorial aims to review statistical methods for the analysis of competing risks and multi-state models. Although some conceptual issues are covered, the emphasis is on practical issues like data preparation, estimation of the effect of covariates, and estimation of cumulative incidence functions and state and transition probabilities. Examples of analysis with standard software are shown.

Data preparation

The data used in Section 4 of the tutorial are 2204 patients transplanted at the EBMT between 1995 and 1998. These data are included in the mstate package.
EBMT platelet recovery data {mstate}: R Documentation Data from the European Society for Blood and Marrow Transplantation (EBMT)
- Description: A data frame of 2204 patients transplanted at the EBMT between 1995 and 1998. These data were used in Section 4 of the tutorial on competing risks and multi-state models (Putter, Fiocco & Geskus, 2007). The included variables are
  - id: Patient identification number
  - prtime: Time in days from transplantation to platelet recovery or last follow-up
  - prstat: Platelet recovery status; 1 = platelet recovery, 0 = censored
  - rfstime: Time in days from transplantation to relapse or death or last follow-up (relapse-free survival time)
  - rfsstat: Relapse-free survival status; 1 = relapsed or dead, 0 = censored
  - dissub: Disease subclassification; factor with levels “AML”, “ALL”, “CML”
  - age: Patient age at transplant; factor with levels “<=20”, “20-40”, “>40”
  - drmatch: Donor-recipient gender match; factor with levels “No gender mismatch”, “Gender mismatch”
  - tcd: T-cell depletion; factor with levels “No TCD”, “TCD”
- Source: We acknowledge the European Society for Blood and Marrow Transplantation (EBMT) for making available these data. Disclaimer: these data were simplified for the purpose of illustration of the analysis of competing risks and multi-state models and do not reflect any real life situation. No clinical conclusions should be drawn from these data.
- References: Putter H, Fiocco M, Geskus RB (2007). Tutorial in biostatistics: Competing risks and multi-state models. Statistics in Medicine 26, 2389-2430.

library(survival)
library(dplyr)
library(mstate)

data(ebmt3)
head(ebmt3)

##   id prtime prstat rfstime rfsstat dissub   age            drmatch    tcd
## 1  1     23      1     744       0    CML   >40    Gender mismatch No TCD
## 2  2     35      1     360       1    CML   >40 No gender mismatch No TCD
## 3  3     26      1     135       1    CML   >40 No gender mismatch No TCD
## 4  4     22      1     995       0    AML 20-40 No gender mismatch No TCD
## 5  5     29      1     422       1    AML 20-40 No gender mismatch No TCD
## 6  6     38      1     119       1    ALL   >40 No gender mismatch No TCD

#help(ebmt3)

Let us first have a look at the covariates. For instance disease subclassification:

n <- nrow(ebmt3)
table(ebmt3$dissub)

## 
## AML ALL CML 
## 853 447 904

round(100 * table(ebmt3$dissub)/n)

## 
## AML ALL CML 
##  39  20  41

table(ebmt3$age)

## 
##  <=20 20-40   >40 
##   419  1057   728

round(100 * table(ebmt3$age)/n)

## 
##  <=20 20-40   >40 
##    19    48    33

table(ebmt3$drmatch)

## 
## No gender mismatch    Gender mismatch 
##               1648                556

round(100 * table(ebmt3$drmatch)/n)

## 
## No gender mismatch    Gender mismatch 
##                 75                 25

table(ebmt3$tcd)

## 
## No TCD    TCD 
##   1928    276

round(100 * table(ebmt3$tcd)/n)

## 
## No TCD    TCD 
##     87     13

The first step in a multi-state model analysis is to set up the transition matrix.

The European Society for Blood and Marrow Transplantation (EBMT) illness-death model

The transition matrix specifies which direct transitions are possible (those with NA are impossible) and assigns numbers to the transitions for future reference. This can be done explicitly

tmat <- matrix(NA, 3, 3)
tmat[1, 2:3] <- 1:2
tmat[2, 3] <- 3
dimnames(tmat) <- list(from = c("Tx", "PR", "RelDeath"), 
                       to = c("Tx", "PR", "RelDeath"))
tmat

##           to
## from       Tx PR RelDeath
##   Tx       NA  1        2
##   PR       NA NA        3
##   RelDeath NA NA       NA

Steven McKinney has kindly provided a convenient function transMat to define transition matrices. The same transition matrix may be constructed as follows.

tmat <- transMat(x = list(c(2, 3), c(3), c()), names = c("Tx", "PR", "RelDeath"))
tmat

##           to
## from       Tx PR RelDeath
##   Tx       NA  1        2
##   PR       NA NA        3
##   RelDeath NA NA       NA

For common multi-state models, such as the illness-death model and competing risks models, there is a built-in function to obtain these transition matrices more easily.

tmat <- trans.illdeath(names = c("Tx", "PR", "RelDeath"))
tmat

##           to
## from       Tx PR RelDeath
##   Tx       NA  1        2
##   PR       NA NA        3
##   RelDeath NA NA       NA

The function paths can be used to give a list of all possible paths through the multi-state model. This function should not be used for transition matrices specifying a multi-state model with loops, since there will be infinitely many paths. At the moment there is no check for the presence of loops, but this will be included shortly

paths(tmat)

##      [,1] [,2] [,3]
## [1,]    1   NA   NA
## [2,]    1    2   NA
## [3,]    1    2    3
## [4,]    1    3   NA

Time in the ebmt3 data is reported in days; before doing any analysis, we first convert this to years.

ebmt3$prtime <- ebmt3$prtime/365.25
ebmt3$rfstime <- ebmt3$rfstime/365.25

In order to prepare data in long format, we specify the names of the covariates that we are interested in modeling. Note that I am adding prtime, which is not really a covariate, but specifying the time of platelet recovery. The purpose of this will become clear later. The specified covariates are to be retained in the dataset in long format (this is the argument keep), which we are going to call msbmt. For the original dataset ebmt3, each row corresponds to a single patient. For the long format data msbmt, each row will correspond to a transition for which a patient is at risk.

covs <- c("dissub", "age", "drmatch", "tcd", "prtime")
msbmt <- msprep(time = c(NA, "prtime", "rfstime"), 
                status = c(NA,"prstat", "rfsstat"), 
                data = ebmt3, 
                trans = tmat, 
                keep = covs)

The result is an S3 object of class msdata and data.frame. An msdata object is actually only a data frame with a trans attribute holding the transition matrix used to define it. A print method has been defined for msdata objects, which also prints the transition matrix if requested (set argument trans to TRUE, default is FALSE).

head(msbmt)

## An object of class 'msdata'
## 
## Data:
##   id from to trans     Tstart      Tstop       time status dissub age
## 1  1    1  2     1 0.00000000 0.06297057 0.06297057      1    CML >40
## 2  1    1  3     2 0.00000000 0.06297057 0.06297057      0    CML >40
## 3  1    2  3     3 0.06297057 2.03696099 1.97399042      0    CML >40
## 4  2    1  2     1 0.00000000 0.09582478 0.09582478      1    CML >40
## 5  2    1  3     2 0.00000000 0.09582478 0.09582478      0    CML >40
## 6  2    2  3     3 0.09582478 0.98562628 0.88980151      1    CML >40
##              drmatch    tcd     prtime
## 1    Gender mismatch No TCD 0.06297057
## 2    Gender mismatch No TCD 0.06297057
## 3    Gender mismatch No TCD 0.06297057
## 4 No gender mismatch No TCD 0.09582478
## 5 No gender mismatch No TCD 0.09582478
## 6 No gender mismatch No TCD 0.09582478

In the above call of msprep, the time and status arguments specify the column names in the data ebmt3 corresponding to the three states in the multi-state model. Since all the patients start in state 1 at time 0, the time and status arguments corresponding to the first state do not really have a value. In such cases, the corresponding elements of time and status may be given the value NA. An alternative way of specifying time and status (and keep as well) is as matrices of dimension $n × S$ with $S$ the number of states (and $n × p$ with $p$ the number of covariates for keep). The data argument doesn’t need to be specified then.
The number of events in the data can be summarized with the function events.

events(msbmt)

## $Frequencies
##           to
## from         Tx   PR RelDeath no event total entering
##   Tx          0 1169      458      577           2204
##   PR          0    0      383      786           1169
##   RelDeath    0    0        0      841            841
## 
## $Proportions
##           to
## from              Tx        PR  RelDeath  no event
##   Tx       0.0000000 0.5303993 0.2078040 0.2617967
##   PR       0.0000000 0.0000000 0.3276305 0.6723695
##   RelDeath 0.0000000 0.0000000 0.0000000 1.0000000

For regression purposes, we now add transition-specific covariates to the dataset. For a numerical covariate cov, the names of the expanded (transition-specific) covariates are cov.1, cov.2 etc. The extension .i refers to transition number i. First, we define these transition-specific covariates as a separate dataset, by setting append to FALSE.

expcovs <- expand.covs(msbmt, covs[2:3], append = FALSE)
head(expcovs)

##   age20.40.1 age20.40.2 age20.40.3 age.40.1 age.40.2 age.40.3
## 1          0          0          0        1        0        0
## 2          0          0          0        0        1        0
## 3          0          0          0        0        0        1
## 4          0          0          0        1        0        0
## 5          0          0          0        0        1        0
## 6          0          0          0        0        0        1
##   drmatchGender.mismatch.1 drmatchGender.mismatch.2 drmatchGender.mismatch.3
## 1                        1                        0                        0
## 2                        0                        1                        0
## 3                        0                        0                        1
## 4                        0                        0                        0
## 5                        0                        0                        0
## 6                        0                        0                        0

We see that this expanded covariates dataset is quite large, and that the covariate names are quite long. For categorical covariates, the default names of the expanded covariates are a combination of the covariate name, the level (similar to the names of the regression coefficients that you see in regression output), followed by the transition number, in such a way that the combination is allowed as column name. If these names are too long, the user may set the value of longnames (default=TRUE) to FALSE. In this case, the covariate name is followed by 1, 2 etc, before the transition number. In case of a covariate with only two levels, the covariate name is just followed by the transition number. Confident that this will work out, we also set append to TRUE (default), which will append the expanded covariates to the dataset.

msbmt <- expand.covs(msbmt, covs, append = TRUE, longnames = FALSE)
head(msbmt)

## An object of class 'msdata'
## 
## Data:
##   id from to trans     Tstart      Tstop       time status dissub age
## 1  1    1  2     1 0.00000000 0.06297057 0.06297057      1    CML >40
## 2  1    1  3     2 0.00000000 0.06297057 0.06297057      0    CML >40
## 3  1    2  3     3 0.06297057 2.03696099 1.97399042      0    CML >40
## 4  2    1  2     1 0.00000000 0.09582478 0.09582478      1    CML >40
## 5  2    1  3     2 0.00000000 0.09582478 0.09582478      0    CML >40
## 6  2    2  3     3 0.09582478 0.98562628 0.88980151      1    CML >40
##              drmatch    tcd     prtime dissub1.1 dissub1.2 dissub1.3 dissub2.1
## 1    Gender mismatch No TCD 0.06297057         0         0         0         1
## 2    Gender mismatch No TCD 0.06297057         0         0         0         0
## 3    Gender mismatch No TCD 0.06297057         0         0         0         0
## 4 No gender mismatch No TCD 0.09582478         0         0         0         1
## 5 No gender mismatch No TCD 0.09582478         0         0         0         0
## 6 No gender mismatch No TCD 0.09582478         0         0         0         0
##   dissub2.2 dissub2.3 age1.1 age1.2 age1.3 age2.1 age2.2 age2.3 drmatch.1
## 1         0         0      0      0      0      1      0      0         1
## 2         1         0      0      0      0      0      1      0         0
## 3         0         1      0      0      0      0      0      1         0
## 4         0         0      0      0      0      1      0      0         0
## 5         1         0      0      0      0      0      1      0         0
## 6         0         1      0      0      0      0      0      1         0
##   drmatch.2 drmatch.3 tcd.1 tcd.2 tcd.3   prtime.1   prtime.2   prtime.3
## 1         0         0     0     0     0 0.06297057 0.00000000 0.00000000
## 2         1         0     0     0     0 0.00000000 0.06297057 0.00000000
## 3         0         1     0     0     0 0.00000000 0.00000000 0.06297057
## 4         0         0     0     0     0 0.09582478 0.00000000 0.00000000
## 5         0         0     0     0     0 0.00000000 0.09582478 0.00000000
## 6         0         0     0     0     0 0.00000000 0.00000000 0.09582478

The names indeed are quite a bit shorter. The downside however is that we need to remember for ourselves to which category for instance the number 1 in age1.2 corresponds (age 20-40 with $\ge$ 20 as reference category).
After having prepared the data in long format, estimation of covariate effects using Cox regression is straightforward using the coxph function of the survival package. This is not at all a feature of the mstate package, other than that msprep has facilitated preparation of the data. Let us consider the Markov model, where we assume different effects of the covariates for different transitions; hence we use the transition-specific covariates obtained by expand.covs. The delayed entry aspect of this model for transition 3 (see discussion in the tutorial) is achieved by specifying Surv(Tstart,Tstop,status), where (this is reflected in the long format data) Tstart is the time of entry in the state, and Tstop the event or censoring time, depending on the value of status. We consider first the model without any proportionality assumption on the baseline hazards; this is achieved by adding strata(trans) to the formula, which estimates separate baseline hazards for different values of trans (the transitions). The results appear in the left column of Table III of the tutorial.
In the disease/recovery process, often more than one type of event plays a role. Usually, one type of event can be singled out as the event of interest. The other event types may prevent the event of interest from occurring. Leukaemia relapse or AIDS may be unobservable because the person died before the diagnosis of these events. Caution is needed in estimating the probability of the event of interest occurring in the presence of these so-called competing risks. Treating the events of the competing causes as censored observations will lead to a bias in the Kaplan-Meier estimate if one of the fundamental assumptions underlying the Kaplan-Meier estimator is violated: the assumption of independence of the time to event and the censoring distributions. The Cox proportional hazards model can still be used, but the interpretation of the results is different. This will be outlined in some detail in Section 3.
In other situations, another event may substantially change the risk of the event of interest to occur. If one is only interested in the event of interest as a first event, the other event can still be seen as competing. Often, one is also interested in what happens after the first non-fatal event. Then intermediate event types provide more detailed information on the disease/recovery process and allow for more precision in predicting the prognosis of patients. For a leukaemia patient, if the event of interest is death, then relapse becomes an intermediate event worth modelling and not preventing death. Such non-fatal events during the disease course can be seen as transitions from one state to another. The time origin is characterized by a transition into an initial, transient, state, such as the start of treatment; the endpoint is an ‘absorbing’ final transition. Instead of survival data or time-to-event data, data on the history of events is available. Multi-state models provide a framework that allow for the analysis of such event history data. They are an extension of competing risk models, since they extend the analysis to what happens after the first event. Multi-state models are the subject of Section 4.
Several of the ideas presented in the sections on competing risks and multi-state models can also be found in Reference [1]. For more information on competing risks and multi-state models we refer to the relevant chapters in the textbooks [2-7]. A recent issue of Statistical Methods in Medical Research, entirely devoted to multi-state models, is also of interest, see e.g. References [1, 8, 9]
This tutorial reviews statistical methods for the analysis of competing risks and multi-state models. Fortunately, the theory that has been developed over the past two decades for the analysis of right censored survival data can be applied to competing risks and multi-state models as well and often most of the analyzes can be performed within the standard statistical packages, but may require some extra effort with respect to data preparation and programming. Section 2 introduces background and notation needed for the sequel of the paper and discusses the implications of the (lack of) independence between the censoring and time-to-event distributions. Sections 3 and 4 discuss competing risks and multi-state models respectively. Each of these sections is concluded with a subsection on available software. We illustrate estimation and modelling aspects of competing risks and multi-state models using the statistical package R [10]. The full code for the analyzes performed in this tutorial as well as the data used are available at http://www.msbi.nl/multistate.

Multi-state models

The class of multi-state models forms an extension to that of competing risks models. Competing risks models deal with one initial state and several mutually exclusive absorbing states. Typically, the disease or recovery process of a patient will also consist of intermediate events that can neither be classified as initial states nor as final states. For example, in many cancer studies, after surgery of the primary tumour, the tumour may recur in the vicinity of the primary tumour (local recurrence), or at distant locations (distant metastasis). These events may occur in any order (although local recurrence usually precedes distant metastasis) and patients may die before or after experiencing local recurrence or distant metastasis.

A multistate model for breast cancer

Another example is that HIV infected individuals may develop AIDS, but may also experience a switch to SI phenotype. If the SI switch occurs first, it may change the risk to progress to AIDS.
This is an example of a special class of multi-state models, called illness-death models. In this class of models individuals start out as healthy; this initial state will be denoted by state 1. They may become ill (move to state 2) and afterwards they may die (state 3). In principle they may also recover from their illness and become healthy again, i.e. move back to state 1. If this is possible the model is called a bi-directional illness-death model. Individuals may also die without first becoming ill (this is a direct transition from state 1 to state 3).

The illness-death model

Preliminaries

Typically, a multi-state model contains one initial state, which we will assign the number 1. In the above examples, this state is entered at the moment of surgery for cancer, bone marrow transplantation and HIV infection respectively. Some states represent an endpoint; when a patient enters such a state, he or she will remain there or one is not interested in what happens after this state has been reached. We call these states final or absorbing states (the latter terminology comes from the theory of Markov chains and processes). The absorbing states in our examples are death (in the cancer example), relapse and death (BMT), AIDS (HIV/AIDS). States that are neither initial nor absorbing states are called intermediate or transient states (again borrowed from Markov chain theory); strictly speaking, the initial state is also transient.
Transitions are represented by arrows going from one state to another. When we assign numbers to all states, we represent a transition from state $i$ to $j$ by $i \to j$. If $T$ denotes the time of reaching state $j$ from state $i$, we denote the hazard rate (transition intensity) of the $i \to j$ transition by \[h_{ij}(t)=\lim_{\Delta t \to 0} \frac{P(t \le t+\Delta t, D=k|T\ge t)}{\Delta t}\]
we define the cumulative hazard for transition $i \to j$ by \[H_{ij}(t)=\int_{0}^{t} h_{ij}(s)ds\]
In the above definition, the question remains: what is $t$, or more precisely, what is the time scale to which t refers? Two approaches are in frequent use, which we shall denote here by the ‘clock forward’ or ‘clock reset’ approach.
Clock forward: Time $t$ refers to the time since the patient entered the initial state. The clock keeps moving forward for the patient, also when intermediate events occur.
Clock reset: Time $t$ in $h_{ij}(t)$ refers to the time since entry in state $i$, also called backward recurrence time. The clock is reset to 0 each time the patient enters a new state.
The difference between the two approaches is illustrated in the Figure.

Illustration of the ‘clock forward’ and ‘clock reset’ approach.

The upper half shows the dates of surgery and subsequent events for a cancer patient. At 13 May 2005, the patient is still alive. The lower picture shows the patient time-scale, first in the ‘clock forward’ approach, where time is measured from date of surgery, then in the ‘clock reset’ approach, where time intervals between state visits are recorded. In both instances the patient is censored for the last event, due to the end of follow-up.
A property that is often assumed in practice is that the multi-state model is a Markov model. Loosely speaking, the Markov property states that the future depends on the history only through the present. For a multi-state model this means that, given the present state and the event history of a patient, the next state to be visited and the time at which this will occur will only depend on the present state. Strictly speaking, only ‘clock forward’ models can be Markov models; for ‘clock reset’ models the Markov property cannot hold since the time scale itself depends on the history through the time since the current state was reached. However, if it is assumed that the sojourn times depend on the history of the process only through the present state and the time since entry of that state, the resulting multi-state model forms a sequence of embedded Markov models, called a Markov renewal model or also a semi-Markov model.

Data Manipulation

Data structure: long format

The data contains a patient identification column patid and a transition column, as well as a from and a to column specifying from which state the transition initiates and to which it terminates. Furthermore, it contains a start and stop time to indicate when the patient started and stopped being at risk for that transition, and a status to denote whether or not (1 and 0, respectively) the patient reached the to state. Patients 1 and 2 are represented by three columns each, one for each of the transitions going out from state 0. Patient 1 has status = 0 for each of these transitions, patient 2 has status = 1 only for the $1 \to 5$ transition (surgery to death). Patient 3 has these same three initial rows as well. After a local recurrence (status = 1 for the $1 \to 2$ transition), two more rows are added, corresponding to the two transitions ($2 \to 4$ and $2 \to 5$) going out from state 2. The start time for these transitions is 2.25, the stop time is 6.75. This is an example of delayed entry or left truncation (Section 2); patient 3 becomes at risk for the transitions $2 \to 4$ and $2 \to 5$ after 2.25 years. The variable status has value 0 for the $2 \to 5$ transition and 1 for the $2 \to 4$ transition. One final row is added after the patient has reached state 4 (local recurrence and distant metastasis). The only transition going out from state 4 is the $4 \to 5$ transition. The start time is 6.75, stop time is 11.36 years, the end of follow-up of that patient. Since the patient is still alive (censored), status = 0 for that row. One column time is added for modelling the ‘clock reset’ approach; it is simply defined by time = stop - start. If time-fixed covariates are also recorded, the values are simply replicated for each row corresponding to the same patient.

Estimation

We will illustrate estimation of the effect of prognostic factors on the transition rates in multi-state models, using the simplest non-trivial multi-state model, the illness-death model. Some aspects that play a role and that we will try to cover here are:

which baseline hazards (for the different transitions) to choose proportional;
whether to use the ‘clock forward’ or ‘clock reset’ approach;
whether to use a (semi-)Markov or a state arrival extended (semi-)Markov model

We will use data from the European Blood and Marrow Transplant registry (EBMT) for illustration in this and the next subsection. The data consists of 2204 patients in this registry, who received bone marrow transplantation between 1995 and 1998, and who had complete information on the prognostic factors considered here. These are as summarized in Table below

The European Society for Blood and Marrow Transplantation (EBMT) illness-death model

The multi-state model that we shall use for illustration here and in the next subsection is the bone marrow transplantation illness-death model. Here, the ‘illness’ state corresponds to platelet recovery and ‘death’ corresponds to relapse or death. The model is illustrated in Figure 13 along with the number of events. We can see that for 1169 of 2204 patients (53 per cent), platelet levels returned to normal levels; 383 of these 1169 (33 per cent) subsequently relapsed or died, the remaining 786 (67 per cent) did not relapse or die after platelet recovery. There were 458 patients (21 per cent) that relapsed or died without platelet recovery prior to relapse or death. Finally, 577 (26 per cent) of all 2204 patients did not experience any event in our data.
Using Cox PH notation, modeling $i \to j$ is \[h_{ij}(t|x)=h_{ij, 0}(t)\exp(\beta_{ij} x)\]
- For estimation in the $1 \to 2$ transition, in long format, it suffices to select only the rows corresponding to transition= 1 to 2, and use a Cox regression on the selected data.
- For estimation of regression parameters for the $2 \to 3$ transition (platelet recovery → relapse or death), it is important to realize that patients are at risk only after entering state 2 (delayed entry).
- The estimates of $\beta_{ij}$, their standard errors and $p$-values are reported in Table in the Markov stratified hazards column. The most important findings are the higher relapse/death rates for older patients (particularly older than 40), both before and after platelet recovery, the lower platelet recovery rate for CML patients, and the increased platelet recovery rate for patients receiving T-cell depletion.

Parameter estimates in different models; ‘clock forward’ approach

R: clock-forward models

After having prepared the data in long format, estimation of covariate effects using Cox regression is straightforward using the coxph function of the survival package. This is not at all a feature of the mstate package, other than that msprep has facilitated preparation of the data. Let us consider the Markov model, where we assume different effects of the covariates for different transitions; hence we use the transition-specific covariates obtained by expand.covs. The delayed entry aspect of this model for transition 3 (see discussion in the tutorial) is achieved by specifying Surv(Tstart,Tstop,status), where (this is reflected in the long format data) Tstart is the time of entry in the state, and Tstop the event or censoring time, depending on the value of status. We consider first the model without any proportionality assumption on the baseline hazards; this is achieved by adding strata(trans) to the formula, which estimates separate baseline hazards for different values of trans (the transitions). The results appear in the left column of Table III of the tutorial.

c1 <- coxph(Surv(Tstart, Tstop, status) ~ dissub1.1 + dissub2.1 +
                age1.1 + age2.1 + drmatch.1 + tcd.1 + dissub1.2 + dissub2.2 +
                age1.2 + age2.2 + drmatch.2 + tcd.2 + dissub1.3 + dissub2.3 +
                age1.3 + age2.3 + drmatch.3 + tcd.3 + strata(trans), data = msbmt,
                method = "breslow")
c1

## Call:
## coxph(formula = Surv(Tstart, Tstop, status) ~ dissub1.1 + dissub2.1 + 
##     age1.1 + age2.1 + drmatch.1 + tcd.1 + dissub1.2 + dissub2.2 + 
##     age1.2 + age2.2 + drmatch.2 + tcd.2 + dissub1.3 + dissub2.3 + 
##     age1.3 + age2.3 + drmatch.3 + tcd.3 + strata(trans), data = msbmt, 
##     method = "breslow")
## 
##               coef exp(coef) se(coef)      z        p
## dissub1.1 -0.04359   0.95734  0.07789 -0.560 0.575698
## dissub2.1 -0.29724   0.74287  0.06800 -4.371 1.23e-05
## age1.1    -0.16461   0.84822  0.07905 -2.082 0.037317
## age2.1    -0.08979   0.91412  0.08647 -1.038 0.299075
## drmatch.1  0.04575   1.04681  0.06660  0.687 0.492127
## tcd.1      0.42907   1.53583  0.08043  5.335 9.57e-08
## dissub1.2  0.25589   1.29161  0.13520  1.893 0.058411
## dissub2.2  0.01675   1.01689  0.10838  0.155 0.877188
## age1.2     0.25516   1.29067  0.15103  1.689 0.091127
## age2.2     0.52649   1.69298  0.15790  3.334 0.000855
## drmatch.2 -0.07525   0.92751  0.11028 -0.682 0.495006
## tcd.2      0.29673   1.34545  0.15007  1.977 0.048006
## dissub1.3  0.13646   1.14621  0.14804  0.922 0.356634
## dissub2.3  0.24692   1.28007  0.11685  2.113 0.034596
## age1.3     0.06156   1.06350  0.15343  0.401 0.688239
## age2.3     0.58075   1.78737  0.16014  3.627 0.000287
## drmatch.3  0.17280   1.18863  0.11452  1.509 0.131315
## tcd.3      0.20088   1.22248  0.12636  1.590 0.111873
## 
## Likelihood ratio test=117.7  on 18 df, p=< 2.2e-16
## n= 5577, number of events= 2010

The next model considered is the Markov model where the transition hazards into relapse or death (these correspond to transitions 2 and 3) are assumed to be proportional. For this purpose transition 1 (transplantation ! platelet recovery) belongs to one stratum and transitions 2 (transplantation -> relapse/death) and 3 (platelet recovery -> relapse/death) belong to a second stratum. Transitions 2 and 3 have the same receiving state, hence the same value of to, so the two strata can be distinguished by the variable to in our dataset. In order to distinguish between transitions 2 and 3, we introduce a time-dependent covariate pr that indicates whether or not platelet recovery has already occurred. For transition 2 (Tx -> RelDeath) the value of pr equals 0, while for transition 3 (PR -> RelDeath) the value of pr equals 1. Results are found in the middle of Table III of the tutorial.

msbmt$pr <- 0
msbmt$pr[msbmt$trans == 3] <- 1
c2 <- coxph(Surv(Tstart, Tstop, status) ~ dissub1.1 + dissub2.1 +
                age1.1 + age2.1 + drmatch.1 + tcd.1 + dissub1.2 + dissub2.2 +
                age1.2 + age2.2 + drmatch.2 + tcd.2 + dissub1.3 + dissub2.3 +
                age1.3 + age2.3 + drmatch.3 + tcd.3 + pr + strata(to), data = msbmt,
                method = "breslow")
c2

## Call:
## coxph(formula = Surv(Tstart, Tstop, status) ~ dissub1.1 + dissub2.1 + 
##     age1.1 + age2.1 + drmatch.1 + tcd.1 + dissub1.2 + dissub2.2 + 
##     age1.2 + age2.2 + drmatch.2 + tcd.2 + dissub1.3 + dissub2.3 + 
##     age1.3 + age2.3 + drmatch.3 + tcd.3 + pr + strata(to), data = msbmt, 
##     method = "breslow")
## 
##                coef exp(coef)  se(coef)      z        p
## dissub1.1 -0.043592  0.957345  0.077887 -0.560 0.575698
## dissub2.1 -0.297240  0.742866  0.067996 -4.371 1.23e-05
## age1.1    -0.164613  0.848222  0.079054 -2.082 0.037317
## age2.1    -0.089790  0.914123  0.086468 -1.038 0.299075
## drmatch.1  0.045751  1.046814  0.066602  0.687 0.492127
## tcd.1      0.429071  1.535831  0.080432  5.335 9.57e-08
## dissub1.2  0.260968  1.298186  0.135182  1.930 0.053546
## dissub2.2  0.003637  1.003644  0.108368  0.034 0.973226
## age1.2     0.250894  1.285174  0.151057  1.661 0.096727
## age2.2     0.525790  1.691796  0.157895  3.330 0.000868
## drmatch.2 -0.072067  0.930469  0.110260 -0.654 0.513364
## tcd.2      0.318537  1.375114  0.149970  2.124 0.033669
## dissub1.3  0.139811  1.150056  0.147981  0.945 0.344767
## dissub2.3  0.250328  1.284447  0.116788  2.143 0.032078
## age1.3     0.055559  1.057131  0.153372  0.362 0.717166
## age2.3     0.562484  1.755027  0.159970  3.516 0.000438
## drmatch.3  0.169149  1.184297  0.114446  1.478 0.139414
## tcd.3      0.211029  1.234948  0.126198  1.672 0.094484
## pr        -0.378633  0.684797  0.211523 -1.790 0.073449
## 
## Likelihood ratio test=135.3  on 19 df, p=< 2.2e-16
## n= 5577, number of events= 2010

For a discussion of the results we again refer to the tutorial. The hazard ratio of pr (0.685) and its $p$-value (0.073) indicate a trend-significant beneficial effect of platelet recovery on relapsefree survival. Later on we will look at the corresponding baseline transition intensities for these two models and see as a graphical check that the assumption of proportionality of the baseline hazards for transitions 2 and 3 is reasonable. This can also be tested formally using the function cox.zph (part of the survival package, not of mstate).

cox.zph(c2)

##              chisq df       p
## dissub1.1 2.46e+01  1 6.9e-07
## dissub2.1 9.68e+00  1 0.00187
## age1.1    1.05e-01  1 0.74633
## age2.1    6.48e+00  1 0.01092
## drmatch.1 6.99e+00  1 0.00821
## tcd.1     1.41e+01  1 0.00017
## dissub1.2 5.43e+00  1 0.01975
## dissub2.2 4.43e+00  1 0.03535
## age1.2    4.79e+00  1 0.02863
## age2.2    1.46e+00  1 0.22647
## drmatch.2 1.12e-01  1 0.73759
## tcd.2     1.07e+00  1 0.30179
## dissub1.3 4.93e-05  1 0.99440
## dissub2.3 2.41e+01  1 9.4e-07
## age1.3    2.64e+00  1 0.10394
## age2.3    6.80e+00  1 0.00913
## drmatch.3 4.65e+00  1 0.03109
## tcd.3     1.83e+01  1 1.9e-05
## pr        1.64e+01  1 5.2e-05
## GLOBAL    1.17e+02 19 4.8e-16

There is no evidence of non-proportionality of the baseline transition intensities of transitions 2 (p=0.496 for pr). There is strong evidence that the proportional hazards assumption for dissub2 (CML vs AML) is violated, at least for the transitions into relapse and death. This makes sense, clinically, since CML and AML are two diseases with completely different biological pathways. It would have been much better to study separate multi-state models for the three disease subclassifications. However, since the purpose of this manuscript is to illustrate the use of mstate, we will blatantly ignore the clear evidence of non-proportionality for the disease subclassifications.
Building on the Markov PH model, we can investigate whether the time at which a patient arrived in state 2 (PR) influences the subsequent RFS rate, that is, the transition hazard of PR -> RelDeath. Here the purpose of expanding prtime becomes apparent. Since prtime only makes sense for transition 3 (PR -> RelDeath), we need the transition-specific covariate of prtime for transition 3, which is prtime.3. The corresponding model is termed the “state arrival extended Markov PH” model in the tutorial, and appears on the right of Table III.

c3 <- coxph(Surv(Tstart, Tstop, status) ~ dissub1.1 + dissub2.1 +
                age1.1 + age2.1 + drmatch.1 + tcd.1 + dissub1.2 + dissub2.2 +
                age1.2 + age2.2 + drmatch.2 + tcd.2 + dissub1.3 + dissub2.3 +
                age1.3 + age2.3 + drmatch.3 + tcd.3 + pr + prtime.3 + strata(to),
                data = msbmt, method = "breslow")
c3

## Call:
## coxph(formula = Surv(Tstart, Tstop, status) ~ dissub1.1 + dissub2.1 + 
##     age1.1 + age2.1 + drmatch.1 + tcd.1 + dissub1.2 + dissub2.2 + 
##     age1.2 + age2.2 + drmatch.2 + tcd.2 + dissub1.3 + dissub2.3 + 
##     age1.3 + age2.3 + drmatch.3 + tcd.3 + pr + prtime.3 + strata(to), 
##     data = msbmt, method = "breslow")
## 
##                coef exp(coef)  se(coef)      z        p
## dissub1.1 -0.043592  0.957345  0.077887 -0.560 0.575698
## dissub2.1 -0.297240  0.742866  0.067996 -4.371 1.23e-05
## age1.1    -0.164613  0.848222  0.079054 -2.082 0.037317
## age2.1    -0.089790  0.914123  0.086468 -1.038 0.299075
## drmatch.1  0.045751  1.046814  0.066602  0.687 0.492127
## tcd.1      0.429071  1.535831  0.080432  5.335 9.57e-08
## dissub1.2  0.260899  1.298097  0.135182  1.930 0.053609
## dissub2.2  0.003761  1.003768  0.108368  0.035 0.972315
## age1.2     0.250952  1.285248  0.151056  1.661 0.096649
## age2.2     0.525772  1.691764  0.157894  3.330 0.000869
## drmatch.2 -0.072088  0.930449  0.110260 -0.654 0.513238
## tcd.2      0.318238  1.374703  0.149971  2.122 0.033838
## dissub1.3  0.132021  1.141132  0.148849  0.887 0.375109
## dissub2.3  0.251811  1.286353  0.116823  2.155 0.031123
## age1.3     0.058227  1.059956  0.153426  0.380 0.704306
## age2.3     0.565752  1.760771  0.160011  3.536 0.000407
## drmatch.3  0.166817  1.181538  0.114556  1.456 0.145334
## tcd.3      0.207404  1.230480  0.126431  1.640 0.100911
## pr        -0.406872  0.665729  0.219075 -1.857 0.063279
## prtime.3   0.295226  1.343430  0.594952  0.496 0.619741
## 
## Likelihood ratio test=135.5  on 20 df, p=< 2.2e-16
## n= 5577, number of events= 2010

R: clock-reset models

Parameter estimates in different models; ‘clock forward’ approach

The clock-reset models may be obtained very similarly to those of the clock-forward models. The only difference is that Surv(Tstart,Tstop,status) is replaced by Surv(time,status).

c4 <- coxph(Surv(time, status) ~ dissub1.1 + dissub2.1 + age1.1 +
                age2.1 + drmatch.1 + tcd.1 + dissub1.2 + dissub2.2 + age1.2 +
                age2.2 + drmatch.2 + tcd.2 + dissub1.3 + dissub2.3 + age1.3 +
                age2.3 + drmatch.3 + tcd.3 + strata(trans), data = msbmt,
                method = "breslow")
c4

## Call:
## coxph(formula = Surv(time, status) ~ dissub1.1 + dissub2.1 + 
##     age1.1 + age2.1 + drmatch.1 + tcd.1 + dissub1.2 + dissub2.2 + 
##     age1.2 + age2.2 + drmatch.2 + tcd.2 + dissub1.3 + dissub2.3 + 
##     age1.3 + age2.3 + drmatch.3 + tcd.3 + strata(trans), data = msbmt, 
##     method = "breslow")
## 
##               coef exp(coef) se(coef)      z        p
## dissub1.1 -0.04359   0.95734  0.07789 -0.560 0.575698
## dissub2.1 -0.29724   0.74287  0.06800 -4.371 1.23e-05
## age1.1    -0.16461   0.84822  0.07905 -2.082 0.037317
## age2.1    -0.08979   0.91412  0.08647 -1.038 0.299075
## drmatch.1  0.04575   1.04681  0.06660  0.687 0.492127
## tcd.1      0.42907   1.53583  0.08043  5.335 9.57e-08
## dissub1.2  0.25589   1.29161  0.13520  1.893 0.058411
## dissub2.2  0.01675   1.01689  0.10838  0.155 0.877188
## age1.2     0.25516   1.29067  0.15103  1.689 0.091127
## age2.2     0.52649   1.69298  0.15790  3.334 0.000855
## drmatch.2 -0.07525   0.92751  0.11028 -0.682 0.495006
## tcd.2      0.29673   1.34545  0.15007  1.977 0.048006
## dissub1.3  0.12026   1.12779  0.14793  0.813 0.416269
## dissub2.3  0.25245   1.28717  0.11685  2.160 0.030737
## age1.3     0.06541   1.06760  0.15338  0.426 0.669773
## age2.3     0.58154   1.78880  0.16002  3.634 0.000279
## drmatch.3  0.16974   1.18499  0.11453  1.482 0.138341
## tcd.3      0.19676   1.21745  0.12633  1.557 0.119365
## 
## Likelihood ratio test=118.1  on 18 df, p=< 2.2e-16
## n= 5577, number of events= 2010

The influence of the time at which platelet recovery occurred seems small and is not significant (p=0.62, last row)
The clock-reset models may be obtained very similarly to those of the clock-forward models. The only difference is that Surv(Tstart,Tstop,status) is replaced by Surv(time,status). This reflects the fact (recall that in our long format data each row corresponds to a transition) that for each transition the time starts at 0, rather than Tstart, the time since start of study at which the state has been entered. We will only show the code, not the output; the reader may try this for him-or herself.

c5 <- coxph(Surv(time, status) ~ dissub1.1 + dissub2.1 + age1.1 +
                age2.1 + drmatch.1 + tcd.1 + dissub1.2 + dissub2.2 + age1.2 +
                age2.2 + drmatch.2 + tcd.2 + dissub1.3 + dissub2.3 + age1.3 +
                age2.3 + drmatch.3 + tcd.3 + pr + strata(to), data = msbmt,
                method = "breslow")
c5

## Call:
## coxph(formula = Surv(time, status) ~ dissub1.1 + dissub2.1 + 
##     age1.1 + age2.1 + drmatch.1 + tcd.1 + dissub1.2 + dissub2.2 + 
##     age1.2 + age2.2 + drmatch.2 + tcd.2 + dissub1.3 + dissub2.3 + 
##     age1.3 + age2.3 + drmatch.3 + tcd.3 + pr + strata(to), data = msbmt, 
##     method = "breslow")
## 
##                coef exp(coef)  se(coef)      z        p
## dissub1.1 -0.043592  0.957345  0.077887 -0.560 0.575698
## dissub2.1 -0.297240  0.742866  0.067996 -4.371 1.23e-05
## age1.1    -0.164613  0.848222  0.079054 -2.082 0.037317
## age2.1    -0.089790  0.914123  0.086468 -1.038 0.299075
## drmatch.1  0.045751  1.046814  0.066602  0.687 0.492127
## tcd.1      0.429071  1.535831  0.080432  5.335 9.57e-08
## dissub1.2  0.258695  1.295239  0.135188  1.914 0.055672
## dissub2.2  0.008247  1.008281  0.108339  0.076 0.939324
## age1.2     0.252081  1.286701  0.151041  1.669 0.095126
## age2.2     0.527568  1.694805  0.157887  3.341 0.000833
## drmatch.2 -0.072862  0.929729  0.110261 -0.661 0.508733
## tcd.2      0.310010  1.363439  0.149921  2.068 0.038657
## dissub1.3  0.117420  1.124592  0.147863  0.794 0.427128
## dissub2.3  0.253025  1.287915  0.116781  2.167 0.030260
## age1.3     0.063925  1.066013  0.153279  0.417 0.676641
## age2.3     0.574319  1.775920  0.159857  3.593 0.000327
## drmatch.3  0.164298  1.178565  0.114469  1.435 0.151199
## tcd.3      0.200781  1.222357  0.126189  1.591 0.111583
## pr        -0.415591  0.659950  0.210949 -1.970 0.048827
## 
## Likelihood ratio test=142.1  on 19 df, p=< 2.2e-16
## n= 5577, number of events= 2010

c6 <- coxph(Surv(time, status) ~ dissub1.1 + dissub2.1 + age1.1 +
                age2.1 + drmatch.1 + tcd.1 + dissub1.2 + dissub2.2 + age1.2 +
                age2.2 + drmatch.2 + tcd.2 + dissub1.3 + dissub2.3 + age1.3 +
                age2.3 + drmatch.3 + tcd.3 + pr + prtime.3 + strata(to),
                data = msbmt, method = "breslow")
c6

## Call:
## coxph(formula = Surv(time, status) ~ dissub1.1 + dissub2.1 + 
##     age1.1 + age2.1 + drmatch.1 + tcd.1 + dissub1.2 + dissub2.2 + 
##     age1.2 + age2.2 + drmatch.2 + tcd.2 + dissub1.3 + dissub2.3 + 
##     age1.3 + age2.3 + drmatch.3 + tcd.3 + pr + prtime.3 + strata(to), 
##     data = msbmt, method = "breslow")
## 
##                coef exp(coef)  se(coef)      z        p
## dissub1.1 -0.043592  0.957345  0.077887 -0.560 0.575698
## dissub2.1 -0.297240  0.742866  0.067996 -4.371 1.23e-05
## age1.1    -0.164613  0.848222  0.079054 -2.082 0.037317
## age2.1    -0.089790  0.914123  0.086468 -1.038 0.299075
## drmatch.1  0.045751  1.046814  0.066602  0.687 0.492127
## tcd.1      0.429071  1.535831  0.080432  5.335 9.57e-08
## dissub1.2  0.258710  1.295258  0.135188  1.914 0.055658
## dissub2.2  0.008234  1.008268  0.108339  0.076 0.939419
## age1.2     0.252110  1.286737  0.151041  1.669 0.095089
## age2.2     0.527581  1.694827  0.157887  3.342 0.000833
## drmatch.2 -0.072845  0.929744  0.110261 -0.661 0.508829
## tcd.2      0.310030  1.363465  0.149921  2.068 0.038644
## dissub1.3  0.137509  1.147412  0.148842  0.924 0.355560
## dissub2.3  0.249439  1.283306  0.116828  2.135 0.032754
## age1.3     0.058214  1.059942  0.153456  0.379 0.704425
## age2.3     0.567443  1.763752  0.160190  3.542 0.000397
## drmatch.3  0.170000  1.185305  0.114543  1.484 0.137766
## tcd.3      0.209244  1.232746  0.126368  1.656 0.097755
## pr        -0.350169  0.704569  0.218695 -1.601 0.109339
## prtime.3  -0.658183  0.517791  0.584642 -1.126 0.260255
## 
## Likelihood ratio test=143.5  on 20 df, p=< 2.2e-16
## n= 5577, number of events= 2010

Prediction²⁶

In the preceding subsection, we have modelled the effects of covariates on the transition hazard. In Section 3 on competing risks we have already seen that effects on the cumulative incidence function may be different from what the regression coefficients suggest. In a multi-state setting, this becomes even more of an issue, since intermediate events also contribute to effects on the cumulative scale. This subsection is devoted to estimation of cumulative effects, or prediction, to answer clinically important questions such as the following in our example: * Given a bone marrow transplantation patient whose platelets have recovered after 60 days and who has had no further events at one year post-transplant, what is then the probability of surviving relapse-free for 2 more years? How does this probability compare to a patient whose platelets have not yet recovered?

In order to obtain prediction probabilities in the context of the Markov multi-state models discussed in the previous section, basically two steps are involved. The first is to use the estimated parameters and baseline transition hazards and the covariate values of a patient of interest, to obtain patient-specific transition hazards for that patient, for each of the transitions in the multi-state model. This is what the function msfit is designed to do. The second step is to use the resulting patient-specific transition hazards (and variances and covariances) as input for probtrans to obtain (patient-specific) transition probabilities.
I will first show how msfit can be used to obtain the baseline hazards associated with the Markov stratified and PH models. The hazards of the Markov stratified models (and their variances and covariates) are obtained by first creating a new dataset containing the (expanded) covariates along with their values (in this case 0). This is very similar to the use of survfit from the survival package. The important difference is that for one patient, this newdata data frame needs to have exactly one line for each transition. When transition-specific covariates have been used in the model, the easiest way to obtain such a data frame is to first create a data frame with the basic covariates and then using expand.covs to obtain the transition-specific covariates. Since expand.covs expects an msdata object, we set the class of the newdata data to msdata explicitly. We also copy the levels of the categorical covariates before expanding, although this is not really necessary here.

newd <- data.frame(dissub = rep(0, 3), age = rep(0, 3), 
                   drmatch = rep(0,3), tcd = rep(0, 3), trans = 1:3)
newd$dissub <- factor(newd$dissub, levels = 0:2, labels = levels(ebmt3$dissub))
newd$age <- factor(newd$age, levels = 0:2, labels = levels(ebmt3$age))
newd$drmatch <- factor(newd$drmatch, levels = 0:1, labels = levels(ebmt3$drmatch))
newd$tcd <- factor(newd$tcd, levels = 0:1, labels = levels(ebmt3$tcd))
attr(newd, "trans") <- tmat
class(newd) <- c("msdata", "data.frame")
newd <- expand.covs(newd, covs[1:4], longnames = FALSE)
newd$strata = 1:3
newd

## An object of class 'msdata'
## 
## Data:
##   dissub  age            drmatch    tcd trans dissub1.1 dissub1.2 dissub1.3
## 1    AML <=20 No gender mismatch No TCD     1         0         0         0
## 2    AML <=20 No gender mismatch No TCD     2         0         0         0
## 3    AML <=20 No gender mismatch No TCD     3         0         0         0
##   dissub2.1 dissub2.2 dissub2.3 age1.1 age1.2 age1.3 age2.1 age2.2 age2.3
## 1         0         0         0      0      0      0      0      0      0
## 2         0         0         0      0      0      0      0      0      0
## 3         0         0         0      0      0      0      0      0      0
##   drmatch.1 drmatch.2 drmatch.3 tcd.1 tcd.2 tcd.3 strata
## 1         0         0         0     0     0     0      1
## 2         0         0         0     0     0     0      2
## 3         0         0         0     0     0     0      3

The last command where the column strata is added is important and points to a second major difference between survfit and msfit. The newdata data frame needs to have a column strata specifying to which stratum in the coxph object each transition belongs. Here each transition corresponds to a separate stratum, so we specify 1, 2, and 3.
To obtain an estimate of the baseline cumulative hazard for the “stratified hazards” model, msfit can be called with the first Cox model, c1, as input model, and newd as newdata argument.

msf1 <- msfit(c1, newdata = newd, trans = tmat)

The result is an object of class msfit, which is a list with three items, Haz, varHaz, and trans. The item trans records the transition matrix used when constructing the msfit object. Haz contains the estimated cumulative hazard for each of the transitions for the particular patient specified in newd, while varHaz contains the estimated variances of these cumulative hazards, as well as the covariances for each combination of two transitions. All are evaluated at the time points for which any event in any transition occurs, possibly augmented with the largest (non-event) time point in the data. The summary method for msfit objects is most conveniently used for a summary. If we also would like to have a look at the covariances, we could set the argument variance equal to TRUE.

summary(msf1)

## 
## Transition 1 (head and tail):
##          time          Haz        seHaz        lower       upper
## 1 0.002737851 0.0005277714 0.0005290102 7.400248e-05 0.003763964
## 2 0.008213552 0.0010560892 0.0007502708 2.624139e-04 0.004250249
## 3 0.010951403 0.0010560892 0.0007502708 2.624139e-04 0.004250249
## 4 0.016427105 0.0010560892 0.0007502708 2.624139e-04 0.004250249
## 5 0.019164956 0.0015857558 0.0009219748 5.073865e-04 0.004956027
## 6 0.021902806 0.0015857558 0.0009219748 5.073865e-04 0.004956027
## 
## ...
##         time       Haz      seHaz     lower    upper
## 500 6.253251 0.9513165 0.07182285 0.8204662 1.103035
## 501 6.357290 0.9513165 0.07182285 0.8204662 1.103035
## 502 6.362765 0.9513165 0.07182285 0.8204662 1.103035
## 503 6.798084 0.9513165 0.07182285 0.8204662 1.103035
## 504 7.110198 0.9513165 0.07182285 0.8204662 1.103035
## 505 7.731691 0.9513165 0.07182285 0.8204662 1.103035
## 
## Transition 2 (head and tail):
##            time          Haz        seHaz        lower       upper
## 506 0.002737851 0.0003046955 0.0003077143 4.209506e-05 0.002205469
## 507 0.008213552 0.0003046955 0.0003077143 4.209506e-05 0.002205469
## 508 0.010951403 0.0006097444 0.0004396591 1.483833e-04 0.002505594
## 509 0.016427105 0.0012203981 0.0006340496 4.408243e-04 0.003378606
## 510 0.019164956 0.0018316171 0.0007912068 7.854882e-04 0.004271001
## 511 0.021902806 0.0024438486 0.0009303805 1.158829e-03 0.005153820
## 
## ...
##          time       Haz      seHaz     lower     upper
## 1005 6.253251 0.5020560 0.08219369 0.3642490 0.6919997
## 1006 6.357290 0.5020560 0.08219369 0.3642490 0.6919997
## 1007 6.362765 0.5248419 0.08821373 0.3775385 0.7296182
## 1008 6.798084 0.5248419 0.08821373 0.3775385 0.7296182
## 1009 7.110198 0.5248419 0.08821373 0.3775385 0.7296182
## 1010 7.731691 0.5248419 0.08821373 0.3775385 0.7296182
## 
## Transition 3 (head and tail):
##             time Haz seHaz lower upper
## 1011 0.002737851   0     0     0     0
## 1012 0.008213552   0     0     0     0
## 1013 0.010951403   0     0     0     0
## 1014 0.016427105   0     0     0     0
## 1015 0.019164956   0     0     0     0
## 1016 0.021902806   0     0     0     0
## 
## ...
##          time       Haz      seHaz     lower     upper
## 1510 6.253251 0.3291154 0.05058502 0.2435110 0.4448133
## 1511 6.357290 0.3427115 0.05413323 0.2514645 0.4670688
## 1512 6.362765 0.3427115 0.05413323 0.2514645 0.4670688
## 1513 6.798084 0.3693677 0.06340696 0.2638388 0.5171055
## 1514 7.110198 0.4647197 0.12159613 0.2782724 0.7760899
## 1515 7.731691 0.4647197 0.12159613 0.2782724 0.7760899

Let us have a closer look at some of the variances and covariances as well.

vH1 <- msf1$varHaz
head(vH1[vH1$trans1 == 1 & vH1$trans2 == 1, ])

##          time       varHaz trans1 trans2
## 1 0.002737851 2.798518e-07      1      1
## 2 0.008213552 5.629062e-07      1      1
## 3 0.010951403 5.629062e-07      1      1
## 4 0.016427105 5.629062e-07      1      1
## 5 0.019164956 8.500376e-07      1      1
## 6 0.021902806 8.500376e-07      1      1

tail(vH1[vH1$trans1 == 1 & vH1$trans2 == 1, ])

##         time      varHaz trans1 trans2
## 500 6.253251 0.005158522      1      1
## 501 6.357290 0.005158522      1      1
## 502 6.362765 0.005158522      1      1
## 503 6.798084 0.005158522      1      1
## 504 7.110198 0.005158522      1      1
## 505 7.731691 0.005158522      1      1

tail(vH1[vH1$trans1 == 1 & vH1$trans2 == 2, ])

##          time varHaz trans1 trans2
## 1005 6.253251      0      1      2
## 1006 6.357290      0      1      2
## 1007 6.362765      0      1      2
## 1008 6.798084      0      1      2
## 1009 7.110198      0      1      2
## 1010 7.731691      0      1      2

tail(vH1[vH1$trans1 == 1 & vH1$trans2 == 3, ])

##          time varHaz trans1 trans2
## 1510 6.253251      0      1      3
## 1511 6.357290      0      1      3
## 1512 6.362765      0      1      3
## 1513 6.798084      0      1      3
## 1514 7.110198      0      1      3
## 1515 7.731691      0      1      3

tail(vH1[vH1$trans1 == 2 & vH1$trans2 == 3, ])

##          time varHaz trans1 trans2
## 2520 6.253251      0      2      3
## 2521 6.357290      0      2      3
## 2522 6.362765      0      2      3
## 2523 6.798084      0      2      3
## 2524 7.110198      0      2      3
## 2525 7.731691      0      2      3

Note that the covariances of the estimated cumulative hazards are practically (apart from rounding errors) 0. Theoretically, they should be 0, because with separate strata and separate covariate effects for the different transitions, the estimates of the three transitions could in fact have been estimated as three separate Cox models (this would give exactly the same results).
The estimated baseline cumulative hazards for the Markov PH model are obtained in mostly the same way. The only exception is the specification of the strata argument in newd. Instead of taking the values 1, 2, and 3, for the three transitions, they take values 1, 2, 2, to indicate that transition 1 corresponds to stratum 1, and both transitions 2 and 3 correspond to stratum 2 (the order of the strata as defined in the coxph object). Also the time-dependent covariate pr needs to be included, taking the value 0 for transitions 1 and 2, and 1 for transition 3.

newd$strata = c(1, 2, 2)
newd$pr <- c(0, 0, 1)
msf2 <- msfit(c2, newdata = newd, trans = tmat)
summary(msf2)

## 
## Transition 1 (head and tail):
##          time          Haz        seHaz        lower       upper
## 1 0.002737851 0.0005277714 0.0005290102 7.400248e-05 0.003763964
## 2 0.008213552 0.0010560892 0.0007502708 2.624139e-04 0.004250249
## 3 0.010951403 0.0010560892 0.0007502708 2.624139e-04 0.004250249
## 4 0.016427105 0.0010560892 0.0007502708 2.624139e-04 0.004250249
## 5 0.019164956 0.0015857558 0.0009219748 5.073865e-04 0.004956027
## 6 0.021902806 0.0015857558 0.0009219748 5.073865e-04 0.004956027
## 
## ...
##         time       Haz      seHaz     lower    upper
## 500 6.253251 0.9513165 0.07182285 0.8204662 1.103035
## 501 6.357290 0.9513165 0.07182285 0.8204662 1.103035
## 502 6.362765 0.9513165 0.07182285 0.8204662 1.103035
## 503 6.798084 0.9513165 0.07182285 0.8204662 1.103035
## 504 7.110198 0.9513165 0.07182285 0.8204662 1.103035
## 505 7.731691 0.9513165 0.07182285 0.8204662 1.103035
## 
## Transition 2 (head and tail):
##            time          Haz        seHaz        lower       upper
## 506 0.002737851 0.0003053084 0.0003083331 4.217979e-05 0.002209902
## 507 0.008213552 0.0003053084 0.0003083331 4.217979e-05 0.002209902
## 508 0.010951403 0.0006107971 0.0004404176 1.486397e-04 0.002509915
## 509 0.016427105 0.0012223306 0.0006350522 4.415233e-04 0.003383948
## 510 0.019164956 0.0018344413 0.0007924245 7.867013e-04 0.004277576
## 511 0.021902806 0.0024473467 0.0009317088 1.160491e-03 0.005161183
## 
## ...
##          time       Haz      seHaz     lower     upper
## 1005 6.253251 0.5040408 0.07806657 0.3720749 0.6828118
## 1006 6.357290 0.5146993 0.08030652 0.3790914 0.6988167
## 1007 6.362765 0.5255361 0.08256535 0.3862540 0.7150431
## 1008 6.798084 0.5476683 0.08851937 0.3989682 0.7517906
## 1009 7.110198 0.6357669 0.13427464 0.4202651 0.9617730
## 1010 7.731691 0.6357669 0.13427464 0.4202651 0.9617730
## 
## Transition 3 (head and tail):
##             time          Haz        seHaz        lower       upper
## 1011 0.002737851 0.0002090742 0.0002116301 2.875366e-05 0.001520225
## 1012 0.008213552 0.0002090742 0.0002116301 2.875366e-05 0.001520225
## 1013 0.010951403 0.0004182719 0.0003029499 1.011445e-04 0.001729717
## 1014 0.016427105 0.0008370481 0.0004386272 2.997137e-04 0.002337729
## 1015 0.019164956 0.0012562195 0.0005493845 5.330994e-04 0.002960212
## 1016 0.021902806 0.0016759351 0.0006481990 7.853066e-04 0.003576640
## 
## ...
##          time       Haz      seHaz     lower     upper
## 1510 6.253251 0.3451655 0.05260815 0.2560308 0.4653317
## 1511 6.357290 0.3524644 0.05411648 0.2608699 0.4762189
## 1512 6.362765 0.3598855 0.05563688 0.2658103 0.4872555
## 1513 6.798084 0.3750415 0.05964162 0.2746095 0.5122042
## 1514 7.110198 0.4353712 0.09072076 0.2893943 0.6549820
## 1515 7.731691 0.4353712 0.09072076 0.2893943 0.6549820

vH2 <- msf2$varHaz
tail(vH2[vH2$trans1 == 1 & vH2$trans2 == 2, ])

##          time varHaz trans1 trans2
## 1005 6.253251      0      1      2
## 1006 6.357290      0      1      2
## 1007 6.362765      0      1      2
## 1008 6.798084      0      1      2
## 1009 7.110198      0      1      2
## 1010 7.731691      0      1      2

tail(vH2[vH2$trans1 == 1 & vH2$trans2 == 3, ])

##          time varHaz trans1 trans2
## 1510 6.253251      0      1      3
## 1511 6.357290      0      1      3
## 1512 6.362765      0      1      3
## 1513 6.798084      0      1      3
## 1514 7.110198      0      1      3
## 1515 7.731691      0      1      3

tail(vH2[vH2$trans1 == 2 & vH2$trans2 == 3, ])

##          time       varHaz trans1 trans2
## 2520 6.253251 0.0004142378      2      3
## 2521 6.357290 0.0005227029      2      3
## 2522 6.362765 0.0006348311      2      3
## 2523 6.798084 0.0011112104      2      3
## 2524 7.110198 0.0088628795      2      3
## 2525 7.731691 0.0088628795      2      3

Note that the estimated cumulative hazards and variances for transition 1 are identical to those from msf1. We saw earlier that the estimated regression coefficients were also identical for the Markov stratified and the Markon PH models. Note also that the variance of the cumulative hazard of transition 3 (and 2, not shown) is smaller than with msf1. The cumulative hazard estimates of transitions 1 and 2 are still uncorrelated (and 1 and 3), but those of transitions 2 and 3 are correlated now, because they share a common baseline.
Let us compare the baseline hazards of the Markov stratified and PH models graphically. For this we use the plot method for msfit objects. Figure 1 corresponds to Figure 14 in the tutorial.

par(mfrow = c(1, 2))
plot(msf1, cols = rep(1, 3), lwd = 2, lty = 1:3, xlab = "Years since transplant",
        ylab = "Stratified baseline hazards", legend.pos = c(2, 0.9))
plot(msf2, cols = rep(1, 3), lwd = 2, lty = 1:3, xlab = "Years since transplant",
        ylab = "Proportional baseline hazards", legend.pos = c(2, 0.9))

par(mfrow = c(1, 1))

Figure 1: Baseline cumulative hazard curves for the EBMT illness-death model. On the left the Markov stratified hazards model, on the right the Markov PH model.
Define the multi-state model as $X(t)$, a random process taking values in $1, \dots, S$ ($S$ being the number of states). We are interested in estimating so called transition probabilities $P_{gh}(s,t)=P(X(t)=h|X(s)=g)$, possibly depending on covariates. For instance, $P_{13}(0, t)$ indicates the probability of having relapsed/died (state 3) by time $t$, given that the individual was alive without relapse or platelet recovery (state 1) at time $s = 0$. By fixing $s$ and varying $t$, we can predict the future behavior of the multi-state model given the present at time $s$. For Markov models, these probabilities will depend only on the state at time s, not on what happened before. For these Markov models there is a powerful relation between these transition probabilities and the transition intensities, given by

\[P(s,t)=\prod_{(s,t]} (I+d\Lambda(u))\]

Here $P(s,t)$ is an $S \times S$ matrix with as $(g, h)$ element the $P_{gh}(s,t)$ in which we are interested, and $\Lambda(t)$ is an $S \times S$ matrix with as off-diagonal $(g, h)$ elements the transition intensities $\Lambda_{gh}(t)$ of transition $g \to h$. If such a direct transition is not possible, then $\Lambda_{gh}(t)=0$. The diagonal elements of $\Lambda(t)$ are defined as $\Lambda_{gg}(t) = -\sum_{h \neq g} \Lambda_{gh}(t)$, i.e. as minus the sum of the transition intensities of the transitions out from state $g$. Finally, $I$ is the $S \times S$ identity matrix. This equation describes a theoretical relation between the true underlying transition intensities and transition probabilities. The product is a so called product integral (Andersen et al. 1993) when the transition intensities are continuous.
We already have estimates of all the transition intensities. If we gather these in a matrix and plug them in equation (1), we get

\[\hat{P}(s,t)=\prod_{(s<u\le t]} (I+d\hat{\Lambda}(u))\]

*as an estimate of the transition probabilities. This estimator is called the Aalen-Johansen estimator, and it is implemented in probtrans. By working with matrices, we immediately get all the transition probabilities from all the starting states g to all the receiving states h in one go. When we fix s, we can calculate all these transition probabilities by forward matrix multiplications using the simple recursive relation

\[\hat{P}(s,t+)=\hat{P}(s,t+) \cdot (I+d\hat{\Lambda}(t+))\]

Andersen et al. (1993) and de Wreede et al. (2009) also describe recursive formulas for the covariance matrix of P^ (s; t), with and without covariates, which are implemented in mstate.
Let us see all this theory in action and let us recreate Figure 15 of the tutorial. For this we need to calculate transition probabilities for a baseline patient, based on the Markov PH model. We thus use msf2 as input for probtrans. By default, probtrans uses forward prediction, which means that s is kept fixed and t > s. The argument predt specifies either s or t. In this case (forward prediction) it specifies s. From version 0.2.3 on, probtrans no longer needs a trans argument, but takes that from the trans item of the msfit object.

pt <- probtrans(msf2, predt = 0)

The result of probtrans is a probtrans object, which is a list, where item [[i]] contains predictions from state $i$. Each item of the list is a data frame with time containing all event time points, and pstate1, pstate2, etc the probabilities of being in state 1, 2, etc, and finally se1, se2 etc the standard errors of these estimated probabilities. The item [[3]] contains predictions $\hat{P}_{3h}(0; t)$ (we chose $s = 0$) starting from the RelDeath state, which is absorbing.

head(pt[[3]])

##          time pstate1 pstate2 pstate3 se1 se2 se3
## 1 0.000000000       0       0       1   0   0   0
## 2 0.002737851       0       0       1   0   0   0
## 3 0.008213552       0       0       1   0   0   0
## 4 0.010951403       0       0       1   0   0   0
## 5 0.016427105       0       0       1   0   0   0
## 6 0.019164956       0       0       1   0   0   0

tail(pt[[3]])

##         time pstate1 pstate2 pstate3 se1 se2 se3
## 501 6.253251       0       0       1   0   0   0
## 502 6.357290       0       0       1   0   0   0
## 503 6.362765       0       0       1   0   0   0
## 504 6.798084       0       0       1   0   0   0
## 505 7.110198       0       0       1   0   0   0
## 506 7.731691       0       0       1   0   0   0

We see that these prediction probabilities are not so interesting; the probabilities are all 0 or 1, and, since there is no randomness, all the SE’s are 0. Item [[2]] contains predictions $\hat{P}_{2h}(0, t)$ from state 2.
It is easier to use the summary method for probtrans objects. The user may specify a from argument, specifying from which state the predictions are to be printed. The summary method prints a selection, the head and tail by default unless there are fewer than 12 time points. When complete is set to TRUE, predictions for all time points are printed. If the from argument is missing in the function call, then predictions from all states are printed.

#Output Suppressed
#summary(pt, from = 2)

From state 2 it is only possible to visit state 3 or to remain in state 2. The probability of going to state 1 is 0. The predictions P^1h(0; t) from state 1 in [[1]] are perhaps of most interest here.

#Output Suppressed 
#summary(pt, from = 1)

But we see that we do not have enough information to create Figure 15 of the tutorial, since the probability of the relapse/death state (pstate3) does not distinguish between relapse/death before or after platelet recovery. The remedy is actually easy in this case. Consider a different multi-state model with two RelDeath states, the first one (state 3) after platelet recovery, the second one (state 4) without platelet recovery. The transition matrix of this multi-state model is defined as

tmat2 <- transMat(x = list(c(2, 4), c(3), c(), c()))
tmat2

##          to
## from      State 1 State 2 State 3 State 4
##   State 1      NA       1      NA       2
##   State 2      NA      NA       3      NA
##   State 3      NA      NA      NA      NA
##   State 4      NA      NA      NA      NA

The multi-state model has four states and the same three transitions as before. If we apply probtrans to this new multi-state model with the same estimated cumulative hazards and standard errors as before, we get exactly what we want. Thus, we just have to call probtrans with the old msf2 and the new tmat2. From version 0.2.3 on, since the transition matrix is in the msfit object, we just need to replace the trans item of msf2 by tmat2. In the elements of the resulting lists, pstate3 will indicate the probability of relapse/death after platelet recovery and pstate4 the probability of relapse/death without platelet recovery.

msf2$trans <- tmat2
pt <- probtrans(msf2, predt = 0)
summary(pt, from = 1)

## 
## Prediction from state 1 (head and tail):
##          time   pstate1      pstate2      pstate3      pstate4          se1
## 1 0.000000000 1.0000000 0.0000000000 0.000000e+00 0.0000000000 0.0000000000
## 2 0.002737851 0.9991669 0.0005277714 0.000000e+00 0.0003053084 0.0006117979
## 3 0.008213552 0.9986390 0.0010556490 0.000000e+00 0.0003053084 0.0008100529
## 4 0.010951403 0.9983340 0.0010554282 2.208393e-07 0.0006103813 0.0008685356
## 5 0.016427105 0.9977235 0.0010549862 6.628276e-07 0.0012208961 0.0009807157
## 6 0.019164956 0.9965843 0.0015830048 1.105048e-06 0.0018316132 0.0012115670
##            se2          se3          se4    lower1       lower2       lower3
## 1 0.0000000000 0.000000e+00 0.0000000000 1.0000000 0.000000e+00 0.000000e+00
## 2 0.0005285695 1.116923e-07 0.0003080762 0.9979685 7.412369e-05 0.000000e+00
## 3 0.0007492497 1.116923e-07 0.0003080762 0.9970526 2.626497e-04 0.000000e+00
## 4 0.0007490930 2.989514e-07 0.0004397978 0.9966331 2.625948e-04 1.555250e-08
## 5 0.0007487794 6.308958e-07 0.0006336859 0.9958031 2.624848e-04 1.026138e-07
## 6 0.0009191199 1.032427e-06 0.0007900509 0.9942125 5.072942e-04 1.770590e-07
##         lower4    upper1      upper2       upper3      upper4
## 1 0.0000000000 1.0000000 0.000000000 0.000000e+00 0.000000000
## 2 0.0000422494 1.0000000 0.003757809          NaN 0.002206261
## 3 0.0000422494 1.0000000 0.004242894          NaN 0.002206261
## 4 0.0001486912 1.0000000 0.004242006 3.135832e-06 0.002505631
## 5 0.0004414450 0.9996475 0.004240230 4.281495e-06 0.003376609
## 6 0.0007864573 0.9989617 0.004939745 6.896741e-06 0.004265720
## 
## ...
##         time   pstate1   pstate2   pstate3   pstate4        se1        se2
## 501 6.253251 0.2308531 0.4336481 0.1681264 0.1673724 0.02448884 0.02974526
## 502 6.357290 0.2283925 0.4304829 0.1712916 0.1698330 0.02460675 0.03002904
## 503 6.362765 0.2259175 0.4272883 0.1744862 0.1723080 0.02472281 0.03031296
## 504 6.798084 0.2209174 0.4208123 0.1809622 0.1773081 0.02518284 0.03119272
## 505 7.110198 0.2014549 0.3954248 0.2063497 0.1967706 0.03067690 0.03987257
## 506 7.731691 0.2014549 0.3954248 0.2063497 0.1967706 0.03067690 0.03987257
##            se3        se4    lower1    lower2    lower3    lower4    upper1
## 501 0.02379684 0.02100629 0.1875169 0.3790974 0.1273960 0.1308738 0.2842045
## 502 0.02430502 0.02136056 0.1849160 0.3754732 0.1297050 0.1327282 0.2820911
## 503 0.02480762 0.02170882 0.1823058 0.3718215 0.1320509 0.1346059 0.2799621
## 504 0.02616939 0.02264879 0.1766850 0.3639092 0.1362993 0.1380380 0.2762233
## 505 0.03690104 0.02987965 0.1494719 0.3245138 0.1453401 0.1461185 0.2715164
## 506 0.03690104 0.02987965 0.1494719 0.3245138 0.1453401 0.1461185 0.2715164
##        upper2    upper3    upper4
## 501 0.4960483 0.2218790 0.2140499
## 502 0.4935519 0.2262118 0.2173106
## 503 0.4910294 0.2305584 0.2205703
## 504 0.4866130 0.2402604 0.2277500
## 505 0.4818309 0.2929694 0.2649813
## 506 0.4818309 0.2929694 0.2649813

The reader may check that the pstate3 and pstate4 probabilities of this new Aalen-Johansen estimator sum up to the pstate3 probability of the result of the previous call to probtrans, and that the pstate1 and pstate2 probabilities are unchanged.
Figure 2 contains a plot of pt1. For this we use the plot method for probtrans objects.

plot(pt, ord = c(2, 3, 4, 1), lwd = 2, xlab = "Years since transplant",
     ylab = "Prediction probabilities", cex = 0.75, 
     legend = c("Alive in remission, no PR",
                "Alive in remission, PR", "Relapse or death after PR",
                "Relapse or death without PR"))

Figure 2: Stacked prediction probabilities at $s = 0$ for a reference patient. PR stands for platelet recovery
The argument from determines from which state the transition probabilities are to be plotted. The default is from state 1, which is what we want, so the from argument is omitted here. The default type of the plot method for probtrans objects is a “stacked”plot, for which the difference between two adjacent lines represents the probability of being in a state. The argument ord specifies the order of the states of which the probabilities are stacked. The present order, 2, 3, 4, 1, allows states 2 and 3 to be combined visually (states with platelet recovery) and states 3 and 4 (death states). Other plot types are “filled”, which is like “stacked”, but uses colors to fill the space between adjacent lines, “single”, which simply plots the transition probabilities as different lines in a single plot, and “separate”, which uses separate plots for the transition probabilities.
To obtain the predictions $\hat{P}_{1h}(s,t)$ for $s=0.5$, which are plotted in Figure 16 of the tutorial, we simply change the value of predt in the call to probtrans.

pt <- probtrans(msf2, predt = 0.5)
summary(pt, from = 1)

## 
## Prediction from state 1 (head and tail):
##        time   pstate1     pstate2      pstate3     pstate4         se1
## 1 0.5000000 1.0000000 0.000000000 0.000000e+00 0.000000000 0.000000000
## 2 0.5010267 0.9985898 0.000000000 0.000000e+00 0.001410218 0.003237571
## 3 0.5037645 0.9976488 0.000000000 0.000000e+00 0.002351164 0.004183373
## 4 0.5065024 0.9955387 0.001639506 0.000000e+00 0.002821775 0.006169060
## 5 0.5092402 0.9938957 0.003282495 0.000000e+00 0.002821775 0.007422321
## 6 0.5119781 0.9915469 0.003277183 5.312169e-06 0.005170580 0.008513835
##           se2          se3         se4    lower1       lower2     lower3
## 1 0.000000000 0.000000e+00 0.000000000 1.0000000 0.000000e+00 0.0000e+00
## 2 0.000000000 0.000000e+00 0.003237571 0.9922644 0.000000e+00 0.0000e+00
## 3 0.000000000 0.000000e+00 0.004183373 0.9894832 0.000000e+00 0.0000e+00
## 4 0.004136138 2.101143e-06 0.004583357 0.9835207 1.167630e-05 0.0000e+00
## 5 0.005848968 2.101143e-06 0.004583357 0.9794542 9.987955e-05 0.0000e+00
## 6 0.005839510 1.353036e-05 0.006209919 0.9749997 9.971745e-05 3.6076e-08
##         lower4 upper1    upper2       upper3     upper4
## 1 0.000000e+00      1 0.0000000 0.0000000000 0.00000000
## 2 1.567120e-05      1 0.0000000 0.0000000000 0.12690255
## 3 7.190497e-05      1 0.0000000 0.0000000000 0.07687883
## 4 1.169315e-04      1 0.2302081          NaN 0.06809471
## 5 1.169315e-04      1 0.1078777          NaN 0.06809471
## 6 4.911765e-04      1 0.1077036 0.0007822136 0.05443032
## 
## ...
##         time   pstate1    pstate2     pstate3   pstate4        se1        se2
## 330 6.253251 0.6872018 0.02597812 0.005991102 0.2808290 0.05248379 0.01448894
## 331 6.357290 0.6798772 0.02578851 0.006180714 0.2881535 0.05348008 0.01438691
## 332 6.362765 0.6725095 0.02559713 0.006372091 0.2955212 0.05445049 0.01428397
## 333 6.798084 0.6576254 0.02520918 0.006760043 0.3104053 0.05723289 0.01407791
## 334 7.110198 0.5996895 0.02368832 0.008280903 0.3683412 0.07993696 0.01332734
## 335 7.731691 0.5996895 0.02368832 0.008280903 0.3683412 0.07993696 0.01332734
##             se3        se4    lower1      lower2      lower3    lower4
## 330 0.003565503 0.05117341 0.5916642 0.008706862 0.001866073 0.1964870
## 331 0.003675647 0.05224080 0.5827386 0.008640867 0.001926781 0.2019786
## 332 0.003786522 0.05327926 0.5738257 0.008574230 0.001988236 0.2075517
## 333 0.004019125 0.05620683 0.5544966 0.008437438 0.002108021 0.2176694
## 334 0.005060910 0.07944552 0.4618104 0.007863898 0.002499552 0.2413567
## 335 0.005060910 0.07944552 0.4618104 0.007863898 0.002499552 0.2413567
##        upper1     upper2     upper3    upper4
## 330 0.7981661 0.07750930 0.01923468 0.4013749
## 331 0.7932082 0.07696533 0.01982645 0.4110953
## 332 0.7881646 0.07641656 0.02042190 0.4207761
## 333 0.7799349 0.07531940 0.02167824 0.4426505
## 334 0.7787343 0.07135602 0.02743426 0.5621360
## 335 0.7787343 0.07135602 0.02743426 0.5621360

The result now contains only time points $t \ge 0.5$. Figure 3 contains a plot of pt1

plot(pt, ord = c(2, 3, 4, 1), lwd = 2, xlab = "Years since transplant",
     ylab = "Prediction probabilities", cex = 0.75, 
     legend = c("Alive in remission, no PR",
                "Alive in remission, PR", "Relapse or death after PR",
                "Relapse or death without PR"))

Figure 17 of the tutorial distinguishes between three patients, one being the good old (or rather young) reference patient, for which we have already calculated the probabilities, one for a patient in the age category 20-40, and one for a patient older than 40. To obtain prediction probabilities for the latter two patients as well, we have to repeat part of the calculations, changing only the value of age in the newdata data frame.

msf2$trans <- tmat
msf.20 <- msf2 # copy msfit result for reference (young) patient
newd <- newd[,1:5] # use the basic covariates of the reference patient
newd2 <- newd
newd2$age <- 1
newd2$age <- factor(newd2$age,levels=0:2,labels=levels(ebmt3$age))
attr(newd2, "trans") <- tmat

Figure 3: Stacked prediction probabilities at $s = 0.5$ for a reference patient

class(newd2) <- c("msdata","data.frame")
newd2 <- expand.covs(newd2,covs[1:4],longnames=FALSE)
newd2$strata=c(1,2,2)
newd2$pr <- c(0,0,1)
msf.2040 <- msfit(c2, newdata=newd2, trans=tmat)
newd3 <- newd
newd3$age <- 2
newd3$age <- factor(newd3$age,levels=0:2,labels=levels(ebmt3$age))
attr(newd3, "trans") <- tmat
class(newd3) <- c("msdata","data.frame")
newd3 <- expand.covs(newd3,covs[1:4],longnames=FALSE)
newd3$strata=c(1,2,2)
newd3$pr <- c(0,0,1)
msf.40 <- msfit(c2, newdata=newd3, trans=tmat)
pt.20 <- probtrans(msf.20,predt=0) # original young (<= 20) patient
pt.201 <- pt.20[[1]]; pt.202 <- pt.20[[2]]
pt.2040 <- probtrans(msf.2040,predt=0) # patient 20-40
pt.20401 <- pt.2040[[1]]; pt.20402 <- pt.2040[[2]]
pt.40 <- probtrans(msf.40,predt=0) # patient > 40
pt.401 <- pt.40[[1]]; pt.402 <- pt.40[[2]]

The 5-years transition probabilities $P_{13}(0, 5)$ and $P_{23}(0, 5)$ are estimated as 0.30275 and 0.26210 respectively.

pt.201[488:489,] # 5 years falls between 488th and 489th time point

##         time   pstate1   pstate2   pstate3        se1        se2        se3
## 488 4.985626 0.2452605 0.4519872 0.3027523 0.02411439 0.02853645 0.02693539
## 489 5.084189 0.2445602 0.4511034 0.3043365 0.02412385 0.02858110 0.02707436

pt.202[488:489,] # 5-years probabilities

##         time pstate1   pstate2   pstate3 se1        se2        se3
## 488 4.985626       0 0.7378970 0.2621030   0 0.03339911 0.03339911
## 489 5.084189       0 0.7364541 0.2635459   0 0.03356217 0.03356217

Figure 4 shows relapse-free survival probabilities without distinction between before or after platelet recovery, so we can use the first transition matrix tmat. The probabilities we want are $1 - \hat{P}_{13}(0, t)$ and $1 - \hat{P}_{23}(0, t)$, the first one conditioning on being in state 1 (transplantation, i.e. no PR), the second in being in state 2 (PR).

plot(pt.201$time, 1 - pt.201$pstate3, ylim = c(0.425, 1), type = "s",
     lwd = 2, col = "red", xlab = "Years since transplant", ylab = "Relapse-free survival")
lines(pt.20401$time, 1 - pt.20401$pstate3, type = "s", lwd = 2,
      col = "blue")
lines(pt.401$time, 1 - pt.401$pstate3, type = "s", lwd = 2, col = "green")
lines(pt.202$time, 1 - pt.202$pstate3, type = "s", lwd = 2, col = "red",
      lty = 2)
lines(pt.20402$time, 1 - pt.20402$pstate3, type = "s", lwd = 2,
      col = "blue", lty = 2)
lines(pt.402$time, 1 - pt.402$pstate3, type = "s", lwd = 2, col = "green",
      lty = 2)
legend(6, 1, c("no PR", "PR"), lwd = 2, lty = 1:2, xjust = 1,
      bty = "n")
legend("topright", c("<=20", "20-40", ">40"), lwd = 2, 
       col = c("red", "blue", "green"), bty = "n")

Figure 4: Predicted relapse-free survival probabilities for three patients in different age categories, given platelet recovery (dashed) and given no platelet recovery (solid). The time of prediction was at transplant (note: in the tutorial this was at 1 month after transplant).
It is also possible to do prediction with a fixed horizon. This should not be understood as attempting to predict the past. It means that in our prediction probabilities $P_{gh}(s, t)$, we fix $t$, a time horizon, and we want to study how $P_{gh}(s, t)$ changes as more and more information on a patient becomes available. From a computational point of view this just means that the order of the matrix multiplication in (2) is reversed. We will plot $1 -\hat{P}_{13}(s, 5)$ and $1 -\hat{P}_{23}(s, 5)$, the 5-years relapse-free survival probabilities given that the patient is in state 1 (no PR) and in state 2 (PR), respectively, for the same three patients as before.

pt.20 <- probtrans(msf.20, direction = "fixedhorizon", predt = 5)
pt.201 <- pt.20[[1]]
pt.202 <- pt.20[[2]]
head(pt.201)

##          time   pstate1   pstate2   pstate3        se1        se2        se3
## 1 0.000000000 0.2452605 0.4519872 0.3027523 0.02411439 0.02853645 0.02693539
## 2 0.002737851 0.2454650 0.4519742 0.3025608 0.02413403 0.02854695 0.02694328
## 3 0.008213552 0.2455948 0.4518230 0.3025823 0.02414644 0.02854909 0.02694380
## 4 0.010951403 0.2456698 0.4519611 0.3023691 0.02415369 0.02855746 0.02695114
## 5 0.016427105 0.2458201 0.4522376 0.3019422 0.02416821 0.02857418 0.02696574
## 6 0.019164956 0.2461011 0.4523628 0.3015361 0.02419520 0.02859303 0.02698076

head(pt.202)

##          time pstate1   pstate2   pstate3 se1        se2        se3
## 1 0.000000000       0 0.7378970 0.2621030   0 0.03339911 0.03339911
## 2 0.002737851       0 0.7380513 0.2619487   0 0.03340572 0.03340572
## 3 0.008213552       0 0.7380513 0.2619487   0 0.03340572 0.03340572
## 4 0.010951403       0 0.7382057 0.2617943   0 0.03341233 0.03341233
## 5 0.016427105       0 0.7385150 0.2614850   0 0.03342551 0.03342551
## 6 0.019164956       0 0.7388247 0.2611753   0 0.03343863 0.03343863

Here item [[1]] gives estimates $\hat{P}_{1h}(s,5)$ and [[2]] gives estimates $\hat{P}_{2h}(s,5)$. For item [[g]], the column time gives the different values of s and pstate1 etc give the estimated probabilities of being in state 1 etc at 5 years, conditional on being in state g at time s. In pt.201 we recognize at time (s)=0) 0.30275 as $\hat{P}_{1h}(0,5)$ and in pt.202 we see 0.26210 as $\hat{P}_{2h}(0,5)$. The backward transition probabilities for the other two patients are calculated similarly.

pt.2040 <- probtrans(msf.2040, direction = "fixedhorizon", predt = 5)
pt.20401 <- pt.2040[[1]]
pt.20402 <- pt.2040[[2]]
pt.40 <- probtrans(msf.40, direction = "fixedhorizon", predt = 5)
pt.401 <- pt.40[[1]]
pt.402 <- pt.40[[2]]

As mentioned before, in $s = 0$, these probabilities are the same as the five-years probabilities of Figure 4, and as s approaches 5, the probabilities approach 1, since both $\hat{P}_{13}(s,5)$ and $\hat{P}_{23}(s,5)$ approach 0. Figure 5 shows 5-years relapse-free survival probabilities, both with and without platelet recovery, with the prediction time s varying.

plot(pt.201$time, 1 - pt.201$pstate3, ylim = c(0.425, 1), type = "s",
     lwd = 2, col = "red", xlab = "Years since transplant", ylab = "Relapse-free survival")
lines(pt.20401$time, 1 - pt.20401$pstate3, type = "s", lwd = 2,
     col = "blue")
lines(pt.401$time, 1 - pt.401$pstate3, type = "s", lwd = 2, col = "green")
lines(pt.202$time, 1 - pt.202$pstate3, type = "s", lwd = 2, col = "red",
     lty = 2)
lines(pt.20402$time, 1 - pt.20402$pstate3, type = "s", lwd = 2,
     col = "blue", lty = 2)
lines(pt.402$time, 1 - pt.402$pstate3, type = "s", lwd = 2, col = "green",
     lty = 2)
legend("topleft", c("<=20", "20-40", ">40"), lwd = 2, 
       col = c("red", "blue", "green"), bty = "n")
legend(1, 1, c("no PR", "PR"), lwd = 2, lty = 1:2, bty = "n")
title(main = "Backward prediction")

* Figure 5: Predicted probabilities of 5-years relapse-free survival, conditional on being alive without relapse with (PR) and without platelet recovery (no PR). Patients in three age categories.

Competing risks

Competing risks concern the situation where more than one cause of failure is possible. If failures are different causes of death, only the first of these to occur is observed. In other situations, observations after the first failure may be observable, but not of interest. We can represent a competing risks model graphically with an initial state (alive or more generally event-free) and a number of different endpoints.

A competing risks situation with K causes of failure

The subject of competing risks goes as far back as the 18th century, when Bernoulli [12] studied the possible consequences of eradication of smallpox on mortality rates. Indeed, the problem of estimation of failure probabilities after elimination (or modification) of one of the competing risks has been of great importance and has been the subject of much debate in the 1970s [13, 14].
The central criticism is the assumption that upon removal of one cause of failure, the risks of failure of the remaining causes is unchanged. While this may be a reasonable assumption in the industrial setting, in human studies it will rarely be true.
In some case, each failure type is equally important. In other cases, one failure type can be singled out as the event of interest, while the remaining failure types are of less importance. One is then interested in the probability of failing from the cause of interest in the presence of competing risks (or, as in the first example, each of the death causes in turn is the cause of interest, with all the other death causes taken as competing risks).
One method that is often used to estimate this failure probability is the Kaplan–Meier estimate, where the failures from the competing causes are treated as censored observations.
However, this naive Kaplan–Meier is biased. The basic issue in competing risks models that results in the bias of the naive Kaplan–Meier estimator is the violation of one of the assumptions underlying the Kaplan–Meier estimator: the assumption of independence of the censoring distribution, i.e. the distribution of the time to the competing events.
- If the competing event time distributions were independent of the distribution of time to the event of interest, this would imply that at each point in time the hazard of the event of interest is the same for subjects that have not yet failed and are still under follow-up as for subjects that have experienced a competing event by that time.
- However, a subject that is censored because of failure from a competing risk will with certainty NOT experience the event of interest. Since subjects that will never fail are treated as if they could fail (they are censored), the naive Kaplan–Meier overestimates the probability of failure (and hence underestimates the corresponding survival probability).
The bias is greater when the competition is heavier, i.e. when the hazard of the competing events is larger. This is different from censoring due to end of study or loss to follow-up. In the latter situations, individuals may still fail at a later time point.
For illustration of several concepts and techniques we will use data from 329 homosexual men from the Amsterdam Cohort Studies on HIV infection and AIDS [15]. During the course of HIV infection, the so-called syncytium inducing (SI) HIV phenotype appears in many individuals. Prognosis is strongly impaired after the appearance of this SI phenotype [16]. Little is known about factors that induce the appearance of SI phenotype. When analysing time to SI appearance before AIDS diagnosis, AIDS acts as a competing event.
Using a naive KM approach, for time to AIDS, all individuals in which SI phenotype appeared first were treated as censored, while for SI appearance, all AIDS diagnoses were treated as censored.

Estimation

The fundamental concept in competing risks models is the cause-specific hazard function, the hazard of failing from a given cause in the presence of the competing events
Let’s review:
- The hazard is defined as \[h(t)=\lim_{\Delta t \to 0} \frac{P(t \le t+\Delta t, D=k|T\ge t)}{\Delta t}\]
- the survival function is defined through \[h(t)=\frac{1}{S(t)}\lim_{\Delta t \to 0} \frac{S(t)-S(t+\Delta t)}{\Delta t}=-\frac{d \log S(t)}{dt}\]
  - The cumularive hazard id defined as \[H(t)=\int_{0}^{t} h(s)ds\]
  - The survival function can be found from the cumulative hazard through the relation \[S(t)=\exp(-H(t))\]
The observable data in competing risks models is represented by the time of failure $T$, the cause of failure $D$, and possibly a covariate vecter $Z$. \[h_k(t)=\lim_{\Delta t \to 0} \frac{P(t \le t+\Delta t, D=k|T\ge t)}{\Delta t}\]

\[H_k(t)=\int_{0}^{t} h_k(s)ds\] \[S_k(t)=\exp(-H_k(t))\]

thus, \[S(t)=\exp(-\sum_{k=1}^{k} H_k(t))\]
This survival function does have an interpretation: it is the probability of not having failed from any cause at time $t$. The cumulative incidence function of casue $k$, $(I_k(t)$, is defined by the probability ($P(T\ le t, D=k)$) of failing from cause $k$ before time $t$. \[I_k(t)=\int_{0}^{t} h_k(s)S(s)ds\]
Alternatively, the latent failure time approach focused on the joint distribution of the times to the $K$ different events,, as described by the joint survival function \[\bar{S}(t_1, \dots, t_K)=P(\tilde{T}_1 > t_1, \dots, \tilde{T}_k > t_k)\]
- The marginal distribution $S^{k}=P(\tilde{T}_k > t)=\bar{S}(0,\dots,0,t,0,\dots,,0)$ then defines a marginal hazard function as above. A fundamental problem with this approach is that, without additional assumptions, the joint survival function is not identifiable from the observed data (a single failure time for each subject). For any joint survival function with arbitrary dependence between the different failure time distributions, one can find a different joint survival function with independent failure time distributions, which has the same cause-specific hazards. The implications of this are that the joint survival function is not identifiable, nor are the marginal distributions. It is even impossible to test for independence of the marginal failure time distributions.
Estimation of the cumulative incidence functions.
- Let $0 < t_1 < t_2, \dots, < t_N$ be the ordered distinct time points at which failures of any cause occur. Let $d_{kj}$ denote the number of patients failing from cause $k$ at $t_j$, and let $d_j = \sum_{k=1}^{K} d_{kj}$ denote the total number of failures (from any cause) at $t_j$. In the absence of ties only one of the $d_{kj}$ equals 1 for a given $j$, and $d_j = 1$.
- The formulas are also valid, however, in the presence of ties. Let $n_j$ be the number of patients at risk (i.e. that are still in follow-up and have not failed from any cause) at time $t_j$. The overall survival probability $S(t)$ at $t$ can be estimated, without considering the cause of failure, by the Kaplan–Meier estimator \[\hat{S}(t)=\prod_{j:t_j \le t} (1-\frac{d_j}{n_j})\]
- A discretized version of the cause-specific hazard of equation is the proportion of subjects at risk that fail from cause $k$, \[h_k(t)=p(T=t_j, D=k|T>t_{j-1})\]
- This quantity would be estimated by \[\hat{h}_k(t_j)=\frac{d_{kj}}{n_j}\]
- Thus, \[\hat{S}(t)=\prod_{j:t_j \le t} (1-\sum_{k=1}^{K} \hat{h}_k(t_j))\]
- The unconditional probability of failing from cause $k$ at $t_j$, $p_k(t_j)=P(T=t_j, D=k)$ is the product of the hazard and the probability of being event-free at $t_j$ and is estimated as \[\hat{p}_k (t_j)=\hat{h}_k (t_j) \hat{S}(t_{j-1})\]
Finally, the cumulative incidence $I_{k}(t)$ of cause $k$ at $t$ is estimated as the sum of these terms for all time points before $t$; in summary \[\hat{I}_k (t)=\sum_{j:t_j \ge t} \hat{p}_k (t_j),\] \[\hat{p}_k (t_j)=\hat{h}_k (t_j) \hat{S}(t_{j-1}),\] \[\hat{h}_k(t_j)=\frac{d_{kj}}{n_j}\]
The effect of covariates on disease progression is most often modelled using the Cox proportional hazards model. In its simplest form, the hazard for a subject with covariate values $x = (x_1, \dots, , x_p)$ is assumed to be \[h(t|x)=h_0(t)\exp(\beta x)\]
Assuming all event times are distinct, the parameter vector $\beta$ is found by maximizing the partial likelihood. This is a product, over the event times, of a quotient that compares the hazard of the individual with the event at $t_j$ to the hazard of all the individuals at risk at $t_j$: \[L(\beta)=\prod_{j=1}^{N} \frac{\exp(\beta x_j)}{\sum_{l: R_j}\exp(\beta x_l)}\]
Note that the baseline hazard cancels out. The estimate $\beta$ is used in Breslow’s estimate of the baseline cumulative hazard \[\hat{H}(t)=\sum_{j:t_j /le t} \frac{1}{\sum_{l:R_j}\exp(\hat{\beta}x_j)}\]
Sometimes, one may want to allow the baseline hazard to be different across subgroups $h = 1, \dots, m,$ called strata: \[h_h(t|x)=h_{h,0}(t)\exp(\beta x)\]
Parameter estimation in this stratified Cox model is performed by maximization of the partial likelihood per stratum \[L(\beta)=\prod_{h=1}^{m} L_h(\beta)\] with \[L_h(\beta)=\prod_{j=1}^{N} \frac{\exp(\beta x_j)}{\sum_{l: R_{h,j}}\exp(\beta x_l)}\]
Here, the product in $L_h(\beta)$ is only taken over the event times from individuals in stratum $h$, and $R_{h,j}$ denotes the risk set at event time $t_j$ in stratum $h$. If all relative risk parameters $\beta$ are allowed to differ per strata, then the $L_h(\beta_h)$ have nothing in common and fitting such a stratified Cox model boils down to fitting m different Cox models, i.e. one per stratum.
The results from a Cox model, which models effects of covariates on the hazard, can also be used to describe cumulative effects. For the moment, assume that only effects of time-fixed covariates have been modelled. If an individual has covariate values $x$, then, his or her survival curve is estimated as \[\hat{S}(t)=\exp(-\hat{H}_0(t) \exp(\beta x))=\hat{S}(t)\exp(\beta x)\]

Illustration of the steps used in estimating the cumulative incidence functions for AIDS and SI appearance in the SI data

Example: The following table illustrates the steps in estimating the cumulative incidence functions for AIDS and SI appearance in the SI data. For example, at time $t_j$ = 0.112, SI appeared in one individual. The estimated overall survival at the previous time point is 1 (there was no earlier event), and the estimate of the failure rate $h_2(t)(0.112)$ is 1/329 = 0.0030. Since the overall survival is one, 0.0030 is also the estimate of the unconditional probability $\hat{p}_2(0.112)$. The first AIDS event occurs at time 1.440. At this time, 309 patients are at risk. The estimated overall survival at the previous time point 1.437 is 0.9723, and the estimate of the failure rate $\hat{h}_1(1.440)$ is 1/309 = 0.0032, yielding 0.9723 $\times$ 0.0032 = 0.0031 for the estimated unconditional failure probability.

Demonstration

Data Manipulation

The data used in the tutorial is available in mstate under the name aidssi. As a preliminary, we introduce two ways of representing the same data.
The first of these is the standard way of representing competing risks data. Consider the first four patients of the SI data set, in regular format:

data(aidssi)
si <- aidssi # Just a shorter name
head(si)

##   patnr   time status      cause ccr5
## 1     1  9.106      1       AIDS   WW
## 2     2 11.039      0 event-free   WM
## 3     3  2.234      1       AIDS   WW
## 4     4  9.878      2         SI   WM
## 5     5  3.819      1       AIDS   WW
## 6     6  6.801      1       AIDS   WW

table(si$status)

## 
##   0   1   2 
## 107 114 108

Here a single time and cause variable are used to indicate time of failure (or censoring) and cause of failure. The variable status is just a numeric representation of cause. The whole data set represented in this format will be called si.
An alternative way of representing the same data is in long format (the SI data set in long format is called silong). We will see later that this representation allows for more flexibility in modelling the effect of covariates.
To prepare data in long format, it is possible to use msprep. In this case there is not a huge advantage in using msprep; the long data may just as easily be prepared directly. Nevertheless we will illustrate the use of msprep to obtain data in long format. The function trans.comprisk prepares a transition matrix for competing risks models. The first argument is the number of causes of failure; in the names argument a character vector of length three (the total number of states in the multi-state model including the failure-free state) may be given. The transition matrix has three states with stte 1 being the failure-free state and the subsequent sttes representing the different causes of failure.

tmat <- trans.comprisk(2, names = c("event-free", "AIDS", "SI"))
tmat

##             to
## from         event-free AIDS SI
##   event-free         NA    1  2
##   AIDS               NA   NA NA
##   SI                 NA   NA NA

Now follows the actual call to msprep.

si$stat1 <- as.numeric(si$status == 1)
si$stat2 <- as.numeric(si$status == 2)
silong <- msprep(time = c(NA, "time", "time"), 
                 status = c(NA, "stat1", "stat2"), 
                 data = si, keep = "ccr5", 
                 trans = tmat)

We can use events to check whether the number of events from original data (si) corresponds with long data.

events(silong)

## $Frequencies
##             to
## from         event-free AIDS  SI no event total entering
##   event-free          0  114 108      107            329
##   AIDS                0    0   0      114            114
##   SI                  0    0   0      108            108
## 
## $Proportions
##             to
## from         event-free      AIDS        SI  no event
##   event-free  0.0000000 0.3465046 0.3282675 0.3252280
##   AIDS        0.0000000 0.0000000 0.0000000 1.0000000
##   SI          0.0000000 0.0000000 0.0000000 1.0000000

For the regression analyses to be performed later we add transition-specific covariates. In the context of competing risks one could call them cause-specific covariates. Since the factor levels of CCR5 are quite short we keep the default setting (TRUE) of longnames.

silong <- expand.covs(silong, "ccr5")
silong[1:8, ]

## An object of class 'msdata'
## 
## Data:
##   id from to trans Tstart  Tstop   time status ccr5 ccr5WM.1 ccr5WM.2
## 1  1    1  2     1      0  9.106  9.106      1   WW        0        0
## 2  1    1  3     2      0  9.106  9.106      0   WW        0        0
## 3  2    1  2     1      0 11.039 11.039      0   WM        1        0
## 4  2    1  3     2      0 11.039 11.039      0   WM        0        1
## 5  3    1  2     1      0  2.234  2.234      1   WW        0        0
## 6  3    1  3     2      0  2.234  2.234      0   WW        0        0
## 7  4    1  2     1      0  9.878  9.878      0   WM        1        0
## 8  4    1  3     2      0  9.878  9.878      1   WM        0        1

If there are $K$ competing events, each individual needs $K$ rows in the new data file, one for each possible cause of failure. A column (cause in the example) is used to denote the event type or failure cause that the row refers to. The value of the time variable is identical over the $K$ rows of an individual. The status variable changes. Instead of values $0, 1, . . . , K$, it now has the value 1 if the corresponding event type is the one that occurred, and it has the value 0 otherwise. Any covariates are simply replicated for each patient over the $K$ rows of that individual. We have also introduced two extra dummy variables ccr5.1 and ccr5.2. They have the value 0 except for mutant (WM) genotypes for the cause that they correspond to (i.e. for a patient with the mutant genotype, ccr5.1 = 1 for the first cause, ‘AIDS’, ccr5.2 = 1 for the second cause, ‘SI’).

Naive Kaplan-Meier

To illustrate the fact that naive Kaplan-Meiers are biased estimators of the probabilities of failing from the different causes of failure, we just make use of the functions in the survival package. I am using coxph below, probably this could be done quicker.

c1 <- coxph(Surv(time, status) ~ 1, data = silong, 
            subset = (trans == 1), method = "breslow")
c2 <- coxph(Surv(time, status) ~ 1, data = silong, 
            subset = (trans == 2), method = "breslow")
h1 <- survfit(c1)
h1 <- data.frame(time = h1$time, surv = h1$surv)
h2 <- survfit(c2)
h2 <- data.frame(time = h2$time, surv = h2$surv)

These naive Kaplan-Meier curves are shown in Figure 6 (Figure 2 in the tutorial). The Kaplan-Meier estimate of AIDS is plotted as a survival curve, while that of SI appearance is shown as a distribution function. There is some extra code to chop the time at 13 years. This was just done to make the picture prettier.

idx1 <- (h1$time<13) # this restricts the plot to the first 13 years
plot(c(0,h1$time[idx1],13),c(1,h1$surv[idx1],min(h1$surv[idx1])),type="s",
     xlim=c(0,13),ylim=c(0,1),xlab="Years from HIV infection",ylab="Probability",lwd=2)
idx2 <- (h2$time<13)
lines(c(0,h2$time[idx2],13),c(0,1-h2$surv[idx2],max(1-h2$surv[idx2])),type="s",lwd=2)
text(8,0.71,adj=0,"AIDS")
text(8,0.32,adj=0,"SI")

* The figure is estimated survival curves for AIDS and probability of SI appearance, based on the naive Kaplan-Meier estimator.

Cumulative incidence functions

Cumulative incidence functions can be computed using the function Cuminc. It takes as main arguments time and status, which can be provided as vectors

ci <- Cuminc(time = si$time, status = si$status)

or, alternatively, as column names representing time and status, along with a data argument containing these column names.

ci <- Cuminc(time = "time", status = "status", data = aidssi)

The result is a data frame containing the failure-free probabilities (Surv) and the cumulative incidence functions with their standard errors. Other arguments allow to specify the codes for the causes of failure and a group identifier.

head(ci)

##    time      Surv CI.1        CI.2      seSurv seCI.1      seCI.2
## 1 0.112 0.9969605    0 0.003039514 0.003034891      0 0.003034891
## 2 0.137 0.9939210    0 0.006079027 0.004285436      0 0.004285436
## 3 0.474 0.9908628    0 0.009137246 0.005251290      0 0.005251290
## 4 0.824 0.9877760    0 0.012224046 0.006074796      0 0.006074796
## 5 0.884 0.9846795    0 0.015320522 0.006799283      0 0.006799283
## 6 0.969 0.9815830    0 0.018416998 0.007449696      0 0.007449696

tail(ci)

##       time      Surv      CI.1      CI.2     seSurv     seCI.1     seCI.2
## 211 11.943 0.2312339 0.4035707 0.3651954 0.02638091 0.02978948 0.02881464
## 212 12.129 0.2266092 0.4081954 0.3651954 0.02625552 0.02989297 0.02881464
## 213 12.400 0.2219845 0.4081954 0.3698201 0.02612382 0.02989297 0.02896110
## 214 12.936 0.2165702 0.4081954 0.3752344 0.02604167 0.02989297 0.02919663
## 215 13.361 0.2067261 0.4180395 0.3752344 0.02665370 0.03089977 0.02919663
## 216 13.936 0.0000000 0.4180395 0.5819605 0.00000000 0.03089977 0.03089977

The cumulative incidence functions just obtained can be used to reproduce Figure 3 of the tutorial.

idx0 <- (ci$time < 13)
plot(c(0, ci$time[idx0], 13), c(1, 1 - ci$CI.1[idx0], 
                                min(1 - ci$CI.1[idx0])), 
     type = "s", xlim = c(0, 13), ylim = c(0, 1), 
     xlab = "Years from HIV infection", 
     ylab = "Probability", lwd = 2)
idx1 <- (h1$time < 13)
lines(c(0, h1$time[idx1], 13), c(1, h1$surv[idx1], min(h1$surv[idx1])),
      type = "s", lwd = 2, col = 8)
lines(c(0, ci$time[idx0], 13), c(0, ci$CI.2[idx0], max(ci$CI.2[idx0])),
      type = "s", lwd = 2)
idx2 <- (h2$time < 13)
lines(c(0, h2$time[idx2], 13), c(0, 1 - h2$surv[idx2], 
                                 max(1 - h2$surv[idx2])), 
      type = "s", lwd = 2, col = 8)
text(8, 0.77, adj = 0, "AIDS")
text(8, 0.275, adj = 0, "SI")

The figure indicates estimates of probabilities of AIDS and SI appearance, based on the naive KaplanMeier (grey) and on cumulative incidence functions (black).
The stacked plots are shown as follows.

idx0 <- (ci$time < 13)
plot(c(0, ci$time[idx0]), c(0, ci$CI.1[idx0]), type = "s", 
     xlim = c(0,13), 
     ylim = c(0, 1), xlab = "Years from HIV infection", ylab = "Probability",
     lwd = 2)
lines(c(0, ci$time[idx0]), c(0, ci$CI.1[idx0] + ci$CI.2[idx0]),
      type = "s", lwd = 2)
text(13, 0.5 * max(ci$CI.1[idx0]), adj = 1, "AIDS")
text(13, max(ci$CI.1[idx0]) + 0.5 * max(ci$CI.2[idx0]), adj = 1, "SI")
text(13, 0.5 + 0.5 * max(ci$CI.1[idx0]) + 0.5 * max(ci$CI.2[idx0]),
     adj = 1, "Event-free")

* The figure indicates cumulative incidence curves of AIDS and SI appearance. The cumulative incidence functions are stacked; the distances between two curves represent the probabilities of the different events.

Modelling and estimating covariate effects

Just like in standard survival analysis, the effect of one or two binary covariates is most easily investigated by estimating cumulative incidence curves non-parametrically and testing whether the curves differ by covariate value. Gray [20] developed a log-rank type test for equality of cumulative incidence curves.
In proportional hazards regression on the cause-specific hazards, we model the cause-specific hazard of cause $k$ for a subject with covariate vector $x$ as (where $h_{k,0}(t)$ is the baseline cause-specific hazard of cause $k$, and the vector $\beta_k$ represents the covariate effects on cause $k$,) \[h_k(t|Z)=h_{k,0}(t)\exp(\beta_k x)\]
The analysis is completely standard, but the interpretation requires caution. At each time some person moves to state $k$, the covariate values of this individual are compared with the covariates of all other individuals still event-free and in follow-up. Persons who move to another state are censored at their transition time.
In this subsection we shall illustrate the use of R in carrying out some of the regression analyzes based on the SI data set. A specific deletion in the C–C chemokine receptor 5 gene (CCR5 $\Delta$ 32) has been associated with reduced susceptibility to HIV infection and delayed AIDS progression. Since NSI viruses use CCR5 for cell entry, whereas SI viruses can also use C-X-C chemokine receptor 4 (CXCR4), the latter virus type may have an advantage in persons with the deletion.
Therefore, we investigate whether in persons with the deletion the SI phenotype appears more rapidly. This question has been addressed using standard survival analysis techniques, which implicitly assumed that a switch to SI and progression to AIDS are independent mechanisms. The CCR5 genotype is incorporated in the SI data set through the covariate ccr5. Persons without the deletion (‘wild type’) have WW, the reference category, whereas individuals who have the deletion on one of the chromosomes have WM (individuals with the deletion on both chromosomes were not present in our data).
let us look at the effect of CCR5 (classified as wild-type (WW) or mutant (WM)) on AIDS and SI appearance. A total of 259 out of 324 patients (80 per cent) had the wild-type variant, while 65 patients (20 per cent) had the mutant variant. Five patients had unknown CCR5-genotype
- Using the original dataset, we can apply ordinary Cox regression for cause 1 (AIDS), taking only the AIDS cases as events. This is done by specifying status==1 below (observations with status=0 (true censorings) and status=2 (SI) are treated as censorings). Similarly for cause 2 (SI appearance), where status==2 indicates that only failures due to SI appearance are to be treated as events.

coxph(Surv(time, status == 1) ~ ccr5, data = si) # AIDS

## Call:
## coxph(formula = Surv(time, status == 1) ~ ccr5, data = si)
## 
##           coef exp(coef) se(coef)      z        p
## ccr5WM -1.2358    0.2906   0.3071 -4.024 5.72e-05
## 
## Likelihood ratio test=21.98  on 1 df, p=2.756e-06
## n= 324, number of events= 113 
##    (5 observations deleted due to missingness)

coxph(Surv(time, status == 2) ~ ccr5, data = si) # SI appearance

## Call:
## coxph(formula = Surv(time, status == 2) ~ ccr5, data = si)
## 
##           coef exp(coef) se(coef)      z     p
## ccr5WM -0.2542    0.7755   0.2380 -1.068 0.286
## 
## Likelihood ratio test=1.19  on 1 df, p=0.2748
## n= 324, number of events= 107 
##    (5 observations deleted due to missingness)

The estimated coefficient for the mutant with respect to the wild-type variant for AIDS was -1.24 (SE 0.31), giving a significant protective effect of the mutant variant (hazard ratio (HR) = 0.29, P $<$ 0.0001). The effect of CCR5 on SI appearance was not significant (coefficient -0.25, SE 0.24, HR 0.78, p = 0.29).
The same analysis can be performed using the long format dataset silong in several ways. For instance, as separate Cox regressions.

coxph(Surv(time, status) ~ ccr5, data = silong, 
      subset = (trans == 1), method = "breslow")

## Call:
## coxph(formula = Surv(time, status) ~ ccr5, data = silong, subset = (trans == 
##     1), method = "breslow")
## 
##           coef exp(coef) se(coef)      z        p
## ccr5WM -1.2358    0.2906   0.3071 -4.024 5.73e-05
## 
## Likelihood ratio test=21.98  on 1 df, p=2.758e-06
## n= 324, number of events= 113 
##    (5 observations deleted due to missingness)

coxph(Surv(time, status) ~ ccr5, data = silong, 
      subset = (trans == 2), method = "breslow")

## Call:
## coxph(formula = Surv(time, status) ~ ccr5, data = silong, subset = (trans == 
##     2), method = "breslow")
## 
##           coef exp(coef) se(coef)      z     p
## ccr5WM -0.2542    0.7755   0.2380 -1.068 0.286
## 
## Likelihood ratio test=1.19  on 1 df, p=0.2748
## n= 324, number of events= 107 
##    (5 observations deleted due to missingness)

Another is to use the dummies ccr5.1 and ccr5.2, to obtain an attractively simple analysis:

coxph(Surv(time, status) ~ ccr5WM.1 + ccr5WM.2 + strata(trans),
      data = silong)

## Call:
## coxph(formula = Surv(time, status) ~ ccr5WM.1 + ccr5WM.2 + strata(trans), 
##     data = silong)
## 
##             coef exp(coef) se(coef)      z        p
## ccr5WM.1 -1.2358    0.2906   0.3071 -4.024 5.72e-05
## ccr5WM.2 -0.2542    0.7755   0.2380 -1.068    0.286
## 
## Likelihood ratio test=23.17  on 2 df, p=9.294e-06
## n= 648, number of events= 220 
##    (10 observations deleted due to missingness)

The n = 648 mentioned here equals the number of rows (two times 324) in the long data set without missing data (a warning from R that 10 observations were not used because of missing covariates has been removed from the output). The same model can also be fitted by adding an interaction term between the cause stratum variable and age.
The same model, but now using a covariate by cause interaction.

coxph(Surv(time, status) ~ ccr5 * factor(trans) +
        strata(trans),
      data = silong)

## Call:
## coxph(formula = Surv(time, status) ~ ccr5 * factor(trans) + strata(trans), 
##     data = silong)
## 
##                          coef exp(coef) se(coef)      z        p
## ccr5WM                -1.2358    0.2906   0.3071 -4.024 5.72e-05
## factor(trans)2             NA        NA   0.0000     NA       NA
## ccr5WM:factor(trans)2  0.9816    2.6688   0.3886  2.526   0.0115
## 
## Likelihood ratio test=23.17  on 2 df, p=9.294e-06
## n= 648, number of events= 220 
##    (10 observations deleted due to missingness)

Now we see the advantage of the use of the long format. The notation allows the effect of the covariates to be different for each failure cause. Use of the long format makes it possible to assume that the effects of CCR5 are identical for the different causes and to test for equality of the effects of CCR5 on AIDS and SI appearance. The coefficient -1.236 is (as before) for the effect of CCR5 on AIDS. The deviant coefficient 0.982 now represents the difference in the effect of CCR5 on the two cause-specific hazards. The CCR5 genotype by cause interaction term is significant, indicating that the effect of CCR5 is quite different on AIDS and SI appearance. The effect of CCR5 on SI appearance is thus given by -1.236 + 0.982 = -0.254, as before. Note that the second row with NA’s in the output above is caused by the fact that the cause main effect cannot be estimated, since the baseline cause-specific hazards are both freely estimated.
Although not applicable here, if we were to assume that the effect of CCR5 on the two cause-specific hazards is equal, we could use a stratified model. In the model below we assume that the effect of CCR5 on the two cause-specific hazards is equal. The significant effect of the interaction in the model we just saw indicates that this is not a good idea.

coxph(Surv(time, status) ~ ccr5 + strata(trans), data = silong)

## Call:
## coxph(formula = Surv(time, status) ~ ccr5 + strata(trans), data = silong)
## 
##           coef exp(coef) se(coef)     z        p
## ccr5WM -0.7012    0.4960   0.1860 -3.77 0.000163
## 
## Likelihood ratio test=16.46  on 1 df, p=4.972e-05
## n= 648, number of events= 220 
##    (10 observations deleted due to missingness)

There are two alternative ways yielding the same result.
First, we can actually leave out the strata term. The reason is that in both strata the risk sets as well as the covariate values (here ccr5) are equal.

coxph(Surv(time, status) ~ ccr5, data = silong)

## Call:
## coxph(formula = Surv(time, status) ~ ccr5, data = silong)
## 
##           coef exp(coef) se(coef)      z        p
## ccr5WM -0.7012    0.4960   0.1860 -3.771 0.000163
## 
## Likelihood ratio test=16.46  on 1 df, p=4.964e-05
## n= 648, number of events= 220 
##    (10 observations deleted due to missingness)

Second, since the strata term is not needed we can use si. (Please note that the actual estimated baseline hazards may be different, whether or not the strata term is used.)

coxph(Surv(time, status != 0) ~ ccr5, data = si)

## Call:
## coxph(formula = Surv(time, status != 0) ~ ccr5, data = si)
## 
##           coef exp(coef) se(coef)      z        p
## ccr5WM -0.7013    0.4959   0.1860 -3.771 0.000163
## 
## Likelihood ratio test=16.47  on 1 df, p=4.953e-05
## n= 324, number of events= 220 
##    (5 observations deleted due to missingness)

Finally, we show the analyzes under the assumption that the baseline cause-specific hazards are proportional. Now cause is not used as stratum, but as another covariate for which a relative risk parameter is estimated.
Assuming that baseline hazards for AIDS and SI are proportional (this is generally not a realistic assumption by the way, but just for illustration purposes),

coxph(Surv(time, status) ~ ccr5WM.1 + ccr5WM.2 + factor(trans),
      data = silong)

## Call:
## coxph(formula = Surv(time, status) ~ ccr5WM.1 + ccr5WM.2 + factor(trans), 
##     data = silong)
## 
##                   coef exp(coef) se(coef)      z       p
## ccr5WM.1       -1.1664    0.3115   0.3063 -3.808 0.00014
## ccr5WM.2       -0.3316    0.7178   0.2366 -1.401 0.16112
## factor(trans)2 -0.1843    0.8317   0.1477 -1.248 0.21201
## 
## Likelihood ratio test=21.54  on 3 df, p=8.124e-05
## n= 648, number of events= 220 
##    (10 observations deleted due to missingness)

The coefficient -0.184 and its hazard ratio 0.832 would indicate that (under the assumption of the cause-specific hazards being proportional) the baseline cause-specific hazard of SI appearance is somewhat smaller than that of AIDS, though not significant (p = 0.21).
Even though the assumption of proportional baseline cause-specific hazards will often be unrealistic, this proportional risk model has the nice property that the probability of an individual failing of cause k follows a logistic model.
Or, again using covariate by cause (transition) interaction.

coxph(Surv(time, status) ~ ccr5 * factor(trans), data = silong)

## Call:
## coxph(formula = Surv(time, status) ~ ccr5 * factor(trans), data = silong)
## 
##                          coef exp(coef) se(coef)      z       p
## ccr5WM                -1.1664    0.3115   0.3063 -3.808 0.00014
## factor(trans)2        -0.1843    0.8317   0.1477 -1.248 0.21201
## ccr5WM:factor(trans)2  0.8348    2.3044   0.3855  2.165 0.03035
## 
## Likelihood ratio test=21.54  on 3 df, p=8.124e-05
## n= 648, number of events= 220 
##    (10 observations deleted due to missingness)

Note that, even though patients are replicated in the long format, it is not necessary to use robust standard errors. Any of the previous analyses with the silong dataset gives identical results when a cluster(id) term is added. For instance,

coxph(Surv(time, status) ~ ccr5 * factor(trans) + cluster(id),
      data = silong)

## Call:
## coxph(formula = Surv(time, status) ~ ccr5 + factor(trans) + ccr5:factor(trans), 
##     data = silong, cluster = id)
## 
##                          coef exp(coef) se(coef) robust se      z        p
## ccr5WM                -1.1664    0.3115   0.3063    0.2928 -3.983 6.81e-05
## factor(trans)2        -0.1843    0.8317   0.1477    0.1477 -1.248   0.2121
## ccr5WM:factor(trans)2  0.8348    2.3044   0.3855    0.3855  2.165   0.0304
## 
## Likelihood ratio test=21.54  on 3 df, p=8.124e-05
## n= 648, number of events= 220 
##    (10 observations deleted due to missingness)

The covariate effects are proportional for the cause-specific hazards. In the absence of competing risks this would mean that the survival functions for different values of the covariates were related through a simple formula. If $S_1$ and $S_2$ are the survival functions for covariate values $x_(1)$ and $x_(2)$, then \[S_2(t)=S_1(t)\exp(\beta_1 (x_2 - x_1))\]
However, in the presence of competing risks, when the effect of the same covariates are also modelled for other causes of failure, this relation does not extend to cumulative incidence functions.
The reason is that the cumulative incidence function for cause $k$ not only depends on the hazard of cause $k$, but also on the hazards of all other causes. Hence the relation of the cumulative incidence functions of cause $k$ for two different covariate values not only depends on the effect of the covariate on cause $k$, but also on the effects of the covariate on all other causes and on the baseline hazards of all other causes.
As a result, the simple effect of a covariate on the cause-specific hazard of cause $k$ can be quite unpredictable when expressed in terms of the cumulative incidence function.
To examine these associations, we will draw figure.
Figure 5 shows the estimated cumulative incidence functions for both wild-type and mutant variants of CCR5 based on the above regression model and formulas for AIDS and for SI appearance.
In order to obtain predicted cumulative incidences, msprep is useful. First let us store our analysis with separate covariate effects for the two causes.

c1 <- coxph(Surv(time, status) ~ ccr5WM.1 + ccr5WM.2 + strata(trans),
            data = silong, method = "breslow")

If we want the predicted cumulative incidences for an individual with CCR5 wild-type (WW), we make a newdata data frame containing the (transition-specific) covariate values for each of the transitions for the individual of interest. Then we apply msfit as illustrated earlier in the context of multi-state models.

WW <- data.frame(ccr5WM.1 = c(0, 0), ccr5WM.2 = c(0, 0), 
                 trans = c(1,2), strata = c(1, 2))
msf.WW <- msfit(c1, WW, trans = tmat)

And finally, to obtain the cumulative incidences we apply probtrans. Item [[1]] is selected because the prediction starts from state 1 (event-free) at time s = 0

pt.WW <- probtrans(msf.WW, 0)[[1]]

Similarly for an individual with the CCR5 mutant (WM) genotype.

WM <- data.frame(ccr5WM.1 = c(1, 0), ccr5WM.2 = c(0, 1), 
                 trans = c(1, 2), strata = c(1, 2))
msf.WM <- msfit(c1, WM, trans = tmat)
pt.WM <- probtrans(msf.WM, 0)[[1]]

We now plot these cumulative incidence curves for AIDS (pstate2) and SI appearance (pstate3), for wild-type (WW) and mutant (WM).

idx1 <- (pt.WW$time < 13)
idx2 <- (pt.WM$time < 13)
plot(c(0, pt.WW$time[idx1]), c(0, pt.WW$pstate2[idx1]), type = "s",
     ylim = c(0, 0.5), xlab = "Years from HIV infection", ylab = "Probability",
     lwd = 2)
lines(c(0, pt.WM$time[idx2]), c(0, pt.WM$pstate2[idx2]), type = "s",
     lwd = 2, col = 8)
title(main = "AIDS")
text(9.2, 0.345, "WW", adj = 0, cex = 0.75)
text(9.2, 0.125, "WM", adj = 0, cex = 0.75)

This figure represents cumulative incidence functions for AIDS, for wildtype (WW) and mutant (WM) CCR5 genotype, based on a proportional hazards model on the cause-specific hazards.

plot(c(0, pt.WW$time[idx1]), c(0, pt.WW$pstate3[idx1]), type = "s",
     ylim = c(0, 0.5), xlab = "Years from HIV infection", ylab = "Probability",
     lwd = 2)
lines(c(0, pt.WM$time[idx2]), c(0, pt.WM$pstate3[idx2]), type = "s",
     lwd = 2, col = 8)
title(main = "SI appearance")
text(7.5, 0.31, "WW", adj = 0, cex = 0.75)
text(7.5, 0.245, "WM", adj = 0, cex = 0.75)

This figure represents cumulative incidence functions for SI appearance, for wildtype (WW) and mutant (WM) CCR5 genotype, based on a proportional hazards model on the cause-specific hazards.
In the above two figures, we can say that
- While the protective effect of the mutant WM on AIDS is clear, on close inspection it is apparent that the effect of CCR5 on the probability of SI appearance is not quite as expected from a standard situation without competing risks.
- In the latter situation, since the hazard ratio is 0.78, the patients with the mutant genotype would have a consistently lower probability of SI appearance, and the difference in SI probabilities between mutant and wild-type would increase with time.
- Here, although initially the probability of SI appearance is indeed lower for the mutant WM, after approximately 9 years the difference decreases rather than increases, and after 11 years the cumulative incidence functions of AIDS and SI appearance cross. This is caused by the fact that although the hazard of SI appearance is lower for WM, the hazard of AIDS is also lower for WM, and the effect is much stronger for AIDS. Both the effect of the covariate on the competing risk and the baseline hazard of the competing risk influence the effect of the covariate on the cumulative incidence of the event of interest.
- The fact that the baseline hazard of the competing risk matters is perhaps unexpected, so we illustrate the fact that the baseline hazard of AIDS (i.e. corresponding to the wild-type WW) plays an important role here in two ways.
In the following Figure, we have considered a somewhat idealized situation, where we have a population of 10000 individuals with the wildtype WW and 10000 individuals with the mutant WM genotype.

The difference between covariate effects on cause-specific hazards and cumulative incidence explained

We assume that WW individuals have a constant failure rate of 30 per cent at discrete time points, for both endpoints. The mutation WM is protective for the cause-specific hazard to SI appearance (hazard ratio 0.90). However, it is even more protective for AIDS diagnosis (hazard ratio 0.33). This latter aspect causes more individuals to remain at risk after the first round for WM. Hence, in the second round, SI appears in more individuals with WM than in individuals with WW (1701 to 1200).
As a result, after the second round, the cumulative incidence for SI appearance is higher for individuals with WM than for individuals with WW genotype. The second illustration of this phenomenon is through Figures below, which shows what would happen if we were to change the baseline hazard of AIDS by multiplying the estimate from the data with different multiplication factors, while keeping everything else (the baseline cause-specific hazard of SI appearance, and the effects of CCR5 on both cause-specific hazards) the same.
The sub-plot with factor = 0 corresponds to the standard Cox regression in the absence of the competing risk ‘AIDS’. Here the difference in probabilities of SI appearance between wild-type and mutant indeed increases with time. As the competition from AIDS is increased, the higher cause-specific hazard for SI appearance, SI(s), for WW compared to WM is offset against an increasingly smaller contribution from the overall survival $S(s) = \exp(-(H_{AIDS}(s)+ H_{SI}(s)))$ for WW, where the contribution of AIDS, $H_{AIDS}(s)$, increases as the multiplication factor increases. At first this results in a crossing of the cumulative incidence curves (see e.g. factor = 1, this is not possible in the absence of competing risks), which occurs earlier with increasing multiplication factor. With factor = 4, the effect of CCR5 on the cumulative incidence of SI appearance is inverse to what the hazard ratio of 0.78 of WM with respect to WW seems to suggest.
The illustration of the phenomenon that the same cause-specific hazard ratio may have different effects on the cumulative incidences may be performed as well, by replacing the appropriate parts of the cumulative hazard of AIDS (trans=1), and calling prob(trans). We are interested in SI appearance and adjust the hazards of the competing risk (AIDS) while keeping the remainder the same. The result is shown as followings. We multiply the baseline hazard of AIDS with factors (ff = 0, 0.5, 1, 1.5, 2, 4).

ffs <- c(0, 0.5, 1, 1.5, 2, 4)
newmsf.WW <- msf.WW
newmsf.WM <- msf.WM

par(mfrow = c(2, 3))
for (ff in ffs) {
  newmsf.WW$Haz$Haz[newmsf.WW$Haz$trans == 1] <- ff * msf.WW$Haz$Haz[msf.WW$Haz$trans == 1]
  pt.WW <- probtrans(newmsf.WW, 0, variance = FALSE)[[1]]
  newmsf.WM$Haz$Haz[newmsf.WM$Haz$trans == 1] <- ff * msf.WM$Haz$Haz[msf.WM$Haz$trans == 1]
  pt.WM <- probtrans(newmsf.WM, 0, variance = FALSE)[[1]]
  idx1 <- (pt.WW$time < 13)
  idx2 <- (pt.WM$time < 13)
  plot(c(0, pt.WW$time[idx1]), c(0, pt.WW$pstate3[idx1]), type = "s",
       ylim = c(0, 0.52), xlab = "Years from HIV infection",
       ylab = "Probability", lwd = 2)
  lines(c(0, pt.WM$time[idx2]), c(0, pt.WM$pstate3[idx2]),
       type = "s", lwd = 2, col = 8)
  title(main = paste("Factor =", ff))
  }

par(mfrow = c(1, 1))

This figure represents cumulative incidence functions for Si appearance, for CCR5 wild-type WW (black) and mutant WM (grey). The baseline hazard of AIDS was multiplied with different factors, while keeping everything else the same.
The use of long format, in particular in combination with the use of cause-specific dummies (ccr5.1 and ccr5.2 in our example) and stratified Cox regression offers great flexibility in modelling the effect of covariates on the cause-specific intensity rates, while using standard statistical software.
Several authors have suggested that robust estimates of standard errors should be used in order to correct for the correlation caused by multiplication of the data set. However, each individual still has at most one event, so that standard estimates of the standard error do suffice.

Regression on cumulative incidence functions

In order to avoid the highly nonlinear effects of covariates on the cumulative incidence functions when modelling is done on the cause-specific hazards, Fine and Gray introduced a way to regress directly on cumulative incidence functions. In analogy with the relation between hazard and survival, they defined a subdistribution hazard \[\bar{h}_k(t)=-\frac{d \log(1-I_k(t))}{dt}\]
This is not the cause-specific hazard. In terms of estimates of this quantity, the difference is in the risk set. For the cause-specific hazard, the risk set decreases at each time point at which there is a failure of another cause. For $\bar{h}_k(t)$, persons who fail from another cause remain in the risk set. If there is no censoring, they remain in the risk set forever and once these individuals are given a censoring time that is larger than all event times, the analysis becomes completely standard. If there is censoring, they remain in the risk set until their potential censoring time, which is not observed if they experienced another event before. With administrative censoring, the potential censoring time is still known. If individuals may also be lost to follow-up, a censoring distribution is estimated from the data. Fine and Gray imposed a proportional hazards assumption on the subdistribution hazards: \[\bar{h}_k(t|x)=\bar{h}_{k,0}(t)\exp(\beta_k x)\]
Estimation follows the partial likelihood approach used in a standard Cox model. In a later paper, Fine extended this idea to other link functions using an estimating equations approach. Using the R library cmprsk we obtain the following results (after removing the five subjects with missing CCR5 covariate values and making ccr5 numeric).
Fine and Gray regression on cumulative incidence functions is not implemented in mstate, but in the R package cmprsk. Using the R library cmprsk we obtain the following results (after removing the five subjects with missing CCR5 covariate values and making ccr5 numeric).

library(cmprsk)
sic <- si[!is.na(si$ccr5),]
ftime <- sic$time
fstatus <- sic$status
cov <- as.numeric(sic$ccr5)-1
# for failures of type 1 (AIDS)
z1 <- crr(ftime,fstatus,cov)

z1

## convergence:  TRUE 
## coefficients:
##   cov1 
## -1.004 
## standard errors:
## [1] 0.295
## two-sided p-values:
##    cov1 
## 0.00066

# for failures of type 2 (SI)
z2 <- crr(ftime,fstatus,cov,failcode=2)
z2

## convergence:  TRUE 
## coefficients:
##    cov1 
## 0.02359 
## standard errors:
## [1] 0.2266
## two-sided p-values:
## cov1 
## 0.92

The protective effect of the mutant WM genotype on AIDS is again apparent (P = 0.0007). Note that the effect of the mutant WM genotype on SI appearance has reversed compared to regression on cause-specific hazards, though it is very far from significant.
The following Figures show the predicted cumulative incidence curves for time to AIDS and time to SI appearance based on the Fine and Gray results. Note that the cumulative incidence curves of SI appearance for CCR5 wild-type and mutant do not cross and that the cumulative incidence curve of the mutant is above that of the wild-type.

z1.pr <- predict(z1,matrix(c(0,1),2,1))
# this will contain predicted cum inc curves, both for WW (2nd column) and WM (3rd)
z2.pr <- predict(z2,matrix(c(0,1),2,1))
# Standard plots, not shown
par(mfrow=c(1,2))
plot(z1.pr,lty=1,lwd=2,color=c(8,1))
plot(z2.pr,lty=1,lwd=2,color=c(8,1))

par(mfrow=c(1,1))
## AIDS
n1 <- nrow(z1.pr) # remove last jump
plot(c(0,z1.pr[-n1,1]),c(0,z1.pr[-n1,2]),type="s",ylim=c(0,0.5),
     xlab="Years from HIV infection",ylab="Probability",lwd=2)
lines(c(0,z1.pr[-n1,1]),c(0,z1.pr[-n1,3]),type="s",lwd=2,col=8)
title(main="AIDS")
text(9.3,0.35,"WW",adj=0,cex=0.75)
text(9.3,0.14,"WM",adj=0,cex=0.75)

## SI appearance
n2 <- nrow(z2.pr) # again remove last jump
plot(c(0,z2.pr[-n2,1]),c(0,z2.pr[-n2,2]),type="s",ylim=c(0,0.5),
     xlab="Years from HIV infection",ylab="Probability",lwd=2)
lines(c(0,z2.pr[-n2,1]),c(0,z2.pr[-n2,3]),type="s",lwd=2,col=8)
title(main="SI appearance")
text(7.9,0.28,"WW",adj=0,cex=0.75)
text(7.9,0.31,"WM",adj=0,cex=0.75)

This figure represents cumulative incidence functions for AIDS (left) and SI appearance (right), for CCR5 wild-type WW and mutant WM, based on the Fine and Gray model.
To judge the “fit”of the cause-specific and Fine & Gray regression models we estimate cumulative incidence curves nonparametrically, i.e., for two subgroups of WW and WM CCR5-genotypes. Here we can use the group argument of Cuminc.

ci <- Cuminc(si$time, si$status, group = si$ccr5)
ci.WW <- ci[ci$group == "WW", ]
ci.WM <- ci[ci$group == "WM", ]

We show these nonparametric estimates in Figure 12 (Figure 9 in the tutorial).

idx1 <- (ci.WW$time < 13)
idx2 <- (ci.WM$time < 13)
plot(c(0, ci.WW$time[idx1]), c(0, ci.WW$CI.1[idx1]), type = "s",
     ylim = c(0, 0.5), xlab = "Years from HIV infection", ylab = "Probability",
     lwd = 2)
lines(c(0, ci.WM$time[idx2]), c(0, ci.WM$CI.1[idx2]), type = "s",
     lwd = 2, col = 8)
title(main = "AIDS")
text(9.3, 0.35, "WW", adj = 0, cex = 0.75)
text(9.3, 0.11, "WM", adj = 0, cex = 0.75)

This figure represents Non-parametric cumulative incidence functions for AIDS for CCR5 wild-type WW and mutant WM.

plot(c(0, ci.WW$time[idx1]), c(0, ci.WW$CI.2[idx1]), type = "s",
     ylim = c(0, 0.5), xlab = "Years from HIV infection", ylab = "Probability",
     lwd = 2)
lines(c(0, ci.WM$time[idx2]), c(0, ci.WM$CI.2[idx2]), type = "s",
     lwd = 2, col = 8)
title(main = "SI appearance")
text(7.9, 0.32, "WW", adj = 0, cex = 0.75)
text(7.9, 0.245, "WM", adj = 0, cex = 0.75)

This figure represents Non-parametric cumulative incidence functions for SI appearance for CCR5 wild-type WW and mutant WM.
As far as we know the Fine and Gray regression does not yet allow the flexibility (e.g. in testing for or assuming equality of covariate effects across different causes) of regression on cause-specific hazards. Also, it is not clear how left truncated data or time-dependent covariates can be included in their approach.

Goran Brostrom, Event History Analysis with R; Kleinbaum and Klein, Survival Analysis↩︎
This section is a summary from Cleves et al, An Introduction to Survival Analysis Using Stata↩︎
Cleves MA. An Introduction to Survival Analysis Using Stata. 3rd ed. Stata Press; 2010.↩︎
This section is a summary from 1) Cleves et al, An Introduction to Survival Analysis Using Stata, 2)Goran Brostrom, Event History Analysis with R, and Kleinbaum and Klein, Survival Analysis↩︎
Goran Brostrom, Event History Analysis with R↩︎
This section is a summarized excerpt from Goran Brostrom, Event History Analysis with R↩︎
This section is a summarized excerpt from Goran Brostrom, Event History Analysis with R and Kleinbaum and Klein, Survival Analysis↩︎
This section is an excerpt and summary from Kleinbaum and Klein, Survival Analysis↩︎
Moore, Dirk. 2016. Applied Survival Analysis Using R↩︎
For more details, see Therneau et al., 2020. Using Time Dependent Covariates and Time Dependent Coefficients in the Cox Model (https://cran.r-project.org/web/packages/survival/vignettes/timedep.pdf)↩︎
Kleinbaum and Klein, Survival Analysis: A Self-Learning Text, $3^{rd}$ edition↩︎
Moore, Dirk. 2016. Applied Survival Analysis Using R↩︎
Moore, Dirk. 2016. Applied Survival Analysis Using R↩︎
Moore, Dirk. 2016. Applied Survival Analysis Using R; Kleinbaum and Klein, Survival Analysis: A Self-Learning Text, $3^{rd}$ edition↩︎
Right truncation is another form of length-biased sampling, but it is much more difficult to accommodate than left truncation.↩︎
Pencina MJ, Larson MG, D’Agostino RB. Choice of time scale and its effect on significance of predictors in longitudinal studies. Statistics in Medicine. 2007;26(6):1343-1359. doi:https://doi.org/10.1002/sim.2699 ↩︎
Moore, Dirk. 2016. Applied Survival Analysis Using R↩︎
Allison, P. Survival Analysis Using SAS. $2^{nd}$ eds.↩︎
Germán Rodríguez, Survival Analysis, https://data.princeton.edu/pop509/recid1 ↩︎
Germán Rodríguez, Survival Analysis, https://data.princeton.edu/pop509/recid3 ↩︎
Agresti, Categorical Data Analysis; Germán Rodríguez, Survival Analysis, https://data.princeton.edu/pop509/recid3 ↩︎
Hosmer, Remeshow, and May. Applied Survival Analysis↩︎
Conditional logistic regression. https://rdrr.io/cran/survival/man/clogit.html ↩︎
Moore DF. Applied Survival Analysis Using R. Springer International Publishing; 2016. doi:10.1007/978-3-319-31245-3; A revised details on R code is available at Putter H. 2020. Tutorial in biostatistics: Competing risks and multi-state models Analyses using the mstate package. (https://cran.r-project.org/web/packages/mstate/vignettes/Tutorial.pdf)↩︎
Putter H, Fiocco M, Geskus RB. Tutorial in biostatistics: competing risks and multi-state models. Statistics in Medicine. 2007;26(11):2389-2430. doi:https://doi.org/10.1002/sim.2712; Putter H. Tutorial in biostatistics: Competing risks and multi-state models Analyses using the mstate package.↩︎
For theoretical background, please refer to Putter H, Fiocco M, Geskus RB. Tutorial in biostatistics: competing risks and multi-state models. Statistics in Medicine. 2007;26(11):2389-2430. doi:https://doi.org/10.1002/sim.2712; Putter H. Tutorial in biostatistics: Competing risks and multi-state models Analyses using the mstate package. :64.↩︎

Parameter	Role	Interpretation
Shape (\(\beta\))	Determines hazard trend	Constant (\(=1\)), increasing (\(>1\)), or decreasing (\(<1\))
Scale (\(\lambda\))	Stretches/compresses time axis	Larger values: longer survival times; smaller values: shorter survival times

Survival Analyses with R

2025-02-20

Example: Lung data

Setup

Modeling

OLS

Logistic

Cox

A short example of Kaplan-Meier estimator

Setup

Kaplan-Meier estimator

Plots

KM Sirvival curves

KM cumulative hazard curve

Nelson-Aalen estimator

Modeling With Addicts dataset

Description of Variables

Description

GLM: linear regression

GLM: Logistic regression

Model specification

Estimated equation

Interpretations

Non-parametric models: KM

Model specification

Estimated equation

Visualization

Log rank test

Stratified Log rank test

Semi-parametric models: Cox Proportional Hazards Model

Readings for the section

Notations

Proportional hazard (PH) assumption

Why we call this model as semi-parametric model

Why the Cox PH model is so popular

Comparisons with the crude and adjusted models

More about Hazard ratio

Example

Model specification

Estimated equation

Interpretations

Proportional hazard (PH) assumption: Revisited

Evaluating the Proportional hazard (PH) assumption

Graphical evaluation

Goodness-of-fit (GOF)

Stratified Cox models

Time-dependent variable approaches

Parametric Models

A note for PH and AFT models

Shape and Scale

Shape Parameter

Scale Parameter

Combined Effect on Hazard Function

Examples

Exponential Distribution (Special Case of Weibull)

Weibull Model with Increasing Hazard

Summary Table

Exponential, PH

Model specification

Estimated equation

Interpretations

Exponential, AFT

Model specification

Estimated equation

Interpretations

Weibull, PH

Model specification

Estimated equation

Weibull, AFT

Model specification

Estimated equation

Log-logistic, Proportional Odds

Model specification

Estimated equation

Log-logistic, AFT

Model specification

Estimated equation

Gompertz, PH

Gompertz, AFT

Lognormal, AFT

Notes for survival model selection¹

Accounting for heterogeneity³

Full parameterization with baseline hazard estimates⁴

Dealing with baseline hazard estimates⁶

Frailty Models⁷

Cox PH models with or without stratification⁸

Time varying covariates⁹

Example: Addicts dataset¹¹

Time varying variables that increase linearly with time¹²

Time varying coefficients¹³

Left Truncation¹⁴

Tied or Discrete Data Analysis¹⁸

Example: Recidivism in the U.S.¹⁹

Continuous and Discrete Models²⁰

A Piecewise Exponential Model²¹

Interval censoring²²

Conditional logistic regression and stratified Cox model²³

Competing risks and Multistate models I²⁴

Competing risks and Multistate models II²⁵

Prediction²⁶