A note on competing risks in survival data analysis

Competing risks are quite common in epidemiological studies, and various modeling approaches have been developed to assess the impact of covariates on outcome in the presence of these competing risks. These methods include the cause-specific hazard model, the Fine-Gray model (also known as the sub-distribution hazard model), the multi-states model, the mixture model, and augmented data methods.

Among these approaches, the cause-specific hazard model and the Fine-Gray model are perhaps the most widely used methods for addressing competing risks. In this note, I will provide an overview of the fundamental concepts underlying the cause-specific hazard model and the Fine-Gray model. Additionally, I will outline the step-by-step procedures for analyzing survival data using these two methods, both in R and SAS.

1. Probability density function (PDF) and cumulative distribution function (CDF) of a random variable $X$.

The PDF and CDF are the most fundamental functions in statistics. (Here, we treat PDF and PMF as the same.)

The PDF is defined as:

\[f_X(x)=\lim_{\Delta x\rightarrow 0^+} \frac{P(x < X \leq x+\Delta x)}{\Delta x}\tag{1}\]

The CDF is defined as

\[F_X(x) = P(X \leq x), \textrm{ for all }x \in \mathbb{R} \tag{2}\] The relationship between PDF and CDF can be derived as:

\[f_X(x)=\lim_{\Delta x\rightarrow 0^+} \frac{F_X(x+\Delta x)-F_X(x)}{\Delta x}=\frac{dF_X(x)}{dx}=F'_X(x)\tag{3}\]

The survival function is defined as: \[S_X(x)=1-F_X(x)=F(X\ge x)\tag{4}\] In survival analysis, we usually use $T$ and $t$ to replace $X$ and $x$. From then on, we use $T$ and $t$ to represent the random variable and a realization of the random variable.

Another important function in survival analysis is the hazard function. When there are no competing risks, the hazard function is defined as:

\[h(t)=\lim_{\Delta t\rightarrow 0^+} \frac{P(t < T \leq t+\Delta t|T\ge t)}{\Delta t}\tag{5}\] Note that equations $(5)$ and $(1)$ may look similar, but they are very different. The equations $(5)$ and $(1)$ have the following relationship:

Remember, \[P(A|B)=\frac{P(A \cap B)}{P(B)}\]

Therefore, from equation $(5)$

\[\begin{align*} h(t)&=\lim_{\Delta t\rightarrow 0^+} \frac{P(t < T \leq t+\Delta t|T\ge t)}{\Delta t}\\ &=\lim_{\Delta t\rightarrow 0^+} \frac{P(t < T \leq t+\Delta t) \cap P(T\ge t))}{\Delta t P(T\ge t)}\\ &\text{note, the set $(t < T \leq t+\Delta t)$ is within the set $(T\ge t)$}\\ &=\frac{\lim_{\Delta t\rightarrow 0^+}\frac{P(t < T \leq t+\Delta t))}{ \Delta t}}{P(T\ge t)}\\ &\text{by equation (1) and (4)}\\ &=\frac{f(t)}{S(t)} \end{align*}\]

Therefore, we can see that the hazard function $h(t)$ is neither a probability density function nor a probability.

2 Functions in presence of competing risks

2.1 The cumulative incidence function (CIF)

When we refer to CIF, we should remember that it is the $k^{th}$ cumulative incidence function. It is defined as:

\[CIF_k(t)=P(T\le t,D=k) \tag{6}\] Here, $D$ is a random variable denoting the type of events that can occur, and $k$ is the $k^{th}$ event.

Note that CIF may resemble equation $(2)$, the cumulative distribution function (CDF); both represent probabilities, but they are not the same.

Remember : \[P(A\cap B)=P(A)P(B|A)\] then

\[\begin{align*} CIF_k(t)&=P(T\le t,D=k)\\ &=P(\{T\le t\}\cap \{D=k\})\\ &=P(T\le t)P(D=k|T\le t)\\ &=CDF*P(D=k|T\le t) \end{align*}\]

2.2 Cause-specific hazard function

The cause-specific hazard function is defined as:

\[h(t)_k^{cs}=\lim_{\Delta t\rightarrow 0^+} \frac{P(t < T \leq t+\Delta t,D=k|T\ge t)}{\Delta t}\tag{7}\] Note that cause-specific hazard equation $(7)$ and equation $(5)$, the hazard function without competing risks, may appear similar, but they are not the same.

2.3 Subdistribution hazard function

The subdistribution hazard function was defined by Fine and Gray; therefore, models using the subdistribution hazard function are also called Fine and Gray models.

\[h(t)_k^{sd}=\lim_{\Delta t\rightarrow 0^+} \frac{P(t < T \leq t+\Delta t,D=k|T\ge t \cup (T<t \cap D \neq k ))}{\Delta t}\tag{8}\]

Note that the set $(T\ge t )\cup (T<t \cap D \neq k )$ is the risk set, including those who are currently event-free as well as those who have previously experienced a competing event, such as death. This differs from the risk set for the cause-specific hazard function in $(7)$, which only includes those who are currently free from any competing events.

3. Modelling cause-specific hazard and the subdistribution hazard (Fine and Gray)

3.1 cause-specific hazard Cox model

The model can be specified as:

\[h(t)_k^{cs}=h_{k0}^{cs}(t)exp(\beta'Z)\tag{9}\] Here, $h_{k0}^{cs}(t)$ is the baseline cause-specific hazard of cause $k$, $Z$ represents the covariates, and $\beta$ represents the coefficients of the covariates.

The coefficients of the above Cox regression model are estimated by maximizing the following partial likelihood function:

\[L(\beta)=\prod_{i=1}^n\left[\frac{\exp(\beta'Z_i)}{\sum_{j\in R_i}\exp(\beta'Z_j)}\right]^{\delta_i}\tag{10}\] Here, $\delta_i$ is an indicator function: when there is an event of interest, $\delta_i=1$; otherwise, $\delta_i=0$.

From equation $(10)$, it’s important to note that only subjects who experience the event of interest contribute to the partial likelihood. However, censored subjects also influence the partial likelihood by impacting the denominator of equation $(10)$. This is because the risk set changes, and we only include subjects within the risk sets $R_i$ for whom we sum $exp(\beta'Z_j)$. These risk sets consist of subjects who have not experienced the event or have not been censored before time $t_i$, which are distinct times of failure (events) denoted as $t_1 < t_2 < \ldots$.

The implementation of the cause-specific model for competing risk models is relatively straightforward. We assign the event of interest a value of 1 and censor all other competing events with a value of 0. For example, in a bone fracture study where death is considered a competing event for bone fracture, we simply censor the cases where death occurs when constructing the cause-specific hazard regression model.

3.2 subdistribution hazard Cox model(Fine and Gray)

The model can be specified as:

\[h(t)_k^{sd}=h_{k0}^{sd}(t)exp(\beta'Z)\tag{11}\] here $h_{k0}^{sd}(t)$is the baseline of the subdistribution hazard of cause $k$

The coefficients of the above Fine and Gray Cox model are estimated by maximizing the following partial likelihood function:

\[L(\beta)=\prod_{i=1}^n\left[\frac{\exp(\beta'Z_i)}{\sum_{j\in R_i}w_{ij}\exp(\beta'Z_j)}\right]^{\delta_i}\tag{12}\] Note that the risk set $R_i$ at $t_i$ includes patients who are still at risk for the event of interest and also patients who experience a competing event before $t_i$，$w_{ij}$ is weight, subjects who experience no event of interest before $t_i$ are given a weight equals to 1, whereas subjects who experience competing events before $t_i$ are given a weight $w_{ij}=\frac{G(t_i)}{G(min(t_j,t_i))}$ where $G$ is the Kaplan-Meier estimate of the survival function.

4. Analyzing real data with competing risks

4.1 The data set

I will utilize the Byar1980 dataset to illustrate the analysis of the effect of covariates on outcomes in survival analysis, accounting for the presence of competing risks. These data originate from a randomized controlled trial in which men diagnosed with either stage 3 or stage 4 prostate cancer were randomly assigned to receive either estrogen therapy (at various dosage levels) or a placebo. The dataset consists of 502 observations and encompasses 17 variables. These variables are:

• Stage: Tumor stage

• Trt: Treatment assigned (0.2 mg estrogen, 1.0 mg estrogen, 5.0 mg estrogen, placebo)

• Time: Time on study (months)

• Status: Status at end of study. One category is alive; all other categories correspond to dead, broken down by cause of death.

• Age: Age in years

• wt: Weight index. Weight (kg) - Height (cm) + 200

• pf: Patient’s activity/performance level (normal activity, in bed < 50% daytime, in bed > 50% daytime, confined to bed)

• hx: History of cardiovascular disease (1=yes, 0=no)

• sbp: Systolic blood pressure (cm Hg)

• dbp: Diastolic blood pressure (cm Hg)

• ekg: Result of electrocardiogram (normal, benign, various disease indicators)

• hg: Serum hemoglobin (g/100mL)

• sz: Size of primary tumor (cm2)

• sg: Combined index of tumor stage and grade

• ap: Serum prostatic acid phosphatase (ng/mL)

• bm: Bone metastasis (1=yes, 0=no)

• Date: Calendar date patient was enrolled in study

The dataset can be down loaded from:https://hbiostat.org/data/ (search BYAR)

Or directly download from here :https://raw.githubusercontent.com/enwuliu/studynotes/main/Byar1980.txt

filename foo url "https://raw.githubusercontent.com/enwuliu/studynotes/main/Byar1980.txt";
data WORK.BYAR    ;
infile foo delimiter='09'x MISSOVER DSD lrecl=32767 firstobs=2 ;
         informat Stage best32. ;
         informat trt $15. ;
         informat Time best32. ;
         informat Status $60. ;
         informat age best32. ;
         informat wt best32. ;
         informat pf $20. ;
         informat hx best32. ;
         informat sbp best32. ;
         informat dbp best32. ;
         informat ekg $33. ;
         informat hg best32. ;
         informat sz best32. ;
         informat sg best32. ;
         informat ap best32. ;
         informat bm best32. ;
         informat Date yymmdd10. ;
         format Stage best12. ;
         format trt $15. ;
         format Time best12. ;
         format Status $60. ;
         format age best12. ;
         format wt best12. ;
         format pf $20. ;
         format hx best12. ;
         format sbp best12. ;
         format dbp best12. ;
         format ekg $33. ;
         format hg best12. ;
         format sz best12. ;
         format sg best12. ;
         format ap best12. ;
         format bm best12. ;
         format Date yymmdd10. ;
input
                  Stage
                  trt  $
                  Time
                  Status  $
                  age
                  wt
                  pf  $
                  hx
                  sbp
                  dbp
                  ekg  $
                  hg
                  sz
                  sg
                  ap
                  bm
                  Date
      ;
run;

For illustrative purposes, I have created a new dataset containing the following variables:

death: death=1, if died from prostatic cancer; death=2 if died from other causes; death=0, if alive.
time: follow up time to death=1, 2 or alive.
RX: treatment group, RX=0 for placebo; RX=1 for 0.2 mg estrogen; RX=2 for 1.0 mg estrogen, or 5.0 mg estrogen.
Size:Size of primary tumor, size=1 if sz$$30, size=0, if sz<30.
Age: Age in year

Hence, within this dataset, we have categorical variable “RX”, binary variable “size”, and continuous variable “age” as covariates.

SAS code to get the data set:

data byar_competing(keep=RX death time age size);
set byar;
if status='dead - prostatic ca' then death=1;
if status in ('dead - cerebrovascular','dead - heart or vascular','dead - pulmonary embolus','dead - respiratory disease','dead - unknown cause','dead - unspecified non-ca','dead - other specific non-ca','dead - other ca') then death=2;
if status in ('alive') then death=0;
if trt in ('1.0 mg estrogen','5.0 mg estrogen') then RX=2;
if trt='0.2 mg estrogen' then RX=1;
if trt ='placebo' then RX=0;
if sz<30 then size=0;
if sz>=30 then size=1;
run;

4.2 Cause-specific hazard model using SAS.

Suppose our primary focus is on cases where death is attributed to prostatic cancer (death = 1), then we will censor all instances of death caused by other factors (death = 2). To accomplish this, the following SAS code is employed. The syntax “time*death(0,2)” indicates to SAS that we want to treat both death = 0 and death = 2 as censored events.

proc phreg data=byar_competing;
class RX(ref='0');
model time*death(0,2) = RX age size/rl; 
run;

We got the following results: it shows censored values are 0 and 2.

Another approach is to re-code the death variable within the dataset and then use the standard proc phreg procedure:


data byar_competing2;
set byar_competing;
if death=1 then death2=1;
if death in (0,2) then death2=0; /****death2 is the new status variable***/
run;
/*******The code is the same as an usual Cox model*****/
proc phreg data=byar_competing2;
class RX(ref='0');
model time*death2(0)= RX age size/rl;
run;

In this scenario, it’s worth noting that we obtained the exact hazard ratios as previously mentioned. However, it’s important to highlight that the censoring variable is now represented as “death2,” with a value of 0 indicating censorship for all other competing events.

4.2 Cause-specific hazard model using R.

The following R code performs a cause-specific hazard model by censoring other competing risks, and the results closely align with those obtained from SAS:

library(survival)
library(cmprsk)

byar<- read.csv("https://raw.githubusercontent.com/enwuliu/studynotes/main/byar1980.csv",header=T)

byar$death<-ifelse(byar$Status=='dead - prostatic ca',1,ifelse(byar$Status=='alive',0,2))

byar$size<-ifelse(byar$sz>=30,1,0) # make a binary tumor size

byar$RX<-ifelse(byar$trt=='placebo',0,ifelse(byar$trt=='0.2 mg estrogen',1,2)) #three group, need to make two dummy variables

byar$death2<-ifelse(byar$death==1,1,0) #death2 for cause-specific hazard model

byar$RX1<-ifelse(byar$RX==1,1,0) # make two dummy variables for the three treatment groups
byar$RX2<-ifelse(byar$RX==2,1,0)

byar2<-subset(byar,select=c(death, death2,Time, RX1,RX2,age,size))

# Cause-specific hazard for prostatic cancer.

cox.1 <- coxph(Surv(Time,death2) ~ RX1+ RX2+age +size,method="breslow", data=byar2)

summary(cox.1)

## Call:
## coxph(formula = Surv(Time, death2) ~ RX1 + RX2 + age + size, 
##     data = byar2, method = "breslow")
## 
##   n= 502, number of events= 130 
## 
##          coef exp(coef) se(coef)      z Pr(>|z|)    
## RX1   0.10762   1.11363  0.22615  0.476   0.6341    
## RX2  -0.53809   0.58386  0.21748 -2.474   0.0134 *  
## age  -0.01712   0.98303  0.01224 -1.399   0.1618    
## size  1.59460   4.92638  0.19955  7.991 1.34e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##      exp(coef) exp(-coef) lower .95 upper .95
## RX1     1.1136      0.898    0.7149    1.7348
## RX2     0.5839      1.713    0.3812    0.8942
## age     0.9830      1.017    0.9597    1.0069
## size    4.9264      0.203    3.3317    7.2843
## 
## Concordance= 0.662  (se = 0.027 )
## Likelihood ratio test= 58.56  on 4 df,   p=6e-12
## Wald test            = 72.49  on 4 df,   p=7e-15
## Score (logrank) test = 84.29  on 4 df,   p=<2e-16

4.2 Fine-Gray model using SAS.

Using SAS to perform a Fine-Gray model is straightforward, as we can specify the event of interest using the “eventcode=” option in the MODEL statement. This allows us to focus on modeling the subdistribution hazard for the event of interest while accounting for competing risks

proc phreg data=byar_competing;
class RX(ref='0');
model time*death(0) = RX age size/rl eventcode=1;
run;

The followings are parts of SAS output:

4.3 Fine-Gray model using R.

The following R code preforms Fine-Gray model

cov.mat <-cbind(byar2$RX1,byar2$RX2,byar2$age,byar2$size) #prepare design matrix for the regression
crr.1 <- crr(byar2$Time,byar2$death,cov.mat,failcode=1,cencode=0)
summary(crr.1)

## Competing Risks Regression
## 
## Call:
## crr(ftime = byar2$Time, fstatus = byar2$death, cov1 = cov.mat, 
##     failcode = 1, cencode = 0)
## 
##             coef exp(coef) se(coef)      z p-value
## cov.mat1  0.1494     1.161   0.2281  0.655 5.1e-01
## cov.mat2 -0.4403     0.644   0.2169 -2.030 4.2e-02
## cov.mat3 -0.0289     0.972   0.0122 -2.378 1.7e-02
## cov.mat4  1.3747     3.954   0.2012  6.832 8.4e-12
## 
##          exp(coef) exp(-coef)  2.5% 97.5%
## cov.mat1     1.161      0.861 0.742 1.816
## cov.mat2     0.644      1.553 0.421 0.985
## cov.mat3     0.972      1.029 0.949 0.995
## cov.mat4     3.954      0.253 2.665 5.865
## 
## Num. cases = 502
## Pseudo Log-likelihood = -757 
## Pseudo likelihood ratio test = 53.3  on 4 df,

The results are similar with SAS.

4.4 Plot cumulative incidence function using R

We can use the following code to plot the overall Cumulative Incidence Function (CIF) for the event of interest and competing events.

cif1 <- cuminc(ftime=byar2$Time,fstatus=byar2$death,cencode=0)

plot(cif1$"1 1"$time,cif1$"1 1"$est,type="l", ylim=c(0,0.7), xlab="Survival time (months)",ylab="Probability", lty=1,col="red")

title("Cumulative Incidence functions")
lines(cif1$"1 2"$time,cif1$"1 2"$est,type="l",lty=1,col="blue")

legend("topleft",legend = c("Prostatic cancer death (CIF)","Other death (CIF)"),lty = c(1,1),
       col = c("red","blue"), bty="n")

5. References

1.Byar DP, Green SB (1980). The choice of treatment for cancer patients based on covariate information. Bulletin du Cancer, 67:477-488.

2.Austin, Peter C., Douglas S. Lee, and Jason P. Fine. “Introduction to the analysis of survival data in the presence of competing risks.” Circulation 133, no. 6 (2016): 601-609.

3.So, Ying, Guixian Lin, and Gordon Johnston. “Using the PHREG procedure to analyze competing-risks data.” SAS Global Forum. Vol. 2014. 2014.

4.Kleinbaum, David G., and Mitchel Klein. Survival analysis a self-learning text. Springer, 1996.